The Reworkd founders launched AgentGPT last year, a free tool for building AI agents that went viral on GitHub and gained over 100,000 daily users in a week. This led them to be accepted into Y Combinator's Summer 2023 class, but the co-founders quickly realized that building a general-purpose AI agent was too broad a scope. So now Reworkd is a web scraping company, specifically building AI agents that extract structured data from the public web.
AgentGPT offered a simple in-browser interface that allowed users to create autonomous AI agents, and soon everyone was hyping agents as the future of computing.
When the tool took off, Asim Shrestha, Adam Watkins, and Srijan Subedi were still living in Canada and Reworkd didn't exist. The huge influx of users caught them off guard. Subedi, now COO at Reworkd, said the tool's API calls were costing them $2,000 a day. So they created Reworkd and had to raise funds, fast. One of the most common use cases for AgentGPT was creating web scrapers, a relatively simple but high-volume task, so Reworkd specialized in that.
Web scrapers have become invaluable in the AI era. According to a new report from Bright Data, the top reason organizations will use public web data in 2024 will be to build AI models. The problem is that web scrapers are traditionally built by humans and must be customized for specific web pages, making them costly. But Reworkd's AI agents can scrape more of the web with less human intervention.
Customers provide Reworkd with a list of hundreds or even thousands of websites to scrape and specify the type of data they're interested in. Reworkd's AI agents then convert this into structured data using multi-modal code generation. The agents generate unique code to scrape each website and extract that data for the customer to use as they wish.
For example, say you want stats for every player in the NFL, but each team's website has a different layout. Instead of building a scraper for each website, Reworkd's agents will run it for you, just by providing a link and a description of the data you want to extract. With 32 teams, that could save you hours, but with 1,000 teams, it could save you weeks.
Reworkd has raised $2.75 million in new seed funding from investors including Paul Graham, AI Grant (Nat Friedman and Daniel Gross' startup accelerator), SV Angel, General Catalyst and Panache Ventures, the startup told TechCrunch exclusively. Combined with $1.25 million in pre-seed funding from Panache Ventures and Y Combinator last year, this brings Reworkd's total funding to date to $4 million.
AI that can utilize the Internet
Shortly after founding Reworkd and relocating to San Francisco, the team hired Rohan Pandey as a founding research engineer, who now lives at AGI House SF, one of the Bay Area's most popular AI-era hacker houses. One investor described Pandey as “a one-man lab within Reworkd.”
“We see it as the culmination of a 30-year-old dream of the Semantic Web,” Pandey said in an interview with TechCrunch, referencing World Wide Web inventor Tim Berners-Lee's vision of the entire internet being readable by computers. “Some websites don't have markup, but LLM can understand websites the same way a human can, so we can basically expose any website as an API. So in a way, Reworkd is like a universal API layer for the internet.”
Reworkd says it can capture the long end of customer data needs, which means its AI agents are particularly well-suited to scrape the thousands of small, public websites that are often ignored by larger competitors. Other companies, such as Bright Data, have already built scrapers for large websites like LinkedIn and Amazon, but building scrapers for each small website by humans may not be worth the effort. Reworkd addresses this concern, but it may raise others.
What exactly is “public” web data?
Web scrapers have been around for decades, but they have stirred up controversy in the AI era. Unrestricted scraping of huge amounts of data has landed OpenAI and Perplexity in legal trouble. Press and media outlets have alleged that AI companies are extracting intellectual property from paid content and widely reproducing it without compensation. Reworkd takes precautions to avoid such issues.
“We see this as increasing accessibility to publicly available information,” Reworkd co-founder and CEO Shrestha told TechCrunch in an interview. “We're only allowing publicly available information, and we're not going through any sign-in barriers or anything like that.”
Going a step further, Reworkd avoids news scraping altogether and is selective about who it works with: Watkins, the company's CTO, said that it's not the company's focus because there are better tools available for aggregating news content.
As an example, Reworkd described its work with Axis, a company that helps policy teams comply with government regulations. Axis uses Reworkd's AI to extract data from thousands of government regulatory documents across many countries in the European Union. Axis then trains and fine-tunes AI models based on this data and offers them as products to its customers.
Starting a web scraping company today can be treading into risky territory, according to Aaron Fiske, a partner at Silicon Valley-based law firm Gunderson Dettmer. Right now, the situation is somewhat fluid, and the jury is still out on how “exposed” web data really is to AI models. But Reworkd's approach, in which clients decide which websites to scrape, could potentially protect them from legal liability, Fiske says.
“It's kind of like how the photocopier was invented, and it turns out that the use of making copies is very valuable economically, but very questionable legally,” Fisk told TechCrunch in an interview. “Web scrapers providing services to AI companies aren't necessarily risky, but working with AI companies that are genuinely interested in harvesting copyrighted content is probably problematic.”
That's why Reworkd is being careful about who it works with. Web scrapers have previously obscured liability in potential AI-related copyright infringement cases. In the OpenAI case, Fisk notes, the New York Times sued the company that allegedly copied the articles, not the web scraper that collected them. But even in that case, the decision is still open as to whether OpenAI's actions were indeed copyright infringement.
In the midst of the AI boom, there is more evidence that web scrapers are legally OK. A court recently ruled in favor of Bright Data, which scraped Facebook and Instagram profiles over the web. One example of the case was a dataset of 615 million records of Instagram user data that Bright Data was selling for $860,000. Meta sued the company, claiming that this violated their terms of use. However, the court ruled that the data was publicly available and therefore scrapable.
Investors think Reworkd can become as big as the big players
Reworkd has attracted big-name early investors, from Y Combinator and Paul Graham to Daniel Gross and Nat Friedman, some of whom say that's because Reworkd's technology promises to improve and get cheaper with new models. The startup says that OpenAI's GPT-4o is currently the best fit for multimodal code generation, and that much of Reworkd's technology wasn't possible until just a few months ago.
“I think as a founder, you're going to struggle if you're trying to compete with the rate of advancement in technology, instead of building on top of it,” General Catalyst's Viet Le said in an interview with TechCrunch. “Reworkd has the mindset of building our solutions around the rate of advancement.”
Reworkd creates AI agents that address specific gaps in the market. AI is advancing rapidly, so companies need more data. As more companies build custom AI models specific to their business, Reworkd will gain more customers. Fine-tuning the models requires large amounts of quality, structured data.
Reworkd says its approach is “self-healing,” meaning its web scrapers won't break with webpage updates. The company claims that Reworkd's agents generate the code to scrape websites, avoiding the hallucination problems that traditionally plague AI models. While the AI can make mistakes and pull incorrect data from websites, the Reworkd team created an open-source evaluation framework, Banana-lyzer, that periodically evaluates its accuracy.
Reworkd doesn't have many employees — just four people on its team — but there are significant inference costs to run its AI agents. The startup expects its prices to become more competitive as these costs fall. OpenAI just released GPT-4o mini, a miniature version of its industry-leading model with competitive benchmarks. Innovations like this could make Reworkd even more competitive.
Paul Graham and AI Grant did not respond to TechCrunch's request for comment.