Surveys have been used since time immemorial to gain insights about populations, products, and public opinion. And while methodologies may have changed over the millennia, one thing remains the same. That means we need people, lots of people.
But what if you can't find enough people to build a sample group large enough to produce meaningful results? Or, even though you might be able to find enough people? First, what if budget constraints limit the amount of talent you can source and interview?
This is where Fairgen wants to help. An Israeli startup today launched a platform that uses “statistical AI” to generate synthetic data that it says is just as good as the real thing. The company also announced $5.5 million in new funding from Maverick Ventures Israel, The Creator Fund, Tal Ventures, Ignia, and a handful of angel investors, bringing total cash raised since inception to $8 million. .
“Fake data”
Data may be the lifeblood of AI, but it will also forever be the cornerstone of market research. So when two worlds collide, as he does in Fairgen's world, the need for high-quality data becomes a little more pronounced.
Founded in Tel Aviv, Israel in 2021, Fairgen was previously focused on tackling bias in AI. However, in late 2022, the company pivoted to a new product, Fairboost, which is currently launching from beta.
Fairboost promises to “boost” small datasets up to 3x, allowing you to target niches that might otherwise be too difficult or too expensive to reach. Allows for more detailed insight into the field. It allows companies to use statistical AI learning patterns across different research segments to train deep machine learning models for each dataset they upload to the Fairgen platform.
The concept of “synthetic data” (data created artificially rather than from real-world events) is not new. Its roots go back to the early days of computing, where it was used to test software and algorithms and simulate processes. However, as we understand today, synthetic data has taken on a life of its own and is increasingly used to train models, especially with the advent of machine learning. Using artificially generated data that does not contain sensitive information can address both data scarcity issues and data privacy concerns.
Fairgen is the latest startup to test synthetic data and is primarily targeting market research. It's worth noting that Fairgen isn't generating data out of thin air or throwing millions of past studies into an AI-powered melting pot. Market researchers must conduct research on a small sample of the target market, and from there Fairgen establishes patterns and expands the sample. The company states that at least for the original sample he can guarantee a 2x boost, but on average he can achieve a 3x boost.
In this way, Fairgen could prove that people of a certain age and/or income level tend to answer questions in a certain way. Or, combine any number of data points and extrapolate from the original data set. It's essentially creating what Samuel Cohen, Fairgen's co-founder and CEO, calls “stronger, more robust data segments with less error.”
“The main realization was that people are becoming more and more diverse. Brands need to adapt to that and understand their customer segments,” Cohen explained to TechCrunch. “The segments are very different. Gen Z thinks differently than older adults. And it would take a lot of money and a lot of time and operational resources to be able to understand this market at a segment level. And we realized that was the problem, and that's where synthetic data played a role.”
The obvious criticism, which the company acknowledges and disputes, is that this all sounds like a huge shortcut to getting out there and interviewing real people and gathering real opinions.
Certainly, underrepresented groups should be concerned that their real voices are being replaced by, well, false voices.
“Every customer we talk to in the research space has a huge blind spot, an audience that is completely difficult to reach,” Fernando Zatz, head of growth at Fairgen, told TechCrunch. Ta. “The reason they're not actually selling projects is because there's a lack of talent, especially in a world where markets are fragmented and increasingly diverse. Sometimes they can't go to certain countries. Since you can't target a specific demographic, you'll actually end up losing money on your project by not meeting your quota. [of respondents]And if that number is not reached, the insights will not be sold. ”
Fairgen is not the only company applying generative AI to the market research space. Qualtrics announced last year that it would invest $500 million over four years to bring generative AI to its platform, essentially focusing on qualitative research. But this is further evidence that synthetic data exists and will continue to exist.
However, validating the results plays an important role in convincing people that this is genuine and not a cost-cutting measure that will produce optimal results. Fairgen does this by comparing “real” sample boosts to “synthetic” sample boosts. Take a small sample of the dataset, extrapolate it, and line it up with the real thing.
“We do these exact same types of tests on every customer we sign up with,” Cohen said.
statistically speaking
Cohen holds a Master's degree in Statistical Science from the University of Oxford and a PhD in Machine Learning from UCL, London. As part of this, he spent nine months working as a research scientist at Meta.
One of the company's co-founders is chairman Benny Schneider, who previously worked in enterprise software and whose name has been withdrawn four times. In 2008 he exited Qumranet and sold it to Red Hat for $107 million. In 2004 he sold P-Cube to Cisco for $200 million. Then, in 2000, Pentacom was acquired by Cisco for $118.
and Emmanuel Candes, a professor of statistics and electrical engineering at Stanford University, is Fairgen's chief scientific advisor.
This business and mathematical backbone is a big selling point for companies trying to convince the world that fake data, if applied correctly, is every bit as good as real data. This is also a way to clearly articulate the thresholds and limits of the technology, i.e. how large a sample is required to achieve optimal boost.
Ideally, a survey should have at least 300 actual respondents, Cohen said, and from there Fairboost can expand the segment size to no more than 15% of the broader survey.
“If it's less than 15%, we can guarantee an average improvement of 3x based on hundreds of parallel tests,” Cohen said. “Statistically, above 15%, the increase is not very dramatic. The data already shows a good confidence level, and the synthetic respondents could potentially match them or have a slight increase. On the business side, there's nothing wrong with anything above 15%. Brands can already learn from these groups. They're just at a niche level.”
Factors not to use LLM
It's worth noting that Fairgen doesn't use large-scale language models (LLMs), and its platform doesn't produce “plain English” responses like ChatGPT. The reason is that LLM uses learning from countless other data sources besides the parameters of the study, increasing the likelihood of introducing biases that are incompatible with quantitative research.
Fairgen is all about statistical models and tabular data, and its training relies solely on the data contained within the uploaded dataset. This effectively allows market researchers to extrapolate from adjacent segments within a survey to generate new synthetic respondents.
“We don’t use LLMs for a very simple reason: If you pre-train with a large amount of LLMs, [other] If you do an investigation, it will only give you false information,'' Cohen said. “There may be cases where another investigation turns up something, but we don't want that. It's all about credibility.”
In terms of business model, Fairgen is sold as a SaaS, where businesses upload their surveys in a structured format (.CSV or .SAV) to Fairgen's cloud-based platform. Depending on the number of questions, it can take up to 20 minutes to train a model based on survey data given, Cohen said. Then, when the user selects a “segment” (a subset of respondents that shares certain characteristics) (for example, “Gen Z working in industry x”), Fairgen creates a segment with exactly the same structure as the original training file. Deliver new files. Question, just a new line.
Fairgen is used by BVA and French polling and market research firm IFOP, both of which have already integrated the startup's technology into their services. IFOP, which is a bit like America's Gallup, is using Fairgen for polling purposes in the European elections, but Cohen believes it could eventually be used in the US elections later this year. There is.
“IFOP is basically our stamp of approval because IFOP has been around for about 100 years,” Cohen said. “They validated the technology and were our original design partner. We are also testing or have already integrated with some of the largest market research companies in the world, but we have yet to talk about that.” I can not do it.”