Jordan Meyer and Matthew Dryhurst founded Spawning AI to create tools that give artists more control over how their work is used online, and their latest project, called Source.Plus, aims to curate “copyright-free” media for training AI models.
The Source.Plus project's initial effort is a dataset of about 40 million public domain and Creative Commons CC0-licensed images, which allow creators to waive nearly all legal rights to their work. Meyer claims that the Source.Plus dataset is “high quality” enough to train state-of-the-art image generation models, despite being significantly smaller than other generative AI training data sets.
“At Source.Plus, we're building a universal 'opt-in' platform,” says Meyer. “Our goal is to make it easy for rights holders to provide media for use in generative AI training on their own terms, and for developers to seamlessly incorporate that media into their training workflows.”
Rights Management
The debate over the ethics of training generative AI models, particularly art-generating models like Stable Diffusion and OpenAI’s DALL-E 3, continues unabated and, no matter how it ends, has major implications for artists.
Generative AI models “learn” to create an output (e.g., photorealistic art) by training on large amounts of relevant data (in this case, images). Some developers of these models argue that they have the right to scape data from public sources through fair use, regardless of the copyright status of the data. Other developers try to toe the line by compensating or at least crediting content owners for their contributions to the training set.
Meyer, the Spawning CEO, believes no one has yet determined the best approach.
“AI training often defaults to using the easiest data available, but that data doesn't necessarily come from the most impartial or responsible sources,” he told TechCrunch in an interview. “Artists and rights holders have had little control over how their data is used in AI training, and developers have had no high-quality alternatives that are more likely to respect data rights.”
Available in limited beta, Source.Plus builds on Spawning's existing tools for managing art provenance and usage rights.
In 2022, Spawning launched HaveIBeenTrained, a website where creators can opt out of training datasets used by vendors that Spawning partners with, such as Hugging Face and Stability AI. After raising $3 million in venture capital from investors including True Ventures and Seed Club Ventures, Spawning rolled out ai.text, a way for websites to “authorize” their AI, and Kudurru, a system to protect against data-scraping bots.
Source.Plus is Spawning's first effort to build a media library and manage it in-house. The initial image dataset, PD/CC0, can be used for commercial or research purposes, Meyer said.
Source.Plus library. Image credit: Spawning
“Source.Plus is not just a repository of training data, it's an enrichment platform with tools to support training pipelines,” he continues. “Our goal is to provide high-quality, copyright-free CC0 datasets that can support powerful base AI models within a year.”
Organisations like Getty Images, Adobe, Shutterstock and AI startup Bria claim they only use fairly sourced data to train their models (Getty even calls its generative AI products “commercially safe”), but Meyer says Spawning aims to set a “higher bar” for what fairly sourced data means.
Source.Plus filters images based on “opt-out” and other artist training settings, displays provenance information about how and where the image was acquired, and filters out images that aren't CC0-licensed, including Creative Commons BY 1.0 licenses that require attribution. Spawning also says it monitors for copyright infringement from sources such as Wikimedia Commons, where someone other than the creator is responsible for indicating the copyright status of a work.
“We meticulously verified the licenses reported for the images we collected and removed any that seemed questionable, a step that is not taken by many 'unbiased' datasets,” Meyer said.
Historically, problematic images such as violent, pornographic, and sensitive personal images have plagued both open and commercial training datasets.
The administrators of the LAION dataset were forced to take one library offline after reports of medical records and depictions of child sexual abuse. This week, a Human Rights Watch investigation found that one of LAION's repositories contained the faces of Brazilian children without their consent or knowledge. And Adobe's stock media library, Adobe Stock, which the company uses to train generative AI models such as its Firefly Image model that generates art, was found to contain AI-generated images from rivals such as Midjourney.
Artwork from Source.Plus Gallery. Image courtesy of Spawning
Spawning's solution is a classification model trained to detect nudity, gore, personally identifiable information, and other undesirable aspects in images. Recognizing that classifiers are not perfect, Meyer says, Spawning plans to give users “flexible” filtering of the Source.Plus dataset by adjusting the classifier's detection threshold.
“We employ moderators to verify ownership of data,” Meyer added, “and we also have built-in remediation capabilities, allowing users to flag potentially violating or infringing works, and we can audit trails of how that data has been used.”
compensation
Most of the programs that pay creators for providing generative AI training data haven't been all that successful: Some programs calculate creator payments based on opaque criteria, while others pay out amounts that artists consider unfairly low.
Take Shutterstock, for example. The stock media library, which has signed deals with AI vendors worth tens of millions of dollars, contributes to a “contributor fund” for artwork used to train its generative AI models and licenses it to third-party developers. But Shutterstock isn't transparent about how much artists can expect to earn, nor does it allow artists to set their own prices and terms. One third-party estimate puts earnings at $15 for 2,000 images, which is hardly an impressive amount.
When Source.Plus leaves beta later this year and expands to non-PD/CC0 datasets, it will take a different approach from other platforms by allowing artists and rights holders to set their own price per download. Spawning will charge a fee, but only a flat rate of “a tenth of a cent,” Meyer says.
Customers can also pay Spawning $10 per month (plus the usual image download fee) for Source.Plus Curation, a subscription plan that gives them the ability to privately manage their image collections, download their datasets up to 10,000 times per month, and early access to new features like “premium” collections and data enrichment.
Image credit: Spawning
“While we provide guidance and recommendations based on current industry standards and internal metrics, it is ultimately up to the contributors to our dataset to decide what is valuable to them,” Meyer said. “We intentionally chose this pricing model to give artists a majority of the revenue and allow them to set their own terms of participation. We believe this revenue share is significantly more favorable to artists than the more common percentage revenue share, and will lead to higher payouts and transparency.”
If Source.Plus proves as popular as Spawning hopes, the company plans to expand it beyond images to other types of media, such as audio and video. Spawning is in talks with unnamed companies to make their data available on Source.Plus. Meyer said Spawning may also use data from the Source.Plus dataset to build its own generative AI models.
“We want to ensure that rights holders who want to participate in the generative AI economy have the opportunity to do so and are fairly compensated,” Meyer said, “and that artists and developers who have felt conflicted about engaging with AI have the opportunity to do so in a way that is respectful of other creators.”
Certainly, Spawning could carve out a niche for itself here, and Source.Plus seems like one of the more promising attempts to involve artists in the generative AI development process, allowing them to share in the profits of their work.
As my colleague Amanda Silberling recently wrote, the emergence of apps like Cara, an art-hosting community that saw a surge in usage after Meta announced it might train its generative AI on Instagram content, including that of artists, signals that the creative community has reached a breaking point: They’re desperate for alternatives to the companies and platforms they see as thieves, and Source.Plus just might be a viable option.
But if Spawning always acts in the best interest of artists (a big assumption, given that Spawning is a venture-capital-backed company), it's questionable whether Source.Plus will be able to successfully scale as Meyer envisions. If social media has taught us anything, it's that moderation (especially of millions of user-generated content) is an intractable problem.
You'll see it soon.