We finally have an “official” definition of open source AI.
The Open Source Initiative (OSI), the long-standing organization dedicated to defining and “governing” all things open source, today released version 1.0 of the Open Source AI Definition (OSAID). OSAID is the result of several years of collaboration between academia and industry, and aims to provide a standard by which anyone can determine whether AI is open source.
Like this reporter, you may be wondering why consensus is important in defining open source AI. OSI Vice President Stefano Maffri said a big motivation is to get policy makers and AI developers on the same page.
“Regulators are already paying attention to this area,” Mahri told TechCrunch, noting that bodies like the European Commission are looking to give open source special recognition. “We have made clear outreach to a variety of stakeholders and communities, not just the usual suspects in the technology industry. We have worked most frequently with regulators to get early feedback. We even tried to reach out to organizations to talk to.”
open AI
To be considered open source under OSAID, an AI model must provide enough information about its design so that it can be “substantially” reproduced by a human. Models must also disclose relevant details about their training data, including its origin, how the data is processed, and how it is acquired or licensed.
“Open source AI is an AI model where you can fully understand how the AI was built,” Mahri said. “This means you have access to all components, including the full code used for training and data filtering.”
OSAID also specifies the usage rights that developers should expect from open source AI. For example, the freedom to use and modify the model for any purpose without asking anyone's permission. “The most important thing is that we can build on it,” Maffuri added.
OSI has no enforcement mechanism to speak of. Developers cannot be pressured to comply with OSAID. However, we will flag models that are described as “open source” but fall short of that definition.
“Our hope is that when someone tries to abuse this term, the AI community will say, 'We don't recognize this as open source,' and it will be fixed,” Mahri said. said. Historically, this has had mixed results, but it is not completely ineffective.
Many startups and large technology companies, most notably Meta, use the term “open source” to describe their AI model release strategies, but few meet OSAID's standards. For example, Meta requires platforms with more than 700 million monthly active users to request a special license to use the Llama model.
Mahri has publicly criticized Meta's decision to call its model “open source.” After discussions with OSI, Google and Microsoft agreed to stop using the term for models that are not fully open, but Meta did not, he said.
Stability AI has long touted its model as “open,” but companies with more than $1 million in revenue must obtain an enterprise license. French AI startup Mistral's license also prohibits the use of certain models and outputs in commercial ventures.
A study last August by researchers from the Signal Foundation, the nonprofit AI Now Institute, and Carnegie Mellon found that many “open source” models are essentially open source in name only. The data needed to train a model is secret, the computational power required to run the model is beyond the reach of many developers, and the techniques for fine-tuning the model are incredibly complex.
Instead of democratizing AI, these “open source” projects tend to entrench and expand centralized power, the study authors concluded. In fact, Meta's Lllama model has racked up hundreds of millions of downloads, and Stability claims that its model enhances up to 80% of all AI-generated images.
opposing opinion
Understandably, Mehta disagrees with this assessment and disputes OSAID as written (despite participating in the drafting process). A spokesperson defended the company's Llama license, arguing that the terms and accompanying acceptable use policy act as guardrails against harmful developments.
Meta also said it is taking a “cautious approach” to sharing model details, including details about training data, as regulations such as California's Training Transparency Act evolve.
“While we agree with our partner OSI on many points, we, like others in the industry, disagree with their new definition,” the spokesperson said. “Defining it is difficult because there is no single open source AI definition, and previous open source definitions do not encompass the complexity of today's rapidly evolving AI models. We make it openly available, and our licensing and acceptable use policies put some restrictions in place to keep people safe. We want to make AI more accessible, regardless of its technical definition. We will continue to work with OSI and other industry organizations to make it free responsibly.”
The spokesperson pointed to other efforts to codify “open source” AI, including a definition proposed by the Linux Foundation, the Free Software Foundation's standards for “free machine learning applications,” and suggestions from other AI researchers. did.
Oddly enough, Meta is one of the companies funding OSI's activities, along with major technology companies such as Amazon, Google, Microsoft, Cisco, Intel, and Salesforce. (OSI recently won a grant from the nonprofit Sloan Foundation to reduce its dependence on technology industry patrons.)
Meta's reluctance to release training data may have something to do with how its (and most) AI models are developed.
AI companies collect vast amounts of images, audio, video, and more from social media and websites and train their models on this data, commonly referred to as “publicly available data.” In today's cutthroat market, the way a company assembles and refines its datasets is seen as a competitive advantage, and companies cite this as one of the main reasons for going private.
But the details of the training data can also paint a legal target on the developer's back. The author and publisher claim that Mehta used copyrighted books for training. Artists are suing Stability for scraping their work and copying it without credit, which they say amounts to theft.
Understanding how OSAID matters for companies seeking to resolve litigation favorably, especially if plaintiffs or judges find this definition persuasive enough to use in court. is not difficult.
unanswered questions
Some have suggested that this definition is not sufficient, for example, how to handle proprietary training data licenses. Lightning AI CTO Luca Antiga points out that the model may meet all OSAID requirements, even though the data used to train the model is not freely available. If a model's creator has to pay thousands of dollars to inspect a private store of licensed images, is it “open”?
“To have any practical value, especially for enterprises, the definition of open source AI must provide reasonable confidence that what is being licensed can be licensed to fit the organization's usage,” Antiga said. told TechCrunch. “By failing to address training data licensing, OSI leaves a gaping hole that reduces the effectiveness of OSI-licensed conditions in determining whether AI models can be adopted in real-world situations. That will happen.”
In OSAID version 1.0, OSI is also silent on copyright as it relates to AI models, and whether granting a copyright license is sufficient to ensure that a model meets the definition of open source. is not mentioned either. It is not yet clear whether the model (or any component of the model) is copyrightable under current intellectual property laws. However, OSI has suggested that new “legal tools” may be needed to properly open source IP-protected models if a court decides that is possible. There is.
Mr Mahri agreed that the definition would probably need to be updated sooner or later. To this end, OSI has established a committee responsible for monitoring how OSAID is applied and recommending modifications for future versions.
“This is not the work of a lone genius in a basement,” he said. “This is work that is being done openly, involving a wide range of stakeholders and a variety of interest groups.”