The battle between open source and proprietary software is well known, but the tensions that have permeated the software industry for decades have also found their way into the burgeoning field of artificial intelligence, where they've sparked fierce debate.
The New York Times recently published a glowing review of Meta CEO Mark Zuckerberg, noting that his “open source AI” initiative has rekindled his popularity in Silicon Valley. But the problem is that Meta's Llama-branded large-scale language models are not actually open source.
Or is it?
By most estimates, no. But it does highlight that the concept of “open source AI” is likely to become even more controversial in the future. This is something the Open Source Initiative (OSI) is trying to address. Led by Executive Director Stefano Maffulli (pictured above), the initiative has been tackling the issue for over two years through a global effort that includes conferences, workshops, panels, webinars, and reports.
AI is not software code
Image credit: Westend61 via Getty
For over 25 years, OSI has been the custodian of the Open Source Definition (OSD), defining how the term “open source” can and should be applied to software. Any license that meets this definition can legitimately be considered “open source,” but a wide range of licenses are permitted, from very permissive to not so permissive.
But applying traditional software licensing and naming conventions to AI is problematic. Joseph Jacks, open source evangelist and founder of venture capital firm OSS Capital, goes so far as to say that “there is no such thing as open source AI,” noting that “open source was invented specifically for software source code.”
In contrast, “neural network weights” (NNWs) is a term used in the artificial intelligence world to describe the parameters or coefficients that a network learns during the training process, but cannot be meaningfully compared to software.
“Neural net weights are not software source code; they cannot be read or debugged by humans,” Jacks points out. “Furthermore, the fundamental rights of open source do not apply in quite the same way to NNWs.”
This led Jacks and his OSS Capital colleague Heather Meeker to come up with their own definition, centered around the concept of “openweight.”
So before we arrive at a meaningful definition of “open source AI,” we can see that trying to get there will create some inherent tensions: how can we agree on a definition if we can't agree that the “thing” we're defining exists?
Mahouli agrees.
“You're right,” he told TechCrunch. “One of the early discussions was whether we should even call this open source AI, but everyone was already using that term.”
This reflects part of a challenge in the broader field of AI, where there is a lot of debate about whether what we call “AI” today is really AI, or just powerful systems taught to find patterns in reams of data. But opponents generally accept the fact that the “AI” label already exists, and see no point in fighting it.
Image credit: Larysa Amosova via Getty
Founded in 1998, OSI is a non-profit public benefit corporation that focuses on advocacy, education, and a wide range of open source related activities with the Open Source Definition at its core. Today, the organization relies on sponsors for funding and includes such notable members as Amazon, Google, Microsoft, Cisco, Intel, Salesforce, and Meta.
Meta's involvement with OSI is especially notable in relation to the current concept of “open source AI.” While Meta positions its AI as open source, the company does place notable restrictions on how the Llama model can be used. Of course, it is free for research and commercial use, but app developers with more than 700 million monthly users must apply for a special license from Meta, which will be granted at Meta's sole discretion.
Simply put, Meta's Big Tech allies can blow the whistle if they want to get involved.
Meta's wording around LLM has been somewhat flexible: the company called the Llama 2 model open source, but with the arrival of Llama 3 in April, it has toned down that term a bit in favor of phrases like “openly available” and “openly accessible,” though it still calls the model “open source” in some places.
“Everyone else in this discussion is in complete agreement that Llama itself cannot be considered open source,” Maffulli said. “People who have spoken to people who work at Meta understand that's a bit of a stretch.”
On top of that, one might argue there is a conflict of interest here: are the companies that have demonstrated a desire to piggyback on the open source brand also funding the maintainers of the “definition”?
That's one reason OSI is looking to diversify its funding, and recently won a grant from the Sloan Foundation, which is funding OSI's multi-stakeholder, global effort to achieve its definition of open source AI. TechCrunch revealed that the grant was worth about $250,000, and Makhlouri hopes it will change perspectives on its reliance on corporate funding.
“One of the things the Sloan grant makes even clearer is that we can say goodbye to Meta's funding at any time,” Mahri said. “We can do that even before the Sloan grant is paid out, because we know that we're going to be getting donations from other people, and Meta knows that very well. They're not going to interfere with this at all.” [process]Microsoft, GitHub, Amazon, and Google all fully understand that their organizational structures mean they cannot interfere.”
A working definition of open source AI
Image credit: Alexei Morozov/Getty Images
The current draft Open Source AI Definition is at version 0.0.8 and consists of three main parts: a “Preamble” that outlines the scope of the document, the Open Source AI Definition itself, and a checklist of necessary components for an open source compliant AI system.
According to the current draft, open source AI systems must grant the freedom to use the system for any purpose without asking permission, the freedom for others to study how the system works and inspect its components, and the freedom to modify and share the system for any purpose.
But one of the biggest challenges was around data: whether an AI system can be classified as “open source” if a company doesn't make its training datasets available to others. Mahruli says it's more important to know where the data came from and how the developers labeled, deduplicated, and filtered it. It's also important to have access to the code that was used to assemble the datasets from various sources.
“Knowing that information is much better than just having a data set without the rest of the information,” Mafoury said.
While it would be nice to have access to the full dataset (OSI lists this as an “optional” component), Maffulli says that in many cases, this is not possible or practical. This may be because the dataset contains confidential or copyrighted information that developers are not allowed to redistribute. Additionally, there are techniques to train machine learning models in such a way that the data itself is not actually shared with the system, using techniques such as federated learning, differential privacy, and homomorphic encryption.
And this perfectly highlights the fundamental difference between “open source software” and “open source AI”: they may be similar in intent, but they are not comparable on an equal footing, and it is this difference that the OSI tries to capture in its definition.
In software, source code and binary code are two views of the same artifact: they reflect the same program in different forms. However, a training dataset and the subsequent trained model are different things. Using the same dataset does not necessarily allow you to consistently recreate the same model.
“There's a lot of statistical and random logic that happens during training, so it can't be replicated in the same way as software,” Makhouli added.
Therefore, an open source AI system should be easy to replicate with clear instructions. This is where the checklist aspect of the Open Source AI definition comes in handy. The definition is based on a recently published academic paper called “Model Openness Framework: Promoting Integrity and Openness for Reproducibility, Transparency, and Usability in Artificial Intelligence.”
The paper proposes the Model Openness Framework (MOF), a classification system that evaluates machine learning models “based on their completeness and openness.” MOF requires that certain components of AI model development, such as details about training methods and model parameters, be “included and released under an appropriate open license.”
Steady state
Stefano Maffouli presents at the Digital Public Goods Alliance (DPGA) Member Summit in Addis Ababa. Image courtesy of OSI
OSI calls its official releases of definitions “stable versions,” much like a company might release an application that has been thoroughly tested and debugged before prime time. OSI deliberately avoids calling them “final releases” because parts of the definitions are likely to evolve.
“We can't expect this definition to last 26 years like the Open Source Definition,” Makhlouri says. “I don't think the first part of the definition, like 'what is an AI system,' will change much. But the part that we refer to in the checklist — the list of components — will depend on the technology. Who knows what the technology will be tomorrow?”
A stable open source AI definition is expected to be approved by the board at its All Things Open conference at the end of October. In the interim, OSI has embarked on a global roadshow across five continents to solicit more “diverse opinions” on how “open source AI” should be defined going forward. But the final changes are likely to be just “small tweaks” here and there.
“This is the final stage,” says Makhlouri, “we've got the full functionality of the definition. We have all the pieces we need. We have the checklist, so we're making sure there are no surprises, that there are systems that we should include or exclude.”