Developers are adopting AI-powered code generators (services like GitHub Copilot and Amazon CodeWhisperer, and open access models like Meta's CodeLlama) at an astonishing rate. However, the tool is far from ideal. Many are not free. Others are available, but only under licenses that prohibit their use in general commercial contexts.
Recognizing the demand for alternatives, several years ago AI startup Hugging Face partnered with workflow automation platform ServiceNow to create StarCoder, an open source code generator with a less restrictive license than others. Did. The original was released online early last year, and a sequel, StarCoder 2, has been in development ever since.
StarCoder 2 is not a single code generation model, but a family. There are three versions of this version released today, the first two of which can run on modern consumer GPUs.
- 3 billion parameter (3B) model trained by ServiceNow
- 7 billion parameter (7B) model trained by Hugging Face
- A 15 billion parameter (15B) model trained by Nvidia, the newest supporter of the StarCoder project.
(Note that “parameters” are the parts of the model learned from the training data that essentially define the model's skill for the problem, and in this case, the code it generates.)
Like most other code generators, StarCoder 2 can suggest ways to complete unfinished lines of code or retrieve summarized snippets of code in response to questions in natural language. Trained on four times the data of the original StarCoder, StarCoder 2 offers what Hugging Face, ServiceNow, and Nvidia characterize as “significantly” improved performance at lower operating costs.
StarCoder 2 can use GPUs like the Nvidia A100 to fine-tune first-party or third-party data in “hours” to create apps like chatbots and personal coding assistants. StarCoder 2 was also trained on a larger and more diverse dataset than the original StarCoder (approximately 619 programming languages), so it can, at least hypothetically, make more accurate and context-aware predictions.
“StarCoder 2 was created specifically for developers who need to build applications quickly,” Harm de Vries, head of ServiceNow's StarCoder 2 development team, told TechCrunch in an interview. “StarCoder2 allows developers to use its features to code more efficiently without sacrificing speed or quality.”
Now, I would venture to say that not all developers agree with De Vries in terms of speed and quality. Code generators promise to streamline certain coding tasks, but they come at a cost.
A recent study from Stanford University found that engineers who use code generation systems are more likely to introduce security vulnerabilities into the apps they develop. Additionally, a poll by cybersecurity firm Sonatype found that a majority of developers felt that code generators lack insight into how code is generated, and that generators generate unmanageable amounts of code. It has been shown that they are concerned about “code sprawl.”
The StarCoder 2 license may also be an obstacle for some people.
StarCoder 2 is licensed under Hugging Face's RAIL-M, which is intended to encourage responsible use by imposing “light touch” restrictions on both model licensees and downstream users. Although RAIL-M is less restrictive than many other licenses, it is not truly “open” in the sense that it does not allow a developer to use his StarCoder 2 for . every Possible applications (for example, apps that provide medical advice are strictly off-limits). Some commentators have said that RAIL-M's requirements may be too vague to comply with in any case, and that RAIL-M may conflict with AI-related regulations such as EU AI law. I am.
Putting all this aside for a moment, is StarCoder 2 really better than other code generators, free or paid?
In some benchmarks, it appears to be more efficient than one of CodeLlama's versions, CodeLlama 33B. According to Hugging Face, StarCoder 2 15B matches CodeLlama 33B twice as fast in a subset of code completion tasks. It's not clear which task. Hug face is not specified.
As an open source collection of models, StarCoder 2 also has the advantage of being able to be deployed locally to “study” the developer's source code or codebase. It's an attractive prospect for developers and companies wary of exposing code to cloud-hosted AI. His 2023 study by Portal26 and Censuswide found that 85% of companies are choosing not to use code generators due to privacy and security risks, such as having employees share sensitive information or vendors training on proprietary data. responded that he is cautious about introducing GenAI.
Hugging Face, ServiceNow, and Nvidia also claim that StarCoder 2 is more ethical and has fewer legal issues than its competitors.
All GenAI models regurgitate. In other words, it spits out a mirror copy of the data used for training. It doesn't take an active imagination to understand why this puts developers in trouble. If you use a code generator that has been trained on copyrighted code, even with filters and additional safeguards, the generator will unknowingly recommend copyrighted code and add that code to the code. may not be able to be labeled as such.
Several vendors, including GitHub, Microsoft (GitHub's parent company), and Amazon, have committed to providing legal protection if their code generator customers are accused of copyright infringement. However, coverage varies by vendor and is typically limited to corporate customers.
In contrast to code generators trained using copyrighted code (particularly GitHub Copilot), StarCoder 2 is based on a license from Software Heritage, a nonprofit organization that provides code archiving services. Trained on data only. Prior to StarCoder 2 training, BigCode, the cross-organizational team behind much of the StarCoder 2 roadmap, gave code owners the opportunity to opt out of the training set if they wished.
Like the original StarCoder, StarCoder 2's training data is available for developers to fork, clone, and audit as needed.
Leandro von Werra, Hugging Face Machine Learning Engineer and Co-Leader of BigCode, says that while open code generators have proliferated recently, they are not accompanied by information about the data used to train them or how they are actually done. He pointed out that there are very few. they were trained.
“From a scientific point of view, the problem is that the training is not reproducible, but even as data creators (i.e. people who upload code to GitHub), you have no control over whether or not your data was used and how. We do not know whether it was used in interview. “StarCoder 2 addresses this issue by being completely transparent throughout the training pipeline, from scraping pre-training data to training itself.”
However, StarCoder 2 isn't perfect. Like any code generator, it is susceptible to bias. De Vries points out that codes can be generated that include elements that reflect gender and racial stereotypes. Additionally, StarCoder 2 is trained primarily on English comments, Python, and Java code, so it performs poorly in non-English languages and “low resource” code such as Fortran and Haksell.
Still, von Werra insists this is a step in the right direction.
“We strongly believe that building trust and accountability in AI models requires transparency and auditability of the complete model pipeline, including training data and training recipes,” he said. Told. “Star Coder 2” [showcases] How can a fully open model achieve competitive performance? ”
Like this author, you may be wondering what motivates Hugging Face, ServiceNow, and Nvidia to invest in a project like StarCoder 2. They are a business, after all, and training models don't come cheap.
As far as I know, this is a proven strategy. That means promoting goodwill and building paid services on top of open source releases.
ServiceNow already uses StarCoder to create Now LLM. It's a code generation product that's fine-tuned to ServiceNow's workflow patterns, use cases, and processes. Hugging Face, which offers model implementation consulting plans, offers a hosted version of his StarCoder 2 model on its platform. Nvidia is doing the same, making StarCoder 2 available through an API and web front end.
Developers specifically interested in a free offline experience can download StarCoder 2 (models, source code, etc.) from the project's GitHub page.