AI training data comes with a hefty price tag, making it ideal for deep-pocketed technology companies. To this end, Harvard University plans to release a dataset containing 1 million public domain books across genres, languages, and authors, including Dickens, Dante, and Shakespeare, which are old and Not protected by copyright.
The new dataset is not yet available, and it is not clear when or how it will be released. However, because it contains books derived from Google Books, Google's long-standing book-scanning project, Google will be involved in publishing “this far-reaching treasure trove.”
Harvard University first teased the Institutional Data Initiative (IDI) in March, outlining plans to build a “trusted pipeline of legal data for AI.” But not much information was available until today's official announcement, with confirmation that IDI includes financial backing from Microsoft and OpenAI.
Greg Reppert, executive director of IDI, said the dataset is “open to everyone, from research institutions to AI startups,” by opening up such a huge dataset to anyone who wants to train large-scale language models (LLMs). It is designed to “level the playing field.”