A leaked memo last week, reported to be written by Luke Sernau – a Google senior engineer – said what many in Silicon Valley have been saying for weeks: that an open-source free for all is threatening Big Tech’s grip on AI.
Researchers and app developers are able to study, modify, and build upon these new open-source large-language models, which are alternatives to Google's Bard and OpenAI's ChatGPT. These smaller, cheaper models of the best-in class AI models created by big firms (almost) match their performance–and are shared for free.
Sernau writes that companies like Google, which revealed this week at its annual product showcase that it will be generating AI for everything from Gmail, Photos, and Maps, were too busy looking at their competitors to notice the real competition. "While we squabbled, a third group has quietly eaten our lunch," he says.
This is a positive thing in many ways. It has also helped to identify their flaws. AI will not thrive if only a handful of mega-rich corporations decide what it's used for or how to use it.
This open-source boom, however, is not without risk. The majority of open-source models are still based on giant models created by large firms with deep pockets. If OpenAI or Meta close up shop, what was once a boomtown may become a backwater.
Many of these models, for example, are built upon LLaMA, a large-language model that was released as an open source by Meta AI. Other models use the Pile, a huge public data set that was created by the open source nonprofit EleutherAI. EleutherAI is only possible because OpenAI was so open that coders could reverse-engineer GPT-3 and create their own.
Stella Biderman is the executive director of EleutherAI and head researcher. She also works for Booz Allen, a consulting firm. Sernau also highlights Meta AI's critical role in his Google Memo. Google confirmed to MIT Technology Review the memo was written in-house by an employee, but noted that it wasn't an official strategy document.
All of that could change. OpenAI has already reversed its open policy due to competition concerns. Meta could start to want to reduce the risk of upstarts doing unpleasant things with their open-source code. Joelle Pineau is Meta AI's managing Director. She says, "I feel that it's a good thing to do now" to open the code up to outsiders. Is this the strategy we will use for the next 5 years? "I don't really know because AI is evolving so rapidly."
If the trend towards closing access continues, not only will open-source be left behind, but the next generation AI breakthroughs will once again be in the hands the richest and biggest AI labs around the world.
AI's future is at a crossroads.
Open-source bonanza
Open-source software is nothing new. The internet is based on open-source software. Open-source AI took off only a few years ago due to the high cost of creating powerful models. It's become a huge success.
Look at the past few weeks. Hugging Face, an AI startup that promotes open and free access to AI unveiled its first open-source chatbot alternative to ChatGPT in November, which was released by OpenAI.
HuggingChat is Hugging Face’s chatbot. It was built using an Open Assistant large language model that has been fine-tuned to conversation with the help from around 13,000 volunteers. The Open Assistant model was released one month ago. Open Assistant is built using Meta's LLaMA.
Stability AI (the company behind Stable Diffusion, a text-to image model that has been a hit) released StableLM on March 19, an open-source large-language model. Stability AI then released StableVicuna on March 28. This version of StableLM is optimized for conversation, like Open Assistant or HuggingChat. StableVicuna is Stability's version of ChatGPT.
These open-source models are the latest in a series of releases over the past few months. They include Alpaca, Dolly and Cerebras GPT (from AI company Cerebras). These models are based on LLaMA, or data and models from EleutherAI. Cerebras GPT follows a template created by DeepMind. More will be coming.
Some people are open-source because it is their principle. In a video about Open Assistant, Yannic Kilcher, an AI researcher and YouTuber, says that the project is a community-wide effort to make conversational AI more accessible to all.
Last month, Julien Chaumond (cofounder of Hugging face) tweeted: "We will not give up on the fight for open-source AI."
Others are motivated by profit. Stability AI wants to do the same thing with chatbots as it did with images. It hopes to fuel and then profit from a burst in innovation among developers who use its products. The company will take the best innovations and turn them into customized products for clients. Emad Mostaque is the CEO of Stability AI. He says, "We ignite innovation and then pick and choose." It's the best model for business in the world.
The abundance of large language models that are free and open puts this technology in the hands of millions around the globe, inspiring them to explore and create new tools. Biderman says that "there's more access to the technology now than ever before."
Amir Ghavi is a lawyer with the firm Fried Frank, who represents several generative AI companies including Stability AI. "I believe that's a testimony to human creativity. That is the entire point of open source."
Melting GPUs
It's hard to train large language models by starting from scratch, rather than adding on top of or changing them. Mostaque says that it's "still beyond the reach of most people". "We melted several GPUs to build StableLM."
Stability AI’s first release was the text-to image model Stable Diffusion. It worked just as well, if not better, than closed-source alternatives such as Google’s Imagen or OpenAI’s DALLE. Stable Diffusion was the model that sparked more open-source AI development in image-making last year than any other.
Mostaque, however, wants to set realistic expectations this time: StableLM is nowhere near GPT-4. He says, "There is still much work to be done." It's not Stable Diffusion where you get something super useful immediately. "Language models are more difficult to train."
The larger the models, the harder it is to train them. This is not only due to the high cost of computing power. It is more common for larger models to break down and need to be restarted. This makes them more expensive to construct.
Biderman says that in practice, there is a limit to how many parameters most groups are able to train. It is due to the fact that large models are trained on multiple GPUs and wiring them all together is difficult. She says that successfully training models of this scale is a new area in high-performance computer research.
As technology advances, the exact number of parameters will change. However, Biderman estimates that the ceiling is roughly between 6 and 10 billion parameters. GPT-3, for example, has 175 trillion parameters. LLaMA only has 65 billion. There's no exact correlation between larger models and better performance, but it is generally true.
Biderman anticipates that the frenzy of activity surrounding open-source large languages models will continue. It will focus on adapting or extending a few pre-trained models, rather than pushing forward the technology. She says that only a few organizations have pretrained these models.
EleutherAI is a nonprofit organization that has made a unique contribution to open source technology. Many open-source models have been built using LLaMA. Biderman claims that she is aware of only one group similar to it, and that's located in China.
EleutherAI was born thanks to OpenAI. In 2020, the San Francisco firm had released a new hot model. "GPT-3 represented a major shift in the way people thought about AI at large scale," Biderman says. It's often credited with a paradigm shift intellectually in terms of what the people expect from these models.
Biderman, along with a few other researchers, were excited by the potential of the new technology. They wanted to experiment with it to better understand how it worked. They decided to duplicate it.
OpenAI hadn't released GPT-3 yet, but they did provide enough information to allow Biderman and colleagues to understand how it was constructed. It was the middle pandemic and the team didn't have much else to do. Biderman says, "I was playing board games with her and doing my work when I became involved." It was easy to devote 10 to 20 hours per week.
They began by assembling a huge new dataset, which contained billions of passages, that would rival the data OpenAI used to train GPT-3. EleutherAI named its dataset "The Pile" and released it free of charge at the end 2020.
EleutherAI used the data to train its open-source first model. EleutherAI's largest model took EleutherAI three and a quarter months to train and was sponsored for by a cloud computing firm. She says that if they had paid for the software out of their own pockets, it would have been about $400,000. It's too much to ask of a university-based research group.
Helping Hand
It's much easier to build upon existing models because of the costs. Meta AI's LLaMA is quickly becoming the starting point for new open-source initiatives. Meta AI, founded by Yann LeCun over a decade earlier, has been heavily focused on open-source software development. Pineau says that this mindset is part and parcel of the company's culture: "It's a very free-market approach, where we move fast and build things."
Pineau is clear about the benefits. She says, "It diversifies the people who are able to contribute to the development of the technology." This means that civil governments, as well as researchers and entrepreneurs, can gain access to these models.
Pineau, along with her colleagues in the open-source community at large, believes that transparency is the norm. She says that one thing she encourages her researchers to do is to start projects with the idea of open-sourcing. Because when you do this, you set a higher bar for what data you use and the way you build your model.
There are also serious risks. They spread misinformation, prejudice and hate speech. They can be used for mass-production propaganda or to power malware factories. Pineau says that you have to choose between safety and transparency.
Meta AI may not release some models due to this trade-off. If Pineau's group has trained a Facebook model, it may not be released because of the high risk that private data will leak out. The team could release the model under a click-through licence that states it can only be used for research.
This was the approach taken for LLaMA. Someone posted the complete model with instructions to run it on 4chan, the internet forum, just days after its release. Pineau says, "I still believe it was the best trade-off for this model." "But I am disappointed that people would do this because it makes it difficult to do these releases."
She says, "We have always received strong support for this approach from the company leadership up to Mark Zuckerberg. But it is not easy."
Meta AI is a high-stakes game. She says that the risk of doing something crazy when you are a small company is much lower than when you are a large corporation. We release the models to thousands, but we will close the circle if the problem becomes worse or if we think the risks are higher. Then we will only release the models to academic partners with very high credentials, under confidentiality agreements or non-disclosure agreements that prohibit them from using the model for anything other than research.
If this happens, many open-source darlings could find their license to develop on the next Meta AI product revoked. Without LLaMA open-source models like Alpaca or Open Assistant would be less effective. The next generation of open source innovators will not have the same advantage as the current group.
The balance
Other people are also weighing the risks and benefits of this open source free for all.
Hugging Face implemented a gating system around the time Meta AI released LLaMA. This required people to request access and be approved before downloading any of the company's models. The idea is that Hugging Face will only allow access to those who can prove a valid reason to download the model.
Margaret Mitchell, Chief Ethics Scientist at Hugging face says: "I am not an open source evangelist." "I can see why being closed is a good thing."
Mitchell cites non-consensual sexual pornography, as an example of what can go wrong when powerful models are made widely available. She says it's one of AI's main uses for image-making.
Mitchell, who worked previously at Google and founded its Ethical Artificial Intelligence team, is aware of the tensions. She supports what she calls "responsible democraticization" – an approach similar to Meta AI, where models released in a controlled manner according to their risk of being misused or causing harm. She says, "I appreciate the open-source ideals but it is useful to have some sort of accountability mechanisms in place."
OpenAI also turns off the faucet. When the company announced GPT-4 last month, its new version of ChatGPT's large language model, a sentence stood out in the technical report. It read: "Given the competitive landscape as well as the safety implications of models of this size, such GPT-4), the report does not contain any further details regarding the architecture, hardware, training computing, dataset construction or training method.
OpenAI, now that it is a profit-driven business competing with Google and other companies, has imposed these new restrictions. They also show a change in attitude. Ilya Suksever, cofounder and chief scientist of the company, said in an article for The Verge in which he admitted that their past openness was a mistake.
Sandhini Agarwal is a policy researcher for OpenAI. She says that OpenAI has changed its strategy when it comes what's safe and not to be made public. "Previously, something being open-source might have been of interest to a small group. The whole landscape has changed. "Open-source development can accelerate and cause a race to bottom.
It wasn't always this way. EleutherAI would not exist if OpenAI felt the same way three years ago, when it first published information about GPT-3.
EleutherAI is a key player in the open-source community today. The Pile was used to train many open-source projects including Stability AI’s StableLM. Mostaque sits on EleutherAI’s board.
All of this wouldn't have been possible had OpenAI shared less information. EleutherAI, like Meta AI, enables open-source innovation.
The open-source community could once again be left to play in the shadow of large corporations, now that GPT-4 and 5 and 6 are locked down. Some of them may even produce new, wild versions that threaten Google's own products. They will still be stuck with the models of last generation. Behind closed doors, real progress and the next big leaps will be made.
Does it matter? What one thinks of the impact on open-source and how big tech companies will shut down access depends on how you view AI and who makes it.
Ghavi says that AI is likely to drive how society organises itself in the next decades. "I believe that having a system of checks and transparent is better than consolidating power in the hand of a handful."
Biderman is in agreement: "I don't believe that there is any kind of moral requirement that everyone does open-source," says she. But at the end, it is important that people who do research and development on this technology are not financially involved in its commercial success.
OpenAI claims that it is simply playing it safe. Dave Willner is the head of OpenAI’s trust and safety team. It's more about trying to reconcile safety and transparency. As these technologies become more powerful, they create tension in the real world.
Willner says that academic research communities have shaped a lot of AI norms, valuing collaboration and transparency to allow people to build on one another's work. Maybe that has to change as the technology advances.
————————————————————————————————————————————————————————————
By: Will Douglas Heaven
Title: The open-source AI boom is built on Big Tech’s handouts. How long will it last?
Sourced From: www.technologyreview.com/2023/05/12/1072950/open-source-ai-google-openai-eleuther-meta/
Published Date: Fri, 12 May 2023 08:48:17 +0000
Leave a Reply