Meta AI and Papers with Code introduce Galactica, an open source scientific language model

Meta AI and Papers with Code, an autonomous team within Meta AI Research, unveiled Galactica on November 15, a 120 billion parameter open source language model trained on a large corpus capable of storing, combining, and reasoning about scientific knowledge. The goal is to help find useful information in the mass of available information. This announcement has already sparked controversy within the scientific community.

Galactica was trained on a corpus of over 360 million contextual citations and over 50 million unique references normalized across a diverse set of sources, enabling it to suggest citations and help discover related articles. Among these sources is NatureBook, a new set of quality scientific data that allowed him to be trained with scientific terminology, mathematical and chemical formulas, as well as source codes.

Manage the plethora of scientific information

Information overload is a major obstacle to scientific progress. Researchers are so buried under a mass of papers, they have trouble finding the information relevant to their research.

Galactica is a large-scale language model (LLM) trained on more than 48 million articles, textbooks, reference papers, compounds, proteins, and other sources of scientific knowledge. It can be used by academic researchers to explore literature, ask scientific questions, write scientific code…

The data set

The dataset used was created by tokenizing information from various scientific sources. For the interface, the team used task-specific tokens to support different types of knowledge. It processed the citations with a special token, which allows a researcher to predict a citation based on any input context.

Step-by-step reasoning has also been wrapped in a special token, which mimics an internal working memory.

The results

Galactica has achieved excellent results in many scientific fields.

In tests of technical knowledge such as LaTeX equations, Galactica outperformed the latest GPT-3 68.2% to 49.0%. It also demonstrated strong reasoning performance, outperforming Chinchilla on Math MMLU with a score of 41.3% vs. 35.7% and PaLM 540B on MATH with 20.4% vs. 8.8%.

It also establishes a new state of the art on downstream activities such as PubMedQA and MedMCQA of 77.6% and 52.9%. And while it hasn’t been trained on a general corpus, Galactica outperforms BLOOM and OPT-175B on BIG-bench.

For researchers, these results demonstrate the potential of language models as a new interface to science. They have released the model as open source for the benefit of the scientific community.

Controversy

On the Galactica website, we remind you that there is no guarantee of true or reliable output of the linguistic models, and that before following their advice it is important to carry out some checks: “Some of the Galactica-generated text can feel very authentic and very trusting, but it can be subtly false in many ways. This is especially true for highly technical content.”

Galactica should be seen as a writing aid, as noted by Yann Le Cun on Twitter:

“This tool is for writing down on paper what driver assistance is for driving. It won’t automatically write articles for you, but it will greatly reduce your cognitive load as you write them.”

Gary Marcus, expert AI scientist, Michael Black, director of the Max Planck Institute, however, reacted on Twitter and warned that the false information generated by Galactica could be taken up in scientific observations and misled.

Meta AI and Papers with Code have not commented yet, but they have disabled the demo feature of the Galactica site.

Article sources:

Galactica: A Great Language Model for Science
arXiv:2211.09085v1, Nov 16, 2022

Authors:
Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, Robert Stojnic.
Meta AI

Leave a Reply

Your email address will not be published. Required fields are marked *