ChatGPT, PaLM, LaMDA: AI still has a long way to go
Fabian Suchanek is Professor of Computer Science at Télécom Paris – Institut Mines-Télécom, and Gaël Varoquaux is Director of Research in Artificial Intelligence and Health Applications at the National Research Institute in Digital Sciences and Technologies (Inria Center Paris).
Artificial intelligences learn to speak thanks to “language models”. The simplest models allow for autocomplete on your smartphone: they suggest the next word. But the prowess and progress of more modern language models such as GPT-3, LaMDA, PaLM or ChatGPT is breathtaking, with computer programs for example being able to write in the style of a given poet, impersonating dead people to explain jokes, translating languages and even producing and correcting computer code, which would have been unthinkable just a few months ago. To do this, the models rely on increasingly complex models of neurons.
When artificial intelligences speak indiscriminately
That said, the models are more superficial than these examples suggest. We compared the stories generated by the language models to the stories written by humans and found them engaging, but less coherent and less surprising than those written by humans.
More importantly, we can show that current language models have problems with even simple reasoning tasks. Consider the following example:
“The lawyer visited the doctor; Has the doctor visited the lawyer? »
Simple language models tend to say yes. GPT-3 even replies that the attorney didn’t visit the doctor. One possible reason we’re exploring is that these language patterns symmetrically encode word positions, so they don’t distinguish between “before the verb” and “after the verb,” which complicates the distinction between subject and object in a sentence.
Furthermore, the theoretical limitations of “transformer”-based language models mean that they cannot distinguish even and odd sequences of a certain element, if these are interspersed with another element. In practice, this means that models cannot solve a task we call a “pizza task” — a simple puzzle of the form:
“The light is off. I press the light switch. I’m eating a pizza. I press the light switch. The light is on? »
Here, a uniform sequence of light switches indicates that the light is off, but a BERT model fails to learn this. The most powerful models currently (GPT-3 and ChatGPT) flatly refuse to conclude that the light is off.
Even today’s language models struggle with negation and typically perform poorly on reasoning tasks as soon as these are more complex. For example, consider the following Chinese civil servant exam riddle:
“David knows Mr. Zhang’s friend Jack, and Jack knows David’s friend Ms.myself Linen. Everyone who knows Jack has a master’s degree, and everyone who knows Mmyself Lin are from Shanghai. Who is from Shanghai and has a master’s degree? »
Current models respond correctly in only 45% of cases and ChatGPT refuses to respond… while the best human performance is 96%.
The model doesn’t know that he doesn’t know
Most likely, the main drawback of current language models is that they are made for making analogies, but not for exactly memorizing too large a set. For example, they can’t remember a list of one million customers or a set of ten thousand proteins. The model, at her discretion, will forget some items. An even bigger problem is that she too can, always at her discretion, invent new things (a process aptly called “hallucination”). This is a fundamental consequence of machine learning techniques of linguistic models, which are optimized for interpolation and generalization, not for memorization.
The problem is that it is impossible to tell from the output of a language model whether an answer was memorized or invented. For example, when GPT-3 is asked where Alan Turing died, the model deadpans “Warrington”, instead of the correct answer (Wilmslow). ChatGPT answers this question correctly, but if asked lesser-known researchers, such as the authors of this article, make up facts. And when asked more complicated questions, he gives elegant, but false answers:
“Is the addition of two odd numbers odd? »
The model replies politely:
“Yes, when you add two odd numbers, the result is always an odd number. This is because an odd number can be expressed as 2n+1, where n is an integer. When adding two numbers in this form, the result is (2n+1)+(2m+1) = 2(n+m)+2, also in the form 2p+1, where p is an integer. This means that the result is an odd number. »
On the form, the answer is very convincing, but false.
With these examples we want to demonstrate that it is currently imprudent to rely on a linguistic model to reason or make decisions. Models get better with time, know more, and know more and more about refraining from responding when they don’t have the information. However, simple questions aside, a language model can easily concoct an answer, with an equally contrived and approximate explanation or demonstration.
Other excellent methods for reasoning about exact facts
All of this isn’t to say that language models aren’t amazing tools with mind-blowing capabilities. Nor is it to say that language models will never be able to overcome these challenges, or that other deep learning methods will not be developed for this purpose. Rather, it is to say that at the time of writing, in 2022, language models are not the preferred tool for reasoning or for memorizing exact data.
For these functions, the preferred tool currently remains the “symbolic representations”: databases, knowledge bases and logic. These representations store data not implicitly, but as sets of entities (such as people, commercial products, or proteins) and relationships between those entities (such as who bought what, what contains what, etc.). Logical rules or constraints are then used to reason about these relationships in a way that has been shown to be correct, although usually without taking into account probabilistic information. For example, this reasoning has been used since 2011 by the computer Watson, during the game Dangerin an attempt to determine which king the following statement referred to:
“A portrait of this Spanish king, painted by Titian, was stolen during an armed robbery from a museum in Argentina in 1987.”
Indeed, the question can translate into logical rules applicable on a knowledge base, and only King Philip II can match. Language models currently don’t know how to answer this question, probably because they can’t store and manipulate enough knowledge (links between known entities).
It’s probably no coincidence that the same large companies that build some of the most powerful language models (Google, Facebook, IBM) also build some of the largest knowledge bases. These symbolic representations are today often constructed by extracting information from a natural language text, i.e. an algorithm tries to create a knowledge base by analyzing press articles or an encyclopedia. The methods used for this are in this case the language models. In this case, language models are not the ultimate goal, but a way to build knowledge bases. They are well suited to this because they are very resistant to noise, both in their training data and in their inputs. They are therefore very well suited to handling ambiguous or noisy input, which is ubiquitous in human speech.
Language models and symbolic representations are complementary: language models excel in the analysis and generation of natural language texts. Symbolic methods are the tool of choice when it comes to memorizing exact items and reasoning about them. An analogy with the human brain can be instructive: some tasks are easy enough for the human brain to perform unconsciously, intuitively, in a matter of milliseconds (by reading simple words or entering the sum “2 + 2”); but abstract operations require painstaking, conscious, logical thinking (for example, memorizing phone numbers, solving equations, or determining the value for money of two washing machines).
Daniel Kahneman dichotomized this spectrum into “System 1” for subconscious reasoning and “System 2” for strenuous reasoning. With current technology, language models appear to solve “System 1” problems. Symbolic representations, on the other hand, are well suited to “System 2” problems. At least for the time being, therefore, it seems that both approaches have their raison d’etre. Furthermore, a whole spectrum between the two remains to be explored. Researchers are already studying the coupling between language models and databases, and some see the future in merging neural and symbolic models into ‘neurosymbolic’ approaches.