“Language Models are Few-Shot Learners” by Brown, et al (link) was the world’s introduction to Open AI’s GPT-3, the state of the art language model showing us how far it was possible to take such models. GPT-3 soon became the ‘talk of the net’ with its amazing ability to do few-shot learning despite not having been specifically trained to do so. Language models have an unexpected natural ability to do few-shot learning including, in some cases, zero-shot and one-shot learning.
For example, a zero-shot might be:
Translate English to French:
Where the agent is expected to understand the instructions to translate cheese to French after the prompt. By comparison, a one-shot learner would give an example:
Translate English to French:
sea otter => loutre de mer
Obviously, a few-shot learner would be more than one example, typically between 10 and 100. 93% of the training set was in English, so there was no explicit training for translations. The model wasn’t specifically trained to do translations yet it can score about 1/3 of correct answers when translating English to French with few-short learning.
Some of the successes of GPT-3 seems a bit mysterious. It got an 80.2% success on 3-digit addition and 94.2% success on 3-digit subtraction despite only 19 out of 4000 of the problems being present in the training set. In addition, “inspection of the incorrect answers reveals that the model often makes mistakes such as not carrying a ‘1’, suggesting it is actually attempting to perform the relevant computation rather than memorizing a table.” (p. 23)
Even more surprising is GPT-3’s ability to learn novel words – even if entirely made up – not found in its training set. For example:
Input: To do a “farduddle” means to jump up and down really fast. An example of a sentence that uses the word farduddle is:
Output: One day when I was playing tag with my little sister, she got really excited and she started doing these crazy farduddles.
Or GPT-3’s ability to solve 40.2% of easier few-shot anagrams despite not being in the training set. Even more significant is that this accuracy is cut in half for one-shot training, suggesting that the model really is learning at test time not during training. Since the model works with tokens instead of letters this is particularly surprising that it can learn to pull apart the substructure of words. (p. 24)
In other words, GPT-3 shows a remarkable ability to generalize far beyond what it was actually trained to do which is really just predict the next word in a string of words. That it actually seems to ‘learn at test time’ goes against conventional Machine Learning theory.
While unable to pass the Turing Test its results are certainly more impressive than any ‘chatbot’ that has come before. Perhaps this is because it isn’t really, strictly speaking, intended to be a chatbot either — but rather this is one more task it can perform that it wasn’t trained to do because it contains a lot of knowledge about language.
The paper leaves more questions than answers and even (unintentionally) raises a number of interesting epistemological questions that I hope to bring up in a future post. Probably the most important is the fact that GPT-3 seems to learn at test time (as a “few-shot learner”) when clearly it’s model isn’t being updated. What is going on in these circumstances and how does it do it? The truth is that we don’t really understand how language models work or what they are doing ‘under the hood.’
Known Limitations of GPT-3
The paper claims that “GPT-3 seems to have special difficulty with ‘common sense physics’…. Specifically GPT-3 has difficulty with questions of the type ‘If I put cheese into the fridge, will it melt?’.” (p. 33) In fact, one of the best ways to trick GPT-3 seems to be to pick knowledge that is so commonplace that it’s not likely to be in the corpus GPT-3 trained on (i.e. the Internet). So, for example, asking it how many eyes a giraffe has will get a correct answer of two eyes, but asking it how many eyes your foot has will get an incorrect answer of two eyes as well. Since GPT-3 is really just predicting the next word in a sequence and not really trying to answer questions it can’t answer questions that aren’t likely to show up on the internet and it doesn’t know how to say “that’s a nonsense question.”
However, it seems to be possible to train GPT-3 at test time to be able to recognize non-sense questions. This suggests how you might go about fixing these limitations. GPT-3 could be trained to specifically understand the concept of non-sense or it could be trained on a knowledgebase meant to teach it how to deal with common sense knowledge that is too common to show up in the corpus because it would be seen as uninteresting by humans.
It also seems possible that this sort of technique could learn to capture knowledge directly instead of indirectly. GPT-3 learns answers to questions by learning the relationship of words to each other. As mentioned, it’s not really trying to answer questions at all, it’s just trying to predict the next word in a sequence. What we need is a technique to train a model specifically on learning knowledge instead of language that can translate that knowledge into language. Perhaps an architecture built on graph embedding would be relevant here.
GPT-3 and Biases
GPT-3 shows subtle biases that match known human biases. For example, in some cases, it assumes gender based on career. (i.e. nurse). Even more concerning, when prompted with “The <race> man was very” it tended to come back with more negative responses for some races than others, reflecting stereotypes. Of course, this makes sense from a purely technical perspective given that GPT-3 was trained on the internet it would reflect the stereotypes of the internet. And one might argue that certain careers are dominated by one gender over another and we’d expect GPT-3 to know that.
However, it’s not hard to see why this might be concerning. We don’t want our ML models to reflect human biases even if those biases are part of the training set. This leads to offensive situations (Google famously had their ML label black people as gorillas and was so embarrassed that that removed gorillas as a possible label) that may damage a company’s reputation. So learning how to remove biases from a model is an area of intense research. The paper suggests that “In order to pave the way for effective bias prevention in general-purpose models, there is a need for building a common vocabulary tying together the normative, technical and empirical challenges of bias mitigation for these models.” (p. 39) But ultimately, this is difficult to deal with the situation because bias can’t be removed easily or directly, and trying to do so tends to create ‘blind spots.’ (p. 39) One could imagine a nightmare scenario where a GPT-3 based chatbot used offensive racial stereotypes when addressing users.
Ultimately, the paper recommends engaging bias in a holistic manner rather than treating it merely as a metric-driven problem.