|The Unreasonable Effectiveness Of GPT-3|
|Written by Mike James|
|Wednesday, 05 August 2020|
I suppose you say that the unreasonable effectiveness of OpenAI's GPT language program is just a special case of the unreasonable effectiveness of Deep Learning but ... it is so much more shocking.
The most important thing to say is that GPT-3 doesn't implement any great new idea. It's a neural network with lots of layers, an interesting architecture, and lots and lots of parameters. It is bigger not different. This seems to be the overall lesson of neural networks. Back in the 1980s we thought that neural networks might be the solution, but the computing power that was available limited what could be implemented to a small number of layers and a modest number of neurons. Put simply, things sort of worked but not well enough. Pioneers such as Geoffrey Hinton continued to believe in neural networks and expended a lot of energy on finding ways of making networks easier to train. As it turned out much of this work wasn't really necessary because as computing power grew it became evident that deep neural networks were working all along. It's just we need to implement bigger networks and use lots more data and time to train them. You could sum this up by saying that we were always on the right track; its just we didn't, or rather couldn't think, big enough.
This brings us to GPT-3 Generative Pretrained Transformer 3. The two big areas of neural networks are vision and language. Vision makes use of convolutional neural networks to detect objects at different places in the visual field. Until recently, language networks have been recurrent, which can be thought of as convolutional within the time domain. Recurrent networks take samples from more than one time and feed their output back in as input. They are generally regarded as difficult to work with and difficult to train, but when they do work they work very well indeed.
The reason a recurrent network is needed is that, in language, words are affected by the words that came earlier in the sequence. There are correlations between pairs of words, triples of words and so on, and the degree of recurrence sets the distance over which one word can affect the meaning of another.
That word "meaning" is a dangerous one. You can read into words and language more than there is. If you step back and try to forget that you are a creature of words, then all you really have is the statistical structure of symbols that make up the language and some how this statistical structure mirrors the structure of the world. You might not like the idea that meaning is just the statistical properties of the symbolic system we call language, but as we shall see GPT-3 seems to suggest that this is true.
The first thing to say about the recent neural language models is that they are based on two new ideas - the transformer and attention. Putting these two things together has allowed simple standard feed-forward networks to be used. Abandoning the recurrent network simplifies things so much that we can train faster and hence use much bigger datasets.
The GPT series of models uses transformers and attention networks in a fairly standard way. GPT-2 had 1.5 billion adjustable parameters and it performed well, but not spectacularly well. So OpenAI took a leap of faith and implemented GPT-3 with 175 billion parameters and trained it on a huge dataset. GPT-3 is an order of magnitude bigger than anything before and this is where much of its power seems to come from. To put this into context, just to store its parameters takes 300GBytes of RAM. The training set was taken from the web - something like 500 billion words in context; took 355 GPU years and cost $4.6 million.
The training was particularly simple. Take a text and delete a word and then train to get GPT-3 to predict the missing word when presented sequentially with the rest of the text, word by word. At the end of the training what you have is a massive auto-complete program. You can type in a sentence and GPT-3 will offer you completions. This sounds trivial but it will auto-complete blocks of text up to 2048 tokens wide. That is, you can type in a complete sentence and GPT-3 will autocomplete with another sentence. This makes it seem to be so much more than an auto-completer. You can type in a question and you will get an answer. Type in a command and you will get a response. This isn't working at the level of "predict the next word to finish a sentence" it is capturing the statistical structure of much larger chunks of text.
Now this is where things get spooky. You would think that training in this way would produce a good auto-complete type system, but it seems to go much further. For example, if you type in a sum like "what is 234 plus 231?" then GPT-3 tends to get the right answer - not always, but often enough to be surprising, around 80% for three-digit addition. If you suppose that this is possible because the sum exists as an example in the huge training set - no because they searched for such sums and they weren't there. This isn't rote memory recall.
The research paper lists lots of tasks the GPT-3 does well on. The key thing is that there has been no fine-tuning on these tasks. For example, if you give it an English to French translation and an English phrase it gives the translation into French as a completion. In other language models the system has to be trained again on translation examples. This is fine-tuning as the weights in the model are adjusted. For GPT-3 no weights were modified to provide an improvement for the different tasks.
What is really impressive is the generality of the language ability. Everything from translation, grammar correction, news report generation, question-answering, comprehension and so on. It seems that you can capture the generality of language in just its statistical properties, because there certainly is no understanding or reasoning of any kind going on in GPT-3 - it's down to conditional probabilities.
In fact the failures indicate this:
"Within the domain of discrete language tasks, we have noticed informally that GPT-3 seems to have special difficulty with “common sense physics”, despite doing well on some datasets that test this domain. Specifically GPT-3 has difficulty with questions of the type “If I put cheese into the fridge, will it melt?”
You can read more about GPT-3 and find many examples of its language generation capacity on the web. The paper concludes with:
"Despite many limitations and weaknesses, these results suggest that very large language models may be an important ingredient in the development of adaptable, general language systems."
It may well have practical importance in the future, but it serves to put the structure of language to the fore. If you can achieve so much without "understanding" perhaps you don't need understanding. Or perhaps the statistical structure is the understanding.
or email your comment to: email@example.com
|Last Updated ( Wednesday, 05 August 2020 )|