The field of AI has periodically had breakthrough moments where gradual advancements in AI software combined with ever more powerful computers reaches a kind of threshold moment where all the small changes add up to something qualitatively different.
Obviously, that’s ChatGPT at the moment.
It sounds absurd but under the covers ChatGPT is just an extremely sophisticated statistical language model backed by a mind bogglingly huge about of computing power. Even more absurd is that the language model is actually a one trick pony. It’s very, very good at that one trick however.
So what’s the trick? The trick is guessing what the next word is in a sentence. More precisely, the trick is guessing what comes next after being presented with some text, including paragraphs, punctuation, etc.
That doesn’t really explain what’s going on though because ChatGPT seems to be doing so much more than just guessing what comes next in a piece of text. To dig in, we’ll go back to simpler times (at least simpler for language models.)
Cue time travel music. First stop: the 1950s.
In the 50s a guy called Claude Shannon, who was a veritable rockstar of computer science, performed an experiment to find out a little bit about predicting the next letter in a sentence. He might not have been to first person to think about language models but he was the first to publish a paper and that’s all that matters. Shannon asked volunteers to guess the next letter when given the first part of a sentence. Sometimes the task was easy and sometimes not so easy. If a sentence starts with “Mary had a little la” then you can guess “m” as the next letter pretty confidently. When prompted with “I went to ” it’ll certainly be harder to guess the next letter. (I’m going to ignore spaces and punctuation for the purposes of this explanation).
After doing this with many people and many prompts, Shannon calculated that people can guess the next letter about half the time. It’s important to note that the people taking part in the study applied an extremely sophisticated language model to perform these predictions – their brains. This model isn’t just using spelling and grammar, it’s using knowledge of poems and other knowledge that is shared among people using English.
So if you know anything about nerds, you’ll know what comes next. The nerds wanted to see how well a computer can do at the same task. Of course, at first it wasn’t as well as humans. (I don’t know the numbers sorry). Let’s look at some highlights.
Progress was slow at first. Computers were slow and didn’t have much disk space or memory. In the 70s and 80s, computers got to point where you could, for example, feed in the complete works of Shakespeare and get the computer to output some text that looked a bit like Shakespeare. These language models were very simple. When predicting the next word, they just look at the previous word or perhaps the previous two words and calculate which word has most often followed those words in Shakespeare’s actual writing.
So now we have something that’s starting to look a little bit like a modern language model. We feed in some training data, the computer does some statistics and then we can use the model to predict words. Let’s say our model uses two words to predict the next word. If we prompt our model with some text such as “But soft, what light”, it’ll look at “what light” and say to itself “Well I’ve seen ‘what light’ twice in Shakespeare. Once it was followed with ‘through’ and once it was followed by ‘is’ so I’ll guess one of those with a 50/50 chance of each”. If we repeat the process, we can get our model to produce some Shakespeare-like text. It’ll be terrible though. This approach of feeding in text and guessing words based on simple statistics can be improved with more training text and looking at more words in the prompt but it was never useful for much. Another way to improve the model is to program in some grammatical rules to help with word prediction. Nouns tend to follow adjectives, etc.
Skip forward to the 90s and a type of statistical model called a “neural network” was all the rage. It was okay at recognising handwriting and not half bad at recognising people’s faces or understanding a word spoken by a person. It’s called a “neural network” because it comes from AI research and uses learning principles inspired by actual neurons in actual brains. However, neural networks aren’t really much like actual brains.
The main thing about a neural network is that it’s a bunch of “neurons” that are connected to each other. Some of the neurons are also connected to the input (words, pixels, etc) and some of the neurons are also connected to the output (what word comes next, whether that picture is a dog, etc).
Training a neural network is a process of adjusting the connections between each neuron to improve the accuracy of the predictions made by the network at the output neurons. Connections can be strong or weak – adjusting the strength of the connection changes how much one neuron affects another. These strengths are called “weights”.
Weirdly, neural networks start out with completely random weights. An untrained network is completely useless. Training a neural is conceptually very simple. The maths is hard but we don’t care about that. To train a network we need samples of input with the corresponding correct outputs. We train a network by giving it sample inputs, looking at the outputs and then adjusting the weights to make the outputs closer to the correct outputs. This turns out to be an excellent way to train lots of models including language models but it does take a lot of computing power.
Before neural networks, if you asked a developer to write some code that could translate English to French and vice versa, the developer would write some code that does basically what you would do with a language dictionary and a copy of Bescherelle. Translate the nouns, verbs, etc then do some conjugations, move some adjectives before or after the noun, etc etc. It turns out to be a very tall mountain to climb for a very disappointing result. Google decided to collect all “parallel texts” they could find – texts that have the same meaning in different languages. United Nations meeting transcripts were an excellent resource for this. They then trained a gigantic neural network (yep, just lots of neurons) with one language as inputs and the other as outputs. One thing they had to do is to expand the number of words the model works with. Our Shakespeare model was only looking at a few words but this translation model looks at whole sentences at a time. This is what Google Translate is now. Lots of people were surprised. It was one of those holy cow moments in language models.
You know when you start typing in the search box on in google and it starts suggesting words? Yep, that’s a language model.
Microsoft, Amazon and other large companies started their own large language models to compete with Google.
One company called OpenAI started creating language models and charging a subscription to use the models. Now anyone could use a language model that’s already trained.
Meanwhile more advances were being made. A technical improvement vastly improved the speed of training and general improvements meant that the size of input text and output text that language models can handle increased. On top of that, OpenAI spent a couple of years (I think) processing all the English text on the internet that it could find. All of twitter, all of reddit, all the text on any website you can think of. Then it released a model called GPT-2 and allowed people to use it with a subscription. GPT-2 caused a huge stir. It was really good at writing things. Give it a few sentences to get it started and it would write a few pages of text that was amazingly good. The language model what so detailed that it “knew” things. So now we have a language prediction model that “predicts” that what comes after the question “Who wrote A Midsummer Night’s Dream?” is “William Shakespeare wrote A Midsummer Night’s Dream.” (OpenAI’s models tend to speak in complete sentences unless prompted otherwise).
This stir and attention was fantastic for OpenAI’s business model so the inevitable GPT-3 model came along, which was even better than GPT-2.
Now one of the problems (for OpenAI at least) with GPT-2 and GPT-3 was that you basically needed to be a developer to use the language models. How to expand the market? How to get more subscribers? Create something that can respond to requests made by people using English language prompts!
Well, we’ve seen that GPT-3 can answer questions just by predicting what comes after a question. It learned this by learning from examples on the internet of people asking questions and other people answering them.
It’s answering questions as if it knows that it’s being asked a question and then answering the question but – and this is really important – it’s just producing text that is a prediction of what would come after the input text. The input text is a question so the statistical model says the next bit of text is an answer.
So OpenAI decided to train a new model using GPT-3 as the starting point and train it some more (a lot more) to interact like a chat bot. I.e., to produce output text that matches very closely what the GPT-3 language model predicts is the “best” response. This also includes a huge amount of training to make sure it knows that racist/intolerant/etc answers are not good responses.
They paid lots and lots of people all over the world to “talk” to ChatGPT and then rate ChatGPT’s responses. These ratings went back to OpenAI who were doing some clever things to train ChatGPT with this feedback.
One very important thing that OpenAI did to the language model was to teach it how to talk about itself and answer questions about itself. If you ask it what it is, it will tell you that it’s a language model. It’s important to understand that this is just training that’s been done to the language model. It could equally have been trained to tell people that it is a purple teapot living on Mars. It doesn’t “know” that it is ChatGPT any more than it “knows” who wrote A Midsummer Night’s Dream”. But it’s very convincing. Very very convincing.
So that’s how ChatGPT was created. The main take away is that it’s still just a predictive language model – it’s not aware of what it’s saying in any meaningful way.
Jeremy Rixon is a Senior Software Developer at the National Library of Australia. Between 2020-2022, he was Senior Developer for the post-launch enhancement phase of the National edeposit service (NED), working with all NSLA libraries in Australia.