We’ve all seen some really cool stuff emerge from the corner of Artificial Intelligence and Natural Language Processing these past few years. Not even ten years ago a robot that can actually speak with you seemed like science fiction. Now we have Google Assistant and Alexa as mainstream products on the market. We used to have to research through articles if we had a question. Now If you google a question, the search engine will actually generate a summary of the best answer for you. On top of that, new medicines are being developed because medical journals have AI models that look through published treatments to see if these treatments could have merit in other areas of medicine.
More recently all the hype is around the GPT-3 model of Open-AI. This model has shown remarkable flexibility and performance. In fact it works so well that it almost seems that the limits lie in the imagination of the user rather than the technical aspects of the model itself. And the great thing is, it’s just a precursor of what’s yet to come.
The main reason that GPT-3 works so well is because it uses the newly proposed Deep Learning architecture called a Transformer Network. What I want to do in this blog is give you an idea of what sets this architecture apart from the other commonly used architectures such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) and in doing so help you understand why I think that Transformers will (soon) do for fields such as Computer Vision, Robotics and Conversational AI’s what GPT-3 has done for Natural Language Processing.
Many complicated AI tasks require a model that can use contextual information. As a simple example, say we want to translate the sentence: “A can of tuna” into another language. This is actually a much more complex task than you’d think. For instance does the word “can” mean “A tin container” or “to be able to” and even sillier, does the word “of” refer to the fact that the tuna is in the can? Or does it mean that the can is made out of tuna? For you, the reader, this is laughably obvious, but for a computer this is very hard.
Different AI models use context in different ways. CNNs look at overlapping windows of words (e.g. “A can of” and “can of tuna”) and give a word meaning based on the other words that are in the same window. On the other hand RNNs look at each word sequentially and takes into account the words that it has already seen. These techniques work well in their own right, but they become very inefficient when you have to consider long pieces of text.
A convolutional neural network looks at overlapping frames
A recurrent neural network takes into account the words that it has already seen
Transformers work differently. They use something that is called “Attention”. This is a relatively new concept as it was first proposed in mid 2017. Attention allows the Transformer to look at all words simultaneously and calculate for each word how important all the other words are.
Look for example at the two sentences “The animal didn’t cross the street because it was too tired.” and “The animal didn’t cross the street because it was too wide” The word “it” has entirely the same words around it, but in the first sentence it refers to the word “animal” and in the other sentence it refers to the word “street”. In the picture below you can see how a Transformer model would divide its attention over the sentence. And you see exactly that it finds the word “animal” in the first sentence (left) the most important and the word street in the second sentence (right).
This is an incredibly important innovation for many technical reasons that I won’t go into here. For now think of it this way. Attention allows the model to look for the relevant information from anywhere in the input, not limited by the distance to the current word it’s looking at.
Now I use the word “input” rather than “text” very deliberately here, because the idea of attention is not necessarily restricted to words and sentences. You can imagine many different forms of data that this could work with. A Computer Vision model could pay attention to different parts of an image or different frames in a video. A robot can learn to pay attention to different parts of its environment while learning to walk. A self-driving car can learn to pay attention to different parts of its environment or different parts of the communication between itself and other self-driving cars around it to prevent accidents and optimize the speed at which everyone can drive. Or an audio model can learn to pay attention to different voices to isolate one particular voice from the background noise. My point is: Transformers are not limited to text. They are applicable in all the fields where AI is now being used and they are very likely to show improvements in those fields reminiscent of what GPT-3 did for NLP.
So if there is anything that I want you to take away from this blog it is this: In the next few years we will see many incredible breakthroughs in the field of AI and a very large portion of them will be new applications of Transformer models or of the underlying principle called Attention.