Transformers Comprise the Fourth Pillar of Deep Learning
Modern artificial intelligence (AI) is powered predominantly by three categories of neural networks, each with distinct capabilities focused on different markets. Convolutional neural networks (CNNs) are used for image recognition, enabling image labeling, video analytics, and autonomous driving. Recurrent neural networks (RNNs) are ideal for processing speech, forming the backbone of voice assistants like Siri and Alexa. Finally, multi-layer perceptrons (MLPs) produce accurate rankings and recommendations, powering search and content feeds like Instagram, Netflix, and YouTube. Today, these three categories of deep learning algorithms power almost all of Google’s and Facebook’s AI workloads, as depicted below.
While each of these three categories still is in early days, a fourth has emerged with the potential to expand AI’s reach to new markets. Pioneered by Google, Transformers is a new architecture that enables computers to understand language with unprecedented accuracy, as shown below on the right. Unlike prior language models that processed words sequentially, Transformers can discern connections between and among words in a sentence. In the sentence “I arrived at the bank after crossing the river,” for example, Transformers establish the relationship between “river” and “bank” interpreting “bank” to mean a riverbank instead of a financial institution.
Transformers have catalyzed natural language understanding, collapsing the time to market. While AI researchers needed five years to match human performance in the ImageNet Large Scale Visual Recognition Challenge, they took little more than a year to develop deep learning models that achieved human performance in the General Language Understanding Evaluation (GLUE) benchmark, as shown above. Just as all ImageNet winners in 2012 used a common architecture—convolutional neural networks — the leading GLUE submissions standardized on Transformers. They are the building blocks of Google’s BERT and OpenAI’s GPT-2, two breakthrough language models that achieved human-like performance in long-form text understanding and writing, respectively.
When ARK published its Deep Learning: An Artificial Intelligence Revolution white paper in early 2017, complex language understanding was an open problem. At that time, we speculated that future neural networks would expand AI’s capability to language understanding, perhaps by incorporating memory. Instead, Transformers met the goal. When stacked together, Transformers can “remember” text, allowing models like Google’s Meena chatbot to participate in coherent, multi-turn conversations.
From an investor perspective, we believe Transformers will expand the AI addressable market meaningfully. To date, three neural network types – CNNs, RNNs, MLPs – have evolved in the AI market, serving three broad use cases. Now the fourth -Transformers – could create and expedite vision and audio applications heretofore not possible. As was the case with image, voice, and recommendation systems, we believe new services and companies will evolve based on natural language understanding. Among them will be information retrieval, call center automation, document summarization, and document generation.
According to our research, deep learning had added roughly $1 trillion in equity market capitalization to companies like Alphabet, Amazon, Nvidia, and TSMC as of year-end 2019 and perhaps another $250-500 billion in 2020. Now Transformers, the “fourth pillar”, could create substantial greenfield opportunities, increasing our confidence in the long term projection we made last year that AI would contribute roughly $30 trillion to global equity market cap creation over the next 20 years.
Our analysis suggests that complex language models will have a near-term impact on the AI chip market. At $4 billion today, the chip market has been training and operating the three established neural network architectures. A new kind of workload, Transformer-based language models are roughly 10x more demanding computationally than image models, as shown above. Even as AI compute costs fall, products and services based on language understanding could generate ~$1 billion in incremental AI hardware spending per year.