OpinionThe game changer for AI research is inspired by our brain
Opinion
The game changer for AI research is inspired by our brain
“Perceiving the world one mode at a time greatly limits AI’s ability to navigate or understand it. Thus, the future of AI is reliant upon three major components - multimodal models, sparsity, and task-agnostic architecture,” writes Elan Sasson, CEO of Data Science Group
Language is at the heart of how we communicate with each other, understanding the world and our surroundings also requires information in non-linguistic formats like images, videos, and audio. However, there’s only so much you can know about the world by looking just at a text. Representations of multisensory information are a key aspect of our brain since we live in a multimodal world. Humans naturally combine various sensory inputs to form a more complete understanding of their environments.
Perceiving the world one mode at a time greatly limits AI’s ability to navigate or understand it. Thus, the future of AI is reliant upon three major components - multimodal models, sparsity, and task-agnostic architecture.
The question of the lion in the African Savanna
Much of the research in the AI or machine learning community has focused on developing a machine that could have the ability to understand or learn any intellectual task that can be performed by human beings (whether implicitly or explicitly) – an intelligent agent of Artificial General Intelligence (AGI). Whether we are headed toward such an agent is a hot topic of debate, but the jury is still out.
Recent years have seen impressive results from large deep neural networks (DNN) trained to perform language generation and comprehension across a variety of tasks. For example, GPT-3 first showed that large language models (LLMs) with 175 billion parameters, can be used for few-shot learning (models with less training data), and can achieve impressive results in NLP (natural language processing) by pre-training on a large web corpus of text. Among GPT-3 capabilities are reading, comprehension, translation, question-answering, generating realistic human text and even software code, and even moderate several tasks that require on-the-fly reasoning.
However, reliable multimodal models are significantly harder to build than good language-only or vision-only models. Multimodal models that encompass vision, auditory, and language understanding simultaneously, are able to identify complex concepts and even draw connections between them.
So, whether the word “lion”, the sound of a lion roar, or a video of a lion running in the African savanna, are presented to a model, the same response is activated internally - the general concept of a lion. This multimodal latent representation creates a variety of tasks that can be generated automatically regardless of the input data type.
Multimodal architectures of DNNs have recently shown promising results. For example, the DALL-E 2 generalist model has been designed to take text as input, and produce images as output.
Sparsity, which is another property of the brain, can be a source of inspiration for future DNN architectures. Sparse processing (activation) leverages conditional computation using different parts of the model to process different types of inputs compared to dense models, where all parameters are used to process any given input and the whole DNN is activated to accomplish a specific task.
Can AI multitask?
Another important aspect of future AI architecture is to build a generalist model that performs well on many different tasks. Today's AI models are typically trained to do only one thing. Extending existing models to learn new tasks requires an architecture that can handle many separate tasks while combining its existing skills to learn new tasks faster and more effectively. Imagine if each time humans learned a new skill (like riding a bike), they forgot everything they learned previously - like how to drive, how to read, and had to start from scratch. That is how most AI models are trained today - by developing thousands of models for thousands of tasks, each requiring much more data. This is completely unlike how people approach and learn new tasks. In the push to endow models with more human-level intelligence, AI researchers are increasingly interested in developing architectures that are capable of handling a variety of tasks without becoming too specialized.
Google and its quest for Artificial General Intelligence
Last year the Google Research team announced Pathways, a single model that could generalize across domains and tasks while being highly efficient. Google's Pathways Language Model (PaLM) is an AI architecture that addresses many of the mentioned characteristics. It contains next-generation AI elements that will arguably help realize a generalist, task-agnostic AI architecture. PaLM consists of 540 billion parameters built with a sparse architecture that could generalize across domains and tasks while orchestrating distributed computation accelerators based on ultra-fast hardware.
The current version of PaLM was trained using a combination of English and multilingual datasets that include high-quality web documents, books, Wikipedia, conversations, and computer code. It demonstrated remarkable performance in numerous challenging tasks such as multilingual NLP benchmarks, natural language understanding, and generation capabilities, breakthrough capabilities on reasoning tasks that require multi-step arithmetic or common-sense reasoning (e.g., chain-of-thought prompting and joke explaining), explicit explanations for scenarios that require a complex combination of multi-step logical inference, code generation, and writing.
However, Pathways has yet to reach its ultimate vision - “Enable a single AI system to generalize across thousands or millions of tasks, to understand different types of data, and to do so with remarkable efficiency." We still have a long way to go in the quest for AGI, raising a fundamental question about the Scaling Hypothesis which claims that if we find a scalable architecture like the brain we will achieve AGI - Will human-level intelligence emerge simply from an increase in model size? Or does it emerge due to an increase in the training data set? Or will it require building an array of supercomputers?
In light of those questions, one could argue that we are moving the needle closer to AGI, however, I personally believe we are conceptually trapped in the DNN parameters scaling space and have not yet reached the mechanics of the human mind (if we can finally define it). The LLM “digital parrots”mimic statistical patterns of how language has been used in their immense datasets without having a conceptual basis to understand what the word “lion” actually means. In the words of Thucydides, the Greek philosopher 2,500 years ago: “Knowledge without understanding is useless.”
For a machine to understand, it should be able to say: "Sorry. I don't know." We are not there…yet.
Dr. Elan Sasson is the CEO of Data Science Group and a lecturer at Tel Aviv University's graduate courses in AI.