Google DeepMind recently announced its new AI model Gemini to compete with OpenAI's ChatGPT.
While both models are examples of ''generative AI'' that learn to find patterns in input training information to generate new data, ChatGPT is a large language model that learns when generating text.
Just as ChatGPT is a web app for dialogue based on a neural network known as GPT, Google has a conversational web app called Bard, which is based on LaMDA.
But Google is now upgrading it on the basis of Gemini zodiac sign. What differentiates Gemini from earlier generic AI models such as LaMDA is that it is a multi-model.
This means that it works seamlessly with multiple methods of input and output: as well as supporting text input and output, it also supports images, audio and video.
Accordingly, a new acronym is emerging: LMM (Large Multimodal Model), not to be confused with LLM. In September, OpenAI announced a model called GPT-4 Vision that can also work with images, audio, and text.
In contrast, Google has designed Gemini to be “natively multimodal”. This means that the core model directly handles a range of input types (audio, images, video and text) and can also output them directly.
The difference between these two approaches may seem academic, but it is important.
However, this is difficult to assess for two reasons. The first reason is that Google has not yet released Ultra, so the results cannot currently be independently validated.
Another reason Google's claims are difficult to assess is that it released a somewhat misleading demonstration video. The video features the Gemini model interactively commentating on a live video stream.
Despite these issues, Gemini and larger multimodal models are an extremely exciting step forward for generative AI. This is due to both their future capabilities and the competitive landscape of AI tools.