web123456

[Revealing Multimodal] Multimodal Large Model, the core driving force of the AGI era

|What is modal?

As more and more large models are on board, innovative automotive intelligent products and services are emerging. When you find a useful or fun function and want to share it with your friends, you can write it out in words, send voice to the other party, or take photos and share it. Of course, taking a video with a complete demo of the function is also a routine operation.

Whether it is text, voice, images or video, it is a way for us to express and transmit information and perceive information. The source or form of information can be called "modal" (Modality). An interesting function is that information, the expression forms or media are diverse, that is, the modality of the same information can be diverse.

Humans passSensors such as vision, hearing, touch, and smell can receive and understand external information, and can also express, transmit and communicate information through the aforementioned modalities such as text, voice, images, and video. However, it is not enough to divide the modals into these media. In a broader sense, two types ofDifferentLanguages, such as Chinese and English, are also two different modalities.

Everything in the world has different forms of expression. Various modalities make our living environment rich and colorful. Humans cannot understand, understand and even transform the environment without the interaction of multiple modal information.

|Why do big models need multimodal?

In the development of artificial intelligence, most models were previously learned, trained and reasoned based on single-modal data.

Taking text as an example,Text has been developed for thousands of years and seems to be able to express anything accurately, making people feel that intelligence can be produced based on words alone.One of the abilities of many models is to fully understand the text, extract key information, and generate a summary of content that conforms to the original intention. The input is text, and the output is also text. Single-text trained models can indeed appear in this specific field and improve work efficiency. In addition, through the text and speech conversion model, the final product can also turn the text output by the model into speech, but this modal transformation may require the cooperation of multiple models or tools.

Similarly, single mode training such as image, voice, video, etc. can also perform well in specific professional fields and achieve many achievements, including in the Go field.AlphaGo, protein structure predictionAlphaFold, but these single modal models still have some limitations.

first,Information delivered by single-modal dataOften not comprehensive and complete enough.It is difficult to fully reflect the complexity and diversity of the real world. For example, a paragraph of text may not describe all the details of a scene clearly, and relying solely on a single image cannot show the function and function of an object.

Secondly, single-modal models are often isolated and closed and cannot effectively interact and integrate with data and models of other modalities. For example, a model that relies solely on text may not be able to generate appropriate descriptions based on the content of the image.

In comparison, human cognition and understanding the world is to form perception and cognition through the information interaction of multiple modalities, and then the brain integrates and understands them, transforming these perceptions and cognition into advanced mental functions such as knowledge, thinking, emotions, and creativity.

For example, if we want to understand a car, we will not just rely on simple text descriptions, but also look for various exterior and interior pictures and videos, and even take test drives in person, touch the vehicle, drive the vehicle, and collect information from various modes to form an understanding and understanding of the car.

More importantly, artificial intelligence needs to interact with "people", and the modality of "people" transmitting information is richer and more diverse than most things. A psychologist proposed a formula: "people"'s information transmission.=7%speech+38%voice+55%expression. In layman's terms, in addition to the language itself, we must also pay attention to the pronunciation and intonation of human information. We need to use the expressions, postures, movements and other body language of both parties, so that we can fully understand the information that others want to express.

Artificial intelligence wants to improve the ability to fully understand the world and understand "people", and to general artificial intelligence (AGI) evolves to truly help humans in work and life,AIIt must be able to support and realize the perception, understanding and interaction of multiple modalities. The multimodal capability isAITowardsAGIThe core driving force of evolution.

|Multimodal big model, open AGIThe key to the door

Wang Xiaogang, co-founder, chief scientist, and president of Jueying Intelligent Automobile Business Group, said at this year's WAIC conference that SenseTime was native to SenseTimeMultimodalThe big model allows everything in the world to be perceived, understood and interacted.

The multimodal model mentioned by Wang Xiaogang isModels that can process and integrate data and information from multiple modalities,The model's comprehensive perception and comprehensive capabilities are improved, allowing it to be closer to the real world and deal with more complex tasks and scenarios.

Previously, multimodals more refer to support for 3V modes, namely Verbal (text), Vocal (voice), andVisual(Visual), many classicAIThe tasks are based on the mutual transformation between these three major modalities. The "image description task" of generating text based on images is similar to the primary school students' picture-looking composition, and there are also the "image generation task" of generating images based on text description. In addition, WeChat's voice to text can also be regarded as a simple multimodal task. Such multimodal models are also called "cross-modal models".

Wang Xiaogang said that in the past, many models used to process different modal information by first converting voice inputs into text, combining text and images for analysis, and the output feedback was also written into text. Regenerating voice output based on text will lead to a large amount of information loss and a high delay.

wait untilOpenAIofGPT-4VThe core technology is to realize the "modal alignment" of visual language tasks based on large language models, allowing different modes to be smoothly connected, bridged and transformed. Just like a musical, music, stage design, and set must cooperate and coordinate with each other to achieve the best stage presentation effect. Sang Tang "New Every Day5.0"The multimodal model also adopts andGPT-4VSome people call similar technical solutions "joint modal model".

In this year7moonWAICThe conference was released by "Ririxin"5.5The new SenseTime multimodal model supported by the system is a more advanced technical solution. Modals such as text, voice, video, etc. are input together. After the multimodal large model is uniformly processed, the information of the corresponding modality is output. Compared with past solutions, the technical difficulty of multimodal fusion is an increase in geometric multiples.

This plan andGPT-4oThey are all multimodal states of "native fusion". They use native and naturally existing multimodal interleaved data to build a graphic-text interleaved native multimodal base, so that different modal data are mapped to the same state space, thereby realizing multimodal information sharing and multi-source knowledge collaboration. This natively fusion multimodal model is also called the "hybrid mode model".

Previously, the release of GPT-4o showed consumers the "native fusion" multimodal real-time interaction method, allowing more people to appreciate the charm of multimodal perception and interaction, and began to help release the imagination space for commercial implementation of multimodal large models.

Compared with mobile phones, smart cars are more suitable for carrying multimodal large models. Because various cameras inside and outside smart cars are normally open, users can interact with the car in real time through multimodal methods. At the same time, the number of smart cars is constantly increasing, which can generate rich end-user feedback and data information, allowing the model to continue to iterate and grow.

These factors combine to show an exciting future development direction of smart cars: from smart cars to super-agents, and multimodal models are the core driving force of this process.

Compared with companies such as OpenAI, SenseTime is the core supplier of smart cars. It has rich mass production experience in the fields of smart driving and smart cockpits. It will accelerate the "people-oriented" smart car interactive innovation with multimodal models as the core.

In the future, with the launch of SenseTime's "native fusion" multi-modal model, every detail inside and outside the car will be taken seriously, and everyone's needs will be responded to.Let information about "people" not be ignored, and it even breaks through the limitations of space, realizes the connection between users in the cabin and the broader physical and digital world, and promotes the evolution of smart cars to super-agents.