① There is no need for a pre-existing map; by combining real-time visual images and Lidar information, the Siasun Robot&Automation can perceive the environment in real time; ② NaVILA further expands navigation technology from wheeled to legged robots, with researchers conducting tests on the Utree Go2 robot dog and G1 humanoid robot; ③ The NVILA model has powerful multimodal reasoning capabilities.
The Star Daily reported on December 11 (Editor: Song Ziqiao) that researchers from the University of California recently$NVIDIA (NVDA.US)$jointly released a new visual language model called "NaVILA." The highlight is that the NaVILA model provides a new solution for Siasun Robot&Automation navigation.


The visual language model (VLM) is a multimodal generative AI model capable of reasoning based on text, images, and video prompts. It gives large language models (LLM) the ability to "see" by combining them with visual encoders.
Traditional robot movement often relies on pre-drawn maps and complex sensor systems. However, the NaVILA model does not require pre-existing maps; the robot only needs to "understand" human natural language instructions, combine real-time visual images and Lidar information, and it can perceive paths, obstacles, and dynamic targets in the environment to autonomously navigate to designated locations.
Not only has it eliminated dependence on maps, but NaVILA also further expands navigation technology from wheeled to legged robots, hoping to enable robots to handle more complex scenarios and acquire the ability to overcome obstacles and perform adaptive path planning.
In the paper, researchers from the University of California conducted practical measurements using Siasun Robot&Automation's Go2 robot dog and G1 humanoid robot. According to the statistical conclusions from the team, the navigation success rate of NaVILA in real environments such as homes, outdoors, and workplaces reached 88%, and the success rate in complex tasks was also 75%.


According to the introduction, the characteristics of the NaVILA model are:
Optimized accuracy and efficiency: The NVILA model reduced training costs by 4.5 times, and the memory required for fine-tuning decreased by 3.4 times. The latency in pre-filling and decoding was nearly halved (these data were derived by comparing with another large visual model, LLaVa OneVision).
High-resolution input: The NVILA model does not optimize input by reducing the size of photos and videos, but instead uses multiple frames from high-resolution images and videos to ensure that no details are lost.
Compression Technology: NVIDIA pointed out that the cost of training visual language models is extremely high, and fine-tuning such models is also very memory-intensive, with a 7B parameter model requiring over 64GB of GPU memory. Therefore, NVIDIA employs a technology called "expand first, then compress" to reduce the size of the input data by compressing visual information into fewer tokens and grouping pixels to retain important information, balancing the model's accuracy with efficiency.
Multimodal Reasoning Ability: The NVILA model can answer multiple queries based on a single image or video segment, exhibiting strong multimodal reasoning capabilities.
In video benchmark tests, NVILA outperformed GPT-4o Mini and also showed excellent performance compared to GPT-4o, Sonnet 3.5, and Gemini 1.5 Pro. NVILA achieved a narrow victory in comparison with Llama 3.2.

NVIDIA stated that the model has not yet been released on the Hugging Face platform, but it is committed to releasing the code and model soon to promote the reproducibility of the model.
Editor/ping