ExpressMind: A Multimodal Pretrained Large Language Model for Expressway Operation

aBeihang University, 100191, Beijing, P.R.China.
bShandong Hi-speed Group Co., Ltd, 250098, Jinan, P.R.China.
cInstitute of automation, Chinese Academy of Sciences, Beijing, P.R.China.
First author: by2313310@buaa.edu.cn

*Corresponding author: zhiyongc@buaa.edu.cn

Abstract

The current expressway operation relies on rule-based and isolated models, which limits the ability to jointly analyze knowledge across different systems. Meanwhile, Large Language Models (LLMs) are increasingly applied in intelligent transportation, advancing traffic models from algorithmic to cognitive intelligence. However, general LLMs are unable to effectively understand the regulations and causal relationships of events in unconventional scenarios in the expressway field. Therefore, this paper constructs a pre-trained multimodal large language model (MLLM) for expressways, ExpressMind, which serves as the cognitive core for intelligent expressway operations. This paper constructs the industry’s first full-stack expressway dataset, encompassing traffic knowledge texts, emergency reasoning chains, and annotated video events to overcome data scarcity. This paper proposes a dual-layer LLM pre-training paradigm based on self-supervised training and unsupervised learning. Additionally, this study introduces a Graph-Augmented RAG framework to dynamically index the expressway knowledge base. To enhance reasoning for expressway incident response strategies, we develop a RL-aligned Chain-of-Thought (RL-CoT) mechanism that enforces consistency between model reasoning and expert problem-solving heuristics for incident handling. Finally, ExpressMind integrates a cross-modal encoder to align the dynamic feature sequences under the visual and textual channels, enabling it to understand traffic scenes in both video and image modalities. Extensive experiments on our newly released multi-modal expressway benchmark demonstrate that ExpressMind comprehensively outperforms existing baselines in event detection, safety response generation, and complex traffic analysis.

Main Contributions

Our contributions can be mainly divided into the following parts:
1.Full-stack Expressway dataset: This study constructs the first industry's full-stack expressway dataset spanning text cognition, logical reasoning, and visual perception, including three specialized subsets: traffic knowledge texts, emergency response reasoning, and event video scene understanding.
2.RL-aligned CoT Reasoning: We design a RL-based expressway strategy alignment strategy in LLM training, which can significantly enhance the model's logical reasoning and self-correction capabilities.
3.Graph-Augmented Retrieval: A graph RAG-based dynamic knowledge base is established for critical expressway information retrieval and indexing.
4.Multimodal Alignment mechanism: A Visual-Prior Alignment mechanism is designed by enforcing alignment and reweighting of visual tokens to enhance the understanding of visual features.
5.Multi-modal Benchmark: The multi-modal Benchmark for evaluating LLMs within the expressway domain is released, encompassing four evaluation subsets: basic knowledge comprehension, video incident detection, safety response generation, and traffic analysis reporting.

Framework Overview of the ExpressMind

Framework

this paper introduces ExpressMind, a domain multimodal LLM for expressway operation. We construct the first full-stack expressway dataset and propose a two-stage pre-training paradigm for the internalization of expressway-domain knowledge. This study also develops a Reinforcement Learning (RL)-based Chain-of-Thought (CoT) alignment mechanism to strengthen domain reasoning. Furthermore, a visual-enhanced cross-modal encoder is incorporated and a graph-based retrieval-augmented generation (RAG) is proposed to enhance the extraction of key traffic scene characteristics and dynamic knowledge.

Framework

This study introduces ExpressMind, the first domain-specific MLLM designed for expressway scenarios. It is built through multiple technical innovations: a two-stage pretraining paradigm for domain knowledge internalization, a GRPO-enhanced RL framework for safety-critical reasoning alignment, a graph-augmented RAG mechanism for real-time spatiotemporal knowledge retrieval, and a VPA multimodal alignment module for deep video understanding. To support this work, we have open-sourced the first training dataset covering domain knowledge, incident CoT strategy reasoning, and multimodal incident detection VQA. The ExpressMind has been applied in top-tier expressway groups, serving as a representative application case of large models in the expressway domain.

Video

The ExpressMind-VL intelligent operation system has been deployed in practical applications for multi-task scenarios on expressways. The visualization system, as shown in the figure/video, includes functions such as releasing warning information, summarizing traffic conditions, describing video events, and generating handling recommendations. We have deployed the system on the expressways in Shandong and Zhejiang provinces. In the intelligent management of Shandong expressways, ExpressMind-VL classifies the types and severity levels of traffic surveillance videos, and generates analytical reports along with handling strategies for traffic incidents. For the intelligent management of Guangdong expressways, ExpressMind-VL detects traffic events based on real-time video streams and produces structured textual descriptions for comprehension.

BibTeX