MoTIF: An end-to-end Multimodal Road Traffic Scene Understanding Foundation Model

Video-based road intelligent detection constitutes a critical component in modern intelligent transportation systems, serving as a crucial enabler for comprehensive transportation planning and emergency traffic management. Current traffic scene perception methodologies relying on conventional deep learning architectures present inherent limitations, including heavy dependence on extensive manual annotations of traffic elements and predefined rule configurations. These approaches demonstrate constrained semantic representation capacity and limited generalizability across heterogeneous traffic scenarios. To address these challenges, this paper proposes a novel end-to-end Multimodal Foundation Model (MFM) architecture that jointly generates dynamic traffic event detection outcomes and semantic-rich contextual descriptions. Through integration of Low-Rank Adaptation (LoRA) as a parameter-efficient fine-tuning strategy, we develop the Multimodal Road Traffic Scene Understanding Foundation Model (MoTIF), which establishes cross-modal alignment between visual patterns and textual semantics. This framework demonstrates enhanced capability in extracting salient traffic targets and generating hierarchical scene representations, significantly improving automated detection efficiency in road video analytics. Notably, MoTIF exhibits contextual reasoning capabilities for implicit traffic event interpretation. Extensive evaluations on two real-world datasets encompassing urban road intersection scenarios in Tianjin and highway monitoring systems in Shandong Province reveal that MoTIF achieves superior performance metrics: 65.81 average score on multimodal scene understanding assessment and 84.17\% event detection accuracy, outperforming mainstream benchmarks in both precision and computational efficiency. This research advances multimodal learning paradigms for intelligent transportation systems while providing practical insights for adaptive traffic management applications. The dataset concluding Tianjin road intersection surveillance video and corresponding data annotation is published on https://github.com/wanderhee/MoTIF-Datasets.

Our contributions can be mainly divided into the following parts:
1. Model: The research in this paper focuses on improving the recognition ability of MFMs in the complex environment of traffic roads. The model proposed in this paper fine-tunes the various types of events in road traffic, which ensures the model's ability to accurately recognize and reason in road traffic scenarios.
2. Task: MoTIF outputs structured descriptive text, which enables the detection and analysis of various types of traffic events. The specific inference ability of the multi-modal large model makes it possible to further deduce the traffic state.
3. Dataset: This paper constructs a multimodal dataset for video understanding of road traffic. We propose a set of automated video annotation methods for traffic target object detection and semantic segmentation fusion for highways. We use Q-A question and answer pairs to annotate roadside surveillance videos of highways.
4. Benchmark: This work develops a multimodal video analysis benchmark tailored to diverse tasks. For intersection scene understanding, it employs NLG metrics including BLEU-4, ROUGE-L, CIDEr, and semantic relevance metrics such as BERTScore, alongside structured human evaluation. For traffic event detection, classification metrics — Accuracy, Precision, and F1-Score — are utilized.

This paper proposes a method to perceive the state of the entire road network for MFM-based traffic scenarios. First, a structured semantic annotation approach is employed for video-image labeling, coupled with traffic video datasets to construct a multi-modal road traffic dataset. The videos and corresponding annotations are fed into the model. A video Q-Former extracts spatiotemporal dynamic features from the videos and aligns visual features with textual information. Finally, the aligned joint features, combined with prompt text, are input into a pre-trained LLM, which outputs traffic scene comprehension text. To further adapt the framework to roadside monitoring scenarios, we implement a traffic scenario-refined fine-tuning strategy using Low-Rank Adaptation (LoRA). Through end-to-end multi-modal alignment and lightweight optimization, this framework provides a high-precision, low-latency, and interpretable traffic scene understanding solution for road network state perception.

Here is an example from the dataset our research team has constructed – a sample sourced from The Tianjin Road Intersection Dataset. Complete dataset will be made publicly available upon publication of the paper.

The annotations for the aforementioned example video are structured as follows: The annotation files consist of a list of dictionaries, with each entry adhering to the uniform format below:


[
  {
    "id": 0,
    "video": "datasets/example.mp4",
    "conversations": [
      {
        "from": "human",
        "value": "<video>\nClarify the contents of the displayed video with great detail."
      },
      {
        "from": "gpt",
        "value": "First, two black cars drove out of the intersection from bottom to top, and one black car drove out of the intersection from top to bottom. Next, the traffic lights turned from green to red, and a white car stopped outside the intersection at the bottom of the video. Then, multiple electric cars and cars drove through the intersection in an orderly manner. Finally, a black car turned left."
      }
    ]
  }
]

MoTIF: An end-to-end Multimodal Road Traffic Scene Understanding Foundation Model

Abstract

Main Contributions

Framework Overview of the MoTIF

Performance Testing

Time-Series-Aligned Description Generation for Urban Intersection Dynamics

Traffic Incident Detection and Description Generation (Congestion)

Traffic Incident Detection and Description Generation (Construction)

Sample Demonstration

BibTeX