Notes on Language Conditioned Traffic Generation

Link to paper: https://arxiv.org/abs/2307.07947

Paper published on: 2023-07-16

Paper's authors: Shuhan Tan, Boris Ivanovic, Xinshuo Weng, Marco Pavone, Philipp Kraehenbuehl

GPT3 API Cost: $0.05

GPT4 API Cost: $0.14

Total Cost To Write This: $0.19

Time Savings: 25:1

The ELI5 TLDR:

Researchers have developed a new model called Language Conditioned Traffic Generation (LCTGen) that can generate realistic traffic scenarios for self-driving cars. LCTGen uses language as a source of supervision and combines a large language model with a transformer-based decoder architecture. It has outperformed previous models in terms of realism and fidelity. LCTGen is composed of three main components: an Interpreter, a Generator, and an Encoder. It uses a transformer-based language model and a retrieval module to generate realistic traffic scenes. The model is trained using a real-world scenario-only driving dataset and is evaluated based on scene reconstruction metrics. LCTGen can also modify existing traffic scenarios based on user instructions. It has practical applications in instructional traffic scenario editing and self-driving policy evaluation. However, it has limitations, such as the lack of detailed lane information from the map. Overall, LCTGen is a promising tool for generating and modifying traffic scenarios for self-driving technology.

The Deeper Dive:

A New Approach to Traffic Scene Generation

The research paper we're discussing today presents a novel approach to traffic scene generation for self-driving development. The authors introduce a model called Language Conditioned Traffic Generation (LCTGen) that uses language as a source of supervision for dynamic traffic scene generation. This model combines a large language model with a transformer-based decoder architecture to generate traffic scenarios.

What sets LCTGen apart from previous work is its ability to outperform them in both unconditional and conditional traffic scene generation in terms of realism and fidelity. This is achieved by using a scenario-only dataset and a Large Language Model (LLM) to address the absence of a shared representation between language and traffic scenarios.

Understanding LCTGen

LCTGen is composed of three main components: an Interpreter, a Generator, and an Encoder. The Interpreter converts a natural language query into a compact, structured representation and retrieves an appropriate map from a real-world map library. The Generator then uses this structured representation and map to generate realistic traffic scenarios.

The structured representation is a key part of this process. It includes a map-specific component and agent-specific components. The map-specific component includes information about the number of lanes, distance to the nearest intersection, and the ego vehicle's lane ID, while the agent-specific components describe the attributes of each vehicle in the scenario, including their quadrant, distance, orientation, speed, and actions.

To achieve the generation of realistic traffic scenes, LCTGen uses a transformer-based language model, GPT-4, for language interpretation and a retrieval module to sample map regions from a map dataset. The Retrieval module samples map regions that align with the center of the map representation.

Training LCTGen

LCTGen is trained with a real-world scenario-only driving dataset. The training process involves using a generative transformer to capture interactions between agents and the map. This is done using a map encoder to extract per-lane map features and an agent query generator to convert structured representations of agents into agent queries.

The generative transformer models agent-agent and agent-map interactions using multi-head cross-attention and multi-head self-attention. The scene decoder then decodes the position, attributes, and motion of each agent using a Multi-Layer Perceptron (MLP).

Evaluating LCTGen

The evaluation of LCTGen is based on scene reconstruction metrics such as maximum mean discrepancy (MMD), mean average distance error (mADE), mean final distance error (mFDE), and scenario collision rate (SCR). The model outperforms existing methods in terms of scene initialization and motion behavior realism, achieving significantly lower MMD values and smaller mADE and mFDE values. Additionally, it achieves a lower scenario collision rate compared to baselines.

LCTGen in Practice

Beyond just generating traffic scenarios, LCTGen also excels at modifying existing ones based on user instructions. The user provides a fixed-form traffic scenario description, map description, and a natural language instruction. The model then outputs a modified traffic scenario according to the instruction.

The first step is to identify which part of the scenario should be modified based on the instruction. For example, if the instruction is to move the vehicle behind the ego vehicle to the opposite lane and accelerate, the model identifies the vehicle behind the ego vehicle (V2), moves it to the leftmost lane of the opposite-direction lanes, changes its direction to parallel_opposite, and moves it to the left back of the ego car. V2's speed is also increased to 10 (25 m/s).

The Impact of LCTGen

LCTGen is not just a theoretical model; it has practical applications. It can be used for instructional traffic scenario editing and controllable self-driving policy evaluation. It's a tool that can generate traffic scenarios with varying properties for controlled evaluation of self-driving policies.

The research also explores the use of LCTGen to generate scenarios for controllable self-driving policy evaluation. The performance of two self-driving policies (IDM and PPO) was evaluated using different types of generated scenarios. The success rate and collision rate of the policies varied depending on the type of scenario.

Limitations and Future Work

Despite its novel approach and promising results, LCTGen does have its limitations. The primary limitation is the lack of access to detailed lane information from the map. Future work could explore ways to incorporate more detailed map information into the model.

In conclusion, LCTGen presents a new way of generating and modifying traffic scenarios for self-driving development. Its use of language as a source of supervision and its ability to generate realistic traffic scenarios make it a promising tool for future research and development in the field of self-driving technology.

Notes on Language Conditioned Traffic Generation

The ELI5 TLDR:

The Deeper Dive: