Motion Transformer‐based planning node for Autoware - tier4/new_planning_framework GitHub Wiki

MTR Trajectory Prediction Node (Autoware Compatible)

This node implements a PyTorch-based trajectory predictor for the ego vehicle using the MTR model (Motion Transformer with Global Intention Localization and Local Movement Refinement). It is fully compatible with Autoware's New Planning Framework.

🎯 Project Objectives

The primary goal of this node was to evaluate the practicality of integrating the MTR model into a planning or simulation pipeline. While MTR is primarily designed for trajectory prediction, this project sought to understand how well it could function — or be adapted — for planning-related tasks.

✅ Key Objectives

Model Integration Testing
Integrate the MTR model into a ROS 2-based simulation setup to evaluate its predictions in a closed-loop environment.
Training Data Requirements
Investigate the amount and type of training data required to achieve generalization in simulation, including:
- Quantity: number of scenarios and data richness.
- Diversity: straight vs. curved roads, intersections, roundabouts, etc.
- Structure: need for semantic inputs (e.g., route, traffic light states).
Closed-loop Simulation Behavior
Analyze how MTR behaves when its outputs directly influence the ego vehicle’s motion, especially under edge cases or ambiguous scenes.
Model Limitations & Bottlenecks
Identify practical limitations in training and deployment:
- Hardware constraints (e.g., batch size, GPU memory).
- Data preprocessing complexity.
- Latency or inference time in real-time scenarios.
Provide a trajectory generator that can be used as a placeholder to develop the selector framework
Baseline for Future Work
Establish a reproducible foundation for future improvements, such as:
- Route-conditioned prediction.
- Selective trajectory sampling.
- Model fusion with planning heuristics or reinforcement learning.

📌 Summary

This node served as a sandbox for experimentation — a means to quantify MTR’s strengths and weaknesses when stepping beyond its original scope as a predictor. The insights gained here will guide future iterations involving better conditioning, model fusion, and more principled dataset design.

🔧 Overview

The node predicts 6 trajectory modes for the ego vehicle using historical ego and agent data, as well as lanelet2 map information. It runs at 10 Hz, producing predictions that are consumed by downstream selectors and planners within Autoware.

In yellow: MTR node outputs

In yellow: MTR node outputs

📥 Inputs

Subscribed ROS 2 Topics

/perception/tracked_objects
/localization/kinematic_state
/map/vector_map (Lanelet2)

Model Inputs

Key	Description	Shape
`obj_trajs`	Past agent trajectories (target-centric, embedded)	`[1, A, 11, 29]`
`obj_trajs_mask`	Mask for valid steps	`[1, A, 11]`
`map_polylines`	Encoded lanelet2 polyline features	`[1, L, 20, 9]`
`map_polylines_mask`	Mask for valid polyline points	`[1, L, 20]`
`map_polylines_center`	Center points of each polyline	`[1, L, 3]`
`obj_trajs_last_pos`	Final positions of each agent	`[1, A, 3]`
`intention_points`	Candidate goals for ego	`[1, 64, 2]`
`track_index_to_predict`	Index of ego in the agent list	`[1]`

🧠 Prediction Logic

The node generates 6 future trajectory modes for the ego vehicle. Each trajectory consists of:

80 future waypoints (0.1s intervals → 8 seconds)
Position (x, y) and velocity (vx, vy)
Estimated yaw (computed post-inference)

Mode selection is delegated to downstream nodes.

🧪 Training & Validation

The model was trained using synthetic T4 data, generated by:

Automatically assigning start/goal poses in an urban map.
Engaging Autoware and recording 1-minute rosbags.
Extracting 10-second training scenes from these rosbags.
Approx. 5000 scenes were created.
Scene duplication was used to extend the scene amount to ~60000 scenes.

A separate training repository processes this data, and extensive testing confirms that this node's pre-processing produces results identical to the training pipeline.

Comparison of node and training framework outputs

Validation comparing node vs. training pipeline

🗺️ Map Handling

The node uses a Lanelet2 map, which is converted into a Polyline format compatible with the model. Each polyline consists of:

Up to 20 points
9 features per point (e.g., position, direction, type)

⚠️ Important Notes:

Map coverage affects prediction quality.
Including or excluding sidewalks in the polyline map can significantly alter the MTR’s output. For example, if sidewalks are not represented, the model may avoid generating paths that cross them.

Polyline representation of lanelet2 map used by the MTR node

Polyline representation of lanelet2 map

⚡ Performance

Inference frequency: 10 Hz
GPU latency: ~30–40 ms (on RTX 3090)
CPU-only inference may struggle to keep real-time speed

🧪 Simulation Tests

Although the training process showed promising quantitative results, simulation performance was suboptimal. Most of the predicted trajectory modes were unsafe, often veering outside the map boundaries. While the MTR model appeared to learn general motion patterns from the training data, its lack of high-level guidance (e.g., route planning, lane-following commands, turn intentions) severely limited its real-world applicability in closed-loop simulation.

This aligns with the original design of MTR as a trajectory prediction model — not a planner. Without an explicit route or semantic command input, the ego vehicle lacks the context necessary for reliable decision-making during autonomous navigation. checkpoint_40.webm

Example simulation output showing trajectory drift and mode dispersion

⚠️ Key Observations

Synthetic training data limitations: Training on naively generated synthetic scenarios (e.g., randomized start/goal pairs) introduced challenges in coverage, relevance, and diversity.
Long training cycles: Training times were significant, especially given the volume of synthetic data required to achieve generalization.
Scenario mining was lacking: No curriculum learning or hard-negative sampling was used, which likely contributed to poor behavior in corner cases.
Instability in PSim datasets: Results were inconsistent, particularly in complex scenes such as curves and intersections.
Hardware bottlenecks: Training was constrained to a batch size of 6 due to GPU memory limits (RTX 3090), which likely impacted stability and convergence.
Checkpoint quality: Despite fine-tuning, the T4-generated model checkpoints did not generalize well and failed to produce consistent, safe trajectories in simulation.

💡 Takeaway

MTR is not a plug-and-play planner. Its predictions are highly sensitive to input signals, map structure, and training diversity. In safety-critical applications like motion planning, additional modules — such as goal conditioning, semantic intention inputs, or downstream selectors — are essential to ensure safe operation.

This aligns with the initial project objectives, which sought to assess whether MTR could be effectively used in a planning context. Through integration testing, closed-loop simulation, and training analysis, it became evident that while MTR captures general motion patterns, it lacks the architectural components and robustness required for direct deployment in a planning stack.

As such, the findings strongly support further exploration of:

Route-conditioned or intention-aware variants of MTR,
Dataset refinement to include more structured and diverse supervision,
Hybrid approaches that combine prediction with rule-based or optimization-based selectors.