More Works

DMW Logo Drive My Way : Preference Alignment of Vision-Language-Action Model for Personalized Driving

CVPR 2026, Denver
1University of California, Riverside    2University of Michigan

*Corresponding author

TASL Lab Logo Trustworthy Autonomous Systems Laboratory (TASL)

Learns Individual and Distinct Driving Behaviors

A pedestrian occluded by a static roadside object suddenly crosses the roadway.

Driver 2

Maintains low and steady speeds throughout.

Driver 4

Makes quicker maneuvers and adopts a higher cruising speed.

Driver 6

Shows behavior in between, balancing caution and pace.

A few accident cars blocking part of the lane.

Driver 3

Maintains low and steady speeds, approaching cautiously and overtaking only when there is a safe gap.

Driver 8

Makes more agile adjustments while maneuvering around the obstacles.

Driver 10

Accelerates to cut in front of the car on the left with an assertive style.

A parked vehicle exiting a parallel parking bay to cut in front.

Driver 1

Accelerates decisively and closes the gap quickly.

Driver 7

Responds with a harder brake and wider gap acceptance.

Driver 9

Maintains a more cautious and steady style.

Aligns with Preference Instructions

A pedestrian emerging from behind a parked vehicle.

Aggressive

"Avoid big delay, I'd rather keep moving. Only make a quick evasive move if someone clearly steps out."

Conservative

"Stay alert. Someone could walk out from behind the vehicles. I don't want to take any risks."

A slow-moving hazard blocking part of the lane.

Aggressive

"If there's a chance to get around the hazard without waiting too long, feel free to take it."

Conservative

"Let's be patient until it's clearly safe. We can move around it when everything looks settled."

A parked vehicle blocking the lane with the opened door.

Aggressive

"Quickly swerve into the adjacent lane to pass the hazard without braking."

Conservative

“Please wait for a safe chance; I don‘t like rushing past hazards.”

Negotiating with the opposite vehicles without traffic lights.

Aggressive

"I'm okay with taking tighter gaps today. I'm running late."

Conservative

"Better to let others go first, patience will keep us safest here."

The ego vehicle loses control due to bad conditions on the road.

Aggressive

"Regain control quickly and get back up to speed."

Conservative

"The road feels unsteady. Let's keep it easy and smooth."

A vehicle coming from the opposite lane invades the ego's lane.

Aggressive

"Let's move fast and maintain speed. Steer right just enough to slip past."

Conservative

"Another vehicle's moving in, better to hold until it passes."

A parked vehicle blocking the lane.

Aggressive

"I'm in hurry. Pass the parked car quickly."

Conservative

"Please don't pass until no cars are coming, I'd feel more comfortable."

A vehicle merging into its lane from a highway on-ramp.

Aggressive

"Don't lose speed. Assert our position but avoid collisions."

Conservative

"Let merging car go, give them space, keep the ride calm."

Abstract

Human driving behavior is inherently personal, which is shaped by long-term habits and influenced by short-term intentions. Individuals differ in how they accelerate, brake, merge, yield, and overtake across diverse situations. However, existing end-to-end autonomous driving systems either optimize for generic objectives or rely on fixed driving modes, lacking the ability to adapt to individual preferences or interpret natural language intent. To address this gap, we propose Drive My Way, a personalized Vision-Language-Action (VLA) driving framework that aligns with users' long-term driving habits and adapts to real-time user instructions. DMW learns a user embedding from our personalized driving dataset collected across multiple real drivers and conditions the policy on this embedding during planning, while natural language instructions provide additional short-term guidance. Closed-loop evaluation on the Bench2Drive benchmark demonstrates that DMW improves style instruction adaptation, and user studies show that its generated behaviors are recognizable as each driver’s own style, highlighting personalization as a key capability for human-centered autonomous driving.

Why Personalized Driving?

Overview: DMW personalizes driving via long-term user preferences and short-term language intent

Human driving behavior is inherently personal and continuously evolving. Drivers exhibit stable, long-term habits while also adjusting their behavior in the moment based on short-term intent and situational context.

What are the limitations of current systems? Most end-to-end autonomous driving systems optimize a generic objective (e.g., safety/efficiency), effectively targeting an “average driver.” Some provide a small set of predefined modes (e.g., sport, comfort, eco), but these coarse presets cannot represent the nuanced, consistent differences in individual driving styles. In addition, they are not designed to interpret and follow natural-language intent specified at runtime.

Why previous personalized driving systems fall short. Existing personalized-focused driving works mainly fall into two categories. First, data-driven approaches extract predefined driving styles from human demonstrations and learn style-conditioned policies. These methods rely on a fixed and limited set of driving styles and scale poorly to a growing population of users with diverse preferences. Second, language-driven approaches leverage large language models (LLMs) for instruction-based personalization, typically requiring learning from explicit human driver feedback. Such approaches introduce interaction costs and remain constrained to simple, low-interaction settings.

Core idea: DMW aligns the VLA policy with long-term driving patterns and real-time preference instructions to produce behaviors that are safe and effective, yet also recognizable and adaptable.

Personalized Driving Dataset (PDD)

PDD dataset overview

PDD collects real human driving demonstrations across diverse scenarios in CARLA with a steering wheel setup. Each driver is associated with structured profile information.

Download PDD (Coming Soon)

Preference Learning and Alignment

DMW method overview and architecture

Given the camera observations and navigation goals, the model fuses with the driver's long-term preferences and user instructions to produce adaptive, personalized actions.

Long-term Preference Encoder

Preference Encoder

The preference encoder learns user embeddings from profile. It aligns structured driver profiles with historical driving behavior using a contrastive objective, ensuring that the learned embedding reflects behaviorally patterns and maintains diversity among drivers. The learned embeddings are used to condition the policy during planning.

GRPO-based Preference Alignment

GRPO

DMW employs Group Relative Policy Optimization (GRPO) with style-aware rewards to align the policy. The residual decoder encourages the policy to sample diverse yet reasonable driving behaviors. The reward adaptation bridges the gap between subtle preferences in language and dynamic style rewards.

How well does the policy adapt to various styles of real-time commands under different scenarios?

RDD method overview and architecture

The substantially larger metric changes observed in DMW under varying instructional styles, relative to the baseline, highlight DMW's capability for real-time preference adaptation.

Can our policy align with specific driving behavior when conditioned on learned user embeddings?

rdd-realworld-ood

When conditioned on a given profile, DMW exhibits consistent motion patterns across different scenarios and instruction styles. Style instructions further provide short-term adaptation, shifting driving behavior toward the intended preference.

User Studies

GRPO

Five evaluators (E1-E5) provide user study ratings (0-10) to assess the alignment between generated trajectories and intended instructions. (The figure shows the results averaged across different style instructions.) Across evaluators, DMW consistently exhibits higher speeds, shorter headways, and more decisive accelerations under aggressive instructions, while conservative instructions result in smoother control behaviors and increased safety margins.

GRPO

Ten evaluators rate the similarity between each driver's driving patterns and the model-generated behavior on a 1-10 scale. (D1, D2) are seen drivers and (D3, D4) are unseen drivers. For each scenario type, three test routes are selected. Evaluators consistently recognize driving behaviors that reflect the corresponding drivers' habits. This generalizability suggests that the user embedding captures semantic context from profiles.

BibTeX