Manuscript · 2026

Human-like autonomy emerges from self-play and a pinch of human data

Daphne Cornelisse1 · Julian Hunt2 · Zixu Zhang3 · Waël Doulazmi4,5 · Kevin Joseph2 · Jaime Fernández Fisac3 · Eugene Vinitsky1

1NYU Tandon School of Engineering · 2NYU Courant · 3Princeton University · 4Centre for Robotics, Mines Paris · 5Valeo

Spiced self-play combines over 60 years of simulated self-play with 30 minutes of human driving data as a behavioral anchor, improving coordination with logged human trajectories without reward engineering or domain randomization.

Paper PDF BibTeX Code forthcoming Rollouts
99.4% safe task completion with human-replay proxies
2,500× less human driving data than imitation-learning baselines
20B self-play transitions, roughly 63 years of driving
15 hrs end-to-end training on one consumer GPU

Rollouts

Qualitative rollouts illustrate policy behavior under replay.

Each rollout is a held-out human-replay scenario: only the blue ego vehicle is controlled by the learned policy, while green vehicles replay recorded human trajectories. The agent is goal-conditioned on the target destination; when the ego vehicle collides, it is colored red. Use the arrows in each card to switch examples.

Spiced policy slowing and waiting for a pedestrian.

Spiced · human-replay

Waiting for a pedestrian

The spiced policy completes the scenario while yielding an interaction that is more readily anticipated.

Spiced · rollout

Coordinated interaction

Spiced self-play remains task-effective while staying close to behavior induced by the human-data anchor.

Unregularized self-play policy moving quickly through traffic.

Unregularized · atypical

Atypical successful rollout

An unregularized policy can complete the scenario using maneuvers that are less consistent with logged human behavior.

Unregularized policy failure example.

Unregularized · transfer

Transfer failure under replay

Without the human-data anchor, self-play may converge to behaviors that transfer poorly to human-replay evaluation.

Spiced policy failure example.

Spiced · failure

Remaining failure case

Human-replay remains an imperfect proxy for deployment, and some interactions remain unresolved.

Demonstration replay

Demonstration data

Expert SDC replays illustrate the demonstrations used to fit the behavioral anchor.

Summary

Spiced self-play uses a pinch of human data to steer scalable self-play toward human-compatible driving.

Self-play RL can scale through simulated experience, but policies trained only against themselves may adopt effective driving conventions that do not coordinate with human drivers. Spiced self-play keeps self-play as the main training engine and adds a small behavioral cloning anchor from human driving data.

With 30 minutes of Waymo human driving data and 20B self-play transitions, spiced policies reach 0.994 safe task completion under human replay, outperforming both unregularized self-play and SMART-tiny CLSFT while avoiding reward engineering and domain randomization.

Spiced objective

PPO training uses a minimal safe-goal-reaching reward, while the behavioral anchor keeps updates close to conventions observed in a small human dataset.

Human replay

Held-out scenes replay logged human agents while the learned policy controls the ego vehicle, exposing coordination failures that self-play evaluation can miss.

Data scaling

The experiments vary map diversity and human driving data from minutes to full-dataset references to measure when coordination emerges.

Evaluation

Human compatibility is measured by safe task completion under replay.

Diagram comparing self-play evaluation, human-replay collision events, and human-replay at-fault collision events.
Safe task completion combines task completion with at-fault collisions, distinguishing failures caused by the learned policy from collisions initiated by replayed human agents.

Results

Spiced self-play improves safe coordination with a fraction of the human data.

Plots comparing safe task completion, total training transitions, and trajectory behavior for spiced self-play, self-play, and imitation-learning baselines.
Headline result from the current manuscript. Spiced self-play reaches 0.994 safe task completion with 20B self-play transitions and roughly 30 minutes of human driving data, improving over unregularized self-play while using 2,500× less human data than imitation-learning baselines.
Charts comparing collision severity distributions for unregularized and spiced self-play policies.
Spiced policies also change the character of failures: collisions are less severe on average and concentrate at lower per-event speed changes than the unregularized self-play baseline.

Abstract

Spiced self-play adds a small behavioral anchor to scalable self-play.

Self-play reinforcement learning can substitute cheap, large-scale simulation for large human driving datasets, but policies trained only through self-play can converge to effective yet incompatible driving conventions. Prior work often addresses this through reward engineering and domain randomization.

We introduce spiced self-play: policies trained with a minimal safe-goal-reaching reward, over 60 years of self-play simulation, and 30 minutes of human driving data as a behavioral anchor. The resulting policies coordinate with logged human trajectories using 2,500× less human data than imitation-learning baselines, and the full pipeline runs in 15 hours on a single consumer-grade GPU.

Citation

BibTeX

@unpublished{cornelisse2026humanlikeautonomy,
  title  = {Human-like autonomy emerges from self-play and a pinch of human data},
  author = {Cornelisse, Daphne and Hunt, Julian and Zhang, Zixu and Doulazmi, Wa{\"e}l and Joseph, Kevin and {Fern{\'a}ndez Fisac}, Jaime and Vinitsky, Eugene},
  year   = {2026},
  month  = may,
  note   = {Manuscript}
}