Manuscript · 2026

Human-like autonomy emerges from self-play and a pinch of human data

Daphne Cornelisse¹ · Julian Hunt² · Zixu Zhang³ · Waël Doulazmi^4,5 · Kevin Joseph² · Jaime Fernández Fisac³ · Eugene Vinitsky¹

¹NYU Tandon School of Engineering · ²NYU Courant · ³Princeton University · ⁴Centre for Robotics, Mines Paris · ⁵Valeo

Spiced self-play combines over 60 years of simulated self-play with 30 minutes of human driving data as a behavioral anchor, improving coordination with logged human trajectories without reward engineering or domain randomization.

arXiv BibTeX Code Rollouts

99.4% safe task completion with human-replay proxies

2,500× less human driving data than imitation-learning baselines

20B self-play transitions, roughly 63 years of driving

15 hrs end-to-end training on one consumer GPU

Abstract

Spiced self-play adds a small behavioral anchor to scalable self-play.

Self-play reinforcement learning can substitute cheap, large-scale simulation for large human driving datasets, but policies trained only through self-play can converge to effective yet incompatible driving conventions. Prior work often addresses this through reward engineering and domain randomization.

We introduce spiced self-play: policies trained with a minimal safe-goal-reaching reward, over 60 years of self-play simulation, and 30 minutes of human driving data as a behavioral anchor. The resulting policies coordinate with logged human trajectories using 2,500× less human data than imitation-learning baselines, and the full pipeline runs in 15 hours on a single consumer-grade GPU.

Rollouts

These videos align paired held-out human-replay scenarios so the regularized policy and unregularized policy can be compared in a shared camera view. In each rollout, the learned ego agent is goal-conditioned: the blue vehicle shows the regularized policy, the yellow vehicle shows the unregularized policy, the green circle marks the goal, and the other light green vehicles replay recorded human trajectories. Both policies use the same minimal reward: +1 when the goal indicator I[goal] is true, and -1 for off-road or collision events.

Cross-traffic negotiation.

Dense queue interaction.

Waiting in traffic.

Awkward behavior avoided.

Room left around replayed traffic.

Measured progress near the goal.

Cautious timing through a busy interaction.

Patient interaction around nearby vehicles.

Regularized

Unregularized

Paired rollout

Negotiating cross traffic

Regularized self-play keeps a more legible path through the interaction, while the unregularized policy presses more directly toward the goal.

Regularized

Unregularized

Paired rollout

Dense queue interaction

The regularized policy stays with the flow of replayed vehicles; the unregularized policy is less patient in the same scene.

Regularized

Unregularized

Paired rollout

Waiting in traffic

With the human-data anchor, the ego vehicle is more willing to wait instead of exploiting every narrow opening.

Regularized

Unregularized

Paired rollout

Awkward behavior avoided

The unregularized rollout can avoid impact while still behaving awkwardly; regularized self-play produces a smoother interaction.

Regularized

Unregularized

Paired rollout

Cautious merge timing

The regularized policy leaves more time around nearby replayed vehicles, while the unregularized rollout takes a tighter interaction.

Regularized

Unregularized

Paired rollout

Social pacing

The paired replay shows a more measured regularized trajectory compared with the unregularized policy's more direct progress.

Regularized

Unregularized

Paired rollout

Social spacing

The paired replay highlights the behavioral difference behind the aggregate social-driving metrics.

Regularized

Unregularized

Paired rollout

Gap selection

The regularized policy leaves more room around replayed traffic, while the unregularized rollout takes a tighter, more direct line.

Regularized

Unregularized

Paired rollout

Patient interaction

The paired replay shows the regularized agent preserving a more cautious interaction pattern around nearby human-driven vehicles.

Summary

Spiced self-play uses a pinch of human data to steer scalable self-play toward human-compatible driving.

Self-play RL can scale through simulated experience, but policies trained only against themselves may adopt effective driving conventions that do not coordinate with human drivers. Spiced self-play keeps self-play as the main training engine and adds a small behavioral cloning anchor from human driving data.

With 30 minutes of Waymo human driving data and 20B self-play transitions, spiced policies reach 0.994 safe task completion under human replay, outperforming both unregularized self-play and SMART-tiny CLSFT while avoiding reward engineering and domain randomization.

Spiced objective

PPO training uses a minimal safe-goal-reaching reward, while the behavioral anchor keeps updates close to conventions observed in a small human dataset.

Human replay

Held-out scenes replay logged human agents while the learned policy controls the ego vehicle, exposing coordination failures that self-play evaluation can miss.

Data scaling

The experiments vary map diversity and human driving data from minutes to full-dataset references to measure when coordination emerges.

Evaluation

Human compatibility is measured by safe task completion under replay.

Results

Spiced self-play improves safe coordination with a fraction of the human data.

Plots comparing safe task completion, total training transitions, and trajectory behavior for spiced self-play, self-play, and imitation-learning baselines. — Headline result from the current manuscript. Spiced self-play reaches 0.994 safe task completion with 20B self-play transitions and roughly 30 minutes of human driving data, improving over unregularized self-play while using 2,500× less human data than imitation-learning baselines.

Regularized

Unregularized

Failure comparison

Different collisions in the same scene

Both policies fail in replay, but the paired rollout shows regularized agents display more cautious behavior around other agents.

Regularized

Unregularized

Failure comparison

Representative replay failure

This pair illustrates a remaining failure mode under human replay, including cases where the controlled vehicle is contacted from behind.

Citation

BibTeX


          @misc{cornelisse2026humanlikeautonomy,
          title         = {Human-like autonomy emerges from self-play and a pinch of human data},
          author        = {Cornelisse, Daphne and Hunt, Julian and Zhang, Zixu and Doulazmi, Wa{\"e}l and Joseph, Kevin and {Fern{\'a}ndez Fisac}, Jaime and Vinitsky, Eugene},
          year          = {2026},
          eprint        = {2606.19370},
          archivePrefix = {arXiv},
          primaryClass  = {cs.LG},
          url           = {https://arxiv.org/abs/2606.19370}
        }