Self-Supervised 3D Representations for 3D Diffusion Policy

UC San Diego, ECE 285: Deep Generative Models ยท Spring Quarter 2026

Code  |  Report


Overview

3D Diffusion Policy (DP3) achieves strong visuomotor manipulation performance using a deliberately simple point cloud encoder: a per-point MLP with global max pooling, trained end-to-end from demonstrations. Prior work shows that richer supervised 3D encoders (PointNet++, PointNeXt) consistently hurt DP3 performance in the low-data regime. This project investigates whether a self-supervised 3D encoder can do better.

The contribution is entirely an integration experiment. We take the pretrained Point-JEPA checkpoint as-is, freeze it, and drop it directly into the DP3 pipeline as an encoder replacement, adding only a new PointJEPAExtractor module, a Hydra config, and Python 3.8 compatibility patches. No modifications were made to either the Point-JEPA or DP3 codebases beyond this. Point-JEPA is a joint-embedding predictive architecture pretrained on ShapeNet-55 that predicts abstract latent representations of masked spatial regions, encouraging global geometric structure capture. We evaluate the frozen encoder at three projection dimensionalities (64, 128, 256) on three Adroit dexterous manipulation tasks with 10 expert demonstrations each.


Tasks

Hammer RGB

Hammer

Door RGB

Door

Pen RGB

Pen

Hammer point cloud

Hammer (point cloud)

Door point cloud

Door (point cloud)

Pen point cloud

Pen (point cloud)

The policy observes only the 512-point xyz point cloud and 24-dim proprioceptive state, not the RGB image. The point clouds capture hand and object geometry but lack the visual detail of RGB.


Policy Rollouts

DP3 Baseline

DP3 hammer

Hammer (cost 0.95)

DP3 door

Door (score 0.65)

DP3 pen

Pen (score 0.80)

Point-JEPA (best variant per task)

JEPA hammer

Hammer, JEPA-128 (score 0.80)

JEPA door

Door, JEPA-64 (score 0.70)

JEPA pen

Pen, JEPA-256 (score 0.40)


Results

ModeldoutHammerPenDoor
DP3 (baseline)640.9500.8000.650
JEPA-64640.4500.2500.700
JEPA-1281280.8000.3750.600
JEPA-2562560.5000.4000.650

Peak test_mean_score (fraction of 20 rollouts succeeding) across 3000 training epochs. JEPA-64 is the only model across all tasks and variants to strictly exceed the baseline's peak, on door (+5%). Results are otherwise mixed.


Training Dynamics

Hammer

Hammer test score

Test success rate over training

Hammer train loss

Diffusion loss (log scale)

All JEPA variants converge to near-zero diffusion loss within 20-30 epochs vs. hundreds for the baseline, yet the baseline achieves task success far earlier. JEPA-128 eventually reaches 0.80 but the baseline stays at 0.95.

Door

Door test score

Test success rate over training

Door train loss

Diffusion loss (log scale)

JEPA-64 eventually reaches 0.70 at epoch 2200, surpassing the baseline's peak of 0.65. All four runs converge to nearly identical diffusion loss, so loss is an unreliable proxy for task performance here.

Pen

Pen test score

Test success rate over training

Pen train loss

Diffusion loss (log scale)

All JEPA variants substantially underperform (peak 0.25-0.40 vs. 0.80 baseline). Identical diffusion loss curves but very different success rates confirm that loss is not a reliable policy quality indicator when encoder alignment is the bottleneck.


Why It Works on Door but Not Pen

Point-JEPA's authors explicitly note its emphasis on global over local features as a limitation. Our results map directly onto this: the performance ordering corresponds precisely to how much global vs. local geometric reasoning each task requires.

The ShapeNet domain gap is also a factor: Point-JEPA was pretrained on clean, normalized CAD objects, while Adroit point clouds are in robot workspace coordinates and include the robot hand alongside the object.


← Back to Portfolio