Ellipse: Evidential Learning for Robust Waypoints and Uncertainties

Abstract

Robust waypoint prediction is crucial for mobile robots operating in open-world, safety-critical settings. While Imitation Learning (IL) methods have demonstrated great success in practice, they are susceptible to distribution shifts: the policy can become dangerously overconfident in unfamiliar states. In this paper, we present ELLIPSE, a method building on multivariate deep evidential regression to output waypoints and multivariate Student-t predictive distributions in a single forward pass. To reduce covariate-shift-induced overconfidence under viewpoint and pose perturbations near expert trajectories, we introduce a lightweight domain augmentation procedure that synthesizes plausible viewpoint/pose variations without collecting additional demonstrations. To improve uncertainty reliability under environment/domain shift (e.g., unseen staircases), we apply a post-hoc isotonic recalibration on probability integral transform (PIT) values so that prediction sets remain plausible during deployment. We ground the discussion and experiments in staircase waypoint prediction, where obtaining robust waypoint and uncertainty is pivotal. Extensive real world evaluations show that ELLIPSE improves both task success rate and uncertainty coverage compared to baselines.

Introduction

Trajectory or waypoint planning in open-world environments is a crucial capability for mobile robots, particularly in safety-critical domains such as construction, defense, and autonomous driving. Recent imitation learning (IL) approaches have demonstrated strong performance in predicting waypoint sequences from expert demonstrations. However, learned waypoint predictors come with limited safety guarantees, and can lead to catastrophic failures when deployed under distribution shift. Uncertainty quantification (UQ) offers a principled mechanism to mitigate this risk by enabling a policy to recognize unreliable predictions and trigger conservative fallbacks (e.g., stopping and requesting expert assistance). In an ideal setting, higher uncertainty correlates with larger errors. In robotics, however, limited demonstration data makes uncertainty estimates vulnerable to covariate shift: the distribution of observations encountered during deployment can differ substantially from the training distribution, causing the model to remain overconfident when the prediction is wrong.

As shown above, stair navigation is a crucial capability for robots to safely explore multi-floor structures (e.g. construction sites), and it is a canonical scenario where accurate uncertainty estimation matters. First, staircase geometry---narrow passages, turns at landings, and elevation changes---restricts visibility and induces partial observability. This motivates us to design a LiDAR-based method for its wider field of view. Second, the margin for error is small: slight waypoint deviations can lead to severe consequences. Last but not least, cascading error can easily drive the robot into viewpoints or poses off the (sparse) demonstration manifold, where the policy is wrong yet confident. Beyond this learner-induced shift, deployment in novel staircases (e.g., different step geometry, materials, and sensing conditions) further induces environment/domain shift that degrades both waypoint and uncertainty reliability.

To improve the robustness of the predicted waypoints and the coverage of their associated distributions, we propose Ellipse, a point-cloud-based model for predicting uncertainty-aware waypoint sequences from expert demonstrations:

Our backbone is multivariate deep evidential regression, which produces both waypoints and multivariate Student-$t$ predictive distributions in a single forward pass.
To mitigate covariate-shift-induced overconfidence when the robot deviates from the demonstration manifold, we augment the training data by synthesizing plausible viewpoint and pose perturbations around each expert trajectory.
Furthermore, we apply a lightweight post-hoc recalibration: we fit an isotonic regression map on probability integral transform (PIT) values so that the resulting prediction set sizes more faithfully adapt to the residual/error magnitudes during deployment.
Finally, we integrate the predicted uncertainty into an MPPI planner, which relaxes constraints on uncertain waypoints and encourages the plans to stay close to (past) confident waypoints, thus mitigating the impact of occasional poor predictions.

Hardware Experiments

ELLIPSE takes as input a deskewed LiDAR point cloud, which is further gravity-aligned, clipped to the axis-aligned box $[-10,-10,-4]\times[10,10,4]$, and randomly subsampled to 20{,}000 points. For training, we collect demonstrations on 25 diverse staircases, of which 21 are used for training and 4 for testing. We refer to the four test staircases as EES (7 floors, right-turning), RWS (10 floors, left-turning), RES (10 floors, left-turning), and CLF (7 floors, right-turning). The ground-truth targets are $T=5$ equally spaced waypoints with stride $d=0.5$m. Each training instance is augmented into 8 additional poses, and the safety margin is set to $\boldsymbol{\epsilon}=[\Delta_x,\Delta_y,\Delta_z,\Delta_{roll},\Delta_{pitch},\Delta_{yaw}]=[0,0.2,0.05,10,10,30]$, with translational components measured in meters and rotational components in degrees. We do not perturb along the robot $x$-axis, since this corresponds to perturbations along the demonstration trajectory and can introduce inconsistent point clouds that degrade training. The model is trained for 50 epochs using a one-cycle scheduler with cosine annealing. On our edge compute platform, the resulting model runs in real time on 10+Hz.

$\textbf{Reliability of Nominal Waypoints and Effectiveness of Domain Augmentation}$

First of all we are interested in evaluating if the nominal waypoints are reliable when directly tracked by MPPI, and does domain augmentation improve task success rate. We observe ELLIPSE significantly outperforms all baselines we considered. BEVFusion fails frequently, despite utilizing 3 additional RGBD cameras onboard, and we hypothesize this is largely due to its inference latency. ELLIPSE-Uni performs worse than ELLIPSE overall because it predicts waypoint $x$ and $y$ coordinates independently, limiting the model's ability to infer their correlations. Most importantly, domain augmentation substantially improves both ELLIPSE and ELLIPSE-Uni.

The above figures show the timelapse and trajectories of the Spot traversing CLF using different variants of ELLIPSE. Without the domain augmentation (left 1, 2), both variants crash into handrails due to compounding error. With the domain augmentation (right 1, 2), both variants completes the run without help, and stay closer to stair center.

$\textbf{Empirical Coverage of the Predictive Distributions}$

To further evaluate the uncertainty predicted by ELLIPSE, we compare the empirical coverage at 90% target coverage and the sharpness of the prediction sets (m²). Ideally, the empirical coverage should be close to 90% while keeping the prediction sets as small as possible. Specifically, we further split the 4 test sequences into calibration (EES & RWS) and deployment (RES & CLF). The variants of ELLIPSE are calibrated with either the augmented dataset or clean expert demonstrations, and are tested on both Adversarial (teleop with aggressive turning and zig-zagging) and Deployment (recorded data of running ELLIPSE and/or ELLIPSE-no-Aug on RES & CLF). ELLIPSE, (trained and calibrated on augmented dataset), achieves strong empirical coverage with compact prediction sets. For ELLIPSE-no-Aug, calibrating on augmented data substantially improves coverage, but at the cost of much larger prediction sets. Although ELLIPSE-no-Aug-MVP yields coverage closest to the target 90%, it relies on privileged online conformity feedback.

$\textbf{Qualitative Comparison of MPPI Variants}$

While the proposed mahalanobis distance based MPPI with history $\tau = 5$ keeps the path close to confident waypoints (blue ellipses), other variants can lead to aggressive behavior (e.g. turning too close to handrails) due to sudden bad predictions or uncertain waypoints (red ellipses)