Temporally Consistent Online Depth Estimation in Dynamic Scenes
WACV 2023

We propose Consistent Online Dynamic Depth (CODD), a framework for consistent depth consistency in dynamic environments (camera motion, object motion and object deformation) and online settings (no future information and real-time inference). Temporally consistent depth estimation is crucial for 3D visualization in mixed reality applications. Otherwise, the jitters in depth corrupt the perceptual quality significantly. However, temporal consistency is largely overlooked. CODD solves this notoriously difficult problem by integrating a novel motion network and fusion network with per-frame depth estimation solutions. CODD improves temporal consistency in depth significantly while producing accurate estimates, for both static background and dynamic foreground objects.

Overview Video


Move your mouse to compare the results! Switching back and forth quickly for better comparison.

Indoor scene

Outdoor scene

Medical scene


CODD builds on top of per-frame stereo depth estimation network with two additions:

  • Motion network -- predicts per-pixel SE3 motion to align past estimates to the current frame,
  • Fusion network -- fuses the aligned estimates for temporal consistency.

Motion network

The motion network uses depth and features from previous time point and current time point to regress per-pixel SE3 motion. The SE3 motion is regularized by a piece-wise consistent assumption to account for dynamics such as camera motion, object motion and object deformation. After recovering the SE3 motion, the motion network aligns the observations using differentiable rendering.

Fusion network

The fusion network regresses per-pixel weightings between the aligned observations. A collection of explicit cues are extracted to guide the weighting regression, including self-similarity and cross-similarity (see paper for more details). The temporally consistent depth is generated by fusing the aligned estimates.


     title={Temporally Consistent Online Depth Estimation in Dynamic Scenes},
     author={Li, Zhaoshuo and Ye, Wei and Wang, Dilin and Creighton, Francis X and Taylor, Russell H and Venkatesh, Ganeshnbsp and Unberath, Mathias},
     journal={IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},



We thank Johannes Kopf and Michael Zollhoefer for helpful discussions, and Vladimir Tankovich for HITNet implementation. We thank anonymous reviewers and meta-reviewers for their feedback on the paper. This work was done during an internship at Reality Labs, Meta Inc and funded in part by NIDCD K08 Grant DC019708.