Temporally Consistent Online Depth Estimation in Dynamic Scenes
                
                
                    WACV 2023
                
            
        - Zhaoshuo Li1
- Wei Ye2
- Dilin Wang2
- Francis X. Creighton1
- Russell H. Taylor1
- Ganesh Venkatesh2
- Mathias Unberath1
- 1 Johns Hopkins University
- 2 Reality Labs, Meta Inc.
We propose Consistent Online Dynamic Depth (CODD), a framework for consistent depth consistency in dynamic environments (camera motion, object motion and object deformation) and online settings (no future information and real-time inference). Temporally consistent depth estimation is crucial for 3D visualization in mixed reality applications. Otherwise, the jitters in depth corrupt the perceptual quality significantly. However, temporal consistency is largely overlooked. CODD solves this notoriously difficult problem by integrating a novel motion network and fusion network with per-frame depth estimation solutions. CODD improves temporal consistency in depth significantly while producing accurate estimates, for both static background and dynamic foreground objects.
Overview Video
Demo
Move your mouse to compare the results! Switching back and forth quickly for better comparison.
Indoor scene
Outdoor scene
Medical scene
Methodology
                    
                    CODD builds on top of per-frame stereo depth estimation network with two additions:
                    
- Motion network -- predicts per-pixel SE3 motion to align past estimates to the current frame,
- Fusion network -- fuses the aligned estimates for temporal consistency.
Motion network
The motion network uses depth and features from previous time point and current time point to regress per-pixel SE3 motion. The SE3 motion is regularized by a piece-wise consistent assumption to account for dynamics such as camera motion, object motion and object deformation. After recovering the SE3 motion, the motion network aligns the observations using differentiable rendering.
Fusion network
The fusion network regresses per-pixel weightings between the aligned observations. A collection of explicit cues are extracted to guide the weighting regression, including self-similarity and cross-similarity (see paper for more details). The temporally consistent depth is generated by fusing the aligned estimates.
Citation
title={Temporally Consistent Online Depth Estimation in Dynamic Scenes},
author={Li, Zhaoshuo and Ye, Wei and Wang, Dilin and Creighton, Francis X and Taylor, Russell H and Venkatesh, Ganeshnbsp and Unberath, Mathias},
journal={IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
year={2023}
}
Poster
Acknowledgements
We thank Johannes Kopf and Michael Zollhoefer for helpful discussions, and Vladimir Tankovich for HITNet implementation. We thank anonymous reviewers and meta-reviewers for their feedback on the paper. This work was done during an internship at Reality Labs, Meta Inc and funded in part by NIDCD K08 Grant DC019708.