Reading Notes - OpenStereo

These are my notes on reading OpenStereo: A Comprehensive Benchmark for Stereo Matching and Strong Baseline by Xianda Guo, Juntao Lu, Chenming Zhang, Yiqi Wang, Yiqun Duan, Tian Yang, Zheng Zhu, and Long Chen.

It’s currently a pre-print, so might change a bit once published.

Overview

In the paper they make the (in my experience well founded) claim that the evaluation methods for stereo matching algorithms are all over the place, making it difficult to make apples-to-apples comparisons and to judge the generalization ability of some stereo matching methods. Due to differences in training regimes and augmentation strategies, as well as inconsistent ablation strategies it can be hard to disentangle whether an improvement is due to architectural changes or training methodology.

The paper introduces the OpenStereo framework for comparing different techniques, as well as a model architecture called StereoBase, which is effectively a model put together from the pieces investigated in the study, which at the time of writing ranks in first place on the KITTI2015 leaderboard as well as reportedly achieving a new SOTA on the SceneFlow test set.

OpenStereo

The OpenStereo framework is introduced and consists of three main parts:

A Data module which loads and preprocesses the datasets.
A Modeling module which defines a BaseModel which can be used to construct the specific network architectures.
An Evaluation module which standardizes the evaluation methodology.

Currently six datasets are supported: SceneFlow, KITTI2012, KITTI2015, ETH3D, Middlebury, and DrivingStereo.

With the above tools a large amount of models can be put together and compared against each other.

Revisit Deep Stereo Matching

Eleven models are reconstructed using the OpenStereo framework and it is found that the OpenStereo implementations outperform the reference implementations when compared on the SceneFlow dataset. For the KITTI2015 dataset one metric is lower than the reference implementation.

An ablation study is performed and it is found that:

Data augmentation:
- Most (standard) data augmentation techniques do more harm then good.
- Random crop, color augmentation and random erase deliver significant improvements however.
Feature extraction:
- Larger backbones do better
- Using a pre-trained backbone leads to very significant improvements.
Cost construction:
- Different cost-volume construction strategies are compared
- 4D Methods (Height x Width x Depth x Channels) perform best
- More channels correlates to better performance at a greater computational cost, but with diminishing returns.
- Optimal quality compute tradeoff seems to be the combined volume G8-G16 (from Group-wise Correlation Stereo Network)
Disparity regression and refinement:
- Larger and more computationally intensive refinement modules lead to better results
- Best result achieved with ConvGRU from RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching at a computational cost of ~20x as much as as the next best result for a ~30% improvement in EPE.

A Strong Pipeline: StereoBase

By combining the learning of the above ablation study a new baseline architecture, called StereoBase was created, which achieves SOTA results on KITTI2012, KITTI2012 and SceneFlow, with competitive results on Driving Stereo.

When trained on only SceneFlow (a synthetic dataset) it shows strong generalization to the KITTI2012, KITTI2015, Middlebury and ETH3D datasets, beating the other implemented architectures.

Interestingly, the architecture they come up with is very similar to IGEV-Stereo.

Questions

How will stereo methods that do not conform to the ED-Conv2D or CVM-Conv3D catagories be accommodated in OpenStereo?
I think some of the augmentations might need to be re-thought:
- As the authors mention, pixel alignment must be preserved
- When flipping, you can’t just naively flip the target, it needs to be inverted, which is a non-trivial operation.
Does it make sense to try even larger pre-trained backbones?
- The authors only test four different backbones.
- The largest is MobilenetV2 120d at 5.21M parameters.
- Does going even bigger do better?
What are the key differences between StereoBase and IGEV-Stereo? Just a larger pre-trained backbone?
Given this framework, could a neural architecture search be performed over an even larger set of backbones and components to find an even better architecture?
Does it make sense to also create a public leaderboard?
- Might be hard with the KITTI datasets that have their own board

Key Takeways

The authors do a great job of categorizing recent developments in deep stereo matching and putting them on an equal footing.

The OpenStereo framework is ambitious, and should really help ensure that future developments are rigorously tested and compared to what came before.

The baseline meta-architecture they put together (OpenBase) shows that there is a lot that can be achieved by just doing a comprehensive ablation study, and should serve as a solid benchmark for future studies.