The research paper "Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?" that was just released presents a comprehensive study on visual representations for embodied AI.
Authors curated CortexBench, which includes 17 tasks spanning locomotion, navigation, dexterous, and mobile manipulation, across a range of environments and agents, and learning conditions.
Existing pre-trained visual representations (PVRs) were evaluated on CortexBench, but no PVR was universally dominant, and an artificial visual cortex does not already exist.
Over 4,000 hours of egocentric videos from 7 different sources and ImageNet were combined to train different-sized vision transformers using Masked Auto-Encoding (MAE) on slices of this data.
Scaling dataset size and diversity does not improve performance universally, but does so on average, contrary to findings in prior work.
The team's strongest model, VC-1 (adapted), was competitive with or outperformed the best prior results on all benchmarks in CortexBench.
This is the largest and most comprehensive empirical study to date of visual representations for embodied AI, which required over 10,000 GPU-hours of training and evaluation.
VC-1 is open-sourced: https://github.com/facebookresearch/eai-vc/.
Relevant links:
Website: https://eai-vc.github.io/
Paper: https://ai.facebook.com/research/publications/where-are-we-i...