Abstract: Camera and IMU are widely used in robotics to achieve accurate and robust pose estimation. However, this fusion relies heavily on sufficient visual feature observations and precise inertial ...
Abstract: Multimodal vision-language models (VLMs) continue to achieve ever-improving scores on chart understanding benchmarks. Yet, we find that this progress does not fully capture the breadth of ...