The ability to accurately estimate distances from RGB image input is just at theย ๐ณ๐ฟ๐ผ๐ป๐๐ถ๐ฒ๐ฟ ๐ผ๐ณ ๐ฐ๐๐ฟ๐ฟ๐ฒ๐ป๐ ๐๐ ๐บ๐ผ๐ฑ๐ฒ๐น ๐ฐ๐ฎ๐ฝ๐ฎ๐ฏ๐ถ๐น๐ถ๐๐ถ๐ฒ๐.
Nonetheless, distance estimation is a ๐ฐ๐ฟ๐ถ๐๐ถ๐ฐ๐ฎ๐น ๐ณ๐ผ๐ฟ ๐ฝ๐ฒ๐ฟ๐ฐ๐ฒ๐ฝ๐๐ถ๐ผ๐ป ๐ฎ๐ป๐ฑ ๐ฝ๐น๐ฎ๐ป๐ป๐ถ๐ป๐ด ๐ถ๐ป ๐ฒ๐บ๐ฏ๐ผ๐ฑ๐ถ๐ฒ๐ฑ ๐๐ ๐ฎ๐ฝ๐ฝ๐น๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป๐ ๐น๐ถ๐ธ๐ฒ ๐ฟ๐ผ๐ฏ๐ผ๐๐ถ๐ฐ๐ which must navigate around our 3D world.
Making a ๐ผ๐ฝ๐ฒ๐ป-๐๐ฒ๐ถ๐ด๐ต๐ model ๐๐บ๐ฎ๐น๐น and ๐ณ๐ฎ๐๐ enough to run ๐ผ๐ป-๐ฑ๐ฒ๐๐ถ๐ฐ๐ฒ, using ๐ผ๐ฝ๐ฒ๐ป-๐๐ผ๐๐ฟ๐ฐ๐ฒ ๐ฐ๐ผ๐ฑ๐ฒ and ๐ฑ๐ฎ๐๐ฎ, we aim to democratize embodied AI.
Iโve updated the comparison among closed APIs with SOTA performance in quantitative spatial reasoning tasks like distance/size estimation from RGB inputs and our 3B open-weight model: SpaceThinker
The performance for the the 3B SpaceThinker lies between gpt-4o and gemini-2.5-pro in estimating distances using the QSpatial++ split of Q-Spatial-Bench.
Evaluation Results: https://huggingface.co/remyxai/SpaceThinker-Qwen2.5VL-3B#qspatial-comparison-table-42525
Interesting finding: By switching model name in this colab, using the non-reasoning variant SpaceQwen, youโll find using the step-by-step reasoning prompt actually hurts performance, challenging the convention that reasoning models donโt benefit from complex instructions the way non-reasoning models do.
Modifying the above colab, you can also compare SpaceThinker to itโs base model to assess the performance impact due to SFT by LoRA using the SpaceThinker dataset: https://huggingface.co/datasets/remyxai/SpaceThinker
๐ฌ Discussion r/LocalLLaMA (35 points, 8 commentaires) ๐ Source