Thinking in Space | lc's space

文章标题：Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
文章地址：https://arxiv.org/abs/2412.14171
arxiv

文章提出了一个benchmark VSI-Bench用于评估多模态大模型的视觉空间智能的能力，包含了5k个QA对；作者发现空间推理能力是获得更高表现的瓶颈，并且流行的语言推理方法（CoT等）不能提高表现，但在qa过程中显式生成‘cognitive map’能够提高多模态大模型的空间距离能力。