<ul class="dashed" data-apple-notes-indent-amount="0"><li><span style="font-family: '.PingFangUITextSC-Regular'">文章标题:</span>Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces</li><li><span style="font-family: '.PingFangSC-Regular'">文章地址:</span><a href="https://arxiv.org/abs/2412.14171">https://arxiv.org/abs/2412.14171</a> </li><li>arxiv</li></ul> <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1736762085/253B49F5-2F61-4BE0-B98A-5A4866C955FB.png" style="background-color:initial;max-width:min(100%,1974px);max-height:min(1332px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1736762085/253B49F5-2F61-4BE0-B98A-5A4866C955FB.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="1974" height="1332"><ul class="dashed" data-apple-notes-indent-amount="0"><li></li></ul> 文章提出了一个benchmark VSI-Bench用于评估多模态大模型的视觉空间智能的能力,包含了5k个QA对;作者发现空间推理能力是获得更高表现的瓶颈,并且流行的语言推理方法(CoT等)不能提高表现,但在qa过程中显式生成‘cognitive map’能够提高多模态大模型的空间距离能力。 <ul class="dashed" data-apple-notes-indent-amount="0"><li>数据:本身是一个benchmark</li><li>开源:<a href="https://vision-x-nyu.github.io/thinking-in-space.github.io/">https://vision-x-nyu.github.io/thinking-in-space.github.io/</a> </li></ul>