<ul class="dashed" data-apple-notes-indent-amount="0"><li><span style="font-family: '.PingFangUITextSC-Regular'">文章标题:</span>An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM</li><li><span style="font-family: '.PingFangSC-Regular'">文章地址:</span><a href="https://arxiv.org/abs/2403.18406">https://arxiv.org/abs/2403.18406</a> </li><li>arxiv</li></ul> <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1736327219/25270F1E-DDCC-4A16-90EA-85C408630F39.png" style="background-color:initial;max-width:min(100%,2480px);max-height:min(2004px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1736327219/25270F1E-DDCC-4A16-90EA-85C408630F39.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="2480" height="2004"><ul class="dashed" data-apple-notes-indent-amount="0"><li></li></ul> 方法非常简单,就是将VLM的图像设置为视频序列采样6帧拼接后的图片,然后设置一些prompt进行引导。比较有启发意义吧这篇文章。应该是training-free利用VLM做视频理解的第一篇文章? <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1736327402/C056166D-1CC2-4AC9-AE76-BB3AD99C8903.png" style="background-color:initial;max-width:min(100%,2468px);max-height:min(2174px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1736327402/C056166D-1CC2-4AC9-AE76-BB3AD99C8903.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="2468" height="2174"> <ul class="dashed" data-apple-notes-indent-amount="0"><li>数据:无需训练</li><li>指标:open-end QA; text generation; multi-choice QA</li><li>硬件:未提及</li><li>开源:<a href="https://github.com/imagegridworth/IG-VLM">https://github.com/imagegridworth/IG-VLM</a> </li></ul>