<ul class="dashed" data-apple-notes-indent-amount="0"><li><span style="font-family: '.PingFangUITextSC-Regular'">文章标题:</span>SLOWFAST-LLAVA: A STRONG TRAINING-FREE BASELINE FOR VIDEO LARGE LANGUAGE MODELS</li><li><span style="font-family: '.PingFangSC-Regular'">文章地址:</span><a href="https://arxiv.org/abs/2407.15841">https://arxiv.org/abs/2407.15841</a> </li><li>arxiv</li></ul> <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1735812287/9AB39C37-B9C5-4E16-B713-302226EC5681.png" style="background-color:initial;max-width:min(100%,1922px);max-height:min(950px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1735812287/9AB39C37-B9C5-4E16-B713-302226EC5681.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="1922" height="950"> 文章利用LLaVA-NeXT(一个开源的增加了图片模态的大语言模型)完成视频任务,无需任何训练,如题,目标是提出一个无需训练的VideoLLM的baseline。 具体来说文章的思路非常简单,将视频的特征分成两部分,Slow和Fast,分别对应两个采样率和特征分辨率,其中Slow特征采样率比较低,特征分辨率比较高;Fast则反之。原因是Slow更关注空间细节和语义特征,Fast更关注动作线索。通过提取这两种特征使得LLaVA更好地理解该视频。然后将视频的特征与文本问题输入到LLaVA中。 <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1735812692/5AC2202C-805D-4664-ACE0-BF3A4063200F.png" style="background-color:initial;max-width:min(100%,1918px);max-height:min(1194px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1735812692/5AC2202C-805D-4664-ACE0-BF3A4063200F.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="1918" height="1194"> <ul class="dashed" data-apple-notes-indent-amount="0"><li>数据:无需训练</li><li>指标:8个视频benchmark</li><li>硬件:8 A100</li><li>开源:<a href="https://github.com/apple/ml-slowfast-llava">https://github.com/apple/ml-slowfast-llava</a> </li></ul>