<ul class="dashed" data-apple-notes-indent-amount="0"><li><span style="font-family: '.PingFangUITextSC-Regular'">文章标题:</span>BINDWEAVE: SUBJECT-CONSISTENT VIDEO GENERATION VIA CROSS-MODAL INTEGRATION</li><li><span style="font-family: '.PingFangSC-Regular'">文章地址:</span><a href="https://arxiv.org/abs/2510.00438">https://arxiv.org/abs/2510.00438</a> </li><li>arxiv 2025(ICLR2026在投4466)</li></ul> <img src="https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_92627681-84f5-4350-86d5-4b8716333ad4/public" style="background-color:initial;max-width:min(100%,2484px);max-height:min(1196px);;background-image:url(https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_92627681-84f5-4350-86d5-4b8716333ad4/public);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="2484" height="1196"> 模型框架如上图,主要创新点在于左半部分,所谓的跨模态混合信息就是利用MLLM做一个信息融合,得到hidden state后接入一个可训练的MLP作为混合信息,该信息与原文本信息进行concate得到联合信息,输入到原模型的文本cross attention当中。此外,为了引入细粒度的参考图像信息,将参考图像经过vae进行编码后以一种类似Wan2.1-I2V的方式进行条件输入,如下图: <img src="https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_47089282-8f53-4b73-98d0-228f37baa400/public" style="background-color:initial;max-width:min(100%,3204px);max-height:min(1226px);;background-image:url(https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_47089282-8f53-4b73-98d0-228f37baa400/public);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="3204" height="1226"> 另外还同Wan2.1-I2V将参考图像的clip特征以cross attention方式注入到模型中,这个地方多参考图像是如何处理的?