<ul class="dashed" data-apple-notes-indent-amount="0"><li><span style="font-family: '.PingFangUITextSC-Regular'">文章标题:</span>Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment</li><li><span style="font-family: '.PingFangSC-Regular'">文章地址:</span><a href="https://arxiv.org/abs/2502.11079">https://arxiv.org/abs/2502.11079</a> </li><li>ICCV2025</li></ul> <img src="https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_04284474-deb5-4092-8d78-e52c406cacca/public" style="background-color:initial;max-width:min(100%,2356px);max-height:min(966px);;background-image:url(https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_04284474-deb5-4092-8d78-e52c406cacca/public);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="2356" height="966"> 文章实现了subject-to-video,通过提出构造数据pipeline以及模型框架。 以下是数据构造pipeline: <img src="https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_c7bf7269-5ad6-4f7b-b908-eef0af2762e5/public" style="background-color:initial;max-width:min(100%,2354px);max-height:min(824px);;background-image:url(https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_c7bf7269-5ad6-4f7b-b908-eef0af2762e5/public);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="2354" height="824"> 关于模型结构,参考图像分别经过VAE和CLIP得到相应的特征,其中VAE得到的toekn与视频token进行拼接,CLIP得到的特征与text token进行拼接,都是在序列长度维度进行。由于该模型使用window self attention,因此做了一些修改用于动态地将参考特征引入到attention计算中: <img src="https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_bd1aae91-ec65-4410-a28f-17068df55d68/public" style="background-color:initial;max-width:min(100%,1018px);max-height:min(1810px);;background-image:url(https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_bd1aae91-ec65-4410-a28f-17068df55d68/public);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="1018" height="1810">