<ul class="dashed" data-apple-notes-indent-amount="0"><li><span style="font-family: '.PingFangUITextSC-Regular'">文章标题:</span>Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation</li><li><span style="font-family: '.PingFangSC-Regular'">文章地址:</span><a href="https://arxiv.org/abs/2508.07901">https://arxiv.org/abs/2508.07901</a> </li><li>arxiv 2025</li></ul> <img src="https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_e1687db5-85ab-4202-b3ea-d02b48aacd3a/public" style="background-color:initial;max-width:min(100%,1042px);max-height:min(1072px);;background-image:url(https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_e1687db5-85ab-4202-b3ea-d02b48aacd3a/public);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="1042" height="1072"> 方法提出了一个轻量的reference-to-video方法Stand-In,其核心就是将参考图像通过VAE编码后与视频token共同输入到模型当中,随后修改Self-Attention层如下: <img src="https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_ddc869af-3a8b-418a-9e7b-489060446615/public" style="background-color:initial;max-width:min(100%,2160px);max-height:min(760px);;background-image:url(https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_ddc869af-3a8b-418a-9e7b-489060446615/public);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="2160" height="760"> 如此一来,只需要训一个自注意力的LoRA就能够完成条件的注入,值得注意的是他这里的位置编码的设计。该方法只支持单参考图像。