<ul class="dashed" data-apple-notes-indent-amount="0"><li><span style="font-family: '.PingFangUITextSC-Regular'">文章标题:</span>SkyReels-A2: Compose Anything in Video Diffusion Transformers</li><li><span style="font-family: '.PingFangSC-Regular'">文章地址:</span><a href="https://arxiv.org/abs/2504.02436">https://arxiv.org/abs/2504.02436</a> </li><li>arxiv 2025</li></ul> <img src="https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_3f5c430f-7969-4375-83f6-4f462fa817e2/public" style="background-color:initial;max-width:min(100%,1898px);max-height:min(1298px);;background-image:url(https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_3f5c430f-7969-4375-83f6-4f462fa817e2/public);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="1898" height="1298"> Subject-to-Video的一篇工作,引入图像条件的方法也比较简单,CLIP特征与文本特征concatenate通过cross-attention进行融合,VAE特征(由padding到目标帧后通过3D VAE编码得到)与视频特征在通道维度进行concatenate(类似Wan-I2V)。