<ul class="dashed" data-apple-notes-indent-amount="0"><li><span style="font-family: '.PingFangUITextSC-Regular'">文章标题:</span>VACE: All-in-One Video Creation and Editing</li><li><span style="font-family: '.PingFangSC-Regular'">文章地址:</span><a href="https://arxiv.org/abs/2503.07598">https://arxiv.org/abs/2503.07598</a> </li><li>ICCV2025</li></ul> <img src="https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_66f36b21-b2ba-4f7b-8fc0-63c56bd4c14a/public" style="background-color:initial;max-width:min(100%,2334px);max-height:min(894px);;background-image:url(https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_66f36b21-b2ba-4f7b-8fc0-63c56bd4c14a/public);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="2334" height="894"> <span style="font-family: '.PingFangUITextSC-Regular'">文章提出了首个基于</span>video DiT<span style="font-family: '.PingFangUITextSC-Regular'">完成多种视频生成及编辑任务的框架。作者首先将任务归为四类:</span>T2V, R2V, V2V, MV2V<span style="font-family: '.PingFangUITextSC-Regular'">。为了完成多种任务,需要满足多种模态的条件输入,因此设计了一个统一的条件单元</span>Video Condition Unit(VCU),其包含[T, F, M],T代表文本输入,F代表视频条件输入(包括所编辑的视频,深度图视频等等),M表示掩码,用于指示编辑区域。因此通过VCU可以表示多种任务: <img src="https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_e8d780be-582b-48b6-9fd7-81ffa1781b80/public" style="background-color:initial;max-width:min(100%,1174px);max-height:min(886px);;background-image:url(https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_e8d780be-582b-48b6-9fd7-81ffa1781b80/public);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="1174" height="886"> 在模型设计上,没有太多的亮点,感觉像一种范式。 <img src="https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_741db301-7e91-4233-849f-3c304c3ee6bb/public" style="background-color:initial;max-width:min(100%,2372px);max-height:min(854px);;background-image:url(https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_741db301-7e91-4233-849f-3c304c3ee6bb/public);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="2372" height="854"> 将VCU通过tokenizer以及embedder之后,得到Context token,(a)全微调版本:将context token与video token进行相加输入到DiT中进行计算,(b)类似ControlNet训练额外的block,将Context token输入到相应的block中进行控制。值得注意的是,当R2V任务中,推理的Video Token会在前面增加与ref image数量相对应数量的token,并同时去噪生成。