<ul class="dashed" data-apple-notes-indent-amount="0"><li><span style="font-family: '.PingFangUITextSC-Regular'">文章标题:</span>DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation</li><li><span style="font-family: '.PingFangSC-Regular'">文章地址:</span><a href="https://arxiv.org/abs/2602.12160">https://arxiv.org/abs/2602.12160</a> </li><li>ICML 2026</li></ul> <img src="https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_6c4d8ee9-f278-4362-adad-88e3404451f6/public" style="background-color:initial;max-width:min(100%,3398px);max-height:min(2170px);;background-image:url(https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_6c4d8ee9-f278-4362-adad-88e3404451f6/public);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="3398" height="2170"> 作者提到当前的R2AV,RV2AV以及RA2V的方法都把这些当作单独的任务,作者认为他们都是一样的目标:都是将静态的id anchor(图像或音频)映射到动态的音视频中。统一这些任务会遇到几个问题:1、如何设计统一的模型框架;2、如何解决多人场景下混淆的问题;3、如何设计训练策略防止多任务冲突。 为了解决这些问题,作者提出了DreamID-Omni,框架如上图,将ref条件(image和audio)与目标token进行序列上的拼接完成条件注入,源视频和驱动音频通过逐元素add的方式注入。在位置编码中,为了将音视频在时序上对应(时序上token数量不一样),将音频token的位置编码进行缩放,到与视频能够呈对角映射形式(参考Ovi论文,如下图)。 <img src="https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_02d2dce3-4dae-43b2-95e2-d662c5242421/public" style="background-color:initial;max-width:min(100%,2404px);max-height:min(798px);;background-image:url(https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_02d2dce3-4dae-43b2-95e2-d662c5242421/public);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="2404" height="798"> <span style="font-family: '.PingFangUITextSC-Regular'">随后,为了区分</span>reference token<span style="font-family: '.PingFangUITextSC-Regular'">以及区分不同的</span>reference<span style="font-family: '.PingFangUITextSC-Regular'">,设定固定的边界,同一个</span>ref<span style="font-family: '.PingFangUITextSC-Regular'">在同一个时序窗口中。为了在文本端避免混淆,文本进行结构化处理。在训练时,分三个阶段:</span>1<span style="font-family: '.PingFangUITextSC-Regular'">、</span>in-pair R2AV<span style="font-family: '.PingFangUITextSC-Regular'">;</span>2<span style="font-family: '.PingFangUITextSC-Regular'">、</span>cross-pair R2AV<span style="font-family: '.PingFangUITextSC-Regular'">;</span>3<span style="font-family: '.PingFangUITextSC-Regular'">、联合训练。每个阶段依次变难。</span> 数据构造pipeline: <img src="https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_10515eef-2211-4fa5-82ec-50f9518deeec/public" style="background-color:initial;max-width:min(100%,3320px);max-height:min(1696px);;background-image:url(https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_10515eef-2211-4fa5-82ec-50f9518deeec/public);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="3320" height="1696">