<ul class="dashed" data-apple-notes-indent-amount="0"><li><span style="font-family: '.PingFangUITextSC-Regular'">文章标题:</span>Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation</li><li><span style="font-family: '.PingFangSC-Regular'">文章地址:</span><a href="https://arxiv.org/abs/2605.17488">https://arxiv.org/abs/2605.17488</a> </li><li>arxiv 2026</li></ul> <img src="https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_919a81aa-effd-482f-8f5a-83d7a718b14c/public" style="background-color:initial;max-width:min(100%,2098px);max-height:min(1486px);;background-image:url(https://imagedelivery.net/phxEHgsq3j8gSnfNAJVJSQ/node3_919a81aa-effd-482f-8f5a-83d7a718b14c/public);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="2098" height="1486"><ul class="dashed" data-apple-notes-indent-amount="0"><li></li></ul> 针对音视频的多模态定制化,除了In-Context进行条件注入外,主要针对文本侧进行了优化。 具体来说就是新增了一个模块用于信息的融合,增强文本端的信息量,并强调哪些文本是台词(解决发音错乱问题)。