<ul class="dashed" data-apple-notes-indent-amount="0"><li><span style="font-family: '.PingFangUITextSC-Regular'">文章标题:</span>SHOW-O: ONE SINGLE TRANSFORMER TO UNIFY MULTIMODAL UNDERSTANDING AND GENERATION</li><li><span style="font-family: '.PingFangSC-Regular'">文章地址:</span><a href="https://arxiv.org/abs/2408.12528">https://arxiv.org/abs/2408.12528</a> </li><li>ICLR 2025</li></ul> <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1746977092/7E75AADB-0C69-4782-B8F4-EAE574868E93.png" style="background-color:initial;max-width:min(100%,1894px);max-height:min(1294px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1746977092/7E75AADB-0C69-4782-B8F4-EAE574868E93.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="1894" height="1294"> <span style="font-family: '.PingFangUITextSC-Regular'">也是一篇统一多模态理解与生成的工作,该模型以</span>transformer<span style="font-family: '.PingFangUITextSC-Regular'">为核心架构,对图像使用</span>MAGVIT-v2进行离散化处理,进行文本图像的联合建模,其中文本的训练目标为NTP(Next-Token-Prediction),而图像的训练目标同MaskGit,在扩散迭代过程中去除被mask的token。 <ul class="dashed" data-apple-notes-indent-amount="0"><li>开源:<a href="https://showlab.github.io/Show-o/">https://showlab.github.io/Show-o/</a> </li></ul>