<ul class="dashed" data-apple-notes-indent-amount="0"><li><span style="font-family: '.PingFangUITextSC-Regular'">文章标题:</span>Emu3: Next-Token Prediction is All You Need</li><li><span style="font-family: '.PingFangSC-Regular'">文章地址:</span><a href="https://arxiv.org/abs/2409.18869">https://arxiv.org/abs/2409.18869</a> </li><li>arxiv</li></ul> <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1746628881/A5A1888F-C831-41D1-B0A6-7145D10BAF0F.png" style="background-color:initial;max-width:min(100%,1582px);max-height:min(928px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1746628881/A5A1888F-C831-41D1-B0A6-7145D10BAF0F.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="1582" height="928"> 文章探究了将所有模态都是用token表示并在单个模型上进行统一建模的方法。使用了统一的next token predict的预训练方法。该方法省去了diffusion模型和CLIP模型等组合形式,将所有模态统一用token来表示和建模并生成,实验证明了这种方法的可行性。 <span style="font-family: '.PingFangUITextSC-Regular'">具体来说,方法将图像和视频使用</span>SBER-MoVQGAN作为tokenizer进行编解码,完成图像视频到token的互相转换。 <ul class="dashed" data-apple-notes-indent-amount="0"><li>开源:<a href="https://github.com/baaivision/Emu3">https://github.com/baaivision/Emu3</a> </li></ul>