<img src="https://res.cloudinary.com/montaigne-io/image/upload/v1721057177/B4334463-B1DD-4AD0-A540-D66407D9C5CF.png" style="background-color:initial;max-width:min(100%,2324px);max-height:min(1016px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1721057177/B4334463-B1DD-4AD0-A540-D66407D9C5CF.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="2324" height="1016"> <ul class="dashed" data-apple-notes-indent-amount="0"><li><span style="font-family: '.PingFangSC-Regular'">文章标题:</span>SoundStorm: Efficient Parallel Audio Generation</li><li><span style="font-family: '.PingFangSC-Regular'">文章地址:</span><a href="https://arxiv.org/abs/2305.09636">https://arxiv.org/abs/2305.09636</a> </li></ul> 文章提出了SoundStorm,一个高效的非自回归生成音频的模型。其将语义token作为输入,以非自回归的方式补全acoustic token,从而实现高效的音频生成。 SoundStorm模型结构如主图,训练时其掩码的计算如下: <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1721057177/D484E7F8-A86A-4E7A-B775-B162CAC1E7CF.png" style="background-color:initial;max-width:min(100%,1090px);max-height:min(170px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1721057177/D484E7F8-A86A-4E7A-B775-B162CAC1E7CF.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="1090" height="170"> <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1721057177/080D3C0B-D36E-422E-B343-DB6A1F473E8A.png" style="background-color:initial;max-width:min(100%,1110px);max-height:min(452px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1721057177/080D3C0B-D36E-422E-B343-DB6A1F473E8A.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="1110" height="452"> 最后模型多次前向计算,逐层得到acoustic token,从而得到了完整的音频token表示。 文章通过实验证明了模型的表现: <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1721057177/FE8B9888-5F0F-46F2-AB4A-0DAF07DCE9C9.png" style="background-color:initial;max-width:min(100%,2298px);max-height:min(900px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1721057177/FE8B9888-5F0F-46F2-AB4A-0DAF07DCE9C9.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="2298" height="900"> <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1721057177/C13D895B-2FD8-478E-A8A9-4E4E8B327793.png" style="background-color:initial;max-width:min(100%,1170px);max-height:min(812px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1721057177/C13D895B-2FD8-478E-A8A9-4E4E8B327793.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="1170" height="812"> <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1721057177/9186D835-0A34-46C8-9D88-C204C5CDC1B2.png" style="background-color:initial;max-width:min(100%,1138px);max-height:min(1454px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1721057177/9186D835-0A34-46C8-9D88-C204C5CDC1B2.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="1138" height="1454">