<img src="https://res.cloudinary.com/montaigne-io/image/upload/v1720977101/F9863908-B22F-4AE7-BBB0-86C3DFEC5B50.png" style="background-color:initial;max-width:min(100%,1898px);max-height:min(1344px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1720977101/F9863908-B22F-4AE7-BBB0-86C3DFEC5B50.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="1898" height="1344"> <ul class="dashed" data-apple-notes-indent-amount="0"><li><span style="font-family: '.PingFangSC-Regular'">文章标题:</span>SPOKEN QUESTION ANSWERING AND SPEECH CONTINUATION USING SPECTROGRAM-POWERED LLM</li><li><span style="font-family: '.PingFangSC-Regular'">文章地址:</span><a href="https://arxiv.org/abs/2305.15255">https://arxiv.org/abs/2305.15255</a> </li><li>ICLR 2024</li></ul> 文章提出了Spectron,一个全新的使用预训练LLM去完成语音问答和语音生成的模型。通过在LLM中增加预训练的语音编码器,使模型具有了语音的输入和输出的能力。整个模型是端到端训练的,并且直接在频谱图上进行,简化了模型的结构。该方法的关键在于模型同时将语音识别,文本续写和语音生成作为训练目标,且只使用了文本-语音对数据,使得模型在只有一个解码阶段具有跨模态的思维链能力。 模型的结构如主图,使用了预训练的语音编码器和预训练的LM。编码器将一段语音作为输入进行编码,随后输入到LM中作为prefix前缀(注意模型中的LM为prefix-decoder架构),随后LM进行token预测,进行文本内容的识别与续写,随后再对语音内容进行预测。具体过程在下面介绍。 在训练阶段,一段语音被分为两段,前一段作为encoder输入,后一段作为模型预测的GT,相应的文本也会分为两段。随后语音编码器将前一段语音进行编码和维度转换输入到LM中作为prefix,LM解码生成文本内容。最后为了使LM具有输入和输出语音的能力,作者加入了Pre-net和Post-net用于处理输入和输出频谱图,有了这两个网络的处理,使得模型能够预测语音内容。模型的损失函数分成两部分,首先是语音识别和文本生成,该部分主要用于训练模型的语音识别和基于文本的问答能力,第二部分是语音生成,该部分主要训练模型的语音回答能力,式子如下: <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1720977101/36BF1054-030F-4CEB-95EA-F4CFE19AEDAC.png" style="background-color:initial;max-width:min(100%,1124px);max-height:min(82px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1720977101/36BF1054-030F-4CEB-95EA-F4CFE19AEDAC.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="1124" height="82"> <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1720977100/2068E6E4-F9FD-4A72-957E-C4A9C8843AFA.png" style="background-color:initial;max-width:min(100%,1208px);max-height:min(156px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1720977100/2068E6E4-F9FD-4A72-957E-C4A9C8843AFA.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="1208" height="156"> <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1720977101/BE1AD835-A0D1-40BC-B605-38B95347D246.png" style="background-color:initial;max-width:min(100%,892px);max-height:min(318px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1720977101/BE1AD835-A0D1-40BC-B605-38B95347D246.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="892" height="318"> <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1720977101/6C5FA2C2-957D-4409-97FB-9E4E1123D763.png" style="background-color:initial;max-width:min(100%,1070px);max-height:min(84px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1720977101/6C5FA2C2-957D-4409-97FB-9E4E1123D763.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="1070" height="84"> <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1720977101/81F785BC-E826-4FD1-B08E-271E0C4CA18D.png" style="background-color:initial;max-width:min(100%,872px);max-height:min(90px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1720977101/81F785BC-E826-4FD1-B08E-271E0C4CA18D.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="872" height="90"> 在推理阶段,输入一段语音,该语音作为语音编码器的输入,进行编码和维度转换后输入到LM中,LM通过自回归的方式依次生成对应文本、文本续写和语音频谱,最后,声码器将频谱转化为音频输出。 作者进行了对比实验和消融实验证明了模型在语义质量,语音质量和回答准确度上的有效性。 <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1720977101/3F183534-6BB3-47E4-A2E1-BB0E0755A8A2.png" style="background-color:initial;max-width:min(100%,1886px);max-height:min(630px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1720977101/3F183534-6BB3-47E4-A2E1-BB0E0755A8A2.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="1886" height="630"> <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1720977101/8681DB64-9877-43AA-AC65-18046D7C835F.png" style="background-color:initial;max-width:min(100%,1802px);max-height:min(770px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1720977101/8681DB64-9877-43AA-AC65-18046D7C835F.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="1802" height="770"> <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1720977101/97C8EE81-2B7A-4AF1-B10C-D969108FB902.png" style="background-color:initial;max-width:min(100%,1622px);max-height:min(598px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1720977101/97C8EE81-2B7A-4AF1-B10C-D969108FB902.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="1622" height="598"> <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1720977101/CBBAB566-42E6-457B-9553-92139805FA80.png" style="background-color:initial;max-width:min(100%,1218px);max-height:min(506px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1720977101/CBBAB566-42E6-457B-9553-92139805FA80.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="1218" height="506"> 最后,文章还指出了模型的局限性。首先是频谱图的生成的计算复杂度高,难以处理长语音,第二是文本和语音的生成不是同步的,会产生一些延迟。