<img src="https://res.cloudinary.com/montaigne-io/image/upload/v1720976476/C7F0FAAB-EE2E-4327-9B89-ACB86609BAB4.png" style="background-color:initial;max-width:min(100%,900px);max-height:min(734px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1720976476/C7F0FAAB-EE2E-4327-9B89-ACB86609BAB4.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="900" height="734"> <ul class="dashed" data-apple-notes-indent-amount="0"><li>文章标题:Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition</li><li>文章地址:<a href="https://arxiv.org/abs/2206.08317">https://arxiv.org/abs/2206.08317</a> </li><li>accepted by INTERSPEECH 2022</li></ul> 由于最近在商汤实习,在做语音模态相关的工作,因此最近这段时间看的大部分文章都是语音模态的。 这篇文章来自阿里的语音实验室,提出了Paraformer模型,该模型为端到端的非自回归语音识别模型。 文章提到Transformer在当前语音识别具有统治地位,而Transformer的Decoder部分为AR(自回归)生成的,这大大增加了推理时间。为了解决该问题,很多NAR(非自回归)方法被提出,目的是并行生成,例如:单步NAR。然而由于单步NAR大多都假设输出的token之间是独立的,因此效果大都不如AR。作者认为要提高单步NAR的效果,主要包含两个困难:<b>首先是,准确地预测token数和提取隐变量;其次,提高对输出token间相互依赖关系的建模。</b><span style="font-family: '.PingFangSC-Regular'">因此作者提出了</span>Paraformer<span style="font-family: '.PingFangSC-Regular'">。实验证明,在效果相当的情况下,</span>Paraformer<span style="font-family: '.PingFangSC-Regular'">生成速度为</span>AR<span style="font-family: '.PingFangSC-Regular'">方式的</span>10<span style="font-family: '.PingFangSC-Regular'">倍。</span> Paraformer主要结构如主图,在训练中首先音频转化为fbanks(音频的声学特征),输入encoder得到隐变量H,H经过predictor预测token数N‘,并得到声学embedding Ea,decoder将Ea和H作为输入并输出Y‘(初步预测)输入sampler,将Y‘与真实标签Y做对比,将Ea中随机替换n个Ec(真实token的embedding)计算语义embedding Es,其中n为Y‘与Y的海明距离(即不同标签的个数)。得到Es后,将其和H作为decoder的输入预测标签Y‘’(最终预测)。在推理阶段,sampler不起作用,将得到的Ea与H作为decoder的输入进行文本标签的预测。 Predictor由两层卷积层组成,其输出为一系列范围为0-1的α,作者用α的和来预测N,使用MAE损失: <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1720976476/52900D75-18D5-4B4C-9DB2-7B75538947F8.png" style="background-color:initial;max-width:min(100%,514px);max-height:min(84px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1720976476/52900D75-18D5-4B4C-9DB2-7B75538947F8.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="514" height="84"> 同时,predictor使用α与H的加权和来预测声学特征,加权时累计α,若α大于阈值β,则将加权和作为一个声学特征Ea(在训练时,α会因目标长度而缩放以此确保Ea的与Ec的长度相当,在推理时不会缩放),作者称该算法为CIF(Continuous Integrate-and-Fire),文章使用了动态阈值的方式: <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1720976476/0DDB1CC6-8999-47F7-95EF-D6DFE3979FCA.png" style="background-color:initial;max-width:min(100%,340px);max-height:min(136px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1720976476/0DDB1CC6-8999-47F7-95EF-D6DFE3979FCA.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="340" height="136"> 下图为一个例子: <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1720976476/909AC608-AEEF-4CE4-BEFC-DA4519D495DF.png" style="background-color:initial;max-width:min(100%,894px);max-height:min(214px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1720976476/909AC608-AEEF-4CE4-BEFC-DA4519D495DF.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="894" height="214"> Sampler主要采用了GLM的思想,对输出的token相互依赖进行建模,GLM的损失为: <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1720976476/06F7C5E1-6D07-41C0-8327-5E449591B8E0.png" style="background-color:initial;max-width:min(100%,1014px);max-height:min(168px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1720976476/06F7C5E1-6D07-41C0-8327-5E449591B8E0.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="1014" height="168"> 其中GLM集合为sampler在Ea与Ec采样的结果,他能对token间的相互依赖进行建模。具体采样方式为: <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1720976476/11764977-F8C2-4765-A757-1576E8C2C3B3.png" style="background-color:initial;max-width:min(100%,982px);max-height:min(88px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1720976476/11764977-F8C2-4765-A757-1576E8C2C3B3.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="982" height="88"> 其中d为海明距离。Sampler模块随机将Ea中λd个token替换成Ec的embedding,然后使用GLM集合的上下文去预测不存在于GLM但存在于标签当中的token,从而实现了对输出token相互依赖的建模。 模型的损失函数如下: <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1720976476/3ABE8BB2-6108-4B80-B917-F16AF43F48C5.png" style="background-color:initial;max-width:min(100%,762px);max-height:min(64px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1720976476/3ABE8BB2-6108-4B80-B917-F16AF43F48C5.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="762" height="64"> 最后文章展示了paraformer与其他模型的对比(准确率和生成速率): 传统的数据集: <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1720976476/7F056E4A-8CEA-410C-9A7B-3BD9311FE9CD.png" style="background-color:initial;max-width:min(100%,1148px);max-height:min(1130px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1720976476/7F056E4A-8CEA-410C-9A7B-3BD9311FE9CD.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="1148" height="1130"> 工业级数据集: <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1720976476/E8651BF5-DAF9-4A54-AC08-EAC82D084081.png" style="background-color:initial;max-width:min(100%,2054px);max-height:min(412px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1720976476/E8651BF5-DAF9-4A54-AC08-EAC82D084081.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="2054" height="412">