<ul class="dashed" data-apple-notes-indent-amount="0"><li><span style="font-family: '.PingFangUITextSC-Regular'">文章标题:</span>Flamingo: a Visual Language Model for Few-Shot Learning</li><li><span style="font-family: '.PingFangSC-Regular'">文章地址:</span><a href="https://arxiv.org/abs/2204.14198">https://arxiv.org/abs/2204.14198</a> </li><li>NIPS 2022</li></ul> <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1746681059/05F365E1-06E2-41C3-BB31-B6524EBBBE9E.png" style="background-color:initial;max-width:min(100%,1600px);max-height:min(848px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1746681059/05F365E1-06E2-41C3-BB31-B6524EBBBE9E.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="1600" height="848"> 文章对于视觉模态先利用编码器提取特征,然后经过感知重采样对齐维度,随后经过门控交叉注意力层与LLM进行特征融合,完成多模态的交织。另外,为了提高模型few-shot的能力,仅通过训练图像文本对是不够的,因此文章构造了一个图像文本交织数据集M3W。