<ul class="dashed" data-apple-notes-indent-amount="0"><li><span style="font-family: '.PingFangSC-Regular'">文章标题:</span>IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models</li><li><span style="font-family: '.PingFangSC-Regular'">文章地址:</span><a href="https://arxiv.org/abs/2308.06721">https://arxiv.org/abs/2308.06721</a> </li><li>arxiv</li></ul> <img src="https://res.cloudinary.com/montaigne-io/image/upload/v1723775822/A61F3E17-9FF1-45F2-A3ED-779598B84D20.png" style="background-color:initial;max-width:min(100%,2286px);max-height:min(1212px);;background-image:url(https://res.cloudinary.com/montaigne-io/image/upload/v1723775822/A61F3E17-9FF1-45F2-A3ED-779598B84D20.png);height:auto;width:100%;object-fit:cover;background-size:cover;display:block;" width="2286" height="1212"> 文章的出发点非常简单,就是将图片作为prompt引入文生图的模型中。实现也非常简单,将图片经过CLIP编码得到特征,然后经过映射形成N个特征,随后引入新的交叉注意力层将图片特征融合到U-Net中(图片与文本分别与隐变量做交叉注意力然后加权求和)。