Zhu S, Li S, Lei Y, et al, PEIT: Bridging the Modality Gap with Pre-trained Models for End-to-End Image Translation, ACL 2023. (顶级国际会议)
Abstract: This paper introduces an end-to-end image translation framework named PEIT (Pre-trained Models for End-to-End Image Translation), designed to bridge the modality gap between visual text inputs in images and textual inputs/outputs of machine translation (MT). PEIT consists of four core components: a visual encoder, a shared encoder-decoder backbone network, a vision-text representation aligner equipped with the shared encoder, and a cross-modal regularizer stacked over the shared decoder. The purpose of these components is to reduce the differences between modalities. The training of PEIT employs a two-stage pre-training strategy, combined with an auxiliary machine translation task, first pre-training the model on MT training data, and then continuing pre-training on a synthesized dataset of rendered images containing text from the MT training data, which includes an aligner and a regularizer. Furthermore, to facilitate the evaluation of PEIT and promote research in image translation, a large-scale image translation corpus called ECOIT has been created, containing 480K image-translation pairs obtained through crowdsourcing and manual post-editing from real-world images in the e-commerce domain. Experiments on the curated ECOIT benchmark dataset demonstrate that PEIT significantly outperforms both cascaded image translation systems (OCR+MT) and previous strong end-to-end image translation models, with fewer parameters and faster decoding speed.
摘要;本文提出了一种名为PEIT(Pre-trained Models for End-to-End Image Translation)的端到端图像翻译框架,旨在弥合图像中视觉文本输入与机器翻译(MT)文本输入/输出之间的模态差异。PEIT由四个核心组件构成:视觉编码器、共享的编解码器骨干网络、配备共享编码器的视觉-文本表示对齐器以及堆叠在共享解码器上的跨模态正则化器。这些组件的目的是减少模态间的差异。PEIT的训练采用两阶段预训练策略,结合辅助的机器翻译任务,先在MT训练数据上预训练模型,然后在MT训练数据的文本渲染图像的合成的数据集上继续预训练,该数据集包括对齐器和正则化器。此外,为了促进PEIT的评估和图像翻译研究,创建了一个大规模的图像翻译语料库ECOIT,包含通过众包和手动后期编辑从电子商务领域的现实世界图像中获取的480K图像-翻译对。在ECOIT基准数据集上的实验表明,PEIT在参数更少、解码速度更快的情况下,显著优于级联图像翻译系统(OCR+MT)和之前的强端到端图像翻译模型。