Jia Chen, Qin Jin, et al.
SIGIR 2014
Improving the captioning performance on low-resource languages by leveraging English caption datasets has received increasing research interest in recent years. Existing works mainly fall into two categories: translation-based and alignment-based approaches. In this paper, we propose to combine the merits of both approaches in one unified architecture. Specifically, we use a pre-trained English caption model to generate high-quality English captions, and then take both the image and generated English captions to generate low-resource language captions. We improve the captioning performance by adding the cycle consistency constraint on the cycle of image regions, English words, and low-resource language words. Moreover, our architecture has a flexible design which enables it to benefit from large monolingual English caption datasets. Experimental results demonstrate that our approach outperforms the state-of-the-art methods on common evaluation metrics. The attention visualization also shows that the proposed approach really improves the fine-grained alignment between words and image regions.
Jia Chen, Qin Jin, et al.
SIGIR 2014
Zhi Qiao, Shiwan Zhao, et al.
IJCAI 2018
Michael Desmond, Honglei Guo, et al.
IBM J. Res. Dev
Jia Chen, Qin Jin, et al.
MM 2013