Caption Generation
❏ Image Captioning that Reflects the Intent of the Explainer based on Tracing with a Pen
In recent years, research on image caption generation has evolved to include not only the generation of image captions based on information obtained from image preprocessing, but also the generation of captions based on the user’s interest in the image by providing additional information corresponding to the viewpoint, called control signals, to the image processing information. In this paper, we propose a new method to generate captions based on the user’s interests. In general, when people describe the image, they usually use their fingers to trace the object they want to describe. In this study, we consider tracing the image as a control signal. And, we propose an interactive generating image caption method that is more in line with the explainer by reflecting the meaning of the traces.
Sayako Watanabe
渡邊 清子,小林 一郎「ペン先の軌跡から説明者の意図を反映した画像キャプション生成」人工知能学会全国大会(第36回),国立京都国際会館,京都,2022年6月.(in Japanese) JSAI Annual Conference Award

❏ VQA-based image captioning system
Research Summary: Humans can explain image content as they like, but most image processing models are unintentional and cannot proactively generate different explanations according to the intent of different users. To solve this situation, we will develop a caption generation method that can obtain different captions and answers by asking different questions for the same image. Research method: The text is input to DistilBERT and the output of all hidden layers is obtained. The image is preprocessed, input to Vision Transformer and output as the final hidden layer. Next, give three layers of cross-attention to get two inputs, and finally get two outputs, a linguistic output and a visual output. The visual output is used to generate a bounding box on the image. All hidden layer variables in the language output are output to the decoder to generate captions. In addition, the first token of the hidden layer is used for classification by softmax and the answer output is obtained.

Jingyi Du