Xuemeng Song , Liqiang Nie , Hengtao Shen , Qi Tian , Hua Huang
2023, 13(2):139-142. DOI: 10.21655/ijsi.1673-7288.00311
Abstract:Preface
Tianyi Liu , Zuxuan Wu , Jingjing Chen , Yugang Jiang
2023, 13(2):143-155. DOI: 10.21655/ijsi.1673-7288.00315
Abstract:Most existing vision-language pre-training methods focus on understanding tasks and use BERT-like loss functions (masked language modeling and image-text matching) during pre-training. Despite their good performance in the understanding of downstream tasks, such as visual question answering, image-text retrieval, and visual entailment, these methods cannot generate information. To tackle this problem, this study proposes Unified multimodal pre-training for Vision-Language understanding and generation (UniVL). The proposed UniVL is capable of handling both understanding tasks and generation tasks. It expands existing pre-training paradigms and uses random masks and causal masks simultaneously, where causal masks are triangular masks that mask future tokens, and such pre-trained models can have autoregressive generation abilities. Moreover, several vision-language understanding tasks are turned into text generation tasks according to specifications, and the prompt-based method is employed for fine-tuning of different downstream tasks. The experiments show that there is a trade-off between understanding tasks and generation tasks when the same model is used, and a feasible way to improve both tasks is to use more data. The proposed UniVL framework attains comparable performance to recent vision-language pre-training methods in both understanding tasks and generation tasks. Moreover, the prompt-based generation method is more effective and even outperforms discriminative methods in few-shot scenarios.
Chengji Wang , Jiawei Su , Zhiming Luo , Donglin Cao , Yaojin Lin , Shaozi Li
2023, 13(2):157-176. DOI: 10.21655/ijsi.1673-7288.00312
Abstract:Text-based person search aims to find person images that match a specific text description from person databases, and it has received a lot of attention from academia and industry in recent years. However, this task faces two challenges simultaneously: fine-grained retrieval and the heterogeneous gap between images and text. Some methods propose to use supervised attribute learning to extract attribute-related features, so as to correlate images and text at a fine-grained level. However, attribute labels are difficult to be obtained, which leads to poor performance of such approaches in practice. How to extract attribute-related features without attribute labeling and establish fine-grained cross-modal semantic association has become a critical problem to be solved. To solve this problem, this study proposes a text-based person search method based on virtual attribute learning by incorporating pre-training techniques, so as to establish fine-grained cross-modal semantic associations through unsupervised attribute learning. First, the study proposes a semantic-guided attribute decoupling method based on the invariance of person attributes and cross-modal semantic consistency, and the method uses the identity label of persons as a supervised signal to guide the model to decouple attribute-related features. Second, a semantic inference-based feature learning module is proposed based on the association between attributes to construct a semantic graph, and the module enhances the cross-modal recognition capability of features by exchanging information between attributes through graph models. Experiments are conducted on the public text-based person search dataset CUHK-PEDES and the cross-modal retrieval dataset Flickr30k, and the proposed method is compared with existing methods. The experimental results demonstrate the effectiveness of the proposed method.
Zelong Sun , Guoxing Yang , Jingyuan Wen , Nanyi Fei , Zhiwu Lu , Jirong Wen
2023, 13(2):197-219. DOI: 10.21655/ijsi.1673-7288.00314
Abstract:With the development of generative adversarial networks, synthesizing images from text descriptions has become an active research area. However, text descriptions used for image generation are often in English, and the generated objects are mostly faces, flowers, birds, etc. Few studies have been conducted on the generation of Chinese paintings with Chinese descriptions. The text-to-image task often requires a large number of labeled image--text pairs, which is expensive and boring. The advance of vision-language pre-training enables an image generation process guided by an optimized way, which significantly reduces the demand for annotated datasets and computational resources. In this paper, a multi-domain VQGAN model is proposed to generate Chinese paintings in multiple domains. Further, a vision-language pre-training model WenLan is used to calculate the distance loss between the generated images and the text descriptions. The semantic consistency between images and text is achieved by optimizing the hidden space variables as the input of multi-domain VQGAN. An ablation study is conducted to compare different variants of our multi-domain VQGAN in terms of the FID and R-precision metrics. We also conduct a user study to further show the effectiveness of our proposed model. The extensive results demonstrate that our proposed multi-domain VQGAN model outperforms all the competitors in terms of image quality and text-image semantic consistency.
Jingkuan Song , Pengpeng Zeng , Jiayang Gu , Jinkuan Zhu , Lianli Gao
2023, 13(2):221-241. DOI: 10.21655/ijsi.1673-7288.00316
Abstract:To date, Transformer-based pre-trained models have demonstrated powerful capabilities of modality representation, leading to a shift towards a fully end-to-end paradigm for multimodal downstream tasks such as image captioning, and enabling better performance and faster inference. However, the grid features extracted with the pre-trained model lack regional visual information, which leads to inaccurate descriptions of the object content by the model. Thus, the applicability of using pre-trained models for image captioning remains largely unexplored. Toward this goal, this paper proposes a novel end-to-end image captioning method based on Visual Region Aggregation and Dual-level Collaboration (VRADC). Specifically, to learn regional visual information, this paper designs a visual region aggregation that aggregates grid features with similar semantics to obtain a compact visual region representation. Next, dual-level collaboration uses the cross-attention mechanism to learn more representative semantic information from the two visual features, which in turn generates more fine-grained descriptions. Experimental results on the MSCOCO and Flickr30k datasets show that the proposed method, VRADC, can significantly improve the quality of image captioning, and achieves state-of-the-art performance.