• Volume 13,Issue 2,2023 Table of Contents
    Select All
    Display Type: |
    • Preface to the Special Issue on Multimodal Learning Integrated with Pre-training Techniques

      2023, 13(2):139-142. DOI: 10.21655/ijsi.1673-7288.00311

      Abstract (410) HTML (0) PDF 139.29 K (529) Comment (0) Favorites

      Abstract:Preface

    • Multimodal Pre-training Method for Vision-language Understanding and Generation

      2023, 13(2):143-155. DOI: 10.21655/ijsi.1673-7288.00315

      Abstract (184) HTML (0) PDF 480.45 K (637) Comment (0) Favorites

      Abstract:Most existing vision-language pre-training methods focus on understanding tasks and use BERT-like loss functions (masked language modeling and image-text matching) during pre-training. Despite their good performance in the understanding of downstream tasks, such as visual question answering, image-text retrieval, and visual entailment, these methods cannot generate information. To tackle this problem, this study proposes Unified multimodal pre-training for Vision-Language understanding and generation (UniVL). The proposed UniVL is capable of handling both understanding tasks and generation tasks. It expands existing pre-training paradigms and uses random masks and causal masks simultaneously, where causal masks are triangular masks that mask future tokens, and such pre-trained models can have autoregressive generation abilities. Moreover, several vision-language understanding tasks are turned into text generation tasks according to specifications, and the prompt-based method is employed for fine-tuning of different downstream tasks. The experiments show that there is a trade-off between understanding tasks and generation tasks when the same model is used, and a feasible way to improve both tasks is to use more data. The proposed UniVL framework attains comparable performance to recent vision-language pre-training methods in both understanding tasks and generation tasks. Moreover, the prompt-based generation method is more effective and even outperforms discriminative methods in few-shot scenarios.

    • Text-based Person Search via Virtual Attribute Learning

      2023, 13(2):157-176. DOI: 10.21655/ijsi.1673-7288.00312

      Abstract (179) HTML (0) PDF 1007.46 K (775) Comment (0) Favorites

      Abstract:Text-based person search aims to find person images that match a specific text description from person databases, and it has received a lot of attention from academia and industry in recent years. However, this task faces two challenges simultaneously: fine-grained retrieval and the heterogeneous gap between images and text. Some methods propose to use supervised attribute learning to extract attribute-related features, so as to correlate images and text at a fine-grained level. However, attribute labels are difficult to be obtained, which leads to poor performance of such approaches in practice. How to extract attribute-related features without attribute labeling and establish fine-grained cross-modal semantic association has become a critical problem to be solved. To solve this problem, this study proposes a text-based person search method based on virtual attribute learning by incorporating pre-training techniques, so as to establish fine-grained cross-modal semantic associations through unsupervised attribute learning. First, the study proposes a semantic-guided attribute decoupling method based on the invariance of person attributes and cross-modal semantic consistency, and the method uses the identity label of persons as a supervised signal to guide the model to decouple attribute-related features. Second, a semantic inference-based feature learning module is proposed based on the association between attributes to construct a semantic graph, and the module enhances the cross-modal recognition capability of features by exchanging information between attributes through graph models. Experiments are conducted on the public text-based person search dataset CUHK-PEDES and the cross-modal retrieval dataset Flickr30k, and the proposed method is compared with existing methods. The experimental results demonstrate the effectiveness of the proposed method.

    • Text-driven Face Image Generation and Manipulation via Multi-level Residual Mapper

      2023, 13(2):177-196. DOI: 10.21655/ijsi.1673-7288.00313

      Abstract (1) HTML (0) PDF 1.47 M (13) Comment (0) Favorites

      Abstract:Although Generative Adversarial Networks (GANs) have achieved great success in face image generation and manipulation, discovering meaningful directions in the latent space of GANs to manipulate semantic attributes is a difficult but meaningful challenge in computer vision. The realization of this challenge typically requires large amounts of labeled data and several hours of network fine-tuning. However, obtaining an annotated collection of images for each desired manipulation is usually very expensive and time-consuming. Recent works aim to overcome this limitation by leveraging the pre-trained models. While they are promising, the accuracy of the manipulation and the authenticity of the results cannot meet the needs of real face editing scenarios. To address these problems, this paper encodes the image and text description into a shared embedding space and proposes a unified image generation and manipulation framework by leveraging the powerful joint representation capability from Contrastive Language-Image Pre-training (CLIP). With carefully designed network structures and loss functions, the proposed framework can learn a latent residual mapper network to map the input conditions into corresponding latent code residuals. This scheme enables the proposed method to perform high-quality face image generation and manipulation by leveraging the generative power from the pre-trained StyleGAN2 model. Extensive experiments demonstrate the superiority of the proposed approach in terms of manipulation accuracy, visual realism, and irrelevant attribute preservation.

    • Text-to-Chinese-painting Method Based on Multi-domain VQGAN

      2023, 13(2):197-219. DOI: 10.21655/ijsi.1673-7288.00314

      Abstract (339) HTML (0) PDF 1.02 M (829) Comment (0) Favorites

      Abstract:With the development of generative adversarial networks, synthesizing images from text descriptions has become an active research area. However, text descriptions used for image generation are often in English, and the generated objects are mostly faces, flowers, birds, etc. Few studies have been conducted on the generation of Chinese paintings with Chinese descriptions. The text-to-image task often requires a large number of labeled image--text pairs, which is expensive and boring. The advance of vision-language pre-training enables an image generation process guided by an optimized way, which significantly reduces the demand for annotated datasets and computational resources. In this paper, a multi-domain VQGAN model is proposed to generate Chinese paintings in multiple domains. Further, a vision-language pre-training model WenLan is used to calculate the distance loss between the generated images and the text descriptions. The semantic consistency between images and text is achieved by optimizing the hidden space variables as the input of multi-domain VQGAN. An ablation study is conducted to compare different variants of our multi-domain VQGAN in terms of the FID and R-precision metrics. We also conduct a user study to further show the effectiveness of our proposed model. The extensive results demonstrate that our proposed multi-domain VQGAN model outperforms all the competitors in terms of image quality and text-image semantic consistency.

    • End-to-end Image Captioning via Visual Region Aggregation and Dual-level Collaboration

      2023, 13(2):221-241. DOI: 10.21655/ijsi.1673-7288.00316

      Abstract (225) HTML (0) PDF 507.21 K (2778) Comment (0) Favorites

      Abstract:To date, Transformer-based pre-trained models have demonstrated powerful capabilities of modality representation, leading to a shift towards a fully end-to-end paradigm for multimodal downstream tasks such as image captioning, and enabling better performance and faster inference. However, the grid features extracted with the pre-trained model lack regional visual information, which leads to inaccurate descriptions of the object content by the model. Thus, the applicability of using pre-trained models for image captioning remains largely unexplored. Toward this goal, this paper proposes a novel end-to-end image captioning method based on Visual Region Aggregation and Dual-level Collaboration (VRADC). Specifically, to learn regional visual information, this paper designs a visual region aggregation that aggregates grid features with similar semantics to obtain a compact visual region representation. Next, dual-level collaboration uses the cross-attention mechanism to learn more representative semantic information from the two visual features, which in turn generates more fine-grained descriptions. Experimental results on the MSCOCO and Flickr30k datasets show that the proposed method, VRADC, can significantly improve the quality of image captioning, and achieves state-of-the-art performance.