• Volume 14,Issue 2,2024 Table of Contents
    Select All
    Display Type: |
    • Preface to the Special Issue on Multimodal Collaborative Perception and Fusion Technology

      2024, 14(2):119-122. DOI: 10.21655/ijsi.1673-7288.00324

      Abstract (182) HTML (0) PDF 145.46 K (385) Comment (0) Favorites

      Abstract:Preface

    • Few-shot Incremental Learning with Textual Knowledge Embedding by Visual-language Model

      2024, 14(2):123-144. DOI: 10.21655/ijsi.1673-7288.00325

      Abstract (132) HTML (0) PDF 727.11 K (374) Comment (0) Favorites

      Abstract:Real scenarios often face the problems of data scarcity and dynamic data changes. The purpose of few-shot incremental learning is to use a small amount of data to infer data knowledge and reduce the model's catastrophic forgetting of old knowledge. Existing few-shot incremental learning algorithms (CEC, FACT, etc.) mainly use visual features to adjust the feature encoder or classifier to achieve the model's transfer to new data and anti-forgetting of old data. However, the visual features of a small amount of data are often difficult to model a complete feature distribution of a category, resulting in weak generalization ability. The text features of image category descriptions have better generalization and anti-forgetting abilities than visual features. Therefore, on the basis of the Visual-Language Model (VLM), we propose the textual knowledge embedding mode to embed text features with anti-forgetting ability in visual features, thus achieving effective learning of new and old category data in few-shot incremental learning. Specifically, in the basic learning stage, we use the VLM to extract the pre-trained visual features and category text descriptions. Furthermore, we use the text encoder to project the pre-trained visual features to text space. Next, we use the visual encoder to fuse the learned text features and pre-trained visual features to enhance visual features with high discrimination ability. In the incremental learning stage, we use the category space encoding of old data and new data features to fine-tune the visual encoder and text encoder and further achieve new data knowledge learning while reviewing old knowledge. We verified the effectiveness of the algorithm on four datasets (CIFAR-100, CUB-200, Stanford Cars, and miniImagenet), proving that textual knowledge embedding based on large-scale VLM can further improve the robustness of few-shot incremental learning on the basis of visual features.

    • Auxiliary Diagnosis for Parkinson's Disease Using Multimodal Feature Analysis

      2024, 14(2):145-163. DOI: 10.21655/ijsi.1673-7288.00328

      Abstract (252) HTML (0) PDF 444.46 K (380) Comment (0) Favorites

      Abstract:Parkinson's disease is a widespread neurodegenerative disease that slowly impairs motor and certain cognitive skills. It is insidious and incurable, and it causes a significant burden on sufferers and their families. However, clinical diagnosis of Parkinson's disease typically relies on subjective rating scales, which can be influenced by the examinee's recall bias and assessor subjectivity. Numerous studies have used diverse methods to investigate the physiological aspects of Parkinson's disease and have provided objective, quantifiable tools for auxiliary diagnosis. However, given the diversity of neurodegenerative illnesses and the similarities in their effects, it remains a problem among unimodal methodologies built upon the representations of Parkinson's disease to identify the disease uniquely. To this end, we develop a multimodal diagnostic tool comprising the paradigms that evoke potential Parkinson's aberrant behaviors. First, parametric tests of the identifying features are performed based on the results of the normal distribution test, and statistically significant feature sets are constructed ($p <$ 0.05). Second, multimodal data are collected from 38 cases in a clinical setting using the MDS-UPDRS scale. Finally, the significance of different feature combinations for the assessment of Parkinson's disease is analyzed based on gait and eye movement modalities; the high immersion triggered task paradigm and the multimodal Parkinson's disease diagnostic tool are validated in virtual reality scenarios. It is worth noting that it only take 2--4 tasks for the combination of gait and eye movement modalities to obtain an average AUC of 0.97 and accuracy of 0.92.

    • AU-aware Algorithm for 3D Facial Reconstruction

      2024, 14(2):165-183. DOI: 10.21655/ijsi.1673-7288.00329

      Abstract (108) HTML (0) PDF 554.95 K (300) Comment (0) Favorites

      Abstract:As a critical task in computer vision and animation, 3D face reconstruction can provide 3D model structures and rich semantic information for multi-modal facial applications. However, monocular 2D facial images lack depth information and the parameters of the predicted facial model are not reliable, which causes poor reconstruction results. We propose to employ facial Action Unit (AU) and key facial points which are highly correlated with model parameters as a bridge to guide the regression of model-related parameters and thus solve the ill-posed monocular face reconstruction. Based on existing face reconstruction datasets, we provide a complete semi-automatic labeling scheme for facial AUs and construct a 300W-LP-AU dataset. Furthermore, a 3D face reconstruction algorithm based on AU awareness is put forward to realize end-to-end multi-tasking learning and reduce the overall training difficulty. Experimental results show that the algorithm can improve the face reconstruction performance, with high fidelity of the rebuilt facial model.

    • Text-to-image Generation Based on Conditional Semantic Augmentation

      2024, 14(2):185-204. DOI: 10.21655/ijsi.1673-7288.00326

      Abstract (96) HTML (0) PDF 469.66 K (376) Comment (0) Favorites

      Abstract:Text-to-image generation achieves visually excellent results but suffers from the problem of insufficient detail representation. We propose Conditional Semantic Augmentation (CSA) Generation Adversarial Networks (CSA-GAN). The model first encodes the text and processes it using CSA. The proposed method extracts the intermediate features of the generator for up-sampling and generates the image mask through a two-layer Convolutional Neural Network (CNN). Finally, the text coding is sent to two perceptrons for processing and fused with the mask, so as to fully integrate the image spatial and text semantics features to improve the detail representation. In order to verify the quality of the generated images of this model, quantitative and qualitative analyses are conducted on different datasets. This paper employs Inception Score (IS) and Frechet Inception Distance (FID) metrics to quantitatively evaluate the image clarity, diversity, and natural realism of the images. The qualitative analyses include the visualization of the generated images and the analysis of specific modules of the ablation experiment. The results show that the proposed model is superior to the state-of-the-art works in recent years. This fully verifies that the proposed method has better performance and can optimize the expression of main feature details in the image generation process.

    • 3D Shape Recognition Based on Multimodal \newline Relation Modeling

      2024, 14(2):205-220. DOI: 10.21655/ijsi.1673-7288.00327

      Abstract (148) HTML (0) PDF 530.32 K (343) Comment (0) Favorites

      Abstract:To make full use of the local spatial relation between point cloud and multi-view data to further improve the accuracy of three-dimensional (3D) shape recognition, a 3D shape recognition network based on multimodal relation is proposed. Firstly, a Multimodal Relation Module (MRM) is designed, which can extract the relation information between the local features of any point cloud and that of any multi-view to obtain the corresponding relation features. Then, a cascade pooling consisting of maximum pooling and generalized mean pooling is applied to process the relation feature tensor and obtain the global relation feature. There are two types of multimodal relation modules, which output the point-view relation feature and the view-point relation feature, respectively. The proposed gating module adopts a self-attentive mechanism to find the relation information within the features so that the aggregated global features can be weighted to suppress redundant information. Extensive experiments show that the multimodal relation module can make the network obtain stronger representational ability; the gating module can make the final global feature more discriminative and boost the performance of the retrieval task. The proposed network achieves classification accuracy of 93.8% and 95.0%, as well as average retrieval precision of 90.5% and 93.4% on two standard 3D shape recognition datasets (ModelNet40 and ModelNet10), respectively, which outperforms the existing works.