
Lifeng Sun , Xinhang Song , Shuqiang Jiang , Lili Wang , Hengtao Shen
2024, 14(2):119-122. DOI: 10.21655/ijsi.1673-7288.00324
Abstract:Preface
Hantao Yao , Lu Yu , Changsheng Xu
2024, 14(2):123-144. DOI: 10.21655/ijsi.1673-7288.00325
Abstract:Real scenarios often face the problems of data scarcity and dynamic data changes. The purpose of few-shot incremental learning is to use a small amount of data to infer data knowledge and reduce the model's catastrophic forgetting of old knowledge. Existing few-shot incremental learning algorithms (CEC, FACT, etc.) mainly use visual features to adjust the feature encoder or classifier to achieve the model's transfer to new data and anti-forgetting of old data. However, the visual features of a small amount of data are often difficult to model a complete feature distribution of a category, resulting in weak generalization ability. The text features of image category descriptions have better generalization and anti-forgetting abilities than visual features. Therefore, on the basis of the Visual-Language Model (VLM), we propose the textual knowledge embedding mode to embed text features with anti-forgetting ability in visual features, thus achieving effective learning of new and old category data in few-shot incremental learning. Specifically, in the basic learning stage, we use the VLM to extract the pre-trained visual features and category text descriptions. Furthermore, we use the text encoder to project the pre-trained visual features to text space. Next, we use the visual encoder to fuse the learned text features and pre-trained visual features to enhance visual features with high discrimination ability. In the incremental learning stage, we use the category space encoding of old data and new data features to fine-tune the visual encoder and text encoder and further achieve new data knowledge learning while reviewing old knowledge. We verified the effectiveness of the algorithm on four datasets (CIFAR-100, CUB-200, Stanford Cars, and miniImagenet), proving that textual knowledge embedding based on large-scale VLM can further improve the robustness of few-shot incremental learning on the basis of visual features.
Wei Qiang , Yu Du , Xinjin Li , Xiangmin Fan , Wen Su , Haibo Chen , Wei Sun , Feng Tian
2024, 14(2):145-163. DOI: 10.21655/ijsi.1673-7288.00328
Abstract:Parkinson's disease is a widespread neurodegenerative disease that slowly impairs motor and certain cognitive skills. It is insidious and incurable, and it causes a significant burden on sufferers and their families. However, clinical diagnosis of Parkinson's disease typically relies on subjective rating scales, which can be influenced by the examinee's recall bias and assessor subjectivity. Numerous studies have used diverse methods to investigate the physiological aspects of Parkinson's disease and have provided objective, quantifiable tools for auxiliary diagnosis. However, given the diversity of neurodegenerative illnesses and the similarities in their effects, it remains a problem among unimodal methodologies built upon the representations of Parkinson's disease to identify the disease uniquely. To this end, we develop a multimodal diagnostic tool comprising the paradigms that evoke potential Parkinson's aberrant behaviors. First, parametric tests of the identifying features are performed based on the results of the normal distribution test, and statistically significant feature sets are constructed ($p <$ 0.05). Second, multimodal data are collected from 38 cases in a clinical setting using the MDS-UPDRS scale. Finally, the significance of different feature combinations for the assessment of Parkinson's disease is analyzed based on gait and eye movement modalities; the high immersion triggered task paradigm and the multimodal Parkinson's disease diagnostic tool are validated in virtual reality scenarios. It is worth noting that it only take 2--4 tasks for the combination of gait and eye movement modalities to obtain an average AUC of 0.97 and accuracy of 0.92.
Yi Zhang , Jiayi Lü , Xing Lan , Jian Xue
2024, 14(2):165-183. DOI: 10.21655/ijsi.1673-7288.00329
Abstract:As a critical task in computer vision and animation, 3D face reconstruction can provide 3D model structures and rich semantic information for multi-modal facial applications. However, monocular 2D facial images lack depth information and the parameters of the predicted facial model are not reliable, which causes poor reconstruction results. We propose to employ facial Action Unit (AU) and key facial points which are highly correlated with model parameters as a bridge to guide the regression of model-related parameters and thus solve the ill-posed monocular face reconstruction. Based on existing face reconstruction datasets, we provide a complete semi-automatic labeling scheme for facial AUs and construct a 300W-LP-AU dataset. Furthermore, a 3D face reconstruction algorithm based on AU awareness is put forward to realize end-to-end multi-tasking learning and reduce the overall training difficulty. Experimental results show that the algorithm can improve the face reconstruction performance, with high fidelity of the rebuilt facial model.
Kai Yu , Yi Bin , Ziqiang Zheng , Yang Yang
2024, 14(2):185-204. DOI: 10.21655/ijsi.1673-7288.00326
Abstract:Text-to-image generation achieves visually excellent results but suffers from the problem of insufficient detail representation. We propose Conditional Semantic Augmentation (CSA) Generation Adversarial Networks (CSA-GAN). The model first encodes the text and processes it using CSA. The proposed method extracts the intermediate features of the generator for up-sampling and generates the image mask through a two-layer Convolutional Neural Network (CNN). Finally, the text coding is sent to two perceptrons for processing and fused with the mask, so as to fully integrate the image spatial and text semantics features to improve the detail representation. In order to verify the quality of the generated images of this model, quantitative and qualitative analyses are conducted on different datasets. This paper employs Inception Score (IS) and Frechet Inception Distance (FID) metrics to quantitatively evaluate the image clarity, diversity, and natural realism of the images. The qualitative analyses include the visualization of the generated images and the analysis of specific modules of the ablation experiment. The results show that the proposed model is superior to the state-of-the-art works in recent years. This fully verifies that the proposed method has better performance and can optimize the expression of main feature details in the image generation process.
Haonan Chen , Yingying Zhu , Junqi Zhao , Qi Tian
2024, 14(2):205-220. DOI: 10.21655/ijsi.1673-7288.00327
Abstract:To make full use of the local spatial relation between point cloud and multi-view data to further improve the accuracy of three-dimensional (3D) shape recognition, a 3D shape recognition network based on multimodal relation is proposed. Firstly, a Multimodal Relation Module (MRM) is designed, which can extract the relation information between the local features of any point cloud and that of any multi-view to obtain the corresponding relation features. Then, a cascade pooling consisting of maximum pooling and generalized mean pooling is applied to process the relation feature tensor and obtain the global relation feature. There are two types of multimodal relation modules, which output the point-view relation feature and the view-point relation feature, respectively. The proposed gating module adopts a self-attentive mechanism to find the relation information within the features so that the aggregated global features can be weighted to suppress redundant information. Extensive experiments show that the multimodal relation module can make the network obtain stronger representational ability; the gating module can make the final global feature more discriminative and boost the performance of the retrieval task. The proposed network achieves classification accuracy of 93.8% and 95.0%, as well as average retrieval precision of 90.5% and 93.4% on two standard 3D shape recognition datasets (ModelNet40 and ModelNet10), respectively, which outperforms the existing works.
