• Volume 2,Issue 2,2008 Table of Contents
    Select All
    Display Type: |
    • Editorial

      2008, 2(2):89-93.

      Abstract (4954) HTML (0) PDF 264.98 K (3868) Comment (0) Favorites


    • Co-Training by Committee: A Generalized Framework for Semi-Supervised Learning with Committees

      2008, 2(2):95-124.

      Abstract (5189) HTML (0) PDF 1.82 M (3939) Comment (0) Favorites

      Abstract:Many data mining applications have a large amount of data but labeling data is often di cult, expensive, or time consuming, as it requires human experts for annotation.Semi-supervised learning addresses this problem by using unlabeled data together with labeled data to improve the performance. Co-Training is a popular semi-supervised learning algorithm that has the assumptions that each example is represented by two or more redundantly su cient sets of features (views) and additionally these views are independent given the class. However, these assumptions are not satis ed in many real-world application domains. In this paper, a framework called Co-Training by Committee (CoBC) is proposed, in which an ensemble of diverse classi ers is used for semi-supervised learning that requires neither redundant and independent views nor di erent base learning algorithms. The framework is a general single-view semi-supervised learner that can be applied on any ensemble learner to build diverse committees. Experimental results of CoBC using Bagging, AdaBoost and the Random Subspace Method (RSM) as ensemble learners demonstrate that error diversity among classi ers leads to an e ective Co-Training style algorithm that maintains the diversity of the underlying ensemble.

    • Attribute Selection for Numerical Databases that Contain Correlations

      2008, 2(2):125-139.

      Abstract (6365) HTML (0) PDF 1.18 M (8282) Comment (0) Favorites

      Abstract:There are many correlated attributes in a database. Conventional attribute selection methods are not able to handle such correlations and tend to eliminate important rules that exist in correlated attributes. In this paper, we propose an attribute selection method that preserves important rules on correlated attributes. We rst compute a ranking of attributes by using conventional attribute selection methods. In addition, we compute two-dimensional rules for each pair of attributes and evaluate their importance for predicting a target attribute. Then, we evaluate the shapes of important two-dimensional rules to pick up hidden important attributes that are under-estimated by conventional attribute selection methods. After the shape evaluation, we re-calculate the ranking so that we can preserve the important correlations. Intensive experiments show that the proposed method can select important correlated attributes that are eliminated by conventional methods.

    • Effcient Mining of Heterogeneous Star-Structured Data

      2008, 2(2):141-161.

      Abstract (3847) HTML (0) PDF 3.79 M (3444) Comment (0) Favorites

      Abstract:Many of the real world clustering problems arising in data mining applications are heterogeneous in nature. Heterogeneous co-clustering involves simultaneous clustering of objects of two or more data types. While pairwise co-clustering of two data types has been well studied in the literature, research on high-order heterogeneous co-clustering is still limited. In this paper, we propose a graph theoretical framework for addressing starstructured co-clustering problems in which a central data type is connected to all the other data types. Partitioning this graph leads to co-clustering of all the data types under the constraints of the star-structure. Although, graph partitioning approach has been adopted before to address star-structured heterogeneous complex problems, the main contribution of this work lies in an e cient algorithm that we propose for partitioning the star-structured graph. Computationally, our algorithm is very quick as it requires a simple solution to a sparse system of overdetermined linear equations. Theoretical analysis and extensive experiments performed on toy and real datasets demonstrate the quality, e ciency and stability of the proposed algorithm.

    • Managing the Acronym/Expansion Identi cation Process for Text-Mining Applications

      2008, 2(2):163-179.

      Abstract (5167) HTML (0) PDF 1.18 M (3711) Comment (0) Favorites

      Abstract:This paper deals with an acronym/de nition extraction approach from textual data (corpora) and the disambiguation of these de nitions (or expansions). Both steps of our global process of acquisition and management of acronyms are precisely described. The first step consists in using markers such as brackets to identify expansion candidates. The alignment of the letters allows to select the acronym/de nition couples. The second step is to de ne the relevant expansion of an acronym in a given context. Our method is based on statistical measurements (Mutual Information, Cubic Mutual Information, Dice Measure) and the results provided by search engines. This paper presents an evaluation of the global process from real data (general and specialized domains).

    • Global and Local (Glocal) Bagging Approach for Classifying Noisy Dataset

      2008, 2(2):181-197.

      Abstract (4334) HTML (0) PDF 1.77 M (3544) Comment (0) Favorites

      Abstract:Learning from noisy data is a challenging task for data mining research. In this paper, we argue that for noisy data both global bagging strategy and local bagging strategy su er from their own inherent disadvantages and thus cannot form accurate prediction models. Consequently, we present a Global and Local Bagging (called Glocal Bagging:GB) approach to tackle this problem. GB assigns weight values to the base classi ers under the consideration that: (1) for each test instance Ix, GB prefers bags close to Ix, which is the nature of the local learning strategy; (2) for base classi ers, GB assigns larger weight values to the ones with higher accuracy on the out-of-bag, which is the nature of the global learning strategy. Combining (1) and (2), GB assign large weight values to the classi ers which are close to the current test instance Ix and have high out-of-bag accuracy. The diversity/accuracy analysis on synthetic datasets shows that GB improves the classi er ensemble's performance by increasing its base classi er's accuracy. Moreover, the bias/variance analysis also shows that GB's accuracy improvement mainly comes from the reduction of the bias error. Experiment results on 25 UCI benchmark datasets show that when the datasets are noisy, GB is superior to other former proposed bagging methods such as the classical bagging, bragging, nice bagging, trimmed bagging and lazy bagging.

    • AWSum-Combining Classi cation with Knowledge Aquisition

      2008, 2(2):199-214.

      Abstract (3846) HTML (0) PDF 1.71 M (3344) Comment (0) Favorites

      Abstract:Many classi ers achieve high levels of accuracy but have limited applicability in real world situations because they do not lead to a greater understanding or insight into the way features in uence the classi cation. In areas such as health informatics a classi er that clearly identi es the in uences on classi cation can be used to direct research and formulate interventions. This research investigates the practical applications of Automated Weighted Sum, (AWSum), a classi er that provides accuracy comparable to other techniques whist providing insight into the data. This is achieved by calculating a weight for each feature value that represents its in uence on the class value. The merits of this approach in classi cation and insight are evaluated on a Cystic Fibrosis and Diabetes datasets with positive results.

    • Mining Gene Expression Data using Domain Knowledge

      2008, 2(2):215-231.

      Abstract (3890) HTML (0) PDF 784.30 K (3417) Comment (0) Favorites

      Abstract:Biology is now an information-intensive science and various research areas,like molecular biology, evolutionary biology or environmental biology, heavily depend on the availability and the e cient use of information. Data mining, that regroups several techniques for analyzing very large datasets, is used to solve problems in an increasing number of biological applications. This article focuses on the analysis of transcriptome, that reflects gene activity in a given cell population at a given time. We describe research themes in transcriptomics related to domain knowledge in biology. We are particularly interested in the way this knowledge can be e ciently combined and used during the various phases of a data mining process, in the most acknowledged applications in transcriptomics.

    • Towards Knowledge Acquisition from Semi-Structured Content

      2008, 2(2):233-248.

      Abstract (4138) HTML (0) PDF 3.14 M (3632) Comment (0) Favorites

      Abstract:A rich family of generic Information Extraction (IE) techniques have been developed by researchers nowadays. This paper proposes WebKER, a system for automatically extracting knowledge from semi-structured content on Web pages based on wrappers and domain ontologies. Within the extracting process, wrappers are learned through su x arrays.Then domain ontologies automatically align the raw data extracted by wrappers and knowledge are generated by describing the data with Resource Description Framework (RDF)statements. After the merging process, newly generated knowledge are added to the Knowledge Base (KB) nally for users to query regardless of resources' derivation. A prototype of WebKER is implemented. This paper also gives the performance evaluation of this system and the comparison between querying information in the KB and querying information in the traditional database, indicating the superiority of our system. In addition, the evaluation of the outstanding wrapper and the method for merging knowledge are also presented.