Multi-valued classification of text data based on an ECOC approach using a ternary orthogonal table

Leona Suzuki, Kenta Mikawa, Masayuki Goto

    Research output: Contribution to journalArticle

    1 Citation (Scopus)

    Abstract

    Because of the advancements in information technology, a large number of document data has been accumulated on various databases and automatic multi-valued classification becomes highly relevant. This paper focuses on a multivalued classification technique that is based on Error-Correcting Output Codes (ECOC) and which combines several binary classifiers. When predicting the category of a new document data, the outputs of the binary classifiers are combined to produce a predicted value. It is a known problem that if two category sets have an imbalanced amount of training data, the prediction accuracy of a binary classifier is low. To solve this problem, a previous study proposed to employ the Reed-Muller (RM) codes in the context an ECOC approach for resolving the imbalance in the cardinality of the training data sets. However, RM codes can equalize the amount of between training data of two category sets only for a specific number of categories. We want to provide a method that can be employed for a multi-valued classification with an arbitrary number of categories. In this paper, we propose a new configuration method combining binary classifiers with categories, which are not used for classification. This method allows us to reduce the amount of training data for each binary classifier while improving the balance of the training data between two category sets for each binary classifier. As a result, the computational complexity can be decreased. We verify the effectiveness of our proposed method by conducting a document classification experiment.

    Original languageEnglish
    Pages (from-to)155-164
    Number of pages10
    JournalIndustrial Engineering and Management Systems
    Volume16
    Issue number2
    DOIs
    Publication statusPublished - 2017 Jun 1

      Fingerprint

    Keywords

    • Error-correcting output codes
    • Multi-valued classification
    • Ternary code table
    • Text data

    ASJC Scopus subject areas

    • Social Sciences(all)
    • Economics, Econometrics and Finance(all)

    Cite this