A note on document classification with small training data

Yasunari Maeda, Hideki Yoshida, Masakiyo Suzuki, Toshiyasu Matsushima

    Research output: Contribution to journalArticle

    Abstract

    Document classification is one of important topics in the field of NLP (Natural Language Processing). In the previous research a document classification method has been proposed which minimizes an error rate with reference to a Bayes criterion. But when the number of documents in training data is small, the accuracy of the previous method is low. So in this research we use estimating data in order to estimate prior distributions. When the training data is small the accuracy using estimating data is higher than the accuracy of the previous method. But when the training data is big the accuracy using estimating data is lower than the accuracy of the previous method. So in this research we also propose another technique whose accuracy is higher than the accuracy of the previous method when the training data is small, and is almost the same as the accuracy of the previous method when the training data is big.

    Original languageEnglish
    Pages (from-to)1459-1466
    Number of pages8
    JournalIEEJ Transactions on Electronics, Information and Systems
    Volume131
    Issue number8
    DOIs
    Publication statusPublished - 2011

    Fingerprint

    Processing
    Big data

    Keywords

    • Document classification
    • Posterior distribution
    • Prior distribution
    • Training data

    ASJC Scopus subject areas

    • Electrical and Electronic Engineering

    Cite this

    A note on document classification with small training data. / Maeda, Yasunari; Yoshida, Hideki; Suzuki, Masakiyo; Matsushima, Toshiyasu.

    In: IEEJ Transactions on Electronics, Information and Systems, Vol. 131, No. 8, 2011, p. 1459-1466.

    Research output: Contribution to journalArticle

    Maeda, Yasunari ; Yoshida, Hideki ; Suzuki, Masakiyo ; Matsushima, Toshiyasu. / A note on document classification with small training data. In: IEEJ Transactions on Electronics, Information and Systems. 2011 ; Vol. 131, No. 8. pp. 1459-1466.
    @article{2a40dd7c0aab4187b0c5014afdf4cf45,
    title = "A note on document classification with small training data",
    abstract = "Document classification is one of important topics in the field of NLP (Natural Language Processing). In the previous research a document classification method has been proposed which minimizes an error rate with reference to a Bayes criterion. But when the number of documents in training data is small, the accuracy of the previous method is low. So in this research we use estimating data in order to estimate prior distributions. When the training data is small the accuracy using estimating data is higher than the accuracy of the previous method. But when the training data is big the accuracy using estimating data is lower than the accuracy of the previous method. So in this research we also propose another technique whose accuracy is higher than the accuracy of the previous method when the training data is small, and is almost the same as the accuracy of the previous method when the training data is big.",
    keywords = "Document classification, Posterior distribution, Prior distribution, Training data",
    author = "Yasunari Maeda and Hideki Yoshida and Masakiyo Suzuki and Toshiyasu Matsushima",
    year = "2011",
    doi = "10.1541/ieejeiss.131.1459",
    language = "English",
    volume = "131",
    pages = "1459--1466",
    journal = "IEEJ Transactions on Electronics, Information and Systems",
    issn = "0385-4221",
    publisher = "The Institute of Electrical Engineers of Japan",
    number = "8",

    }

    TY - JOUR

    T1 - A note on document classification with small training data

    AU - Maeda, Yasunari

    AU - Yoshida, Hideki

    AU - Suzuki, Masakiyo

    AU - Matsushima, Toshiyasu

    PY - 2011

    Y1 - 2011

    N2 - Document classification is one of important topics in the field of NLP (Natural Language Processing). In the previous research a document classification method has been proposed which minimizes an error rate with reference to a Bayes criterion. But when the number of documents in training data is small, the accuracy of the previous method is low. So in this research we use estimating data in order to estimate prior distributions. When the training data is small the accuracy using estimating data is higher than the accuracy of the previous method. But when the training data is big the accuracy using estimating data is lower than the accuracy of the previous method. So in this research we also propose another technique whose accuracy is higher than the accuracy of the previous method when the training data is small, and is almost the same as the accuracy of the previous method when the training data is big.

    AB - Document classification is one of important topics in the field of NLP (Natural Language Processing). In the previous research a document classification method has been proposed which minimizes an error rate with reference to a Bayes criterion. But when the number of documents in training data is small, the accuracy of the previous method is low. So in this research we use estimating data in order to estimate prior distributions. When the training data is small the accuracy using estimating data is higher than the accuracy of the previous method. But when the training data is big the accuracy using estimating data is lower than the accuracy of the previous method. So in this research we also propose another technique whose accuracy is higher than the accuracy of the previous method when the training data is small, and is almost the same as the accuracy of the previous method when the training data is big.

    KW - Document classification

    KW - Posterior distribution

    KW - Prior distribution

    KW - Training data

    UR - http://www.scopus.com/inward/record.url?scp=80052706462&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=80052706462&partnerID=8YFLogxK

    U2 - 10.1541/ieejeiss.131.1459

    DO - 10.1541/ieejeiss.131.1459

    M3 - Article

    AN - SCOPUS:80052706462

    VL - 131

    SP - 1459

    EP - 1466

    JO - IEEJ Transactions on Electronics, Information and Systems

    JF - IEEJ Transactions on Electronics, Information and Systems

    SN - 0385-4221

    IS - 8

    ER -