Regularized distance metric learning for document classification and its application

Kenta Mikawa, Masayuki Goto

    Research output: Contribution to journalArticle

    5 Citations (Scopus)

    Abstract

    Due to the development of information technologies, there is a huge amount of text data posted on the Internet. In this study, we focus on distance metric learning, which is one of the models of machine learning. Distance metric learning is a method of estimating the metric matrix of Mahalanobis squared distance from training data under an appropriate constraint. Mochihashi et al. proposed a method which can derive the optimal metric matrix analytically. However, the vector space for document data is normally very high dimensionally and sparse. Therefore, when this method is applied to document data directly, over-fitting may occur because the number of estimated parameters is in proportion to the square of the input data dimensions. To avoid the problem of over-fitting, a regularization term is introduced in this study. The purpose of this study is to formulate the regularized estimation of the metric matrix in which the optimal metric matrix can be derived analytically. To verify the effectiveness of the proposed method, document classification using a Japanese newspaper article is conducted.

    Original languageEnglish
    Pages (from-to)190-203
    Number of pages14
    JournalJournal of Japan Industrial Management Association
    Volume66
    Issue number2E
    Publication statusPublished - 2015

    Fingerprint

    Document Classification
    Distance Metric
    Metric
    Overfitting
    Vector spaces
    Information technology
    Learning systems
    Information Technology
    Internet
    Vector space
    Regularization
    Machine Learning
    Proportion
    Learning
    Document classification
    Verify
    Term

    Keywords

    • Distance metric learning
    • Document classification
    • Regularization
    • Vector space model

    ASJC Scopus subject areas

    • Industrial and Manufacturing Engineering
    • Applied Mathematics
    • Management Science and Operations Research
    • Strategy and Management

    Cite this

    Regularized distance metric learning for document classification and its application. / Mikawa, Kenta; Goto, Masayuki.

    In: Journal of Japan Industrial Management Association, Vol. 66, No. 2E, 2015, p. 190-203.

    Research output: Contribution to journalArticle

    @article{2768bd193d7b49b592a86e9f34ff8c92,
    title = "Regularized distance metric learning for document classification and its application",
    abstract = "Due to the development of information technologies, there is a huge amount of text data posted on the Internet. In this study, we focus on distance metric learning, which is one of the models of machine learning. Distance metric learning is a method of estimating the metric matrix of Mahalanobis squared distance from training data under an appropriate constraint. Mochihashi et al. proposed a method which can derive the optimal metric matrix analytically. However, the vector space for document data is normally very high dimensionally and sparse. Therefore, when this method is applied to document data directly, over-fitting may occur because the number of estimated parameters is in proportion to the square of the input data dimensions. To avoid the problem of over-fitting, a regularization term is introduced in this study. The purpose of this study is to formulate the regularized estimation of the metric matrix in which the optimal metric matrix can be derived analytically. To verify the effectiveness of the proposed method, document classification using a Japanese newspaper article is conducted.",
    keywords = "Distance metric learning, Document classification, Regularization, Vector space model",
    author = "Kenta Mikawa and Masayuki Goto",
    year = "2015",
    language = "English",
    volume = "66",
    pages = "190--203",
    journal = "Journal of Japan Industrial Management Association",
    issn = "0386-4812",
    publisher = "Nihon Keikei Kogakkai",
    number = "2E",

    }

    TY - JOUR

    T1 - Regularized distance metric learning for document classification and its application

    AU - Mikawa, Kenta

    AU - Goto, Masayuki

    PY - 2015

    Y1 - 2015

    N2 - Due to the development of information technologies, there is a huge amount of text data posted on the Internet. In this study, we focus on distance metric learning, which is one of the models of machine learning. Distance metric learning is a method of estimating the metric matrix of Mahalanobis squared distance from training data under an appropriate constraint. Mochihashi et al. proposed a method which can derive the optimal metric matrix analytically. However, the vector space for document data is normally very high dimensionally and sparse. Therefore, when this method is applied to document data directly, over-fitting may occur because the number of estimated parameters is in proportion to the square of the input data dimensions. To avoid the problem of over-fitting, a regularization term is introduced in this study. The purpose of this study is to formulate the regularized estimation of the metric matrix in which the optimal metric matrix can be derived analytically. To verify the effectiveness of the proposed method, document classification using a Japanese newspaper article is conducted.

    AB - Due to the development of information technologies, there is a huge amount of text data posted on the Internet. In this study, we focus on distance metric learning, which is one of the models of machine learning. Distance metric learning is a method of estimating the metric matrix of Mahalanobis squared distance from training data under an appropriate constraint. Mochihashi et al. proposed a method which can derive the optimal metric matrix analytically. However, the vector space for document data is normally very high dimensionally and sparse. Therefore, when this method is applied to document data directly, over-fitting may occur because the number of estimated parameters is in proportion to the square of the input data dimensions. To avoid the problem of over-fitting, a regularization term is introduced in this study. The purpose of this study is to formulate the regularized estimation of the metric matrix in which the optimal metric matrix can be derived analytically. To verify the effectiveness of the proposed method, document classification using a Japanese newspaper article is conducted.

    KW - Distance metric learning

    KW - Document classification

    KW - Regularization

    KW - Vector space model

    UR - http://www.scopus.com/inward/record.url?scp=84940978828&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84940978828&partnerID=8YFLogxK

    M3 - Article

    AN - SCOPUS:84940978828

    VL - 66

    SP - 190

    EP - 203

    JO - Journal of Japan Industrial Management Association

    JF - Journal of Japan Industrial Management Association

    SN - 0386-4812

    IS - 2E

    ER -