Proposal of a semiautomatic classification method for systematization of large-scale text data based on machine learning

Ryo Shimomura, Kenta Mikawa, Masayuki Goto

    Research output: Contribution to journalArticle

    Abstract

    These days, many companies store enormous amounts of text data used in their operations in digital format because computers are now used in all processes in every department. However, the valuable information in this enormous amount of text data often cannot be used effectively. Normally, such data contains a lot of useful information for company workers and it is important to use the data effectively for the development of companies. However, the volume of the text data is sometimes too enormous to use the data directly. Even if analysts spend a lot of time in order to extract useful information from the text data, it may be impossible to analyze such huge amounts of text data. Generally, clustering or grouping by similarity and naming each group to provide category information are effective ways to grasp the tendency of the whole data and systematize many ideas. However, analysis by hand can only be carried out for small volumes of data in which analysts can see all data items or ideas. Therefore, it is difficult to apply clustering by hand to the large-scale text data which is stored in companies. If clustering and naming by analysts can be applied to enormous amounts of text data, it will be useful for extracting valuable information. In this study, we propose a new method based on the combination of clustering by hand and text classification in order to effectively analyze large-scale digital data which is stored in a company. The first step of the proposed method is to provide category information by hand for the sample data selected randomly from all the text data. The next step is to estimate classifiers through learning of this sample data, and to classify the rest of the data using the classifiers automatically. Using the proposed method, enormous amounts of text data can be systemized provided that only a small sample set is analyzed by hand. To verify the effectiveness of the proposed method, it is applied to the large-scale text data which was stored in a company as a case study.

    Original languageEnglish
    Pages (from-to)51-60
    Number of pages10
    JournalJournal of Japan Industrial Management Association
    Volume65
    Issue number2
    Publication statusPublished - 2014

    Fingerprint

    Learning systems
    Machine Learning
    Industry
    Classifiers
    Text
    Machine learning
    Clustering
    Classifier
    Text Classification
    Small Sample
    Grouping

    Keywords

    • Big data
    • Clustering
    • Self-training
    • Template matching
    • Text data

    ASJC Scopus subject areas

    • Industrial and Manufacturing Engineering
    • Applied Mathematics
    • Management Science and Operations Research
    • Strategy and Management

    Cite this

    @article{7e962d1acbe749a08e7ee8ad0a9136ad,
    title = "Proposal of a semiautomatic classification method for systematization of large-scale text data based on machine learning",
    abstract = "These days, many companies store enormous amounts of text data used in their operations in digital format because computers are now used in all processes in every department. However, the valuable information in this enormous amount of text data often cannot be used effectively. Normally, such data contains a lot of useful information for company workers and it is important to use the data effectively for the development of companies. However, the volume of the text data is sometimes too enormous to use the data directly. Even if analysts spend a lot of time in order to extract useful information from the text data, it may be impossible to analyze such huge amounts of text data. Generally, clustering or grouping by similarity and naming each group to provide category information are effective ways to grasp the tendency of the whole data and systematize many ideas. However, analysis by hand can only be carried out for small volumes of data in which analysts can see all data items or ideas. Therefore, it is difficult to apply clustering by hand to the large-scale text data which is stored in companies. If clustering and naming by analysts can be applied to enormous amounts of text data, it will be useful for extracting valuable information. In this study, we propose a new method based on the combination of clustering by hand and text classification in order to effectively analyze large-scale digital data which is stored in a company. The first step of the proposed method is to provide category information by hand for the sample data selected randomly from all the text data. The next step is to estimate classifiers through learning of this sample data, and to classify the rest of the data using the classifiers automatically. Using the proposed method, enormous amounts of text data can be systemized provided that only a small sample set is analyzed by hand. To verify the effectiveness of the proposed method, it is applied to the large-scale text data which was stored in a company as a case study.",
    keywords = "Big data, Clustering, Self-training, Template matching, Text data",
    author = "Ryo Shimomura and Kenta Mikawa and Masayuki Goto",
    year = "2014",
    language = "English",
    volume = "65",
    pages = "51--60",
    journal = "Journal of Japan Industrial Management Association",
    issn = "0386-4812",
    publisher = "Nihon Keikei Kogakkai",
    number = "2",

    }

    TY - JOUR

    T1 - Proposal of a semiautomatic classification method for systematization of large-scale text data based on machine learning

    AU - Shimomura, Ryo

    AU - Mikawa, Kenta

    AU - Goto, Masayuki

    PY - 2014

    Y1 - 2014

    N2 - These days, many companies store enormous amounts of text data used in their operations in digital format because computers are now used in all processes in every department. However, the valuable information in this enormous amount of text data often cannot be used effectively. Normally, such data contains a lot of useful information for company workers and it is important to use the data effectively for the development of companies. However, the volume of the text data is sometimes too enormous to use the data directly. Even if analysts spend a lot of time in order to extract useful information from the text data, it may be impossible to analyze such huge amounts of text data. Generally, clustering or grouping by similarity and naming each group to provide category information are effective ways to grasp the tendency of the whole data and systematize many ideas. However, analysis by hand can only be carried out for small volumes of data in which analysts can see all data items or ideas. Therefore, it is difficult to apply clustering by hand to the large-scale text data which is stored in companies. If clustering and naming by analysts can be applied to enormous amounts of text data, it will be useful for extracting valuable information. In this study, we propose a new method based on the combination of clustering by hand and text classification in order to effectively analyze large-scale digital data which is stored in a company. The first step of the proposed method is to provide category information by hand for the sample data selected randomly from all the text data. The next step is to estimate classifiers through learning of this sample data, and to classify the rest of the data using the classifiers automatically. Using the proposed method, enormous amounts of text data can be systemized provided that only a small sample set is analyzed by hand. To verify the effectiveness of the proposed method, it is applied to the large-scale text data which was stored in a company as a case study.

    AB - These days, many companies store enormous amounts of text data used in their operations in digital format because computers are now used in all processes in every department. However, the valuable information in this enormous amount of text data often cannot be used effectively. Normally, such data contains a lot of useful information for company workers and it is important to use the data effectively for the development of companies. However, the volume of the text data is sometimes too enormous to use the data directly. Even if analysts spend a lot of time in order to extract useful information from the text data, it may be impossible to analyze such huge amounts of text data. Generally, clustering or grouping by similarity and naming each group to provide category information are effective ways to grasp the tendency of the whole data and systematize many ideas. However, analysis by hand can only be carried out for small volumes of data in which analysts can see all data items or ideas. Therefore, it is difficult to apply clustering by hand to the large-scale text data which is stored in companies. If clustering and naming by analysts can be applied to enormous amounts of text data, it will be useful for extracting valuable information. In this study, we propose a new method based on the combination of clustering by hand and text classification in order to effectively analyze large-scale digital data which is stored in a company. The first step of the proposed method is to provide category information by hand for the sample data selected randomly from all the text data. The next step is to estimate classifiers through learning of this sample data, and to classify the rest of the data using the classifiers automatically. Using the proposed method, enormous amounts of text data can be systemized provided that only a small sample set is analyzed by hand. To verify the effectiveness of the proposed method, it is applied to the large-scale text data which was stored in a company as a case study.

    KW - Big data

    KW - Clustering

    KW - Self-training

    KW - Template matching

    KW - Text data

    UR - http://www.scopus.com/inward/record.url?scp=84923228887&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84923228887&partnerID=8YFLogxK

    M3 - Article

    VL - 65

    SP - 51

    EP - 60

    JO - Journal of Japan Industrial Management Association

    JF - Journal of Japan Industrial Management Association

    SN - 0386-4812

    IS - 2

    ER -