Proposal of a semiautomatic classification method for systematization of large-scale text data based on machine learning

Ryo Shimomura, Kenta Mikawa, Masayuki Goto

Research output: Contribution to journalArticle

Abstract

These days, many companies store enormous amounts of text data used in their operations in digital format because computers are now used in all processes in every department. However, the valuable information in this enormous amount of text data often cannot be used effectively. Normally, such data contains a lot of useful information for company workers and it is important to use the data effectively for the development of companies. However, the volume of the text data is sometimes too enormous to use the data directly. Even if analysts spend a lot of time in order to extract useful information from the text data, it may be impossible to analyze such huge amounts of text data. Generally, clustering or grouping by similarity and naming each group to provide category information are effective ways to grasp the tendency of the whole data and systematize many ideas. However, analysis by hand can only be carried out for small volumes of data in which analysts can see all data items or ideas. Therefore, it is difficult to apply clustering by hand to the large-scale text data which is stored in companies. If clustering and naming by analysts can be applied to enormous amounts of text data, it will be useful for extracting valuable information. In this study, we propose a new method based on the combination of clustering by hand and text classification in order to effectively analyze large-scale digital data which is stored in a company. The first step of the proposed method is to provide category information by hand for the sample data selected randomly from all the text data. The next step is to estimate classifiers through learning of this sample data, and to classify the rest of the data using the classifiers automatically. Using the proposed method, enormous amounts of text data can be systemized provided that only a small sample set is analyzed by hand. To verify the effectiveness of the proposed method, it is applied to the large-scale text data which was stored in a company as a case study.

Original languageEnglish
Pages (from-to)51-60
Number of pages10
JournalJournal of Japan Industrial Management Association
Volume65
Issue number2
Publication statusPublished - 2014 Jan 1

    Fingerprint

Keywords

  • Big data
  • Clustering
  • Self-training
  • Template matching
  • Text data

ASJC Scopus subject areas

  • Strategy and Management
  • Management Science and Operations Research
  • Industrial and Manufacturing Engineering
  • Applied Mathematics

Cite this