Binary document classification based on fast flux discriminant with similarity measure on word set

Keisuke Okubo, Gendo Kumoi, Masayuki Goto

Research output: Contribution to journalArticle

Abstract

Fast Flux Discriminant (FFD) is known as one of the high-performance nonlinear binary classifiers, and it is possible to construct a classification model considering the interaction between variables. In order to take account of the interaction between variables, FFD introduces the histogram-based kernel smoothing using subspaces including variable combinations. However, when creating a subspace, the original FFD should cover all variables including combinations of variables with low interaction. Therefore, the disadvantage is that the calculation amount increases exponentially as the dimension increases. In this study, we calculate the similarity between variables by using KL divergence. Then, among the obtained similarities, divisions are performed for each subspace with similar variables. Through this method, we try to reduce the amount of calculation while maintaining classification accuracy by using only combinations of variables that are likely to take high interaction. Through the simulation experiments with Japanese newspaper articles, the effectiveness of our proposed method is clarified.

Original languageEnglish
Pages (from-to)245-251
Number of pages7
JournalIndustrial Engineering and Management Systems
Volume18
Issue number2
DOIs
Publication statusPublished - 2019 Jan 1

Fingerprint

interaction
divergence
newspaper
simulation
Document classification
Discriminant
Similarity measure
experiment
performance
Interaction

Keywords

  • Binary classification
  • Interaction
  • KL divergence
  • Similarity
  • Text data

ASJC Scopus subject areas

  • Social Sciences(all)
  • Economics, Econometrics and Finance(all)

Cite this

Binary document classification based on fast flux discriminant with similarity measure on word set. / Okubo, Keisuke; Kumoi, Gendo; Goto, Masayuki.

In: Industrial Engineering and Management Systems, Vol. 18, No. 2, 01.01.2019, p. 245-251.

Research output: Contribution to journalArticle

@article{a8aecddb975b429e9c9961097b0babf0,
title = "Binary document classification based on fast flux discriminant with similarity measure on word set",
abstract = "Fast Flux Discriminant (FFD) is known as one of the high-performance nonlinear binary classifiers, and it is possible to construct a classification model considering the interaction between variables. In order to take account of the interaction between variables, FFD introduces the histogram-based kernel smoothing using subspaces including variable combinations. However, when creating a subspace, the original FFD should cover all variables including combinations of variables with low interaction. Therefore, the disadvantage is that the calculation amount increases exponentially as the dimension increases. In this study, we calculate the similarity between variables by using KL divergence. Then, among the obtained similarities, divisions are performed for each subspace with similar variables. Through this method, we try to reduce the amount of calculation while maintaining classification accuracy by using only combinations of variables that are likely to take high interaction. Through the simulation experiments with Japanese newspaper articles, the effectiveness of our proposed method is clarified.",
keywords = "Binary classification, Interaction, KL divergence, Similarity, Text data",
author = "Keisuke Okubo and Gendo Kumoi and Masayuki Goto",
year = "2019",
month = "1",
day = "1",
doi = "10.7232/iems.2019.18.2.245",
language = "English",
volume = "18",
pages = "245--251",
journal = "Industrial Engineering and Management Systems",
issn = "1598-7248",
publisher = "Korean Institute of Industrial Engineers",
number = "2",

}

TY - JOUR

T1 - Binary document classification based on fast flux discriminant with similarity measure on word set

AU - Okubo, Keisuke

AU - Kumoi, Gendo

AU - Goto, Masayuki

PY - 2019/1/1

Y1 - 2019/1/1

N2 - Fast Flux Discriminant (FFD) is known as one of the high-performance nonlinear binary classifiers, and it is possible to construct a classification model considering the interaction between variables. In order to take account of the interaction between variables, FFD introduces the histogram-based kernel smoothing using subspaces including variable combinations. However, when creating a subspace, the original FFD should cover all variables including combinations of variables with low interaction. Therefore, the disadvantage is that the calculation amount increases exponentially as the dimension increases. In this study, we calculate the similarity between variables by using KL divergence. Then, among the obtained similarities, divisions are performed for each subspace with similar variables. Through this method, we try to reduce the amount of calculation while maintaining classification accuracy by using only combinations of variables that are likely to take high interaction. Through the simulation experiments with Japanese newspaper articles, the effectiveness of our proposed method is clarified.

AB - Fast Flux Discriminant (FFD) is known as one of the high-performance nonlinear binary classifiers, and it is possible to construct a classification model considering the interaction between variables. In order to take account of the interaction between variables, FFD introduces the histogram-based kernel smoothing using subspaces including variable combinations. However, when creating a subspace, the original FFD should cover all variables including combinations of variables with low interaction. Therefore, the disadvantage is that the calculation amount increases exponentially as the dimension increases. In this study, we calculate the similarity between variables by using KL divergence. Then, among the obtained similarities, divisions are performed for each subspace with similar variables. Through this method, we try to reduce the amount of calculation while maintaining classification accuracy by using only combinations of variables that are likely to take high interaction. Through the simulation experiments with Japanese newspaper articles, the effectiveness of our proposed method is clarified.

KW - Binary classification

KW - Interaction

KW - KL divergence

KW - Similarity

KW - Text data

UR - http://www.scopus.com/inward/record.url?scp=85069958648&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85069958648&partnerID=8YFLogxK

U2 - 10.7232/iems.2019.18.2.245

DO - 10.7232/iems.2019.18.2.245

M3 - Article

AN - SCOPUS:85069958648

VL - 18

SP - 245

EP - 251

JO - Industrial Engineering and Management Systems

JF - Industrial Engineering and Management Systems

SN - 1598-7248

IS - 2

ER -