Exploration into gray area: Efficient labeling for malicious domain name detection

Naoki Fukushi, Daiki Chiba, Mitsuaki Akiyama, Masato Uchida

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper presents a method to reduce the labeling cost when acquiring training data for a system that detects malicious domain names by supervised machine learning. The conventional system requires large quantities of both benign and malicious domain names to be prepared as training data to obtain a classifier with high classification accuracy. In general, malicious domain names are observed less frequently than benign domain names. Therefore, it is difficult to acquire a large number of malicious domain names without a dedicated labeling method. We propose a method based on active learning that labels data around the decision boundary of classification, i.e., in the gray area, and we show that the classification accuracy can be improved by only using approximately 2.5% of the training data used by the conventional system. An additional disadvantage of the conventional system is that, if the classifier is trained with a small amount of training data, its generalization ability cannot be guaranteed. We propose a method based on ensemble learning that integrates multiple classifiers, and we show that the classification accuracy can be stabilized and improved.

Original languageEnglish
Title of host publicationProceedings - 2019 IEEE 43rd Annual Computer Software and Applications Conference, COMPSAC 2019
EditorsVladimir Getov, Jean-Luc Gaudiot, Nariyoshi Yamai, Stelvio Cimato, Morris Chang, Yuuichi Teranishi, Ji-Jiang Yang, Hong Va Leong, Hossian Shahriar, Michiharu Takemoto, Dave Towey, Hiroki Takakura, Atilla Elci, Susumu Takeuchi, Satish Puri
PublisherIEEE Computer Society
Pages770-775
Number of pages6
ISBN (Electronic)9781728126074
DOIs
Publication statusPublished - 2019 Jul
Event43rd IEEE Annual Computer Software and Applications Conference, COMPSAC 2019 - Milwaukee, United States
Duration: 2019 Jul 152019 Jul 19

Publication series

NameProceedings - International Computer Software and Applications Conference
Volume1
ISSN (Print)0730-3157

Conference

Conference43rd IEEE Annual Computer Software and Applications Conference, COMPSAC 2019
CountryUnited States
CityMilwaukee
Period19/7/1519/7/19

Fingerprint

Labeling
Classifiers
Learning systems
Labels
Costs

Keywords

  • Active learning
  • Data labeling
  • Ensemble learning
  • Malicious domain name

ASJC Scopus subject areas

  • Software
  • Computer Science Applications

Cite this

Fukushi, N., Chiba, D., Akiyama, M., & Uchida, M. (2019). Exploration into gray area: Efficient labeling for malicious domain name detection. In V. Getov, J-L. Gaudiot, N. Yamai, S. Cimato, M. Chang, Y. Teranishi, J-J. Yang, H. V. Leong, H. Shahriar, M. Takemoto, D. Towey, H. Takakura, A. Elci, S. Takeuchi, ... S. Puri (Eds.), Proceedings - 2019 IEEE 43rd Annual Computer Software and Applications Conference, COMPSAC 2019 (pp. 770-775). [8754122] (Proceedings - International Computer Software and Applications Conference; Vol. 1). IEEE Computer Society. https://doi.org/10.1109/COMPSAC.2019.00114

Exploration into gray area : Efficient labeling for malicious domain name detection. / Fukushi, Naoki; Chiba, Daiki; Akiyama, Mitsuaki; Uchida, Masato.

Proceedings - 2019 IEEE 43rd Annual Computer Software and Applications Conference, COMPSAC 2019. ed. / Vladimir Getov; Jean-Luc Gaudiot; Nariyoshi Yamai; Stelvio Cimato; Morris Chang; Yuuichi Teranishi; Ji-Jiang Yang; Hong Va Leong; Hossian Shahriar; Michiharu Takemoto; Dave Towey; Hiroki Takakura; Atilla Elci; Susumu Takeuchi; Satish Puri. IEEE Computer Society, 2019. p. 770-775 8754122 (Proceedings - International Computer Software and Applications Conference; Vol. 1).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Fukushi, N, Chiba, D, Akiyama, M & Uchida, M 2019, Exploration into gray area: Efficient labeling for malicious domain name detection. in V Getov, J-L Gaudiot, N Yamai, S Cimato, M Chang, Y Teranishi, J-J Yang, HV Leong, H Shahriar, M Takemoto, D Towey, H Takakura, A Elci, S Takeuchi & S Puri (eds), Proceedings - 2019 IEEE 43rd Annual Computer Software and Applications Conference, COMPSAC 2019., 8754122, Proceedings - International Computer Software and Applications Conference, vol. 1, IEEE Computer Society, pp. 770-775, 43rd IEEE Annual Computer Software and Applications Conference, COMPSAC 2019, Milwaukee, United States, 19/7/15. https://doi.org/10.1109/COMPSAC.2019.00114
Fukushi N, Chiba D, Akiyama M, Uchida M. Exploration into gray area: Efficient labeling for malicious domain name detection. In Getov V, Gaudiot J-L, Yamai N, Cimato S, Chang M, Teranishi Y, Yang J-J, Leong HV, Shahriar H, Takemoto M, Towey D, Takakura H, Elci A, Takeuchi S, Puri S, editors, Proceedings - 2019 IEEE 43rd Annual Computer Software and Applications Conference, COMPSAC 2019. IEEE Computer Society. 2019. p. 770-775. 8754122. (Proceedings - International Computer Software and Applications Conference). https://doi.org/10.1109/COMPSAC.2019.00114
Fukushi, Naoki ; Chiba, Daiki ; Akiyama, Mitsuaki ; Uchida, Masato. / Exploration into gray area : Efficient labeling for malicious domain name detection. Proceedings - 2019 IEEE 43rd Annual Computer Software and Applications Conference, COMPSAC 2019. editor / Vladimir Getov ; Jean-Luc Gaudiot ; Nariyoshi Yamai ; Stelvio Cimato ; Morris Chang ; Yuuichi Teranishi ; Ji-Jiang Yang ; Hong Va Leong ; Hossian Shahriar ; Michiharu Takemoto ; Dave Towey ; Hiroki Takakura ; Atilla Elci ; Susumu Takeuchi ; Satish Puri. IEEE Computer Society, 2019. pp. 770-775 (Proceedings - International Computer Software and Applications Conference).
@inproceedings{effa0d5dd1824f509813bb4aa3e24fed,
title = "Exploration into gray area: Efficient labeling for malicious domain name detection",
abstract = "This paper presents a method to reduce the labeling cost when acquiring training data for a system that detects malicious domain names by supervised machine learning. The conventional system requires large quantities of both benign and malicious domain names to be prepared as training data to obtain a classifier with high classification accuracy. In general, malicious domain names are observed less frequently than benign domain names. Therefore, it is difficult to acquire a large number of malicious domain names without a dedicated labeling method. We propose a method based on active learning that labels data around the decision boundary of classification, i.e., in the gray area, and we show that the classification accuracy can be improved by only using approximately 2.5{\%} of the training data used by the conventional system. An additional disadvantage of the conventional system is that, if the classifier is trained with a small amount of training data, its generalization ability cannot be guaranteed. We propose a method based on ensemble learning that integrates multiple classifiers, and we show that the classification accuracy can be stabilized and improved.",
keywords = "Active learning, Data labeling, Ensemble learning, Malicious domain name",
author = "Naoki Fukushi and Daiki Chiba and Mitsuaki Akiyama and Masato Uchida",
year = "2019",
month = "7",
doi = "10.1109/COMPSAC.2019.00114",
language = "English",
series = "Proceedings - International Computer Software and Applications Conference",
publisher = "IEEE Computer Society",
pages = "770--775",
editor = "Vladimir Getov and Jean-Luc Gaudiot and Nariyoshi Yamai and Stelvio Cimato and Morris Chang and Yuuichi Teranishi and Ji-Jiang Yang and Leong, {Hong Va} and Hossian Shahriar and Michiharu Takemoto and Dave Towey and Hiroki Takakura and Atilla Elci and Susumu Takeuchi and Satish Puri",
booktitle = "Proceedings - 2019 IEEE 43rd Annual Computer Software and Applications Conference, COMPSAC 2019",

}

TY - GEN

T1 - Exploration into gray area

T2 - Efficient labeling for malicious domain name detection

AU - Fukushi, Naoki

AU - Chiba, Daiki

AU - Akiyama, Mitsuaki

AU - Uchida, Masato

PY - 2019/7

Y1 - 2019/7

N2 - This paper presents a method to reduce the labeling cost when acquiring training data for a system that detects malicious domain names by supervised machine learning. The conventional system requires large quantities of both benign and malicious domain names to be prepared as training data to obtain a classifier with high classification accuracy. In general, malicious domain names are observed less frequently than benign domain names. Therefore, it is difficult to acquire a large number of malicious domain names without a dedicated labeling method. We propose a method based on active learning that labels data around the decision boundary of classification, i.e., in the gray area, and we show that the classification accuracy can be improved by only using approximately 2.5% of the training data used by the conventional system. An additional disadvantage of the conventional system is that, if the classifier is trained with a small amount of training data, its generalization ability cannot be guaranteed. We propose a method based on ensemble learning that integrates multiple classifiers, and we show that the classification accuracy can be stabilized and improved.

AB - This paper presents a method to reduce the labeling cost when acquiring training data for a system that detects malicious domain names by supervised machine learning. The conventional system requires large quantities of both benign and malicious domain names to be prepared as training data to obtain a classifier with high classification accuracy. In general, malicious domain names are observed less frequently than benign domain names. Therefore, it is difficult to acquire a large number of malicious domain names without a dedicated labeling method. We propose a method based on active learning that labels data around the decision boundary of classification, i.e., in the gray area, and we show that the classification accuracy can be improved by only using approximately 2.5% of the training data used by the conventional system. An additional disadvantage of the conventional system is that, if the classifier is trained with a small amount of training data, its generalization ability cannot be guaranteed. We propose a method based on ensemble learning that integrates multiple classifiers, and we show that the classification accuracy can be stabilized and improved.

KW - Active learning

KW - Data labeling

KW - Ensemble learning

KW - Malicious domain name

UR - http://www.scopus.com/inward/record.url?scp=85072699478&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85072699478&partnerID=8YFLogxK

U2 - 10.1109/COMPSAC.2019.00114

DO - 10.1109/COMPSAC.2019.00114

M3 - Conference contribution

AN - SCOPUS:85072699478

T3 - Proceedings - International Computer Software and Applications Conference

SP - 770

EP - 775

BT - Proceedings - 2019 IEEE 43rd Annual Computer Software and Applications Conference, COMPSAC 2019

A2 - Getov, Vladimir

A2 - Gaudiot, Jean-Luc

A2 - Yamai, Nariyoshi

A2 - Cimato, Stelvio

A2 - Chang, Morris

A2 - Teranishi, Yuuichi

A2 - Yang, Ji-Jiang

A2 - Leong, Hong Va

A2 - Shahriar, Hossian

A2 - Takemoto, Michiharu

A2 - Towey, Dave

A2 - Takakura, Hiroki

A2 - Elci, Atilla

A2 - Takeuchi, Susumu

A2 - Puri, Satish

PB - IEEE Computer Society

ER -