TY - GEN
T1 - Exploration into gray area
T2 - 43rd IEEE Annual Computer Software and Applications Conference, COMPSAC 2019
AU - Fukushi, Naoki
AU - Chiba, Daiki
AU - Akiyama, Mitsuaki
AU - Uchida, Masato
N1 - Funding Information:
ACKNOWLEDGMENT This work was supported in part by the Japan Society for the Promotion of Science through Grants-in-Aid for Scientific Research (C) (17K00135).
Publisher Copyright:
© 2019 IEEE.
PY - 2019/7
Y1 - 2019/7
N2 - This paper presents a method to reduce the labeling cost when acquiring training data for a system that detects malicious domain names by supervised machine learning. The conventional system requires large quantities of both benign and malicious domain names to be prepared as training data to obtain a classifier with high classification accuracy. In general, malicious domain names are observed less frequently than benign domain names. Therefore, it is difficult to acquire a large number of malicious domain names without a dedicated labeling method. We propose a method based on active learning that labels data around the decision boundary of classification, i.e., in the gray area, and we show that the classification accuracy can be improved by only using approximately 2.5% of the training data used by the conventional system. An additional disadvantage of the conventional system is that, if the classifier is trained with a small amount of training data, its generalization ability cannot be guaranteed. We propose a method based on ensemble learning that integrates multiple classifiers, and we show that the classification accuracy can be stabilized and improved.
AB - This paper presents a method to reduce the labeling cost when acquiring training data for a system that detects malicious domain names by supervised machine learning. The conventional system requires large quantities of both benign and malicious domain names to be prepared as training data to obtain a classifier with high classification accuracy. In general, malicious domain names are observed less frequently than benign domain names. Therefore, it is difficult to acquire a large number of malicious domain names without a dedicated labeling method. We propose a method based on active learning that labels data around the decision boundary of classification, i.e., in the gray area, and we show that the classification accuracy can be improved by only using approximately 2.5% of the training data used by the conventional system. An additional disadvantage of the conventional system is that, if the classifier is trained with a small amount of training data, its generalization ability cannot be guaranteed. We propose a method based on ensemble learning that integrates multiple classifiers, and we show that the classification accuracy can be stabilized and improved.
KW - Active learning
KW - Data labeling
KW - Ensemble learning
KW - Malicious domain name
UR - http://www.scopus.com/inward/record.url?scp=85072699478&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85072699478&partnerID=8YFLogxK
U2 - 10.1109/COMPSAC.2019.00114
DO - 10.1109/COMPSAC.2019.00114
M3 - Conference contribution
AN - SCOPUS:85072699478
T3 - Proceedings - International Computer Software and Applications Conference
SP - 770
EP - 775
BT - Proceedings - 2019 IEEE 43rd Annual Computer Software and Applications Conference, COMPSAC 2019
A2 - Getov, Vladimir
A2 - Gaudiot, Jean-Luc
A2 - Yamai, Nariyoshi
A2 - Cimato, Stelvio
A2 - Chang, Morris
A2 - Teranishi, Yuuichi
A2 - Yang, Ji-Jiang
A2 - Leong, Hong Va
A2 - Shahriar, Hossian
A2 - Takemoto, Michiharu
A2 - Towey, Dave
A2 - Takakura, Hiroki
A2 - Elci, Atilla
A2 - Takeuchi, Susumu
A2 - Puri, Satish
PB - IEEE Computer Society
Y2 - 15 July 2019 through 19 July 2019
ER -