Finding cardinality heavy-hitters in massive traffic data and its application to anomaly detection

Keisuke Ishibashi, Tatsuya Mori, Ryoichi Kawahara, Yutaka Hirokawa, Atsushi Kobayashi, Kimihiro Yamamoto, Hitoaki Sakamoto, Shoichiro Asano

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

We propose an algorithm for finding heavy hitters in terms of cardinality (the number of distinct items in a set) in massive traffic data using a small amount of memory. Examples of such cardinality heavyhitters are hosts that send large numbers of flows, or hosts that communicate with large numbers of other hosts. Finding these hosts is crucial to the provision of good communication quality because they significantly affect the communications of other hosts via either malicious activities such as worm scans, spam distribution, or botnet control or normal activities such as being a member of a flash crowd or performing peer-to-peer (P2P) communication. To precisely determine the cardinality of a host we need tables of previously seen items for each host (e.g., flow tables for every host) and this may infeasible for a high-speed environment with a massive amount of traffic. In this paper, we use a cardinality estimation algorithm that does not require these tables but needs only a little information called the cardinality summary. This is made possible by relaxing the goal from exact counting to estimation of cardinality. In addition, we propose an algorithm that does not need to maintain the cardinality summary for each host, but only for partitioned addresses of a host. As a result, the required number of tables can be significantly decreased. We evaluated our algorithm using actual backbone traffic data to find the heavy-hitters in the number of flows and estimate the number of these flows. We found that while the accuracy degraded when estimating for hosts with few flows, the algorithm could accurately find the top-100 hosts in terms of the number of flows using a limited-sized memory. In addition, we found that the number of tables required to achieve a pre-defined accuracy increased logarithmically with respect to the total number of hosts, which indicates that our method is applicable for large traffic data for a very large number of hosts. We also introduce an application of our algorithm to anomaly detection. With actual traffic data, our method could successfully detect a sudden network scan.

Original languageEnglish
Pages (from-to)1331-1339
Number of pages9
JournalIEICE Transactions on Communications
VolumeE91-B
Issue number5
DOIs
Publication statusPublished - 2008
Externally publishedYes

Fingerprint

Communication
Data storage equipment
Botnet

Keywords

  • Anomaly detection
  • Cardinality
  • Data stream

ASJC Scopus subject areas

  • Electrical and Electronic Engineering
  • Computer Networks and Communications
  • Software

Cite this

Finding cardinality heavy-hitters in massive traffic data and its application to anomaly detection. / Ishibashi, Keisuke; Mori, Tatsuya; Kawahara, Ryoichi; Hirokawa, Yutaka; Kobayashi, Atsushi; Yamamoto, Kimihiro; Sakamoto, Hitoaki; Asano, Shoichiro.

In: IEICE Transactions on Communications, Vol. E91-B, No. 5, 2008, p. 1331-1339.

Research output: Contribution to journalArticle

Ishibashi, K, Mori, T, Kawahara, R, Hirokawa, Y, Kobayashi, A, Yamamoto, K, Sakamoto, H & Asano, S 2008, 'Finding cardinality heavy-hitters in massive traffic data and its application to anomaly detection', IEICE Transactions on Communications, vol. E91-B, no. 5, pp. 1331-1339. https://doi.org/10.1093/ietcom/e91-b.5.1331
Ishibashi, Keisuke ; Mori, Tatsuya ; Kawahara, Ryoichi ; Hirokawa, Yutaka ; Kobayashi, Atsushi ; Yamamoto, Kimihiro ; Sakamoto, Hitoaki ; Asano, Shoichiro. / Finding cardinality heavy-hitters in massive traffic data and its application to anomaly detection. In: IEICE Transactions on Communications. 2008 ; Vol. E91-B, No. 5. pp. 1331-1339.
@article{bcf6f8235c254ac7afbc8848d89d0b7e,
title = "Finding cardinality heavy-hitters in massive traffic data and its application to anomaly detection",
abstract = "We propose an algorithm for finding heavy hitters in terms of cardinality (the number of distinct items in a set) in massive traffic data using a small amount of memory. Examples of such cardinality heavyhitters are hosts that send large numbers of flows, or hosts that communicate with large numbers of other hosts. Finding these hosts is crucial to the provision of good communication quality because they significantly affect the communications of other hosts via either malicious activities such as worm scans, spam distribution, or botnet control or normal activities such as being a member of a flash crowd or performing peer-to-peer (P2P) communication. To precisely determine the cardinality of a host we need tables of previously seen items for each host (e.g., flow tables for every host) and this may infeasible for a high-speed environment with a massive amount of traffic. In this paper, we use a cardinality estimation algorithm that does not require these tables but needs only a little information called the cardinality summary. This is made possible by relaxing the goal from exact counting to estimation of cardinality. In addition, we propose an algorithm that does not need to maintain the cardinality summary for each host, but only for partitioned addresses of a host. As a result, the required number of tables can be significantly decreased. We evaluated our algorithm using actual backbone traffic data to find the heavy-hitters in the number of flows and estimate the number of these flows. We found that while the accuracy degraded when estimating for hosts with few flows, the algorithm could accurately find the top-100 hosts in terms of the number of flows using a limited-sized memory. In addition, we found that the number of tables required to achieve a pre-defined accuracy increased logarithmically with respect to the total number of hosts, which indicates that our method is applicable for large traffic data for a very large number of hosts. We also introduce an application of our algorithm to anomaly detection. With actual traffic data, our method could successfully detect a sudden network scan.",
keywords = "Anomaly detection, Cardinality, Data stream",
author = "Keisuke Ishibashi and Tatsuya Mori and Ryoichi Kawahara and Yutaka Hirokawa and Atsushi Kobayashi and Kimihiro Yamamoto and Hitoaki Sakamoto and Shoichiro Asano",
year = "2008",
doi = "10.1093/ietcom/e91-b.5.1331",
language = "English",
volume = "E91-B",
pages = "1331--1339",
journal = "IEICE Transactions on Communications",
issn = "0916-8516",
publisher = "Maruzen Co., Ltd/Maruzen Kabushikikaisha",
number = "5",

}

TY - JOUR

T1 - Finding cardinality heavy-hitters in massive traffic data and its application to anomaly detection

AU - Ishibashi, Keisuke

AU - Mori, Tatsuya

AU - Kawahara, Ryoichi

AU - Hirokawa, Yutaka

AU - Kobayashi, Atsushi

AU - Yamamoto, Kimihiro

AU - Sakamoto, Hitoaki

AU - Asano, Shoichiro

PY - 2008

Y1 - 2008

N2 - We propose an algorithm for finding heavy hitters in terms of cardinality (the number of distinct items in a set) in massive traffic data using a small amount of memory. Examples of such cardinality heavyhitters are hosts that send large numbers of flows, or hosts that communicate with large numbers of other hosts. Finding these hosts is crucial to the provision of good communication quality because they significantly affect the communications of other hosts via either malicious activities such as worm scans, spam distribution, or botnet control or normal activities such as being a member of a flash crowd or performing peer-to-peer (P2P) communication. To precisely determine the cardinality of a host we need tables of previously seen items for each host (e.g., flow tables for every host) and this may infeasible for a high-speed environment with a massive amount of traffic. In this paper, we use a cardinality estimation algorithm that does not require these tables but needs only a little information called the cardinality summary. This is made possible by relaxing the goal from exact counting to estimation of cardinality. In addition, we propose an algorithm that does not need to maintain the cardinality summary for each host, but only for partitioned addresses of a host. As a result, the required number of tables can be significantly decreased. We evaluated our algorithm using actual backbone traffic data to find the heavy-hitters in the number of flows and estimate the number of these flows. We found that while the accuracy degraded when estimating for hosts with few flows, the algorithm could accurately find the top-100 hosts in terms of the number of flows using a limited-sized memory. In addition, we found that the number of tables required to achieve a pre-defined accuracy increased logarithmically with respect to the total number of hosts, which indicates that our method is applicable for large traffic data for a very large number of hosts. We also introduce an application of our algorithm to anomaly detection. With actual traffic data, our method could successfully detect a sudden network scan.

AB - We propose an algorithm for finding heavy hitters in terms of cardinality (the number of distinct items in a set) in massive traffic data using a small amount of memory. Examples of such cardinality heavyhitters are hosts that send large numbers of flows, or hosts that communicate with large numbers of other hosts. Finding these hosts is crucial to the provision of good communication quality because they significantly affect the communications of other hosts via either malicious activities such as worm scans, spam distribution, or botnet control or normal activities such as being a member of a flash crowd or performing peer-to-peer (P2P) communication. To precisely determine the cardinality of a host we need tables of previously seen items for each host (e.g., flow tables for every host) and this may infeasible for a high-speed environment with a massive amount of traffic. In this paper, we use a cardinality estimation algorithm that does not require these tables but needs only a little information called the cardinality summary. This is made possible by relaxing the goal from exact counting to estimation of cardinality. In addition, we propose an algorithm that does not need to maintain the cardinality summary for each host, but only for partitioned addresses of a host. As a result, the required number of tables can be significantly decreased. We evaluated our algorithm using actual backbone traffic data to find the heavy-hitters in the number of flows and estimate the number of these flows. We found that while the accuracy degraded when estimating for hosts with few flows, the algorithm could accurately find the top-100 hosts in terms of the number of flows using a limited-sized memory. In addition, we found that the number of tables required to achieve a pre-defined accuracy increased logarithmically with respect to the total number of hosts, which indicates that our method is applicable for large traffic data for a very large number of hosts. We also introduce an application of our algorithm to anomaly detection. With actual traffic data, our method could successfully detect a sudden network scan.

KW - Anomaly detection

KW - Cardinality

KW - Data stream

UR - http://www.scopus.com/inward/record.url?scp=68449093969&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=68449093969&partnerID=8YFLogxK

U2 - 10.1093/ietcom/e91-b.5.1331

DO - 10.1093/ietcom/e91-b.5.1331

M3 - Article

AN - SCOPUS:68449093969

VL - E91-B

SP - 1331

EP - 1339

JO - IEICE Transactions on Communications

JF - IEICE Transactions on Communications

SN - 0916-8516

IS - 5

ER -