### Abstract

Finding rare information behind big data is important and meaningful for outlier detection. However, to find such rare information is extremely difficult when the notorious curse of dimensionality exists in high dimensional space. Most of existing methods fail to obtain good result since the Euclidean distance cannot work well in high dimensional space. In this paper, we first perform a grid division of data for each attribute, and compare the density ratio for every point in each dimension. We then project the points of the same area to other dimensions, and then we calculate the disperse extent with defined cluster density value. At last, we sum up all weight values for each point in two-step calculations. After the process, outliers are those points scoring the largest weight. The experimental results show that the proposed algorithm can achieve high precision and recall on the synthetic datasets with the dimension varying from 100 to 10000.

Original language | English |
---|---|

Title of host publication | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |

Pages | 307-318 |

Number of pages | 12 |

Volume | 7867 LNAI |

DOIs | |

Publication status | Published - 2013 |

Event | 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2013 - Gold Coast, QLD Duration: 2013 Apr 14 → 2013 Apr 17 |

### Publication series

Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|

Volume | 7867 LNAI |

ISSN (Print) | 03029743 |

ISSN (Electronic) | 16113349 |

### Other

Other | 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2013 |
---|---|

City | Gold Coast, QLD |

Period | 13/4/14 → 13/4/17 |

### Fingerprint

### Keywords

- Dimensional projection
- High dimension
- Outlier score

### ASJC Scopus subject areas

- Computer Science(all)
- Theoretical Computer Science

### Cite this

*Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)*(Vol. 7867 LNAI, pp. 307-318). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 7867 LNAI). https://doi.org/10.1007/978-3-642-40319-4_27

**A novel proposal for outlier detection in high dimensional space.** / Bao, Zhana; Kameyama, Wataru.

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

*Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics).*vol. 7867 LNAI, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 7867 LNAI, pp. 307-318, 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2013, Gold Coast, QLD, 13/4/14. https://doi.org/10.1007/978-3-642-40319-4_27

}

TY - GEN

T1 - A novel proposal for outlier detection in high dimensional space

AU - Bao, Zhana

AU - Kameyama, Wataru

PY - 2013

Y1 - 2013

N2 - Finding rare information behind big data is important and meaningful for outlier detection. However, to find such rare information is extremely difficult when the notorious curse of dimensionality exists in high dimensional space. Most of existing methods fail to obtain good result since the Euclidean distance cannot work well in high dimensional space. In this paper, we first perform a grid division of data for each attribute, and compare the density ratio for every point in each dimension. We then project the points of the same area to other dimensions, and then we calculate the disperse extent with defined cluster density value. At last, we sum up all weight values for each point in two-step calculations. After the process, outliers are those points scoring the largest weight. The experimental results show that the proposed algorithm can achieve high precision and recall on the synthetic datasets with the dimension varying from 100 to 10000.

AB - Finding rare information behind big data is important and meaningful for outlier detection. However, to find such rare information is extremely difficult when the notorious curse of dimensionality exists in high dimensional space. Most of existing methods fail to obtain good result since the Euclidean distance cannot work well in high dimensional space. In this paper, we first perform a grid division of data for each attribute, and compare the density ratio for every point in each dimension. We then project the points of the same area to other dimensions, and then we calculate the disperse extent with defined cluster density value. At last, we sum up all weight values for each point in two-step calculations. After the process, outliers are those points scoring the largest weight. The experimental results show that the proposed algorithm can achieve high precision and recall on the synthetic datasets with the dimension varying from 100 to 10000.

KW - Dimensional projection

KW - High dimension

KW - Outlier score

UR - http://www.scopus.com/inward/record.url?scp=84892885580&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84892885580&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-40319-4_27

DO - 10.1007/978-3-642-40319-4_27

M3 - Conference contribution

AN - SCOPUS:84892885580

SN - 9783642403187

VL - 7867 LNAI

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 307

EP - 318

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

ER -