Adaptive Focused Website Segment Crawler

Tanaphol Suebchua, Arnon Rungsawang, Hayato Yamana

研究成果

5 被引用数 (Scopus)

抄録

Focused web crawler has become indispensable for vertical search engines that provide a search service for specialized datasets. These vertical search engines have to collect specific web pages in the web space, whereas search engines such as Google and Bing gather web pages from all over the world. The problem in focused crawling research is how to collect specific web pages with minimal computing resources. We previously addressed this problem by proposing a focused crawling strategy, which utilizes an ensemble machine learning classifier to find the group of relevant web pages, referred to as relevant website segment. In this paper, we enhance the proposed crawler as follows: 1) We increase the accuracy of predicting website segments, by preparing two predictors: a predictor learned by features extracted from relevant source website segments and another predictor learned by features from irrelevant ones. The idea is that there may exist different characteristics between these two types of source website segments. 2) We also propose a noisy data elimination method when updating the predictor incrementally during the crawling process. A preliminary experiment shows that our enhanced crawler outperforms a crawler that equips neither of these approaches by around 12%, at most.

本文言語English
ホスト出版物のタイトルNBiS 2016 - 19th International Conference on Network-Based Information Systems
編集者Fatos Xhafa, Tomoya Enokido, Leonard Barolli, Makoto Takizawa, Vaclav Snasel
出版社Institute of Electrical and Electronics Engineers Inc.
ページ181-187
ページ数7
ISBN(電子版)9781509009794
DOI
出版ステータスPublished - 2016 12月 16
イベント19th International Conference on Network-Based Information Systems, NBiS 2016 - Ostrava, Czech Republic
継続期間: 2016 9月 72016 9月 9

出版物シリーズ

名前NBiS 2016 - 19th International Conference on Network-Based Information Systems

Other

Other19th International Conference on Network-Based Information Systems, NBiS 2016
国/地域Czech Republic
CityOstrava
Period16/9/716/9/9

ASJC Scopus subject areas

  • 情報システム
  • コンピュータ ネットワークおよび通信

フィンガープリント

「Adaptive Focused Website Segment Crawler」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル