History-enhanced focused website segment crawler

Tanaphol Suebchua, Bundit Manaskasemsak, Arnon Rungsawang, Hayato Yamana

研究成果: Conference contribution

1 被引用数 (Scopus)

抄録

The primary challenge in focused crawling research is how to efficiently utilize computing resources, e.g., bandwidth, disk space, and time, to find as many web pages related to a specific topic as possible. To meet this challenge, we previously introduced a machine-learning-based focused crawler that aims to crawl a group of relevant web pages located in the same directory path, called a website segment, and has achieved high efficiency so far. One of the limitations of our previous approach is that it may repeatedly visit a website that does not serve any relevant website segments, in the scenario where the website segments share the same linkage characteristics as the relevant ones in the training dataset. In this paper, we propose a 'history-enhanced focused website segment crawler' to solve the problem. The idea behind it is that the priority score of an unvisited website segment should be reduced if the crawler has consecutively downloaded many irrelevant web pages from the website. To implement this idea, we propose a new prediction feature, called the 'history feature', that is extracted from the recent crawling results, i.e., relevant and irrelevant web pages gathered from the target website. Our experiment shows that our newly proposed feature could improve the crawling efficiency of our focused crawler by a maximum of approximately 5%.

本文言語English
ホスト出版物のタイトル32nd International Conference on Information Networking, ICOIN 2018
出版社IEEE Computer Society
ページ80-85
ページ数6
ISBN(電子版)9781538622896
DOI
出版ステータスPublished - 2018 4 19
イベント32nd International Conference on Information Networking, ICOIN 2018 - Chiang Mai, Thailand
継続期間: 2018 1 102018 1 12

出版物シリーズ

名前International Conference on Information Networking
2018-January
ISSN(印刷版)1976-7684

Other

Other32nd International Conference on Information Networking, ICOIN 2018
国/地域Thailand
CityChiang Mai
Period18/1/1018/1/12

ASJC Scopus subject areas

  • コンピュータ ネットワークおよび通信
  • 情報システム

フィンガープリント

「History-enhanced focused website segment crawler」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル