Efficient Topical Focused Crawling Through Neighborhood Feature

Tanaphol Suebchua, Bundit Manaskasemsak, Arnon Rungsawang, Hayato Yamana

    Research output: Contribution to journalArticle

    Abstract

    A focused web crawler is an essential tool for gathering domain-specific data used by national web corpora, vertical search engines, and so on, since it is more efficient than general Breadth-First or Depth-First crawlers. The problem in focused crawling research is the prioritization of unvisited web pages in the crawling frontier followed by crawling these web pages in the order of their priority. The most common feature, adopted in many focused crawling researches, to prioritize an unvisited web page is the relevancy of the set of its source web pages, i.e., its in-linked web pages. However, this feature is limited, because we cannot estimate the relevancy of the unvisited web page correctly if we have few source web pages. To solve this problem and enhance the efficiency of focused web crawlers, we propose a new feature, called the “neighborhood feature”. This enables the adoption of additional already-downloaded web pages to estimate the priority of a target web page. The additionally adopted web pages consist both of web pages located at the same directory as that of the target web page and web pages whose directory paths are similar to that of the target web page. Our experimental results show that our enhanced focused crawlers outperform the crawlers not utilizing the neighborhood feature as well as the state-of-the-art focused crawlers, including HMM crawler.

    Original languageEnglish
    Pages (from-to)95-118
    Number of pages24
    JournalNew Generation Computing
    Volume36
    Issue number2
    DOIs
    Publication statusPublished - 2018 Apr 1

    Keywords

    • Domain-specific dataset
    • Focused crawler
    • Vertical search engine
    • Web archive

    ASJC Scopus subject areas

    • Software
    • Theoretical Computer Science
    • Hardware and Architecture
    • Computer Networks and Communications

    Fingerprint Dive into the research topics of 'Efficient Topical Focused Crawling Through Neighborhood Feature'. Together they form a unique fingerprint.

  • Cite this