Efficient Topical Focused Crawling Through Neighborhood Feature

Tanaphol Suebchua, Bundit Manaskasemsak, Arnon Rungsawang, Hayato Yamana

    Research output: Contribution to journalArticle

    Abstract

    A focused web crawler is an essential tool for gathering domain-specific data used by national web corpora, vertical search engines, and so on, since it is more efficient than general Breadth-First or Depth-First crawlers. The problem in focused crawling research is the prioritization of unvisited web pages in the crawling frontier followed by crawling these web pages in the order of their priority. The most common feature, adopted in many focused crawling researches, to prioritize an unvisited web page is the relevancy of the set of its source web pages, i.e., its in-linked web pages. However, this feature is limited, because we cannot estimate the relevancy of the unvisited web page correctly if we have few source web pages. To solve this problem and enhance the efficiency of focused web crawlers, we propose a new feature, called the “neighborhood feature”. This enables the adoption of additional already-downloaded web pages to estimate the priority of a target web page. The additionally adopted web pages consist both of web pages located at the same directory as that of the target web page and web pages whose directory paths are similar to that of the target web page. Our experimental results show that our enhanced focused crawlers outperform the crawlers not utilizing the neighborhood feature as well as the state-of-the-art focused crawlers, including HMM crawler.

    Original languageEnglish
    Pages (from-to)95-118
    Number of pages24
    JournalNew Generation Computing
    Volume36
    Issue number2
    DOIs
    Publication statusPublished - 2018 Apr 1

    Fingerprint

    Websites
    Target
    Search engines
    Prioritization
    Breadth
    World Wide Web
    Search Engine
    Estimate
    Vertical

    Keywords

    • Domain-specific dataset
    • Focused crawler
    • Vertical search engine
    • Web archive

    ASJC Scopus subject areas

    • Software
    • Theoretical Computer Science
    • Hardware and Architecture
    • Computer Networks and Communications

    Cite this

    Efficient Topical Focused Crawling Through Neighborhood Feature. / Suebchua, Tanaphol; Manaskasemsak, Bundit; Rungsawang, Arnon; Yamana, Hayato.

    In: New Generation Computing, Vol. 36, No. 2, 01.04.2018, p. 95-118.

    Research output: Contribution to journalArticle

    Suebchua, Tanaphol ; Manaskasemsak, Bundit ; Rungsawang, Arnon ; Yamana, Hayato. / Efficient Topical Focused Crawling Through Neighborhood Feature. In: New Generation Computing. 2018 ; Vol. 36, No. 2. pp. 95-118.
    @article{a3a526c733be4064983006d796ecefb9,
    title = "Efficient Topical Focused Crawling Through Neighborhood Feature",
    abstract = "A focused web crawler is an essential tool for gathering domain-specific data used by national web corpora, vertical search engines, and so on, since it is more efficient than general Breadth-First or Depth-First crawlers. The problem in focused crawling research is the prioritization of unvisited web pages in the crawling frontier followed by crawling these web pages in the order of their priority. The most common feature, adopted in many focused crawling researches, to prioritize an unvisited web page is the relevancy of the set of its source web pages, i.e., its in-linked web pages. However, this feature is limited, because we cannot estimate the relevancy of the unvisited web page correctly if we have few source web pages. To solve this problem and enhance the efficiency of focused web crawlers, we propose a new feature, called the “neighborhood feature”. This enables the adoption of additional already-downloaded web pages to estimate the priority of a target web page. The additionally adopted web pages consist both of web pages located at the same directory as that of the target web page and web pages whose directory paths are similar to that of the target web page. Our experimental results show that our enhanced focused crawlers outperform the crawlers not utilizing the neighborhood feature as well as the state-of-the-art focused crawlers, including HMM crawler.",
    keywords = "Domain-specific dataset, Focused crawler, Vertical search engine, Web archive",
    author = "Tanaphol Suebchua and Bundit Manaskasemsak and Arnon Rungsawang and Hayato Yamana",
    year = "2018",
    month = "4",
    day = "1",
    doi = "10.1007/s00354-017-0029-8",
    language = "English",
    volume = "36",
    pages = "95--118",
    journal = "New Generation Computing",
    issn = "0288-3635",
    publisher = "Springer Japan",
    number = "2",

    }

    TY - JOUR

    T1 - Efficient Topical Focused Crawling Through Neighborhood Feature

    AU - Suebchua, Tanaphol

    AU - Manaskasemsak, Bundit

    AU - Rungsawang, Arnon

    AU - Yamana, Hayato

    PY - 2018/4/1

    Y1 - 2018/4/1

    N2 - A focused web crawler is an essential tool for gathering domain-specific data used by national web corpora, vertical search engines, and so on, since it is more efficient than general Breadth-First or Depth-First crawlers. The problem in focused crawling research is the prioritization of unvisited web pages in the crawling frontier followed by crawling these web pages in the order of their priority. The most common feature, adopted in many focused crawling researches, to prioritize an unvisited web page is the relevancy of the set of its source web pages, i.e., its in-linked web pages. However, this feature is limited, because we cannot estimate the relevancy of the unvisited web page correctly if we have few source web pages. To solve this problem and enhance the efficiency of focused web crawlers, we propose a new feature, called the “neighborhood feature”. This enables the adoption of additional already-downloaded web pages to estimate the priority of a target web page. The additionally adopted web pages consist both of web pages located at the same directory as that of the target web page and web pages whose directory paths are similar to that of the target web page. Our experimental results show that our enhanced focused crawlers outperform the crawlers not utilizing the neighborhood feature as well as the state-of-the-art focused crawlers, including HMM crawler.

    AB - A focused web crawler is an essential tool for gathering domain-specific data used by national web corpora, vertical search engines, and so on, since it is more efficient than general Breadth-First or Depth-First crawlers. The problem in focused crawling research is the prioritization of unvisited web pages in the crawling frontier followed by crawling these web pages in the order of their priority. The most common feature, adopted in many focused crawling researches, to prioritize an unvisited web page is the relevancy of the set of its source web pages, i.e., its in-linked web pages. However, this feature is limited, because we cannot estimate the relevancy of the unvisited web page correctly if we have few source web pages. To solve this problem and enhance the efficiency of focused web crawlers, we propose a new feature, called the “neighborhood feature”. This enables the adoption of additional already-downloaded web pages to estimate the priority of a target web page. The additionally adopted web pages consist both of web pages located at the same directory as that of the target web page and web pages whose directory paths are similar to that of the target web page. Our experimental results show that our enhanced focused crawlers outperform the crawlers not utilizing the neighborhood feature as well as the state-of-the-art focused crawlers, including HMM crawler.

    KW - Domain-specific dataset

    KW - Focused crawler

    KW - Vertical search engine

    KW - Web archive

    UR - http://www.scopus.com/inward/record.url?scp=85038086044&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=85038086044&partnerID=8YFLogxK

    U2 - 10.1007/s00354-017-0029-8

    DO - 10.1007/s00354-017-0029-8

    M3 - Article

    VL - 36

    SP - 95

    EP - 118

    JO - New Generation Computing

    JF - New Generation Computing

    SN - 0288-3635

    IS - 2

    ER -