History-enhanced focused website segment crawler

Tanaphol Suebchua, Bundit Manaskasemsak, Arnon Rungsawang, Hayato Yamana

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    The primary challenge in focused crawling research is how to efficiently utilize computing resources, e.g., bandwidth, disk space, and time, to find as many web pages related to a specific topic as possible. To meet this challenge, we previously introduced a machine-learning-based focused crawler that aims to crawl a group of relevant web pages located in the same directory path, called a website segment, and has achieved high efficiency so far. One of the limitations of our previous approach is that it may repeatedly visit a website that does not serve any relevant website segments, in the scenario where the website segments share the same linkage characteristics as the relevant ones in the training dataset. In this paper, we propose a 'history-enhanced focused website segment crawler' to solve the problem. The idea behind it is that the priority score of an unvisited website segment should be reduced if the crawler has consecutively downloaded many irrelevant web pages from the website. To implement this idea, we propose a new prediction feature, called the 'history feature', that is extracted from the recent crawling results, i.e., relevant and irrelevant web pages gathered from the target website. Our experiment shows that our newly proposed feature could improve the crawling efficiency of our focused crawler by a maximum of approximately 5%.

    Original languageEnglish
    Title of host publication32nd International Conference on Information Networking, ICOIN 2018
    PublisherIEEE Computer Society
    Pages80-85
    Number of pages6
    Volume2018-January
    ISBN (Electronic)9781538622896
    DOIs
    Publication statusPublished - 2018 Apr 19
    Event32nd International Conference on Information Networking, ICOIN 2018 - Chiang Mai, Thailand
    Duration: 2018 Jan 102018 Jan 12

    Other

    Other32nd International Conference on Information Networking, ICOIN 2018
    CountryThailand
    CityChiang Mai
    Period18/1/1018/1/12

    Fingerprint

    Websites
    Web crawler
    Learning systems
    Bandwidth

    Keywords

    • Focused crawler
    • Machine learning
    • Topic-specific web crawler
    • Vertical search engine

    ASJC Scopus subject areas

    • Computer Networks and Communications
    • Information Systems

    Cite this

    Suebchua, T., Manaskasemsak, B., Rungsawang, A., & Yamana, H. (2018). History-enhanced focused website segment crawler. In 32nd International Conference on Information Networking, ICOIN 2018 (Vol. 2018-January, pp. 80-85). IEEE Computer Society. https://doi.org/10.1109/ICOIN.2018.8343090

    History-enhanced focused website segment crawler. / Suebchua, Tanaphol; Manaskasemsak, Bundit; Rungsawang, Arnon; Yamana, Hayato.

    32nd International Conference on Information Networking, ICOIN 2018. Vol. 2018-January IEEE Computer Society, 2018. p. 80-85.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Suebchua, T, Manaskasemsak, B, Rungsawang, A & Yamana, H 2018, History-enhanced focused website segment crawler. in 32nd International Conference on Information Networking, ICOIN 2018. vol. 2018-January, IEEE Computer Society, pp. 80-85, 32nd International Conference on Information Networking, ICOIN 2018, Chiang Mai, Thailand, 18/1/10. https://doi.org/10.1109/ICOIN.2018.8343090
    Suebchua T, Manaskasemsak B, Rungsawang A, Yamana H. History-enhanced focused website segment crawler. In 32nd International Conference on Information Networking, ICOIN 2018. Vol. 2018-January. IEEE Computer Society. 2018. p. 80-85 https://doi.org/10.1109/ICOIN.2018.8343090
    Suebchua, Tanaphol ; Manaskasemsak, Bundit ; Rungsawang, Arnon ; Yamana, Hayato. / History-enhanced focused website segment crawler. 32nd International Conference on Information Networking, ICOIN 2018. Vol. 2018-January IEEE Computer Society, 2018. pp. 80-85
    @inproceedings{93f7dd78d2ea4a5e8d21cc33b71fe646,
    title = "History-enhanced focused website segment crawler",
    abstract = "The primary challenge in focused crawling research is how to efficiently utilize computing resources, e.g., bandwidth, disk space, and time, to find as many web pages related to a specific topic as possible. To meet this challenge, we previously introduced a machine-learning-based focused crawler that aims to crawl a group of relevant web pages located in the same directory path, called a website segment, and has achieved high efficiency so far. One of the limitations of our previous approach is that it may repeatedly visit a website that does not serve any relevant website segments, in the scenario where the website segments share the same linkage characteristics as the relevant ones in the training dataset. In this paper, we propose a 'history-enhanced focused website segment crawler' to solve the problem. The idea behind it is that the priority score of an unvisited website segment should be reduced if the crawler has consecutively downloaded many irrelevant web pages from the website. To implement this idea, we propose a new prediction feature, called the 'history feature', that is extracted from the recent crawling results, i.e., relevant and irrelevant web pages gathered from the target website. Our experiment shows that our newly proposed feature could improve the crawling efficiency of our focused crawler by a maximum of approximately 5{\%}.",
    keywords = "Focused crawler, Machine learning, Topic-specific web crawler, Vertical search engine",
    author = "Tanaphol Suebchua and Bundit Manaskasemsak and Arnon Rungsawang and Hayato Yamana",
    year = "2018",
    month = "4",
    day = "19",
    doi = "10.1109/ICOIN.2018.8343090",
    language = "English",
    volume = "2018-January",
    pages = "80--85",
    booktitle = "32nd International Conference on Information Networking, ICOIN 2018",
    publisher = "IEEE Computer Society",

    }

    TY - GEN

    T1 - History-enhanced focused website segment crawler

    AU - Suebchua, Tanaphol

    AU - Manaskasemsak, Bundit

    AU - Rungsawang, Arnon

    AU - Yamana, Hayato

    PY - 2018/4/19

    Y1 - 2018/4/19

    N2 - The primary challenge in focused crawling research is how to efficiently utilize computing resources, e.g., bandwidth, disk space, and time, to find as many web pages related to a specific topic as possible. To meet this challenge, we previously introduced a machine-learning-based focused crawler that aims to crawl a group of relevant web pages located in the same directory path, called a website segment, and has achieved high efficiency so far. One of the limitations of our previous approach is that it may repeatedly visit a website that does not serve any relevant website segments, in the scenario where the website segments share the same linkage characteristics as the relevant ones in the training dataset. In this paper, we propose a 'history-enhanced focused website segment crawler' to solve the problem. The idea behind it is that the priority score of an unvisited website segment should be reduced if the crawler has consecutively downloaded many irrelevant web pages from the website. To implement this idea, we propose a new prediction feature, called the 'history feature', that is extracted from the recent crawling results, i.e., relevant and irrelevant web pages gathered from the target website. Our experiment shows that our newly proposed feature could improve the crawling efficiency of our focused crawler by a maximum of approximately 5%.

    AB - The primary challenge in focused crawling research is how to efficiently utilize computing resources, e.g., bandwidth, disk space, and time, to find as many web pages related to a specific topic as possible. To meet this challenge, we previously introduced a machine-learning-based focused crawler that aims to crawl a group of relevant web pages located in the same directory path, called a website segment, and has achieved high efficiency so far. One of the limitations of our previous approach is that it may repeatedly visit a website that does not serve any relevant website segments, in the scenario where the website segments share the same linkage characteristics as the relevant ones in the training dataset. In this paper, we propose a 'history-enhanced focused website segment crawler' to solve the problem. The idea behind it is that the priority score of an unvisited website segment should be reduced if the crawler has consecutively downloaded many irrelevant web pages from the website. To implement this idea, we propose a new prediction feature, called the 'history feature', that is extracted from the recent crawling results, i.e., relevant and irrelevant web pages gathered from the target website. Our experiment shows that our newly proposed feature could improve the crawling efficiency of our focused crawler by a maximum of approximately 5%.

    KW - Focused crawler

    KW - Machine learning

    KW - Topic-specific web crawler

    KW - Vertical search engine

    UR - http://www.scopus.com/inward/record.url?scp=85046998816&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=85046998816&partnerID=8YFLogxK

    U2 - 10.1109/ICOIN.2018.8343090

    DO - 10.1109/ICOIN.2018.8343090

    M3 - Conference contribution

    AN - SCOPUS:85046998816

    VL - 2018-January

    SP - 80

    EP - 85

    BT - 32nd International Conference on Information Networking, ICOIN 2018

    PB - IEEE Computer Society

    ER -