Adaptive Focused Website Segment Crawler

Tanaphol Suebchua, Arnon Rungsawang, Hayato Yamana

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    4 Citations (Scopus)

    Abstract

    Focused web crawler has become indispensable for vertical search engines that provide a search service for specialized datasets. These vertical search engines have to collect specific web pages in the web space, whereas search engines such as Google and Bing gather web pages from all over the world. The problem in focused crawling research is how to collect specific web pages with minimal computing resources. We previously addressed this problem by proposing a focused crawling strategy, which utilizes an ensemble machine learning classifier to find the group of relevant web pages, referred to as relevant website segment. In this paper, we enhance the proposed crawler as follows: 1) We increase the accuracy of predicting website segments, by preparing two predictors: a predictor learned by features extracted from relevant source website segments and another predictor learned by features from irrelevant ones. The idea is that there may exist different characteristics between these two types of source website segments. 2) We also propose a noisy data elimination method when updating the predictor incrementally during the crawling process. A preliminary experiment shows that our enhanced crawler outperforms a crawler that equips neither of these approaches by around 12%, at most.

    Original languageEnglish
    Title of host publicationNBiS 2016 - 19th International Conference on Network-Based Information Systems
    PublisherInstitute of Electrical and Electronics Engineers Inc.
    Pages181-187
    Number of pages7
    ISBN (Electronic)9781509009794
    DOIs
    Publication statusPublished - 2016 Dec 16
    Event19th International Conference on Network-Based Information Systems, NBiS 2016 - Ostrava, Czech Republic
    Duration: 2016 Sep 72016 Sep 9

    Other

    Other19th International Conference on Network-Based Information Systems, NBiS 2016
    CountryCzech Republic
    CityOstrava
    Period16/9/716/9/9

    Fingerprint

    Websites
    Search engines
    Web crawler
    World Wide Web
    Learning systems
    Classifiers
    Experiments

    Keywords

    • classifier ensemble
    • focused crawler
    • machine learning
    • noise reduction
    • topic specific web crawler
    • website segment crawler

    ASJC Scopus subject areas

    • Information Systems
    • Computer Networks and Communications

    Cite this

    Suebchua, T., Rungsawang, A., & Yamana, H. (2016). Adaptive Focused Website Segment Crawler. In NBiS 2016 - 19th International Conference on Network-Based Information Systems (pp. 181-187). [7789756] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/NBiS.2016.5

    Adaptive Focused Website Segment Crawler. / Suebchua, Tanaphol; Rungsawang, Arnon; Yamana, Hayato.

    NBiS 2016 - 19th International Conference on Network-Based Information Systems. Institute of Electrical and Electronics Engineers Inc., 2016. p. 181-187 7789756.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Suebchua, T, Rungsawang, A & Yamana, H 2016, Adaptive Focused Website Segment Crawler. in NBiS 2016 - 19th International Conference on Network-Based Information Systems., 7789756, Institute of Electrical and Electronics Engineers Inc., pp. 181-187, 19th International Conference on Network-Based Information Systems, NBiS 2016, Ostrava, Czech Republic, 16/9/7. https://doi.org/10.1109/NBiS.2016.5
    Suebchua T, Rungsawang A, Yamana H. Adaptive Focused Website Segment Crawler. In NBiS 2016 - 19th International Conference on Network-Based Information Systems. Institute of Electrical and Electronics Engineers Inc. 2016. p. 181-187. 7789756 https://doi.org/10.1109/NBiS.2016.5
    Suebchua, Tanaphol ; Rungsawang, Arnon ; Yamana, Hayato. / Adaptive Focused Website Segment Crawler. NBiS 2016 - 19th International Conference on Network-Based Information Systems. Institute of Electrical and Electronics Engineers Inc., 2016. pp. 181-187
    @inproceedings{c371e9e92579459284ef4e3a606b6973,
    title = "Adaptive Focused Website Segment Crawler",
    abstract = "Focused web crawler has become indispensable for vertical search engines that provide a search service for specialized datasets. These vertical search engines have to collect specific web pages in the web space, whereas search engines such as Google and Bing gather web pages from all over the world. The problem in focused crawling research is how to collect specific web pages with minimal computing resources. We previously addressed this problem by proposing a focused crawling strategy, which utilizes an ensemble machine learning classifier to find the group of relevant web pages, referred to as relevant website segment. In this paper, we enhance the proposed crawler as follows: 1) We increase the accuracy of predicting website segments, by preparing two predictors: a predictor learned by features extracted from relevant source website segments and another predictor learned by features from irrelevant ones. The idea is that there may exist different characteristics between these two types of source website segments. 2) We also propose a noisy data elimination method when updating the predictor incrementally during the crawling process. A preliminary experiment shows that our enhanced crawler outperforms a crawler that equips neither of these approaches by around 12{\%}, at most.",
    keywords = "classifier ensemble, focused crawler, machine learning, noise reduction, topic specific web crawler, website segment crawler",
    author = "Tanaphol Suebchua and Arnon Rungsawang and Hayato Yamana",
    year = "2016",
    month = "12",
    day = "16",
    doi = "10.1109/NBiS.2016.5",
    language = "English",
    pages = "181--187",
    booktitle = "NBiS 2016 - 19th International Conference on Network-Based Information Systems",
    publisher = "Institute of Electrical and Electronics Engineers Inc.",
    address = "United States",

    }

    TY - GEN

    T1 - Adaptive Focused Website Segment Crawler

    AU - Suebchua, Tanaphol

    AU - Rungsawang, Arnon

    AU - Yamana, Hayato

    PY - 2016/12/16

    Y1 - 2016/12/16

    N2 - Focused web crawler has become indispensable for vertical search engines that provide a search service for specialized datasets. These vertical search engines have to collect specific web pages in the web space, whereas search engines such as Google and Bing gather web pages from all over the world. The problem in focused crawling research is how to collect specific web pages with minimal computing resources. We previously addressed this problem by proposing a focused crawling strategy, which utilizes an ensemble machine learning classifier to find the group of relevant web pages, referred to as relevant website segment. In this paper, we enhance the proposed crawler as follows: 1) We increase the accuracy of predicting website segments, by preparing two predictors: a predictor learned by features extracted from relevant source website segments and another predictor learned by features from irrelevant ones. The idea is that there may exist different characteristics between these two types of source website segments. 2) We also propose a noisy data elimination method when updating the predictor incrementally during the crawling process. A preliminary experiment shows that our enhanced crawler outperforms a crawler that equips neither of these approaches by around 12%, at most.

    AB - Focused web crawler has become indispensable for vertical search engines that provide a search service for specialized datasets. These vertical search engines have to collect specific web pages in the web space, whereas search engines such as Google and Bing gather web pages from all over the world. The problem in focused crawling research is how to collect specific web pages with minimal computing resources. We previously addressed this problem by proposing a focused crawling strategy, which utilizes an ensemble machine learning classifier to find the group of relevant web pages, referred to as relevant website segment. In this paper, we enhance the proposed crawler as follows: 1) We increase the accuracy of predicting website segments, by preparing two predictors: a predictor learned by features extracted from relevant source website segments and another predictor learned by features from irrelevant ones. The idea is that there may exist different characteristics between these two types of source website segments. 2) We also propose a noisy data elimination method when updating the predictor incrementally during the crawling process. A preliminary experiment shows that our enhanced crawler outperforms a crawler that equips neither of these approaches by around 12%, at most.

    KW - classifier ensemble

    KW - focused crawler

    KW - machine learning

    KW - noise reduction

    KW - topic specific web crawler

    KW - website segment crawler

    UR - http://www.scopus.com/inward/record.url?scp=85011051603&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=85011051603&partnerID=8YFLogxK

    U2 - 10.1109/NBiS.2016.5

    DO - 10.1109/NBiS.2016.5

    M3 - Conference contribution

    AN - SCOPUS:85011051603

    SP - 181

    EP - 187

    BT - NBiS 2016 - 19th International Conference on Network-Based Information Systems

    PB - Institute of Electrical and Electronics Engineers Inc.

    ER -