EPCI: Extracting potentially copyright infringement texts from the web

Takashi Tashiro, Takanori Ueda, Taisuke Hori, Yu Hirate, Hayato Yamana

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    3 Citations (Scopus)

    Abstract

    In this paper, we propose a new system extracting potentially copyright infringement texts from the Web, called EPCI. EPCI extracts them in the following way: (1) generating a set of queries based on a given copyright reserved seed-text, (2) putting every query to search engine API, (3) gathering the search result Web pages from high ranking until the similarity between the given seed-text and the search result pages becomes less than a given threshold value, and (4) merging all the gathered pages, then re-ranking them in the order of their similarity. Our experimental result using 40 seed-texts shows that EPCI is able to extract 132 potentially copyright infringement Web pages per a given copyright reserved seed-text with 94% precision in average.

    Original languageEnglish
    Title of host publication16th International World Wide Web Conference, WWW2007
    Pages1151-1152
    Number of pages2
    DOIs
    Publication statusPublished - 2007
    Event16th International World Wide Web Conference, WWW2007 - Banff, AB
    Duration: 2007 May 82007 May 12

    Other

    Other16th International World Wide Web Conference, WWW2007
    CityBanff, AB
    Period07/5/807/5/12

      Fingerprint

    Keywords

    • Copy detection
    • Information retrieval

    ASJC Scopus subject areas

    • Computer Networks and Communications
    • Software

    Cite this

    Tashiro, T., Ueda, T., Hori, T., Hirate, Y., & Yamana, H. (2007). EPCI: Extracting potentially copyright infringement texts from the web. In 16th International World Wide Web Conference, WWW2007 (pp. 1151-1152) https://doi.org/10.1145/1242572.1242740