EPCI: Extracting potentially copyright infringement texts from the web

Takashi Tashiro, Takanori Ueda, Taisuke Hori, Yu Hirate, Hayato Yamana

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    3 Citations (Scopus)

    Abstract

    In this paper, we propose a new system extracting potentially copyright infringement texts from the Web, called EPCI. EPCI extracts them in the following way: (1) generating a set of queries based on a given copyright reserved seed-text, (2) putting every query to search engine API, (3) gathering the search result Web pages from high ranking until the similarity between the given seed-text and the search result pages becomes less than a given threshold value, and (4) merging all the gathered pages, then re-ranking them in the order of their similarity. Our experimental result using 40 seed-texts shows that EPCI is able to extract 132 potentially copyright infringement Web pages per a given copyright reserved seed-text with 94% precision in average.

    Original languageEnglish
    Title of host publication16th International World Wide Web Conference, WWW2007
    Pages1151-1152
    Number of pages2
    DOIs
    Publication statusPublished - 2007
    Event16th International World Wide Web Conference, WWW2007 - Banff, AB
    Duration: 2007 May 82007 May 12

    Other

    Other16th International World Wide Web Conference, WWW2007
    CityBanff, AB
    Period07/5/807/5/12

    Fingerprint

    Seed
    Websites
    Search engines
    Application programming interfaces (API)
    Merging

    Keywords

    • Copy detection
    • Information retrieval

    ASJC Scopus subject areas

    • Computer Networks and Communications
    • Software

    Cite this

    Tashiro, T., Ueda, T., Hori, T., Hirate, Y., & Yamana, H. (2007). EPCI: Extracting potentially copyright infringement texts from the web. In 16th International World Wide Web Conference, WWW2007 (pp. 1151-1152) https://doi.org/10.1145/1242572.1242740

    EPCI : Extracting potentially copyright infringement texts from the web. / Tashiro, Takashi; Ueda, Takanori; Hori, Taisuke; Hirate, Yu; Yamana, Hayato.

    16th International World Wide Web Conference, WWW2007. 2007. p. 1151-1152.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Tashiro, T, Ueda, T, Hori, T, Hirate, Y & Yamana, H 2007, EPCI: Extracting potentially copyright infringement texts from the web. in 16th International World Wide Web Conference, WWW2007. pp. 1151-1152, 16th International World Wide Web Conference, WWW2007, Banff, AB, 07/5/8. https://doi.org/10.1145/1242572.1242740
    Tashiro T, Ueda T, Hori T, Hirate Y, Yamana H. EPCI: Extracting potentially copyright infringement texts from the web. In 16th International World Wide Web Conference, WWW2007. 2007. p. 1151-1152 https://doi.org/10.1145/1242572.1242740
    Tashiro, Takashi ; Ueda, Takanori ; Hori, Taisuke ; Hirate, Yu ; Yamana, Hayato. / EPCI : Extracting potentially copyright infringement texts from the web. 16th International World Wide Web Conference, WWW2007. 2007. pp. 1151-1152
    @inproceedings{6fd8606ad61c40eeb424a89a69179179,
    title = "EPCI: Extracting potentially copyright infringement texts from the web",
    abstract = "In this paper, we propose a new system extracting potentially copyright infringement texts from the Web, called EPCI. EPCI extracts them in the following way: (1) generating a set of queries based on a given copyright reserved seed-text, (2) putting every query to search engine API, (3) gathering the search result Web pages from high ranking until the similarity between the given seed-text and the search result pages becomes less than a given threshold value, and (4) merging all the gathered pages, then re-ranking them in the order of their similarity. Our experimental result using 40 seed-texts shows that EPCI is able to extract 132 potentially copyright infringement Web pages per a given copyright reserved seed-text with 94{\%} precision in average.",
    keywords = "Copy detection, Information retrieval",
    author = "Takashi Tashiro and Takanori Ueda and Taisuke Hori and Yu Hirate and Hayato Yamana",
    year = "2007",
    doi = "10.1145/1242572.1242740",
    language = "English",
    isbn = "1595936548",
    pages = "1151--1152",
    booktitle = "16th International World Wide Web Conference, WWW2007",

    }

    TY - GEN

    T1 - EPCI

    T2 - Extracting potentially copyright infringement texts from the web

    AU - Tashiro, Takashi

    AU - Ueda, Takanori

    AU - Hori, Taisuke

    AU - Hirate, Yu

    AU - Yamana, Hayato

    PY - 2007

    Y1 - 2007

    N2 - In this paper, we propose a new system extracting potentially copyright infringement texts from the Web, called EPCI. EPCI extracts them in the following way: (1) generating a set of queries based on a given copyright reserved seed-text, (2) putting every query to search engine API, (3) gathering the search result Web pages from high ranking until the similarity between the given seed-text and the search result pages becomes less than a given threshold value, and (4) merging all the gathered pages, then re-ranking them in the order of their similarity. Our experimental result using 40 seed-texts shows that EPCI is able to extract 132 potentially copyright infringement Web pages per a given copyright reserved seed-text with 94% precision in average.

    AB - In this paper, we propose a new system extracting potentially copyright infringement texts from the Web, called EPCI. EPCI extracts them in the following way: (1) generating a set of queries based on a given copyright reserved seed-text, (2) putting every query to search engine API, (3) gathering the search result Web pages from high ranking until the similarity between the given seed-text and the search result pages becomes less than a given threshold value, and (4) merging all the gathered pages, then re-ranking them in the order of their similarity. Our experimental result using 40 seed-texts shows that EPCI is able to extract 132 potentially copyright infringement Web pages per a given copyright reserved seed-text with 94% precision in average.

    KW - Copy detection

    KW - Information retrieval

    UR - http://www.scopus.com/inward/record.url?scp=35348850182&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=35348850182&partnerID=8YFLogxK

    U2 - 10.1145/1242572.1242740

    DO - 10.1145/1242572.1242740

    M3 - Conference contribution

    AN - SCOPUS:35348850182

    SN - 1595936548

    SN - 9781595936547

    SP - 1151

    EP - 1152

    BT - 16th International World Wide Web Conference, WWW2007

    ER -