Dynamic author name disambiguation for growing digital libraries

Yanan Qian, Qinghua Zheng, Tetsuya Sakai, Junting Ye, Jun Liu

    Research output: Contribution to journalArticle

    17 Citations (Scopus)

    Abstract

    When a digital library user searches for publications by an author name, she often sees a mixture of publications by different authors who have the same name. With the growth of digital libraries and involvement of more authors, this author ambiguity problem is becoming critical. Author disambiguation (AD) often tries to solve this problem by leveraging metadata such as coauthors, research topics, publication venues and citation information, since more personal information such as the contact details is often restricted or missing. In this paper, we study the problem of how to efficiently disambiguate author names given an incessant stream of published papers. To this end, we propose a “BatchAD+IncAD” framework for dynamic author disambiguation. First, we perform batch author disambiguation (BatchAD) to disambiguate all author names at a given time by grouping all records (each record refers to a paper with one of its author names) into disjoint clusters. This establishes a one-to-one mapping between the clusters and real-world authors. Then, for newly added papers, we periodically perform incremental author disambiguation (IncAD), which determines whether each new record can be assigned to an existing cluster, or to a new cluster not yet included in the previous data. Based on the new data, IncAD also tries to correct previous AD results. Our main contributions are: (1) We demonstrate with real data that a small number of new papers often have overlapping author names with a large portion of existing papers, so it is challenging for IncAD to effectively leverage previous AD results. (2) We propose a novel IncAD model which aggregates metadata from a cluster of records to estimate the author’s profile such as her coauthor distributions and keyword distributions, in order to predict how likely it is that a new record is “produced” by the author. (3) Using two labeled datasets and one large-scale raw dataset, we show that the proposed method is much more efficient than state-of-the-art methods while ensuring high accuracy.

    Original languageEnglish
    Pages (from-to)379-412
    Number of pages34
    JournalInformation Retrieval
    Volume18
    Issue number5
    DOIs
    Publication statusPublished - 2015 Jul 21

    Fingerprint

    Digital libraries
    Metadata
    grouping
    contact

    Keywords

    • Author disambiguation
    • Clustering
    • Data stream
    • Digital library
    • Multi-classification

    ASJC Scopus subject areas

    • Information Systems
    • Library and Information Sciences

    Cite this

    Dynamic author name disambiguation for growing digital libraries. / Qian, Yanan; Zheng, Qinghua; Sakai, Tetsuya; Ye, Junting; Liu, Jun.

    In: Information Retrieval, Vol. 18, No. 5, 21.07.2015, p. 379-412.

    Research output: Contribution to journalArticle

    Qian, Yanan ; Zheng, Qinghua ; Sakai, Tetsuya ; Ye, Junting ; Liu, Jun. / Dynamic author name disambiguation for growing digital libraries. In: Information Retrieval. 2015 ; Vol. 18, No. 5. pp. 379-412.
    @article{60c36960a79342a8a0f85efcafeba715,
    title = "Dynamic author name disambiguation for growing digital libraries",
    abstract = "When a digital library user searches for publications by an author name, she often sees a mixture of publications by different authors who have the same name. With the growth of digital libraries and involvement of more authors, this author ambiguity problem is becoming critical. Author disambiguation (AD) often tries to solve this problem by leveraging metadata such as coauthors, research topics, publication venues and citation information, since more personal information such as the contact details is often restricted or missing. In this paper, we study the problem of how to efficiently disambiguate author names given an incessant stream of published papers. To this end, we propose a “BatchAD+IncAD” framework for dynamic author disambiguation. First, we perform batch author disambiguation (BatchAD) to disambiguate all author names at a given time by grouping all records (each record refers to a paper with one of its author names) into disjoint clusters. This establishes a one-to-one mapping between the clusters and real-world authors. Then, for newly added papers, we periodically perform incremental author disambiguation (IncAD), which determines whether each new record can be assigned to an existing cluster, or to a new cluster not yet included in the previous data. Based on the new data, IncAD also tries to correct previous AD results. Our main contributions are: (1) We demonstrate with real data that a small number of new papers often have overlapping author names with a large portion of existing papers, so it is challenging for IncAD to effectively leverage previous AD results. (2) We propose a novel IncAD model which aggregates metadata from a cluster of records to estimate the author’s profile such as her coauthor distributions and keyword distributions, in order to predict how likely it is that a new record is “produced” by the author. (3) Using two labeled datasets and one large-scale raw dataset, we show that the proposed method is much more efficient than state-of-the-art methods while ensuring high accuracy.",
    keywords = "Author disambiguation, Clustering, Data stream, Digital library, Multi-classification",
    author = "Yanan Qian and Qinghua Zheng and Tetsuya Sakai and Junting Ye and Jun Liu",
    year = "2015",
    month = "7",
    day = "21",
    doi = "10.1007/s10791-015-9261-3",
    language = "English",
    volume = "18",
    pages = "379--412",
    journal = "Information Retrieval",
    issn = "1386-4564",
    publisher = "Springer Netherlands",
    number = "5",

    }

    TY - JOUR

    T1 - Dynamic author name disambiguation for growing digital libraries

    AU - Qian, Yanan

    AU - Zheng, Qinghua

    AU - Sakai, Tetsuya

    AU - Ye, Junting

    AU - Liu, Jun

    PY - 2015/7/21

    Y1 - 2015/7/21

    N2 - When a digital library user searches for publications by an author name, she often sees a mixture of publications by different authors who have the same name. With the growth of digital libraries and involvement of more authors, this author ambiguity problem is becoming critical. Author disambiguation (AD) often tries to solve this problem by leveraging metadata such as coauthors, research topics, publication venues and citation information, since more personal information such as the contact details is often restricted or missing. In this paper, we study the problem of how to efficiently disambiguate author names given an incessant stream of published papers. To this end, we propose a “BatchAD+IncAD” framework for dynamic author disambiguation. First, we perform batch author disambiguation (BatchAD) to disambiguate all author names at a given time by grouping all records (each record refers to a paper with one of its author names) into disjoint clusters. This establishes a one-to-one mapping between the clusters and real-world authors. Then, for newly added papers, we periodically perform incremental author disambiguation (IncAD), which determines whether each new record can be assigned to an existing cluster, or to a new cluster not yet included in the previous data. Based on the new data, IncAD also tries to correct previous AD results. Our main contributions are: (1) We demonstrate with real data that a small number of new papers often have overlapping author names with a large portion of existing papers, so it is challenging for IncAD to effectively leverage previous AD results. (2) We propose a novel IncAD model which aggregates metadata from a cluster of records to estimate the author’s profile such as her coauthor distributions and keyword distributions, in order to predict how likely it is that a new record is “produced” by the author. (3) Using two labeled datasets and one large-scale raw dataset, we show that the proposed method is much more efficient than state-of-the-art methods while ensuring high accuracy.

    AB - When a digital library user searches for publications by an author name, she often sees a mixture of publications by different authors who have the same name. With the growth of digital libraries and involvement of more authors, this author ambiguity problem is becoming critical. Author disambiguation (AD) often tries to solve this problem by leveraging metadata such as coauthors, research topics, publication venues and citation information, since more personal information such as the contact details is often restricted or missing. In this paper, we study the problem of how to efficiently disambiguate author names given an incessant stream of published papers. To this end, we propose a “BatchAD+IncAD” framework for dynamic author disambiguation. First, we perform batch author disambiguation (BatchAD) to disambiguate all author names at a given time by grouping all records (each record refers to a paper with one of its author names) into disjoint clusters. This establishes a one-to-one mapping between the clusters and real-world authors. Then, for newly added papers, we periodically perform incremental author disambiguation (IncAD), which determines whether each new record can be assigned to an existing cluster, or to a new cluster not yet included in the previous data. Based on the new data, IncAD also tries to correct previous AD results. Our main contributions are: (1) We demonstrate with real data that a small number of new papers often have overlapping author names with a large portion of existing papers, so it is challenging for IncAD to effectively leverage previous AD results. (2) We propose a novel IncAD model which aggregates metadata from a cluster of records to estimate the author’s profile such as her coauthor distributions and keyword distributions, in order to predict how likely it is that a new record is “produced” by the author. (3) Using two labeled datasets and one large-scale raw dataset, we show that the proposed method is much more efficient than state-of-the-art methods while ensuring high accuracy.

    KW - Author disambiguation

    KW - Clustering

    KW - Data stream

    KW - Digital library

    KW - Multi-classification

    UR - http://www.scopus.com/inward/record.url?scp=84942503944&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84942503944&partnerID=8YFLogxK

    U2 - 10.1007/s10791-015-9261-3

    DO - 10.1007/s10791-015-9261-3

    M3 - Article

    AN - SCOPUS:84942503944

    VL - 18

    SP - 379

    EP - 412

    JO - Information Retrieval

    JF - Information Retrieval

    SN - 1386-4564

    IS - 5

    ER -