Decomposition of term-document matrix representation for clustering analysis

Jianxiong Yang, Junzo Watada

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    4 Citations (Scopus)

    Abstract

    Latent Semantic Indexing (LSI) is an information retrieval technique using a low-rank singular value decomposition (SVD) of term-document matrix. The aim of this method is to reduce the matrix dimension by finding a pattern in document collection with concurrently referring terms. The methods are implemented to calculate the weight of term-document in vector space model (VSM) for document clustering using fuzzy clustering algorithm. LSI is an attempt to exploit the underlying semantic structure of word usage in documents. During the query-matching phase of LSI, a user's query is first projected into the term-document space, and then compared to all terms and documents represented in the vector space. Using some similarity measure, the nearest (most relevant) terms and documents are identified and returned to the user. The current LSI query-matching method requires computing the similarity measure about the query of every term and document in the vector space. In this paper, the Maximal Tree Algorithm is used within a recent LSI implementation to mitigate the computational time and computational complexity of query matching. The Maximal Tree data structure stores the term and document vectors in such a way that only those terms and documents are most likely qualified as the nearest neighbor to the query will be examined and retrieved. In a word, this novel algorithm is suitable for improving the accuracy of data miners.

    Original languageEnglish
    Title of host publicationIEEE International Conference on Fuzzy Systems
    Pages976-983
    Number of pages8
    DOIs
    Publication statusPublished - 2011
    Event2011 IEEE International Conference on Fuzzy Systems, FUZZ 2011 - Taipei
    Duration: 2011 Jun 272011 Jun 30

    Other

    Other2011 IEEE International Conference on Fuzzy Systems, FUZZ 2011
    CityTaipei
    Period11/6/2711/6/30

    Fingerprint

    Clustering Analysis
    Matrix Representation
    Latent Semantic Indexing
    Semantics
    Decompose
    Term
    Vector spaces
    Query
    Similarity Measure
    Phase matching
    Vector space
    Fuzzy clustering
    Miners
    Trees (mathematics)
    Singular value decomposition
    Information retrieval
    Clustering algorithms
    Data structures
    Document Clustering
    Computational complexity

    Keywords

    • data mining
    • Fuzzy clustering
    • LSI
    • SVD

    ASJC Scopus subject areas

    • Software
    • Artificial Intelligence
    • Applied Mathematics
    • Theoretical Computer Science

    Cite this

    Yang, J., & Watada, J. (2011). Decomposition of term-document matrix representation for clustering analysis. In IEEE International Conference on Fuzzy Systems (pp. 976-983). [6007525] https://doi.org/10.1109/FUZZY.2011.6007525

    Decomposition of term-document matrix representation for clustering analysis. / Yang, Jianxiong; Watada, Junzo.

    IEEE International Conference on Fuzzy Systems. 2011. p. 976-983 6007525.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Yang, J & Watada, J 2011, Decomposition of term-document matrix representation for clustering analysis. in IEEE International Conference on Fuzzy Systems., 6007525, pp. 976-983, 2011 IEEE International Conference on Fuzzy Systems, FUZZ 2011, Taipei, 11/6/27. https://doi.org/10.1109/FUZZY.2011.6007525
    Yang J, Watada J. Decomposition of term-document matrix representation for clustering analysis. In IEEE International Conference on Fuzzy Systems. 2011. p. 976-983. 6007525 https://doi.org/10.1109/FUZZY.2011.6007525
    Yang, Jianxiong ; Watada, Junzo. / Decomposition of term-document matrix representation for clustering analysis. IEEE International Conference on Fuzzy Systems. 2011. pp. 976-983
    @inproceedings{c3070bf5137041a69947e496e1d8d752,
    title = "Decomposition of term-document matrix representation for clustering analysis",
    abstract = "Latent Semantic Indexing (LSI) is an information retrieval technique using a low-rank singular value decomposition (SVD) of term-document matrix. The aim of this method is to reduce the matrix dimension by finding a pattern in document collection with concurrently referring terms. The methods are implemented to calculate the weight of term-document in vector space model (VSM) for document clustering using fuzzy clustering algorithm. LSI is an attempt to exploit the underlying semantic structure of word usage in documents. During the query-matching phase of LSI, a user's query is first projected into the term-document space, and then compared to all terms and documents represented in the vector space. Using some similarity measure, the nearest (most relevant) terms and documents are identified and returned to the user. The current LSI query-matching method requires computing the similarity measure about the query of every term and document in the vector space. In this paper, the Maximal Tree Algorithm is used within a recent LSI implementation to mitigate the computational time and computational complexity of query matching. The Maximal Tree data structure stores the term and document vectors in such a way that only those terms and documents are most likely qualified as the nearest neighbor to the query will be examined and retrieved. In a word, this novel algorithm is suitable for improving the accuracy of data miners.",
    keywords = "data mining, Fuzzy clustering, LSI, SVD",
    author = "Jianxiong Yang and Junzo Watada",
    year = "2011",
    doi = "10.1109/FUZZY.2011.6007525",
    language = "English",
    isbn = "9781424473175",
    pages = "976--983",
    booktitle = "IEEE International Conference on Fuzzy Systems",

    }

    TY - GEN

    T1 - Decomposition of term-document matrix representation for clustering analysis

    AU - Yang, Jianxiong

    AU - Watada, Junzo

    PY - 2011

    Y1 - 2011

    N2 - Latent Semantic Indexing (LSI) is an information retrieval technique using a low-rank singular value decomposition (SVD) of term-document matrix. The aim of this method is to reduce the matrix dimension by finding a pattern in document collection with concurrently referring terms. The methods are implemented to calculate the weight of term-document in vector space model (VSM) for document clustering using fuzzy clustering algorithm. LSI is an attempt to exploit the underlying semantic structure of word usage in documents. During the query-matching phase of LSI, a user's query is first projected into the term-document space, and then compared to all terms and documents represented in the vector space. Using some similarity measure, the nearest (most relevant) terms and documents are identified and returned to the user. The current LSI query-matching method requires computing the similarity measure about the query of every term and document in the vector space. In this paper, the Maximal Tree Algorithm is used within a recent LSI implementation to mitigate the computational time and computational complexity of query matching. The Maximal Tree data structure stores the term and document vectors in such a way that only those terms and documents are most likely qualified as the nearest neighbor to the query will be examined and retrieved. In a word, this novel algorithm is suitable for improving the accuracy of data miners.

    AB - Latent Semantic Indexing (LSI) is an information retrieval technique using a low-rank singular value decomposition (SVD) of term-document matrix. The aim of this method is to reduce the matrix dimension by finding a pattern in document collection with concurrently referring terms. The methods are implemented to calculate the weight of term-document in vector space model (VSM) for document clustering using fuzzy clustering algorithm. LSI is an attempt to exploit the underlying semantic structure of word usage in documents. During the query-matching phase of LSI, a user's query is first projected into the term-document space, and then compared to all terms and documents represented in the vector space. Using some similarity measure, the nearest (most relevant) terms and documents are identified and returned to the user. The current LSI query-matching method requires computing the similarity measure about the query of every term and document in the vector space. In this paper, the Maximal Tree Algorithm is used within a recent LSI implementation to mitigate the computational time and computational complexity of query matching. The Maximal Tree data structure stores the term and document vectors in such a way that only those terms and documents are most likely qualified as the nearest neighbor to the query will be examined and retrieved. In a word, this novel algorithm is suitable for improving the accuracy of data miners.

    KW - data mining

    KW - Fuzzy clustering

    KW - LSI

    KW - SVD

    UR - http://www.scopus.com/inward/record.url?scp=80053084828&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=80053084828&partnerID=8YFLogxK

    U2 - 10.1109/FUZZY.2011.6007525

    DO - 10.1109/FUZZY.2011.6007525

    M3 - Conference contribution

    AN - SCOPUS:80053084828

    SN - 9781424473175

    SP - 976

    EP - 983

    BT - IEEE International Conference on Fuzzy Systems

    ER -