Key word extraction from a document using word co-occurrence statistical information

Yutaka Matsuo, Mitsuru Ishizuka

Research output: Contribution to journalArticle

29 Citations (Scopus)

Abstract

We present a new keyword extraction algorithm that applies to a single document without using a large corpus. Frequent terms are extracted first, then a set of co-occurrence between each term and the frequent terms, i.e., occurrences in the same sentences, is generated. The distribution of co-occurrence shows the importance of a term in the document as follows. If the probability distribution of co-occurrence between term a and the frequent terms is biased to a particular subset of the frequent terms, then term a is likely to be a keyword. The degree of the biases of the distribution is measured by Χ 2-measure. We show our algorithm performs well for indexing technical papers.

Original languageEnglish
Pages (from-to)217-223
Number of pages7
JournalTransactions of the Japanese Society for Artificial Intelligence
Volume17
Issue number3
DOIs
Publication statusPublished - 2002
Externally publishedYes

Fingerprint

Probability distributions

Keywords

  • Χ test
  • Keyword extraction
  • Word co-occurrence

ASJC Scopus subject areas

  • Artificial Intelligence

Cite this

Key word extraction from a document using word co-occurrence statistical information. / Matsuo, Yutaka; Ishizuka, Mitsuru.

In: Transactions of the Japanese Society for Artificial Intelligence, Vol. 17, No. 3, 2002, p. 217-223.

Research output: Contribution to journalArticle

@article{4d7007dba3cf440c9b235d3bccbdb76b,
title = "Key word extraction from a document using word co-occurrence statistical information",
abstract = "We present a new keyword extraction algorithm that applies to a single document without using a large corpus. Frequent terms are extracted first, then a set of co-occurrence between each term and the frequent terms, i.e., occurrences in the same sentences, is generated. The distribution of co-occurrence shows the importance of a term in the document as follows. If the probability distribution of co-occurrence between term a and the frequent terms is biased to a particular subset of the frequent terms, then term a is likely to be a keyword. The degree of the biases of the distribution is measured by Χ 2-measure. We show our algorithm performs well for indexing technical papers.",
keywords = "Χ test, Keyword extraction, Word co-occurrence",
author = "Yutaka Matsuo and Mitsuru Ishizuka",
year = "2002",
doi = "10.1527/tjsai.17.217",
language = "English",
volume = "17",
pages = "217--223",
journal = "Transactions of the Japanese Society for Artificial Intelligence",
issn = "1346-0714",
publisher = "Japanese Society for Artificial Intelligence",
number = "3",

}

TY - JOUR

T1 - Key word extraction from a document using word co-occurrence statistical information

AU - Matsuo, Yutaka

AU - Ishizuka, Mitsuru

PY - 2002

Y1 - 2002

N2 - We present a new keyword extraction algorithm that applies to a single document without using a large corpus. Frequent terms are extracted first, then a set of co-occurrence between each term and the frequent terms, i.e., occurrences in the same sentences, is generated. The distribution of co-occurrence shows the importance of a term in the document as follows. If the probability distribution of co-occurrence between term a and the frequent terms is biased to a particular subset of the frequent terms, then term a is likely to be a keyword. The degree of the biases of the distribution is measured by Χ 2-measure. We show our algorithm performs well for indexing technical papers.

AB - We present a new keyword extraction algorithm that applies to a single document without using a large corpus. Frequent terms are extracted first, then a set of co-occurrence between each term and the frequent terms, i.e., occurrences in the same sentences, is generated. The distribution of co-occurrence shows the importance of a term in the document as follows. If the probability distribution of co-occurrence between term a and the frequent terms is biased to a particular subset of the frequent terms, then term a is likely to be a keyword. The degree of the biases of the distribution is measured by Χ 2-measure. We show our algorithm performs well for indexing technical papers.

KW - Χ test

KW - Keyword extraction

KW - Word co-occurrence

UR - http://www.scopus.com/inward/record.url?scp=0012854928&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0012854928&partnerID=8YFLogxK

U2 - 10.1527/tjsai.17.217

DO - 10.1527/tjsai.17.217

M3 - Article

AN - SCOPUS:0012854928

VL - 17

SP - 217

EP - 223

JO - Transactions of the Japanese Society for Artificial Intelligence

JF - Transactions of the Japanese Society for Artificial Intelligence

SN - 1346-0714

IS - 3

ER -