TY - GEN
T1 - Analysis and improvement of HITS algorithm for detecting Web communities
AU - Nomura, S.
AU - Oyama, S.
AU - Hayamizu, T.
AU - Ishida, T.
PY - 2002/1/1
Y1 - 2002/1/1
N2 - We discuss problems with the HITS (Hyperlink-Induced Topic Search) algorithm, which capitalizes on hyperlinks to extract topic-bound communities of Web pages. Despite its theoretically sound foundations, we observed that the HITS algorithm has failed in real applications. In order to understand this problem, we developed a visualization tool LinkViewer, which graphically presents the extraction process. This tool helped reveal that a large and densely linked set of unrelated Web pages in the base set impeded the extraction. These pages were obtained when the root set was expanded into the base set. As a remedy to this topic drift problem, prior studies applied a textual analysis method. We propose two methods which only utilize the structural information of the Web: 1) the projection method, which projects eigenvectors on the root subspace, so that most elements in the root set will be relevant to the original topic; and 2) the base-set downsizing method, which filters out the pages without links to multiple pages in the root set. These methods are shown to be robust for broader types of topic and low in computation cost.
AB - We discuss problems with the HITS (Hyperlink-Induced Topic Search) algorithm, which capitalizes on hyperlinks to extract topic-bound communities of Web pages. Despite its theoretically sound foundations, we observed that the HITS algorithm has failed in real applications. In order to understand this problem, we developed a visualization tool LinkViewer, which graphically presents the extraction process. This tool helped reveal that a large and densely linked set of unrelated Web pages in the base set impeded the extraction. These pages were obtained when the root set was expanded into the base set. As a remedy to this topic drift problem, prior studies applied a textual analysis method. We propose two methods which only utilize the structural information of the Web: 1) the projection method, which projects eigenvectors on the root subspace, so that most elements in the root set will be relevant to the original topic; and 2) the base-set downsizing method, which filters out the pages without links to multiple pages in the root set. These methods are shown to be robust for broader types of topic and low in computation cost.
KW - Algorithm design and analysis
KW - Computational efficiency
KW - Data mining
KW - Impedance
KW - Informatics
KW - Information filtering
KW - Information filters
KW - Visualization
KW - Web pages
KW - Web sites
UR - http://www.scopus.com/inward/record.url?scp=84886369304&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84886369304&partnerID=8YFLogxK
U2 - 10.1109/SAINT.2002.994467
DO - 10.1109/SAINT.2002.994467
M3 - Conference contribution
AN - SCOPUS:84886369304
T3 - Proceedings - 2002 Symposium on Applications and the Internet, SAINT 2002
SP - 132
EP - 140
BT - Proceedings - 2002 Symposium on Applications and the Internet, SAINT 2002
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - Symposium on Applications and the Internet, SAINT 2002
Y2 - 28 January 2002 through 1 February 2002
ER -