Abstract
A new word-clustering technique is proposed to efficiently build statistically salient class 2-grams from language corpora. By splitting word neighboring characteristics into word-preceding and following directions, multiple (two-dimensional) word classes are assigned to each word. In each side, word classes are merged into larger clusters independently according to preceding or following word distributions. This word-clustering can provide more efficient and statistically reliable word clusters. Further, we extend it to Multi-Class Composite N-gram that unit is Multi-Class 2-gram and joined word. Multi-Class Composite N-gram showed better performance both in perplexity and recognition rates with one thousandth smaller size than conventional word 2-grams.
Original language | English |
---|---|
Pages (from-to) | 533-536 |
Number of pages | 4 |
Journal | ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings |
Volume | 1 |
Publication status | Published - 1999 Jan 1 |
Externally published | Yes |
Event | Proceedings of the 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-99) - Phoenix, AZ, USA Duration: 1999 Mar 15 → 1999 Mar 19 |
ASJC Scopus subject areas
- Software
- Signal Processing
- Electrical and Electronic Engineering