A new word-clustering technique is proposed to efficiently build statistically salient class 2-grams from language corpora. By splitting word neighboring characteristics into word-preceding and following directions, multiple (two-dimensional) word classes are assigned to each word. In each side, word classes are merged into larger clusters independently according to preceding or following word distributions. This word-clustering can provide more efficient and statistically reliable word clusters. Further, we extend it to Multi-Class Composite N-gram that unit is Multi-Class 2-gram and joined word. Multi-Class Composite N-gram showed better performance both in perplexity and recognition rates with one thousandth smaller size than conventional word 2-grams.
|ジャーナル||ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings|
|出版ステータス||Published - 1999 1 1|
|イベント||Proceedings of the 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-99) - Phoenix, AZ, USA|
継続期間: 1999 3 15 → 1999 3 19
ASJC Scopus subject areas
- Signal Processing
- Electrical and Electronic Engineering