We have developed a method that allows Japanese document images to be retrieved more accurately by using OCR character candidate information and a conventional plain text search engine. In this method, the document image is first recognized by normal OCR to produce text. Keyword areas are then estimated from the normal OCR produced text through morphological analysis. A lattice of candidate-character codes is extracted from these areas, and then character strings are extracted from the lattice using a word-matching method in noun areas and a K-th DP-matching method in undefined word areas. Finally, these extracted character strings are added to the normal OCR produced text to improve document retrieval accuracy when using a conventional plain text search engine. Experimental results from searches of 49 OHP sheet images revealed that our method has a high recall rate of 98.2%, compared to 90.3% with a conventional method using only normal OCR produced text, while requiring about the same processing time as normal OCR.
|ジャーナル||Proceedings of SPIE - The International Society for Optical Engineering|
|出版ステータス||Published - 2002|
|イベント||Documentation Recognition and Retrieval IX - San Jose, CA, United States|
継続期間: 2002 1月 21 → 2002 1月 22
ASJC Scopus subject areas
- コンピュータ サイエンスの応用