Highly accurate retrieval of Japanese document images through a combination of morphological analysis and OCR

Yutaka Katsuyama, Hiroaki Takebe, Koji Kurokawa, Takahiro Saitoh, Satoshi Naoi

Research output: Contribution to journalConference article

7 Citations (Scopus)

Abstract

We have developed a method that allows Japanese document images to be retrieved more accurately by using OCR character candidate information and a conventional plain text search engine. In this method, the document image is first recognized by normal OCR to produce text. Keyword areas are then estimated from the normal OCR produced text through morphological analysis. A lattice of candidate-character codes is extracted from these areas, and then character strings are extracted from the lattice using a word-matching method in noun areas and a K-th DP-matching method in undefined word areas. Finally, these extracted character strings are added to the normal OCR produced text to improve document retrieval accuracy when using a conventional plain text search engine. Experimental results from searches of 49 OHP sheet images revealed that our method has a high recall rate of 98.2%, compared to 90.3% with a conventional method using only normal OCR produced text, while requiring about the same processing time as normal OCR.

Original languageEnglish
Pages (from-to)57-67
Number of pages11
JournalProceedings of SPIE - The International Society for Optical Engineering
Volume4670
DOIs
Publication statusPublished - 2002
Externally publishedYes
EventDocumentation Recognition and Retrieval IX - San Jose, CA, United States
Duration: 2002 Jan 212002 Jan 22

Keywords

  • Document image
  • Document retrieval
  • Document-management systems
  • Morphological analysis
  • OCR

ASJC Scopus subject areas

  • Electronic, Optical and Magnetic Materials
  • Condensed Matter Physics
  • Computer Science Applications
  • Applied Mathematics
  • Electrical and Electronic Engineering

Fingerprint Dive into the research topics of 'Highly accurate retrieval of Japanese document images through a combination of morphological analysis and OCR'. Together they form a unique fingerprint.

  • Cite this