JP3589007B2

JP3589007B2 - Document filing system and document filing method

Info

Publication number: JP3589007B2
Application number: JP03554598A
Authority: JP
Inventors: 泰三亀代; 康裕岡田
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1998-02-18
Filing date: 1998-02-18
Publication date: 2004-11-17
Anticipated expiration: 2018-02-18
Also published as: JPH11232296A

Description

【０００１】
【発明の属する技術分野】
本発明は、例えば、文書や図面等の画像を電子的にファイリングする文書ファイリングシステムおよび文書ファイリング方法に関し、特にファイルした文書をキーワードを用いて検索する文書ファイリングシステムおよび文書ファイリング方法に関するものである。
【０００２】
【従来の技術】
従来より、文書画像を電子的に保存し、それを検索、表示するため、文書画像に対して人手でキーワード情報を付加して保存する、という方法が用いられている。また、人手によるキーワード入力の手間を省くために、文字認識機能を有するシステムを使用し、それによって文書画像中の文字を認識して、関連するキーワードあるいは全文を、文書画像とともに保存する方法が用いられている。
【０００３】
後者の場合には、そのシステムの文字認識性能が不完全であることに起因して誤認識が生じる。このため、検索用に入力したキーワードに対して、そのキーワードとは異なる文字列が検索結果として表示される、いわゆる「誤抽出」が発生する。また、文書画像中の文字が、入力したキーワードと同一であるにもかかわらず、文字の誤認識があるために検索結果として表示されない、「検索もれ」も発生する。
【０００４】
そこで、検索精度を向上させるためには、上記の誤抽出および検索もれを極力少なくする必要がある。この検索時の誤抽出、検索もれを減少させる方法として、
（１）文字認識性能を向上させる（正解文字保有率を向上させる）
（２）入力キーワードと検索対象文字列との部分的な不一致を許容し、文字認識性能の不完全性を補助する
という方法がある。
【０００５】
上記（１）の一例として、１つの文字画像に対する文字認識結果を複数保持することで、正解文字を保有する確率を高める方法がある。例えば、「文書認識と全文検索の融合技術に関する実験的検討」（情報処理学会研究会、情報学基礎３９−９，１９９５年９月、丸川他）では、認識候補文字に対して、個々の認識の類似度によって、保持する候補文字数を可変にし、それらを複数保持することで、１文字づつの保持に比べて、より高精度な検索を可能にしている。
【０００６】
また、例えば、特開平８−２７２８１３に示すファイリング装置では、各文字画像に対する認識結果を第４位候補まで固定して保持し、候補文字も含めた文書コードから検索キーワードとのマッチングを行っている。
【０００７】
上記（２）に係る方法の一例としては、上記特開平８−２７２８１３に記載されているように、キーワードと認識結果との一致度ｍを、
ｍ＝（一致した文字数／キーワードの文字数）×１００（％） …（１）
で算出し、文字認識をした結果としての候補文字中に全ての検索文字が含まれていなくても、それを検索結果として出力するものがある。
【０００８】
以下、特開平８−２７２８１３に係る装置の動作を簡単に説明する。
＜データの格納方法の説明＞
図１７は、特開平８−２７２８１３に係るファイリング装置のブロック構成図である。同図において、スキャナ１０１で読み取られた原稿画像は、スキャナインタフェース（Ｉ／Ｆ）回路１０２でディジタル信号に変換される。原稿が文字画像の場合、Ｉ／Ｆ回路１０２より信号を受けたＣＰＵ１０５は、文字認識処理を行い、文字画像の１文字に対し、４文字までの認識結果としての候補文字を、文書保存手段である外部記憶装置１１０に出力する。
【０００９】
なお、ＲＡＭ１０７は、文字画像の展開や文字認識処理のための作業領域である。また、上記の外部記憶装置１１０は、例えば、バードディスク等、登録されたデータを格納する装置であり、ここではデータの蓄積のみならず、文字認識用の辞書も格納されている。
【００１０】
このファイリング装置では、入力文書画像をファイリングするとき、文字認識処理で得られた文字の第１候補の文字コードだけをテキストデータとして登録するのではなく、第４候補までの文字コードを登録する。つまり、個々の文字画像に対して、文字画像と認識結果の候補文字を、４文字保存する。
【００１１】
＜検索方法の説明＞
図１８は、特開平８−２７２８１３に係る装置におけるテキストデータの候補を示している。同図において、検索のキーワードを「内部処理統合型」とした場合の文字認識結果とキーワードの照合部分（一致部分）を矢印で示す。上述のＣＰＵ１０５は、文書検索手段としてキーワードを、第４位までの候補文字全てと照合する。
【００１２】
上記の式（１）でｍの値が、ある閾値（例えば６０（％））以上のとき、これを検索結果候補とすると、図１８では、７個のキーワード文字数に対して、６文字の一致があるので、

となり、これらが検索結果候補となる。
【００１３】
【発明が解決しようとする課題】
しかしながら、上記従来の、１つの文字画像に対する文字認識結果を複数保持する方法は、文字認識の精度に関して、保存する候補文字数を少なくすると、候補文字中に正解文字を含む可能性が低くなって、検索もれが起こりやすい。また、候補文字数を多く保持すると、正解文字を含む可能性が高くなるために検索もれは減少するが、正解文字以外の文字をも多く保持するために、誤抽出が多く発生する、という問題がある。また、候補文字を多く保持すると文書保存のためのメモリ容量が増大するという問題もある。
【００１４】
類似度によって保持する候補文字数を可変にする方法では、例えば、文書画像の濃度が適正でなく、つぶれていたり、掠れている場合には、標準パターンと文字画像とで、認識に用いる特徴量の差が大きくなるために、候補文字中に正解文字が含まれず、その認識率が低下して、このために検索もれが生じる、という問題がある。
【００１５】
また、この場合、認識結果の類似度が小さくなる（正解の可能性が低くなる）ため、一定の正解文字含有率を満たすには、より多くの候補文字を保存する必要が生じる。その結果、検索時に誤抽出が大きくなるという問題がある。
【００１６】
一方、特開平８−２７２８１３に開示されているような、ある程度の不一致を許容する検索方法では、キーワードと照合する文字の不一致となる部分がどのような文字であっても、一致する部分が共通である限り、同一の一致度として計算される、という問題がある。
【００１７】
これにより、例えば、検索キーワードが「日本人」の場合、文字列「日本入」「日本語」「日本国」「日本の」「日本は」等に対して、いずれも、ｍ＝（２／３）×１００＝６７％という同一の一致度となり、これを検索結果として出力、表示することになる。ここでは、「日本入」の「入」が誤認識され、実際に検索したい文字列は「日本人」であるのに、上記の「日本語」「日本国」「日本の」「日本は」等と一致度が等しいために、それを一致度の高い順に表示した場合、「日本入」を、これらの中に埋もれて表示してしまうことになる。
【００１８】
そこでユーザは、表示手段であるディスプレイ１０８に表示された、このような誤抽出の中から、さらに希望する結果を探す必要がある。そして、この不一致を許可する閾値が小さいほど、誤抽出も大量に出力されるため、ユーザが真に検索したい文書が誤抽出に埋もれ、結果として、その装置がユーザには使いづらいものとなる、という問題がある。また、閾値を大きくすると検索もれが増える、といった問題もある。
【００１９】
本発明は、上記の課題に鑑みてなされたもので、その目的とするところは、文書検索の検索もれが起こりにくく、候補文字数を多くしても誤抽出が発生しない文書ファイリングシステムおよび文書ファイリング方法を提供することである。また、本発明の他の目的は、キーワードと一部不一致である文字が認識結果に存在しても適正な一致度を算出でき、高精度な検索を実行できる文書ファイリングシステムおよび文書ファイリング方法を提供することである。
【００２０】
【課題を解決するための手段】
上記の目的を達成するため、第１の発明は、文書画像より、キーワードに従って所定の文書を検索する文書ファイリングシステムにおいて、上記文書画像中の文字を認識する手段と、上記キーワードと上記認識した文字とを照合する照合手段と、上記照合の結果をもとに、上記文書画像に関する情報を文書検索結果として出力する出力手段とを備え、上記照合手段は、辞書内の標準パターンを使って、上記キーワード中の文字の文字コードの濃度パターンと上記認識した文字の文字コードの濃度パターンとを比較して、上記キーワード中の文字と上記認識した文字との類似度を文字毎に算出し、この算出された文字毎の類似度の総和を用いて、上記キーワード中の文字で一致している文字の割合を示す一致度を算出し、計算された一致度に基づいて上記照合を行い、また、上記出力手段は、この一致度の示す値をもとに、上記キーワードに対応する文書を上記文書検索結果として出力する文書ファイリングシステムを提供する。
【００２１】
第２の発明は、第１の発明において、さらに、あらかじめ格納した単語辞書を参照して、上記認識した文字に所定の文法解析を施し、この文法解析の結果をもとに、上記文字と単語辞書との一致状態を示す評価値を各文字に付与する手段を備え、上記照合手段は、上記類似度と上記評価値とを重み付け加算して総合評価値を求め、この求めた各文字毎の総合評価値に基づいて上記照合を行う文書ファイリングシステムを提供する。
【００２２】
第３の発明は、第２の発明において、上記照合手段が、上記総合評価値が一定値以上であるか否かをもとに、上記認識した文字について、その文字コードを一意に決定できる文字と一意に決定できない文字との区別を行い、上記一意に決定できる文字を、その文字コードと所定のフラグとを対応付けて保存する文書ファイリングシステムを提供する。
【００２３】
また、第４の発明は、第３の発明において、上記照合手段が、上記文字コードが同一であるか否かをもとに、上記一意に決定できる文字と上記キーワード中の文字との類似度を算出するとともに、この類似度を用いて上記一致度を計算し、また、上記一意に決定できない文字については、上記キーワード中の文字と上記認識した文字との類似度を用いて上記一致度を計算する文書ファイリングシステムを提供する。
【００２４】
第５の発明は、第３の発明において、さらに、上記キーワード中の文字と上記認識した文字それぞれの特徴量を抽出する手段を備え、上記照合手段は、上記一意に決定できない文字について、上記特徴量の類似度を算出し、この類似度を用いて上記一致度を計算する文書ファイリングシステムを提供する。
【００２５】
そして、第６の発明は、第３の発明において、さらに、上記一意に決定できない文字の文字数と上記認識した文字の文字数の比率を算出する手段を備え、上記出力手段は、上記比率が一定値以上の場合、上記所定の文書を適正に検索できない旨の表示をする文書ファイリングシステムを提供する。
【００２６】
第７の発明は、光学的に読み取られた文書画像が保存される文書画像データベースと、前記文書画像中の文字を認識する文字認識部と、前記文字認識された結果の文字コードを保存する文字認識結果データベースとからなる文書ファイリングシステムで、文書画像より、キーワードに従って所定の文書を検索する文書ファイリング方法において、上記文書画像中の文字を認識する工程と、辞書内の標準パターンを使って、上記キーワード中の文字の文字コードの濃度パターンと上記認識した文字の文字コードの濃度パターンとを比較して、上記キーワードの文字と上記認識した文字との類似度を文字毎に算出する工程と、上記算出された文字毎の類似度の総和を用いて、上記キーワード中の文字で一致している文字の割合を示す一致度を計算する工程と、上記計算された一致度に基づいて、上記キーワード中の文字と上記認識した文字とを照合する照合工程と、上記照合の結果をもとに、上記キーワードに対応する文書を文書検索結果として出力する出力工程とを備える文書ファイリング方法を提供する。
【００２７】
第８の発明は、第７の発明において、さらに、あらかじめ格納した単語辞書を参照して、上記認識した文字に所定の文法解析を施し、この文法解析の結果をもとに、上記文字と単語辞書との一致状態を示す評価値を各文字に付与する工程を備え、上記照合工程は、上記類似度と上記評価値とを重み付け加算して総合評価値を求め、上記求められた各文字毎の総合評価値に基づいて上記照合を行う文書ファイリング方法を提供する。
【００２８】
第９の発明は、第８の発明において、上記照合工程が、上記総合評価値が一定値以上であるか否かをもとに、上記認識した文字について、その文字コードを一意に決定できる文字と一意に決定できない文字との区別を行い、上記一意に決定できる文字を、その文字コードと所定のフラグとを対応付けて保存する文書ファイリング方法を提供する。
【００２９】
また、第１０の発明は、第９の発明において、上記照合工程が、上記文字コードが同一であるか否かをもとに、上記一意に決定できる文字と上記キーワード中の文字との類似度を算出するとともに、この類似度を用いて上記一致度を計算し、また、上記一意に決定できない文字については、上記キーワード中の文字と上記認識した文字との類似度を用いて上記一致度を計算する文書ファイリング方法を提供する。
【００３０】
また、第１１の発明は、第９の発明において、さらに、上記キーワード中の文字と上記認識した文字それぞれの特徴量を抽出する行う工程を備え、上記照合工程は、上記一意に決定できない文字について、上記特徴量の類似度を算出し、この類似度を用いて上記一致度を計算する文書ファイリング方法を提供する。
【００３１】
そして、第１２の発明は、第９の発明において、さらに、上記一意に決定できない文字の文字数と上記認識した文字の文字数の比率を算出する工程を備え、上記出力工程は、上記比率が一定値以上の場合、上記所定の文書を適正に検索できない旨の表示をする文書ファイリング方法を提供する。
【００３２】
【実施の形態】
以下、添付図面を参照して、本発明の実施の形態を説明する。
実施の形態１．
図１は、本発明の実施の形態１に係るファイリングシステムの構成を示すブロック図である。同図に示すシステムでは、入力部１より文書画像が入力され、入力した文書画像と、文字認識部３が認識した結果を文書保存部２に保存する。この文字認識部３は、入力した文書画像より文字領域を求め、その文字領域内の文字を認識する。また、辞書８は、文字認識部３が文字を認識するために保持する、各文字毎の標準パターンの特徴を保持する。
【００３３】
文書画像データベース４は、入力画像を電子的に保持し、文字認識結果データベース５には、文字認識部３が認識した結果の文字コード列が保持される。文書検索部６は、文字認識結果データベース５から、ユーザが入力したキーワード２０と一致する文字列を含む文書を検索する。また、類似度出力部７は、２つの文字コード間の類似度を、辞書８を参照して算出し、表示部９は、文書検索部６による検索結果を表示する。
【００３４】
図２は、本実施の形態に係るシステムにおける文書保存処理手順を示すフローチャートである。
最初に、図２に示すフローチャートを用いて、本実施の形態における文字認識結果の保存方法について説明する。図２のステップＳ１００では、入力部１が画像を入力し、それを文書保存部２へ転送する。なお、この入力部１は、例えば、スキャナを用いて原稿画像を光電変換するものでもよいし、あらかじめ光電変換された画像をネットワーク経由等で入力してもよい。
【００３５】
文書保存部２は、ステップＳ１１０で、入力部１より入力した画像を文字認識部３へ渡し、この文字認識部３から、画像内の文字認識結果である文字コードを受け取る。なお、この文字認識部３は、公知の文字列切り出し、文字切り出し、文字認識を実行し、それによって得られた文字認識結果を文書保存部２に返す。また、文字列切り出しは、例えば、文書画像内の黒画素が連続する部分を連結し、黒画素の連結成分の幅、高さから、それが文字列であるかを決定する、という方法をとる。
【００３６】
文字切り出し方法は、例えば、文字列切り出しで決定した文字列画像を、縦方向と横方向から走査し、黒画素数の周辺分布を求めて、黒画素数の少ない部分を切り出し候補点として、１文字毎の画像に分割する。また、文字認識処理は、文字切り出しによって１文字単位に分割した画像に対し、例えば、８×８次元の濃度特徴を抽出し、標準パターンとの各次元毎の差分の和を求めて、この差分の和の最も小さな標準パターンから、数文字を認識結果として出力する。
【００３７】
次のステップＳ１２０では、文書保存部２が、文字認識結果となる文字コードを文字認識結果データベース５に保存するが、ここでは、各１文字画像に対して１文字の認識結果のみを保存する。なお、図３に、本実施の形態における文字認識結果の保存例を示す。また、文書保存部２は、入力された２値文書画像を文字画像データベース４に保存する。以上で文書保存処理を終了する。
【００３８】
次に、本実施の形態におけるキーワード検索方法について説明する。
図４は、本実施の形態に係るシステムにおける検索処理手順を示すフローチャートである。すなわち、図４は、キーワードと認識結果の文字コードとの照合手順を示すフローチャートである。
【００３９】
図４のステップＳ２００では、ユーザが入力したキーワード２０を文書検索部６へ入力する。続くステップＳ２１０で、認識結果を含む配列（以下、ｔｅｘｔと記す）と、キーワードを含む配列（以下、ｋｅｙと記す）のポインタを初期化する。文書検索部６は、文字認識結果データベース５より、１文書の認識結果を取り出し、それをｔｅｘｔにセットする。
【００４０】
次のステップＳ２２０では、ｔｅｘｔとｋｅｙ内の文字の照合を行う。この照合は、文字ｔｅｘｔ［ｋ］と文字ｋｅｙ［ｊ］の文字コードを類似度出力部７に入力し、類似度出力部７が辞書８を参照して、ｔｅｘｔ［ｋ］とｋｅｙ［ｊ］の文字の類似度ｓを出力することで行う。ここでは、

によって求めたｓを類似度として返す。なお、上記の式（２）において、Ｎｔｅｘｔ［ｋ］ｎは、辞書８内の文字ｔｅｘｔ［ｋ］における第ｎ次元の濃度値であり、Ｎｋｅｙ［ｋ］ｎは、辞書８内の文字ｋｅｙ［ｋ］における第ｎ次元の濃度値である。また、ｎは１〜６４の値をとり、｜｜は、絶対値を表す。
【００４１】
そして、文書検索部６は、類似度出力部７が出力する類似度を用いて、
一致度＝照合した文字のｓの和／照合した文字数 …（３）
を計算する。
【００４２】
ステップＳ２３０では、文書検索部６が、上記の一致度が一定値以上であるか否かを判定する。それが一定値以上である場合は、ステップＳ２４０に進んで、ポインタｋ，ｊそれぞれを１インクリメントし、続くステップＳ２５０で、全てのキーワードとの照合が行われたかどうかを判定する。
【００４３】
このステップＳ２５０で、全てのキーワードとの照合が行われたと判定された場合は、ステップＳ２７０で出力結果を保存し、本処理を終了する。なお、この出力結果とは、一致した文書名、一致度、ページ番号、一致した文字の画像上での座標等を意味しており、本実施の形態では、それらを表示部９に出力し、表示する。
【００４４】
一方、ステップＳ２３０で、一致度が一定値以下と判定された場合は、処理をステップＳ２８０へ進め、文書の終わりであるかどうかをチェックする。文書の終わりである場合には、本処理を終了するが、文書の終わりでない場合には、ステップＳ２９０へ進み、ｉを１インクリメントしてｋに代入し、キーワードのポインタｊを０にして、再びステップＳ２２０へ戻る。
【００４５】
また、ステップＳ２５０で、全てのキーワードとの照合が行われていないと判定された場合には、続くステップＳ２６０で、文書の終わりであるかどうかをチェックする。そして、文書の終わりである場合には、本処理を終了するが、文書の終わりでない場合には、再びステップＳ２２０へ戻る。
以下同様にして、文字認識結果データベース５内の全ての文書に対して、上記の処理を繰り返す。
【００４６】
ここで、本実施の形態における、文字認識結果とキーワード「内部処理統合型」との照合方法について、例示しながら説明する。
図４のステップＳ２２０まで検索処理が進み、ポインタｋ＝０，ｊ＝０のとき、すなわち、文字認識結果とキーワードのそれぞれ最初の文字を照合する。つまり、ｔｅｘｔ［０］＝「で」と、ｋｅｙ［０］＝「内」の照合を行い、それらの類似度を求める。なお、図５は、辞書８内の標準パターン「で」の特徴を、図６は、同じく「内」の特徴を示しており、これらのパターン内にある数値は、それぞれの特徴量を示す。
【００４７】
類似度出力部７は、図５と図６のパターンから、上記の式（２）に従って類似度ｓを計算し、ｓ＝１−｛９１／（９４＋９５）｝＝０．５２を得て、それを文書検索部６へ出力する。そして、ステップＳ２３０で文書検索部６は、この類似度ｓをもとに一致度を計算し、一致度＝０．５２／１＝０．５２を得る。ここで、一致度ｓの閾値を０．６とすると、ステップＳ２３０での判定は「Ｎｏ」となるので、処理はステップＳ２８０へ進む。
【００４８】
この段階では、文書の終わりではないので、ステップＳ２８０での判定は「Ｎｏ」となり、続くステップＳ２９０でｉのインクリメント、つまり、ｋを１インクリメントして、ステップＳ２２０へ戻る。その結果、ステップＳ２２０では、ｔｅｘｔ［１］＝「行」とｋｅｙ［０］＝「内」の照合を行う。
【００４９】
類似度出力部７が、上記と同様の計算を行い、例えば、類似度０．４を文書検索部６へ出力すると、この場合もステップＳ２３０での判定は「Ｎｏ」となるので、処理はステップＳ２８０へ進む。以下、同様にこれらの処理を実行し、照合の結果、ｔｅｘｔ内の「内都処理縦合型」とキーワードが一定値以上の一致度を示した場合、これを出力候補とする。
【００５０】
以上説明したように、本実施の形態によれば、認識結果を各１文字画像に対して１文字しか保存しないので、文書保存のために必要となるメモリ容量が少なくて済む。
また、照合対象文字を認識結果候補である数文字に限定せず、全ての文字を照合対象とし、かつ、照合時に各文字毎の一致度を全て用いるため、検索もれが少なくなるとともに、キーワードと一部不一致である文字が認識結果に存在しても、適正な一致度を算出することが可能となるため、ユーザが検索結果を確認する手間も省ける、という顕著な効果がある。
【００５１】
なお、文字認識方法については、上記の方法に限定されるものではなく、例えば、「パターン認識」（舟久保登著、共立出版）に記述された構造解析的手法を用いてもよい。また、上記の実施の形態では、類似度出力部７は、文字認識部３が用いる辞書から計算しているが、これに限定されるものではなく、あらかじめ各文字間の類似度の表を保持するようにしてもよい。図７は、各文字間の類似度を表にした例である。
さらに、一致度の閾値および一致度の計算方法も、上記の数値および式に限定されないことは言うまでもない。
【００５２】
実施の形態２．
図８は、本発明の実施の形態２に係るファイリングシステムの構成を示すブロック図である。なお、同図において、図１に示す上記実施の形態１に係るシステムと同一構成要素には同一符号を付し、ここでは、それらの説明を省略する。
図８の後処理部１０は、文書保存部２に保存された、文字認識部３による文字認識結果を文法的に検証し、その結果を出力する。単語辞書１１は、後処理部１０が用いる辞書である。
【００５３】
最初に、本実施の形態に係るシステムにおける文書の登録方法について述べる。
図９は、本実施の形態に係るファイリングシステムで実行される、文字認識結果の保存処理を示すフローチャートである。同図のステップＳ３００で、文字認識部３が文字認識処理を行う。ここでの文字認識の方法は、上記実施の形態１における方法と同じであるので、その説明は省略する。なお、この認識結果は、候補文字を含め、１文字画像に対し複数出力する。
【００５４】
次にステップＳ３１０へ進み、後処理を行う。ここでは、後処理部１０が、候補文字も含めた認識結果に対して、それを文法的に解析し、文章として正しいと思われる組み合わせを決定する。以下、公知の形態素照合を用いた後処理方法について説明する。
【００５５】
図１０は、本実施の形態に係る後処理の手順を示すフローチャートであり、ここでは、図１８に示す認識結果候補文字に対して、この後処理を行う。また、図１１は、単語辞書１１内の単語と形態素番号の対応を示す図である。なお、形態素番号とは、形態素（普通名詞、固有名詞、サ変名詞等の単語）に対して、一意に番号を割り当てたものである。そして、図１２は、これら形態素番号の接続関係を記述したものを示している。
【００５６】
図１０のステップＳ４１０で後処理部１０は、文字の認識結果と単語辞書１１との照合を行う。ここで、図１１に示す単語と図１８に示す認識結果とを照合すると、図１１内の「処理」「統合」「型」と認識結果との照合に成功する。そして、ステップＳ４２０へ進み、図１２に示す記述に従って、形態素接続検定を行う。
【００５７】
図１１に示すように、「処理」の形態素番号は「５５」、「統合」の形態素番号は「５５」，「５９」の２種類、そして、「型」については「４６」である。そこで、図１２に示す記述内容を用いて形態素接続検定をすると、５５と５５、５５と４６の接続がいずれも成立する（図中の○印）。次にステップＳ４３０へ進み、一致した形態素文字数を評価値として、それらを各候補文字に付与する。
【００５８】
図１３は、後処理部１０が出力する、照合結果を示す。同図に示すように、「処」「理」「統」「合」に評価値「２」、「型」に評価値「１」、それ以外の文字には、それらが単語辞書と一致しないので、全て「０」が付与されている。
【００５９】
上記の後処理が終わると、図９のステップＳ３２０が実行される。すなわち、文書保存部２は、文字認識結果と後処理結果から、各文字毎の総合評価値を算出する。この総合評価値は、以下の式（４）により算出される。
総合評価値＝α×（文字認識の類似度）＋（１−α）×（後処理の評価値） …（４）
なお、ここでαは、０≦α≦１を満たす値をとり、本実施の形態では、α＝０．８とする。
【００６０】
今、図１８に示す候補文字の内、「縦」の類似度を０．７、「統」の類似度を０．５とし、それらと、図１３に示す照合結果とを、上記の式（４）に当てはめると、「縦」についての総合評価値は、０．８×０．７＋０．２×０＝０．５６、「統」についての総合評価値は、０．８×０．５＋０．２×２＝０．８となり、「統」の方が「縦」よりも評価値が高い。
【００６１】
続くステップＳ３３０で、文書保存部２は、一意に決定できる文字と一意に決定できない文字を区別する。この区別の方法は、例えば、総合評価値が一定値以上の場合、一意に決定可能とし、それ以下の場合は、その決定が不可能とする。そして、ステップＳ３４０では、文書保存部２が、一意に決定可能となった文字については、総合評価値が最も高い文字コードのみを、フラグ「０」とともに保存する。しかし、一意に決定できないと判断した文字は、総合評価値が最も高い文字コードを、フラグ「１」とともに保存する。つまり、文字認識および後処理を行って、評価値が最も大きいもの１文字を、一意に決定できるか否かにかかわらず、認識結果として保存する。
【００６２】
図１４は、本実施の形態における文字認識結果の保存例を示す。同図に示すように、文字認識結果は、図３の場合に比べて、本実施の形態では「統」の文字が正しく保存されている。
【００６３】
以下、本実施の形態に係るシステムにおける検索方法について説明する。なお、ここでの検索方法は、基本的には、図４に示す実施の形態１に係る検索方法と同一であるが、ステップＳ２２０での処理が異なる。つまり、上記実施の形態１では、全ての文字に対してキーワードとの一致度を、類似度出力部７からの類似度を用いて計算しているが、本実施の形態２では、ｔｅｘｔのフラグが１の文字については、類似度出力部７からの値を類似度として用いて、その一致度を計算するが、フラグが０の文字については、ｔｅｘｔとｋｅｙの文字コードが完全に一致した場合に照合が成功したとして、類似度１を与える。また、フラグが０の文字で、ｔｅｘｔとｋｅｙの文字コードが一致しない場合は、類似度０を返す。
【００６４】
そこで、本実施の形態における検索方法を、図１４に示すｔｅｘｔとキーワード「内部処理統合型」とを照合する場合を例に、図４に示すフローチャートを参照して説明する。
文書検索部６がキーワード「内部処理統合型」を受け取り（図４のステップＳ２００）、ｔｅｘｔとｋｅｙのポインタを初期化して（ステップＳ２１０）、続くステップＳ２２０で、ｔｅｘｔとｋｅｙの照合を行う。ここでも、上記実施の形態１と同様、最初は、ｔｅｘｔ［０］＝「で」と、ｋｅｙ［０］＝「内」との照合を行う。
【００６５】
図１４に示すように、ｔｅｘｔ［０］のフラグは「０」なので、文書検索部６は、ｔｅｘｔ［０］とｋｅｙ［０］が同一の文字コードであるかどうかの比較をする。この場合、同一ではないので、照合に失敗したとして、文書検索部６は類似度０を返す。そして、ステップＳ２３０へ進み、一致度を計算する。ここでの一致度は、０／１＝０であるから、処理をステップＳ２８０以降へ進め、再びステップＳ２２０で、今度はｔｅｘｔ［１］＝「行」とｋｅｙ［０］＝「内」との照合を行う。
【００６６】
以下、同様に処理を進め、ｔｅｘｔ［５］とｋｅｙ［０］との照合を行うと、所定値以上の一致度が得られるので、ステップＳ２３０での判定結果が「Ｙｅｓ」となり、処理はステップＳ２４０へ進む。このステップＳ２４０では、ｔｅｘｔとｋｅｙのポインタを１インクリメントし、その後、ステップＳ２５０，Ｓ２６０を経て、処理はステップＳ２２０へ戻る。
【００６７】
ｔｅｘｔ［６］＝「都」とｋｅｙ［１］＝「部」との照合の場合、ｔｅｘｔ［６］のフラグは「１」なので、文書検索部６は、類似度出力部７の結果を類似度とする。例えば、類似度出力部７が返す類似度（都、部）を０．６とすると、ステップＳ２３０で求められる一致度は、（１＋０．６）／２＝０．８となる。その結果、この一致度が一定値０．６を超えているため、処理はステップＳ２４０へ進む。
以下、同様に処理を行い、ｔｅｘｔである「内都処理縦合型」との照合に成功した場合、それを検索結果として出力する。
【００６８】
以上説明したように、本実施の形態によれば、一意に決定できる文字に対しては、文字コードが完全に同一である場合のみ、照合成功とし、例えば、総合評価値が２位以下の候補文字を除外すると、それらの照合が不要となるので、誤抽出が検出される確率をさらに小さくできる。
【００６９】
また、常に一文字毎の類似度を用いてキーワードとの一致度を評価するので、不一致部分にどのような文字が来ても、それらが同一の一致度で表示されることがないため、得られた一致度に基づいて、誤抽出と正しい検索結果を、より明確に分離することが可能となる。
【００７０】
なお、上記の実施の形態２では、その文字コードを一意に決定できない文字に対しては、文字コードとフラグ１を保存しているが、これに限定されず、文字コードを一意に決定できないのは、例えば、入力画像の濃度値が適切ではなく、それが掠れていたり、つぶれている場合等が考えられる。この場合、一意に決定できない文字に対しては、文字コードを保存するよりも、文字認識に用いる特徴を保存した方が、より正確な照合が可能となり、正確な検索が可能となる。
【００７１】
図１５は、上記の特徴保存に係る辞書の例を示す。同図に示すように、この辞書では、フラグが１の文字に対して、その特徴値を保存しており、特徴値（図中、２１で示す）は、例えば、図５，図６に示すような、８×８次元（６４次元）特徴値を一列に表示したものである。
【００７２】
また、検索時の処理として、図４のステップＳ２２０で類似度出力部７が、ｔｅｘｔ［ｋ］のフラグが０の場合、ｔｅｘｔ［ｋ］とｋｅｙ［ｊ］との照合を行い、ｔｅｘｔ［ｋ］のフラグが１の場合には、ｔｅｘｔ［ｋ］の文字の特徴として、図１５に示す特徴値を使用し、ｋｅｙ［ｊ］の文字の特徴には、辞書８内の特徴を用いて、上記の式（２）に従って類似度を算出するようにしてもよい。
【００７３】
一方、一意に決定できない文字数が、文書画像内の文字数に比べ大きくなると、その文書は、様々なキーワードに対して許容範囲が広くなるため、誤抽出として検索される確率が高くなる。そこで、このような場合、ユーザに対して、例えば、「この文書は正しく検索されない可能性があります」という旨を、入力部１より入力した画像イメージとともに表示部９に表示するようにしてもよい。
【００７４】
そこで、この動作を、図９，図１６を参照して説明する。
この場合、図９のステップＳ３４０で、文書保存部２は、図１６に示すフローチャートに従って、文書保存手順を実行する。すなわち、図１６のステップＳ５１０で文書保存部２は、文字認識部３が出力する文字数Ａを計算する。そして、ステップＳ５２０で、文字認識部３の結果と後処理部１０での結果から算出した、一意に決定できない文字数（フラグが１の文字数）Ｆを計算する。
【００７５】
次に、ステップＳ５３０で、Ｆ／Ａが一定値以上かどうかを判定する。それが一定値以上の場合、ステップＳ５４０へ進み、表示部９に「この文書は正しく検索されない可能性があります」という旨の警告文を表示して、文書を保存せずに本処理を終了する。また、Ｆ／Ａが一定値より小さければ、上記の警告文を表示せず、文書を保存して本処理を終了する。
【００７６】
このようにすることで、ユーザは、メッセージと画像を見ることによって、文書が正しく検索されない原因を推測することが可能となるため、これが、正しいデータを保存するために有効な手段となる。
【００７７】
【発明の効果】
以上説明したように、第１の発明によれば、文書画像より、キーワードに従って所定の文書を検索する文書ファイリングシステムにおいて、上記文書画像中の文字を認識する手段と、上記キーワードと上記認識した文字とを照合する照合手段と、上記照合の結果をもとに、上記文書画像に関する情報を文書検索結果として出力する出力手段とを備え、上記照合手段は、辞書内の標準パターンを使って、上記キーワード中の文字の文字コードの濃度パターンと上記認識した文字の文字コードの濃度パターンとを比較して、上記キーワード中の文字と上記認識した文字との類似度を文字毎に算出し、この算出された文字毎の類似度の総和を用いて、上記キーワード中の文字で一致している文字の割合を示す一致度を計算し、上記計算された一致度に基づいて上記照合を行い、また、上記出力手段は、この一致度の示す値をもとに、上記キーワードに対応する文書を上記文書検索結果として出力することで、文書保存のために必要となるメモリ容量が少なくて済み、検索もれが少なくなるとともに、キーワードと一部不一致である文字が認識結果に存在しても、適正な一致度を算出することが可能となり、結果的に、ユーザが検索結果を確認する手間も省ける。
【００７８】
第２の発明によれば、さらに、あらかじめ格納した単語辞書を参照して、上記認識した文字に所定の文法解析を施し、この文法解析の結果をもとに、上記文字と単語辞書との一致状態を示す評価値を各文字に付与する手段を備え、また、上記照合手段は、上記類似度と上記評価値とを重み付け加算して総合評価値を求め、上記求められた各文字毎の総合評価値に基づいて上記照合を行うことで、より適正な一致度を算出できる。
【００７９】
第３の発明によれば、上記照合手段が、上記総合評価値が一定値以上であるか否かをもとに、上記認識した文字について、その文字コードを一意に決定できる文字と一意に決定できない文字との区別を行い、上記一意に決定できる文字を、その文字コードと所定のフラグとを対応付けて保存することで、誤抽出が検出される確率を小さくでき、また、不一致部分に対して同一の一致度で表示されることがないため、得られた一致度に基づいて、誤抽出と正しい検索結果を、より明確に分離できる。
【００８０】
第４の発明によれば、上記照合手段が、上記文字コードが同一であるか否かをもとに、上記一意に決定できる文字と上記キーワード中の文字との類似度を算出するとともに、この類似度を用いて上記一致度を計算し、また、上記一意に決定できない文字については、上記キーワード中の文字と上記認識した文字との類似度を用いて上記一致度を計算することで、文書検索のための適正な文字の一致度を算出することができる。
【００８１】
第５の発明によれば、さらに、上記キーワード中の文字と上記認識した文字それぞれの特徴量を抽出する手段を備え、上記照合手段は、上記一意に決定できない文字について、上記特徴量の類似度を算出し、この類似度を用いて上記一致度を計算することで、文字コードを保存する場合に比べて、正確な照合および検索が可能となる。
【００８２】
そして、第６の発明によれば、さらに、上記一意に決定できない文字の文字数と上記認識した文字の文字数の比率を算出する手段を備え、上記出力手段が、上記比率が一定値以上の場合、上記所定の文書を適正に検索できない旨の表示をすることで、その表示を見て、文書が正しく検索されない原因を推測することが可能となる。
【００８３】
第７の発明は、光学的に読み取られた文書画像が保存される文書画像データベースと、前記文書画像中の文字を認識する文字認識部と、前記文字認識された結果の文字コードを保存する文字認識結果データベースとからなる文書ファイリングシステムで、文書画像より、キーワードに従って所定の文書を検索する文書ファイリング方法において、上記文書画像中の文字を認識する工程と、辞書内の標準パターンを使って、上記キーワード中の文字の文字コードの濃度パターンと上記認識した文字の文字コードの濃度パターンとを比較して、上記キーワードの文字と上記認識した文字との類似度を文字毎に算出する工程と、上記算出された文字毎の類似度の総和を用いて、上記キーワード中の文字で一致している文字の割合を示す一致度を計算する工程と、上記一致度に基づいて、上記キーワード中の文字と上記認識した文字とを照合する照合工程と、上記照合の結果をもとに、上記キーワードに対応する文書を文書検索結果として出力する出力工程とを備えることで、文書保存のために必要となるメモリ容量が少なくて済むとともに、検索もれが少なくなり、キーワードと一部不一致である文字が認識結果に存在しても、適正な一致度を算出することができ、ユーザには、検索結果を確認する手間も省けることになる、という効果がある。
【００８４】
第８の発明によれば、さらに、あらかじめ格納した単語辞書を参照して、上記認識した文字に所定の文法解析を施し、この文法解析の結果をもとに、上記文字と単語辞書との一致状態を示す評価値を各文字に付与する工程を備え、上記照合工程は、上記類似度と上記評価値とを重み付け加算して総合評価値を求め、上記求められた各文字毎の総合評価値に基づいて上記照合を行うことで、より適正な一致度を算出できる。
【００８５】
第９の発明によれば、上記照合工程が、上記総合評価値が一定値以上であるか否かをもとに、上記認識した文字について、その文字コードを一意に決定できる文字と一意に決定できない文字との区別を行い、上記一意に決定できる文字を、その文字コードと所定のフラグとを対応付けて保存することで、誤抽出が検出される確率を小さくでき、また、不一致部分に対して同一の一致度で表示されることがないので、得られた一致度に基づいて、誤抽出と正しい検索結果を明確に分離できる。
【００８６】
第１０の発明によれば、上記照合工程が、上記文字コードが同一であるか否かをもとに、上記一意に決定できる文字と上記キーワード中の文字との類似度を算出するとともに、この類似度を用いて上記一致度を計算し、また、上記一意に決定できない文字については、上記キーワード中の文字と上記認識した文字との類似度を用いて上記一致度を計算することで、文書検索のための適正な文字の一致度を算出することができる。
【００８７】
また、第１１の発明によれば、さらに、上記キーワード中の文字と上記認識した文字それぞれの特徴量を抽出する行う工程を備え、上記照合工程が、上記一意に決定できない文字について、上記特徴量の類似度を算出し、この類似度を用いて上記一致度を計算することで、文字コードを保存する場合に比べて、正確な照合および検索が可能となる。
【００８８】
そして、第１２の発明によれば、さらに、上記一意に決定できない文字の文字数と上記認識した文字の文字数の比率を算出する工程を備え、上記出力工程が、上記比率が一定値以上の場合、上記所定の文書を適正に検索できない旨の表示をすることで、ユーザがその表示を見て、文書が正しく検索されない原因を推測することが可能となる。
【図面の簡単な説明】
【図１】本発明の実施の形態１に係るファイリングシステムの構成を示すブロック図である
【図２】本発明の実施の形態１に係るシステムにおける文書保存処理手順を示すフローチャートである。
【図３】本発明の実施の形態１における文字認識結果の保存例を示す図である。
【図４】本発明の実施の形態１，２に係るシステムにおける検索処理手順を示すフローチャートである
【図５】実施の形態１に係る辞書内の標準パターン（文字「で」）の特徴を示す図である。
【図６】実施の形態１に係る辞書内の標準パターン（文字「内」）の特徴を示す図である。
【図７】各文字間の類似度を表形式で表した図である。
【図８】本発明の実施の形態２に係るファイリングシステムの構成を示すブロック図である。
【図９】実施の形態２に係るファイリングシステムで実行される、文字認識結果の保存処理を示すフローチャートである。
【図１０】実施の形態２に係る後処理の手順を示すフローチャートである。
【図１１】実施の形態２に係る単語辞書内の単語と形態素番号の対応を示す図である。
【図１２】実施の形態２に係る単語辞書内の形態素間の接続ルールの記述を示す図である。
【図１３】後処理部の出力する照合結果を示す図である。
【図１４】実施の形態２における文字認識結果の保存例を示す図である。
【図１５】文字認識に用いる特徴保存に係る辞書の例を示す図である。
【図１６】実施の形態２に係る、文書保存の他の手順を示すフローチャートである。
【図１７】従来のファイリング装置のブロック構成を示す図である。
【図１８】従来の装置におけるテキストデータの候補を示す図である。
【符号の説明】
１…入力部、２…文書保存部、３…文字認識部、４…文書画像データベース、５…文字認識結果データベース、６…文書検索部、７…類似度出力部、８…辞書、９…表示部、１０…後処理部、１１…単語辞書、２０…キーワード[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a document filing system and a document filing method for electronically filing images such as documents and drawings, and more particularly to a document filing system and a document filing method for searching a filed document using a keyword.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, a method has been used in which a document image is stored electronically, and keyword information is manually added to the document image to search and display the document image. In addition, in order to save the trouble of manually inputting a keyword, a method of using a system having a character recognition function to recognize a character in a document image and save a related keyword or a whole text together with the document image is used. Have been.
[0003]
In the latter case, erroneous recognition occurs due to imperfect character recognition performance of the system. Therefore, a so-called “erroneous extraction” occurs in which a character string different from the keyword input for the search is displayed as a search result. In addition, even though the characters in the document image are the same as the input keyword, "search omission" occurs, which is not displayed as a search result due to misrecognition of characters.
[0004]
Therefore, in order to improve search accuracy, it is necessary to minimize the above-described erroneous extraction and search omission. As a method to reduce false extraction and search omission during this search,
(1) Improve character recognition performance (improve correct character possession rate)
(2) Partial inconsistency between the input keyword and the search target character string is allowed to assist incomplete character recognition performance.
There is a method.
[0005]
As an example of the above (1), there is a method of increasing the probability of retaining correct characters by retaining a plurality of character recognition results for one character image. For example, in "Experimental Study on Fusion Technology of Document Recognition and Full Text Search" (Information Processing Society of Japan, Information Science Fundamentals 39-9, September 1995, Marukawa et al.) The number of candidate characters to be held is made variable according to the similarity of, and a plurality of the candidate characters are held, thereby making it possible to perform a search with higher precision than holding one character at a time.
[0006]
Further, for example, in a filing apparatus disclosed in Japanese Patent Application Laid-Open No. H8-272813, a recognition result for each character image is fixedly held up to a fourth candidate, and matching with a search keyword is performed from a document code including candidate characters. .
[0007]
As an example of the method according to the above (2), as described in JP-A-8-272813, the degree of coincidence m between a keyword and a recognition result is calculated as follows.
m = (number of matched characters / number of keywords) × 100 (%) (1)
In some cases, even if all the search characters are not included in the candidate characters as a result of character recognition as a result of character recognition, they are output as search results.
[0008]
Hereinafter, the operation of the apparatus according to JP-A-8-272813 will be briefly described.
<Explanation of data storage method>
FIG. 17 is a block diagram of a filing apparatus according to Japanese Patent Application Laid-Open No. 8-27213. In FIG. 1, a document image read by a scanner 101 is converted into a digital signal by a scanner interface (I / F) circuit 102. When the original is a character image, the CPU 105 receiving the signal from the I / F circuit 102 performs a character recognition process, and stores up to four candidate characters as recognition results for one character of the character image in the document storage unit. Output to an external storage device 110.
[0009]
The RAM 107 is a work area for developing character images and performing character recognition processing. The external storage device 110 is a device for storing registered data, such as a bird disk, and stores not only data storage but also a dictionary for character recognition.
[0010]
In this filing apparatus, when filing an input document image, not only the character code of the first candidate of the character obtained in the character recognition processing is registered as text data, but the character codes of up to the fourth candidate are registered. That is, for each character image, four character images and candidate characters of the recognition result are stored.
[0011]
<Description of search method>
FIG. 18 shows text data candidates in the apparatus according to JP-A-8-272813. In the figure, the arrow indicates the character recognition result when the search keyword is “internal processing integrated type” and the matching portion (matching portion) of the keyword. The above-mentioned CPU 105 collates the keyword with all the candidate characters up to the fourth place as a document search unit.
[0012]
If the value of m is equal to or greater than a certain threshold value (for example, 60 (%)) in the above equation (1), and this is set as a search result candidate, FIG. Because there is

And these are search result candidates.
[0013]
[Problems to be solved by the invention]
However, in the conventional method of storing a plurality of character recognition results for one character image, with respect to the accuracy of character recognition, when the number of candidate characters to be saved is reduced, the possibility that correct characters are included in candidate characters is reduced. Search leaks are likely to occur. In addition, if a large number of candidate characters are held, the possibility of including correct characters is increased, and search omissions are reduced. However, since many characters other than the correct characters are also held, erroneous extraction often occurs. There is. Another problem is that holding a large number of candidate characters increases the memory capacity for storing documents.
[0014]
In the method of making the number of candidate characters to be held variable according to the similarity, for example, when the density of the document image is not appropriate and the document image is crushed or blurred, the feature amount used for recognition is compared with the standard pattern and the character image. Since the difference is large, the correct character is not included in the candidate characters, and the recognition rate is reduced.
[0015]
Also, in this case, the similarity of the recognition result is reduced (the possibility of correct answer is low), so that it is necessary to store more candidate characters in order to satisfy a certain correct character content rate. As a result, there is a problem that erroneous extraction at the time of retrieval increases.
[0016]
On the other hand, in the search method disclosed in Japanese Patent Application Laid-Open No. Hei 8-27213, which allows a certain degree of mismatch, no matter what character is mismatched between the keyword and the matching character, the matching portion is common. , There is a problem that the same degree of coincidence is calculated.
[0017]
Thus, for example, when the search keyword is "Japanese", m = (2/2 /) for the character strings "Japanese input", "Japanese", "Japan", "Japan", "Japan". 3) The same degree of coincidence of × 100 = 67% is obtained, which is output and displayed as a search result. Here, "Japanese" is misrecognized as "Japanese", and the character string you actually want to search for is "Japanese", but the above "Japanese", "Japanese country", "Japanese", "Japan is" Since the degree of coincidence is the same as the degree of similarity, if they are displayed in the descending order of the degree of coincidence, "Japanese entry" will be buried in these and displayed.
[0018]
Therefore, the user needs to search for a desired result from among the erroneous extractions displayed on the display 108 as the display means. Then, as the threshold value for permitting the mismatch is smaller, a large amount of erroneous extraction is also output, so that the document that the user really wants to search is buried in the erroneous extraction, and as a result, the device becomes difficult for the user to use. There is a problem. In addition, there is also a problem that search omission increases when the threshold value is increased.
[0019]
SUMMARY OF THE INVENTION The present invention has been made in view of the above problems, and has as its object to provide a document filing system and a document filing system in which search omission of a document search hardly occurs and erroneous extraction does not occur even if the number of candidate characters is increased. Is to provide a way. Another object of the present invention is to provide a document filing system and a document filing method that can calculate a proper matching degree even if a character that does not match a keyword partially exists in a recognition result and can execute a highly accurate search. It is to be.
[0020]
[Means for Solving the Problems]
In order to achieve the above object, a first invention is a document filing system for searching for a predetermined document from a document image in accordance with a keyword, means for recognizing characters in the document image, Matching means for matching the document image, and output means for outputting information on the document image as a document search result based on the result of the matching, wherein the matching means uses a standard pattern in a dictionary to The density pattern of the character code of the character in the keyword is compared with the density pattern of the character code of the recognized character, and the similarity between the character in the keyword and the recognized character is calculated for each character. Was done Sum of similarity for each character Is used to calculate the degree of coincidence indicating the proportion of characters that match in the characters in the keyword, and performs the matching based on the calculated degree of coincidence, and the output means indicates the degree of coincidence Provided is a document filing system that outputs a document corresponding to the keyword as a result of the document search based on a value.
[0021]
According to a second invention, in the first invention, a predetermined grammatical analysis is performed on the recognized character with reference to a word dictionary stored in advance, and the character and the word are determined based on a result of the grammatical analysis. Means for assigning, to each character, an evaluation value indicating a state of matching with the dictionary, wherein the matching means determines whether the similarity and the evaluation value To obtain an overall evaluation value by adding Provided is a document filing system that performs the above-mentioned collation based on the obtained comprehensive evaluation value for each character.
[0022]
In a third aspect based on the second aspect, the collating means is capable of uniquely determining a character code of the recognized character based on whether the comprehensive evaluation value is equal to or greater than a predetermined value. And a character filing system that distinguishes characters that cannot be uniquely determined and stores the characters that can be uniquely determined in association with their character codes and predetermined flags.
[0023]
In a fourth aspect based on the third aspect, the collating means determines the similarity between the uniquely determinable character and the character in the keyword based on whether or not the character code is the same. Is calculated using the similarity, and for the character that cannot be uniquely determined, the similarity is calculated using the similarity between the character in the keyword and the recognized character. Provide a document filing system to calculate.
[0024]
According to a fifth aspect of the present invention, in the third aspect, there is further provided a unit for extracting a characteristic amount of each of the character in the keyword and the recognized character. Provided is a document filing system which calculates the similarity of quantities and calculates the above-mentioned degree of coincidence using the similarity.
[0025]
The sixth aspect of the present invention is the third aspect, further comprising: means for calculating a ratio between the number of characters that cannot be uniquely determined and the number of characters of the recognized character. In the above case, there is provided a document filing system for displaying that the predetermined document cannot be properly searched.
[0026]
A seventh invention is a document image database storing an optically read document image, a character recognizing unit for recognizing characters in the document image, and a character storing a character code as a result of the character recognition. A document filing system comprising a recognition result database, a document filing method for retrieving a predetermined document from a document image according to a keyword, wherein a step of recognizing characters in the document image and a standard pattern in a dictionary Comparing the density pattern of the character code of the character in the keyword with the density pattern of the character code of the recognized character, and calculating the similarity between the character of the keyword and the recognized character for each character; Calculated Sum of similarity for each character Calculating the degree of matching indicating the proportion of characters that match in the characters in the keyword, and comparing the characters in the keyword with the recognized characters based on the calculated degree of matching. A document filing method comprising: a collation step of performing a collation; and an output step of outputting a document corresponding to the keyword as a document retrieval result based on a result of the collation.
[0027]
In an eighth aspect based on the seventh aspect, a predetermined grammatical analysis is performed on the recognized character with reference to a word dictionary stored in advance, and the character and the word are determined based on a result of the grammatical analysis. A step of assigning an evaluation value indicating a state of matching with the dictionary to each character, wherein the matching step includes the step of comparing the similarity and the evaluation value To obtain a comprehensive evaluation value by adding Request Be And a document filing method for performing the above collation based on the total evaluation value for each character.
[0028]
In a ninth aspect based on the eighth aspect, the collating step is characterized in that the character code that can uniquely determine the character code of the recognized character is determined based on whether the comprehensive evaluation value is equal to or greater than a predetermined value. The present invention provides a document filing method that distinguishes between characters that cannot be uniquely determined and characters that can be uniquely determined and stores the characters that can be uniquely determined in association with their character codes and predetermined flags.
[0029]
In a tenth aspect based on the ninth aspect, the collating step includes a step of determining a similarity between the uniquely determinable character and the character in the keyword based on whether or not the character code is the same. Is calculated using the similarity, and for the character that cannot be uniquely determined, the similarity is calculated using the similarity between the character in the keyword and the recognized character. Provide a document filing method to calculate.
[0030]
An eleventh invention according to the ninth invention further comprises a step of extracting a feature amount of each of the character in the keyword and the recognized character, and the collating step includes a step of extracting the character which cannot be uniquely determined. And a document filing method for calculating the similarity between the feature amounts and calculating the coincidence using the similarity.
[0031]
The twelfth invention according to the ninth invention further comprises a step of calculating a ratio between the number of characters of the characters that cannot be uniquely determined and the number of characters of the recognized characters. In the above case, a document filing method for displaying that the predetermined document cannot be properly searched is provided.
[0032]
Embodiment
Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings.
Embodiment 1 FIG.
FIG. 1 is a block diagram showing a configuration of a filing system according to Embodiment 1 of the present invention. In the system shown in FIG. 1, a document image is input from an input unit 1, and the input document image and a result recognized by a character recognition unit 3 are stored in a document storage unit 2. The character recognizing unit 3 obtains a character area from the input document image and recognizes characters in the character area. Further, the dictionary 8 holds the features of the standard pattern for each character which the character recognition unit 3 holds for character recognition.
[0033]
The document image database 4 electronically stores the input image, and the character recognition result database 5 stores a character code string as a result of recognition by the character recognition unit 3. The document search unit 6 searches the character recognition result database 5 for a document including a character string that matches the keyword 20 input by the user. Further, the similarity output unit 7 calculates the similarity between the two character codes with reference to the dictionary 8, and the display unit 9 displays the search result by the document search unit 6.
[0034]
FIG. 2 is a flowchart showing a document storage processing procedure in the system according to the present embodiment.
First, a method for storing a character recognition result according to the present embodiment will be described with reference to the flowchart shown in FIG. In step S100 of FIG. 2, the input unit 1 inputs an image and transfers it to the document storage unit 2. The input unit 1 may be, for example, a unit that photoelectrically converts a document image using a scanner, or may input a photoelectrically converted image via a network or the like.
[0035]
In step S110, the document storage unit 2 transfers the image input from the input unit 1 to the character recognition unit 3, and receives a character code as a result of character recognition in the image from the character recognition unit 3. The character recognizing unit 3 performs known character string cutting, character cutting, and character recognition, and returns a character recognition result obtained to the document storage unit 2. In addition, the character string cutout uses, for example, a method of connecting portions where black pixels are consecutive in a document image, and determining whether or not it is a character string based on the width and height of the connected components of the black pixels. .
[0036]
For example, the character cutout method scans the character string image determined by the character string cutout in the vertical direction and the horizontal direction, finds the peripheral distribution of the number of black pixels, and sets a portion having a small number of black pixels as a cutout candidate point, Divide into images for each character. In the character recognition process, for example, an 8 × 8-dimensional density feature is extracted from an image divided into single characters by character segmentation, and a sum of a difference for each dimension from a standard pattern is obtained. A few characters are output as a recognition result from the standard pattern having the smallest sum of.
[0037]
In the next step S120, the document storage unit 2 stores the character code as the character recognition result in the character recognition result database 5, but here, only one character recognition result is stored for each one character image. FIG. 3 shows an example of storing a character recognition result in the present embodiment. In addition, the document storage unit 2 stores the input binary document image in the character image database 4. Thus, the document storage processing is completed.
[0038]
Next, a keyword search method according to the present embodiment will be described.
FIG. 4 is a flowchart showing a search processing procedure in the system according to the present embodiment. That is, FIG. 4 is a flowchart showing a procedure for collating the keyword with the character code of the recognition result.
[0039]
In step S200 of FIG. 4, the keyword 20 input by the user is input to the document search unit 6. In the following step S210, the pointers of an array including the recognition result (hereinafter, referred to as text) and an array including the keyword (hereinafter, referred to as key) are initialized. The document retrieval unit 6 extracts the recognition result of one document from the character recognition result database 5 and sets it in text.
[0040]
In the next step S220, the text and the characters in the key are collated. In this collation, the character codes of the characters text [k] and key [j] are input to the similarity output unit 7, and the similarity output unit 7 refers to the dictionary 8 to check the text [k] and key [j]. This is performed by outputting the similarity s of the character. here,

Is returned as the similarity. In the above equation (2), Ntext [k] n is the n-dimensional density value of the character text [k] in the dictionary 8, and Nkey [k] n is the character key [ k] is the n-dimensional density value. Further, n takes a value of 1 to 64, and | | represents an absolute value.
[0041]
Then, the document search unit 6 uses the similarity output from the similarity output unit 7 to
Matching degree = sum of collated characters / number of collated characters ... (3)
Is calculated.
[0042]
In step S230, the document search unit 6 determines whether or not the degree of matching is equal to or greater than a certain value. If it is equal to or greater than the predetermined value, the process proceeds to step S240, where each of the pointers k and j is incremented by one, and in a succeeding step S250, it is determined whether or not all keywords have been collated.
[0043]
If it is determined in step S250 that all keywords have been matched, the output result is saved in step S270, and the process ends. Note that the output result means a matched document name, a matching degree, a page number, coordinates of a matched character on an image, and the like. In the present embodiment, these are output to the display unit 9 and indicate.
[0044]
On the other hand, if it is determined in step S230 that the degree of coincidence is equal to or smaller than the predetermined value, the process proceeds to step S280 to check whether the end of the document has been reached. If it is the end of the document, the process ends. If it is not the end of the document, the process proceeds to step S290, i is incremented by 1 and assigned to k, the pointer j of the keyword is set to 0, and It returns to step S220.
[0045]
If it is determined in step S250 that all keywords have not been collated, it is checked in step S260 whether the end of the document has been reached. If it is the end of the document, the process ends, but if it is not the end of the document, the process returns to step S220.
In the same manner, the above processing is repeated for all the documents in the character recognition result database 5.
[0046]
Here, a method of comparing the character recognition result with the keyword “internal processing integrated type” in the present embodiment will be described by way of example.
The search process proceeds to step S220 in FIG. 4, and when the pointers k = 0 and j = 0, that is, the character recognition result is compared with the first character of each of the keywords. In other words, text [0] = “in” and key [0] = “in” are collated, and their similarity is obtained. FIG. 5 shows the features of the standard pattern “de” in the dictionary 8 and FIG. 6 shows the features of the “in” similarly, and the numerical values in these patterns indicate the respective feature amounts.
[0047]
The similarity output unit 7 calculates the similarity s from the patterns of FIGS. 5 and 6 according to the above equation (2), and obtains s = 1− {91 / (94 + 95)} = 0.52. Is output to the document search unit 6. Then, in step S230, the document search unit 6 calculates the degree of coincidence based on the similarity s, and obtains the degree of coincidence = 0.52 / 1 = 0.52. Here, assuming that the threshold value of the degree of coincidence s is 0.6, the determination in step S230 is “No”, and the process proceeds to step S280.
[0048]
At this stage, since the end of the document is not reached, the determination in step S280 is “No”, and in step S290, i is incremented, that is, k is incremented by 1, and the process returns to step S220. As a result, in step S220, collation between text [1] = “row” and key [0] = “inside” is performed.
[0049]
When the similarity output unit 7 performs the same calculation as above and outputs, for example, the similarity 0.4 to the document search unit 6, the determination in step S230 is also "No" in this case, so that the processing is performed in step S230. Proceed to S280. Hereinafter, these processes are similarly executed, and when the keyword indicates a matching degree equal to or more than a certain value with "internal processing vertical type" in the text as a result of collation, this is set as an output candidate.
[0050]
As described above, according to the present embodiment, since only one character is stored for each character image in the recognition result, the memory capacity required for storing a document can be reduced.
In addition, the characters to be collated are not limited to several characters that are candidates for the recognition result, all characters are to be collated, and all the degrees of matching for each character are used at the time of collation, so that search omissions are reduced and keyword Even if a character that partially does not match with the recognition result exists, it is possible to calculate an appropriate matching degree, so that there is a remarkable effect that the user does not have to check the search result.
[0051]
Note that the character recognition method is not limited to the above method. For example, a structural analysis method described in “Pattern Recognition” (Noboru Funakubo, Kyoritsu Shuppan) may be used. Further, in the above embodiment, the similarity output unit 7 calculates from the dictionary used by the character recognition unit 3, but is not limited to this, and holds a table of similarity between each character in advance. You may make it. FIG. 7 is an example in which the similarity between characters is tabulated.
Further, it goes without saying that the threshold value of the degree of coincidence and the method of calculating the degree of coincidence are not limited to the above numerical values and expressions.
[0052]
Embodiment 2 FIG.
FIG. 8 is a block diagram showing a configuration of a filing system according to Embodiment 2 of the present invention. In the figure, the same components as those of the system according to the first embodiment shown in FIG. 1 are denoted by the same reference numerals, and the description thereof is omitted here.
The post-processing unit 10 in FIG. 8 grammatically verifies the character recognition result of the character recognition unit 3 stored in the document storage unit 2 and outputs the result. The word dictionary 11 is a dictionary used by the post-processing unit 10.
[0053]
First, a document registration method in the system according to the present embodiment will be described.
FIG. 9 is a flowchart showing a character recognition result saving process executed by the filing system according to the present embodiment. In step S300 in the figure, the character recognition unit 3 performs a character recognition process. Since the method of character recognition here is the same as the method in the first embodiment, description thereof will be omitted. It should be noted that a plurality of recognition results, including candidate characters, are output for one character image.
[0054]
Next, the process proceeds to step S310 to perform post-processing. Here, the post-processing unit 10 grammatically analyzes the recognition result including the candidate character and determines a combination that is considered to be correct as a sentence. Hereinafter, a post-processing method using known morphological matching will be described.
[0055]
FIG. 10 is a flowchart showing the procedure of post-processing according to the present embodiment. Here, this post-processing is performed on the recognition result candidate characters shown in FIG. FIG. 11 is a diagram showing correspondence between words in the word dictionary 11 and morpheme numbers. Note that the morpheme number is a number uniquely assigned to a morpheme (a word such as a common noun, proper noun, or savari noun). FIG. 12 shows the connection relationship between these morpheme numbers.
[0056]
In step S410 of FIG. 10, the post-processing unit 10 checks the result of character recognition against the word dictionary 11. Here, when the word shown in FIG. 11 is compared with the recognition result shown in FIG. 18, the “processing”, “integration”, and “type” in FIG. 11 are successfully matched with the recognition result. Then, the process proceeds to step S420, and a morphological connection test is performed according to the description shown in FIG.
[0057]
As shown in FIG. 11, the morpheme number of “process” is “55”, the morpheme numbers of “integration” are two types of “55” and “59”, and the “type” is “46”. Therefore, when a morphological connection test is performed using the description contents shown in FIG. 12, the connections of 55 and 55, and 55 and 46 are all established (indicated by a circle in the figure). Next, the process proceeds to step S430, where the number of matching morpheme characters is set as an evaluation value, and these are assigned to each candidate character.
[0058]
FIG. 13 shows a collation result output by the post-processing unit 10. As shown in the figure, the evaluation value is “2” for “processing”, “physical”, “to”, and “go”, the evaluation value is “1” for “type”, and other characters do not match the word dictionary. Therefore, “0” is all assigned.
[0059]
When the above post-processing is completed, step S320 in FIG. 9 is executed. That is, the document storage unit 2 calculates a comprehensive evaluation value for each character from the character recognition result and the post-processing result. This comprehensive evaluation value is calculated by the following equation (4).
Overall evaluation value = α × (similarity of character recognition) + (1−α) × (evaluation value of post-processing) (4)
Here, α takes a value satisfying 0 ≦ α ≦ 1, and in the present embodiment, α = 0.8.
[0060]
Now, of the candidate characters shown in FIG. 18, the similarity of “vertical” is set to 0.7 and the similarity of “to” is set to 0.5, and the matching result shown in FIG. Applying to 4), the overall evaluation value for “vertical” is 0.8 × 0.7 + 0.2 × 0 = 0.56, and the overall evaluation value for “to” is 0.8 × 0.5 + 0. 2 × 2 = 0.8, and the evaluation value of “to” is higher than that of “vertical”.
[0061]
In the following step S330, the document storage unit 2 distinguishes between characters that can be uniquely determined and characters that cannot be uniquely determined. The method of this distinction is that, for example, when the total evaluation value is equal to or more than a certain value, it can be uniquely determined, and when it is less than that, it cannot be determined. Then, in step S340, for the characters that can be uniquely determined, only the character code having the highest overall evaluation value is stored by the document storage unit 2 together with the flag “0”. However, for a character determined that it cannot be uniquely determined, the character code having the highest overall evaluation value is stored together with the flag “1”. That is, character recognition and post-processing are performed, and the one character having the largest evaluation value is stored as a recognition result regardless of whether or not it can be uniquely determined.
[0062]
FIG. 14 shows an example of storing a character recognition result in the present embodiment. As shown in the figure, as for the character recognition result, in the present embodiment, the character "" is correctly stored as compared with the case of FIG.
[0063]
Hereinafter, a search method in the system according to the present embodiment will be described. Note that the search method here is basically the same as the search method according to the first embodiment shown in FIG. 4, but the processing in step S220 is different. That is, in the first embodiment, the degree of coincidence of all the characters with the keyword is calculated using the similarity from the similarity output unit 7. In the second embodiment, the text flag is used. Is calculated using the value from the similarity output unit 7 as the similarity for the character having a value of 1. However, if the character code of the text and the key completely match for the character of the flag of 0 , The similarity 1 is given. If the flag is 0 and the character codes of text and key do not match, the similarity 0 is returned.
[0064]
Therefore, the search method according to the present embodiment will be described with reference to the flowchart shown in FIG. 4 taking as an example a case where the text shown in FIG. 14 is collated with the keyword “internal processing integrated type”.
The document search unit 6 receives the keyword "internal processing integrated type" (step S200 in FIG. 4), initializes the text and key pointers (step S210), and compares the text and the key in the subsequent step S220. Here, as in the first embodiment, first, the collation between text [0] = “in” and key [0] = “inside” is performed.
[0065]
As shown in FIG. 14, since the flag of text [0] is “0”, the document search unit 6 compares whether text [0] and key [0] have the same character code. In this case, since they are not the same, it is determined that the matching has failed, and the document search unit 6 returns the similarity 0. Then, the process proceeds to step S230, and the degree of coincidence is calculated. Since the degree of coincidence is 0/1 = 0, the process proceeds to step S280 and subsequent steps, and again in step S220, text [1] = “line” and key [0] = “in” Perform collation.
[0066]
Hereinafter, when the process proceeds in the same manner and the text [5] is compared with the key [0], a matching degree equal to or more than a predetermined value is obtained. Therefore, the determination result in step S230 is “Yes”, and the processing is performed in step S230. Proceed to S240. In this step S240, the pointers of the text and the key are incremented by one, and thereafter, the processing returns to the step S220 via steps S250 and S260.
[0067]
In the case of text [6] = “to” and key [1] = “part”, the flag of text [6] is “1”, so the document search part 6 compares the result of the similarity output part 7 Degree. For example, assuming that the similarity (city, part) returned by the similarity output unit 7 is 0.6, the coincidence obtained in step S230 is (1 + 0.6) /2=0.8. As a result, since the degree of coincidence exceeds the fixed value 0.6, the process proceeds to step S240.
Hereinafter, the same processing is performed, and when the collation with the text “internal processing vertical type” is successful, it is output as a search result.
[0068]
As described above, according to the present embodiment, for characters that can be uniquely determined, only when the character codes are completely the same, the matching is determined to be successful. Excluding characters eliminates the need for collation, thereby further reducing the probability of erroneous extraction being detected.
[0069]
Also, since the degree of matching with the keyword is always evaluated using the degree of similarity for each character, no matter what character comes in the non-matching part, they are not displayed with the same degree of matching. Based on the matching degree, the erroneous extraction and the correct search result can be more clearly separated.
[0070]
In the second embodiment, the character code and the flag 1 are stored for a character whose character code cannot be uniquely determined. However, the present invention is not limited to this, and the character code cannot be uniquely determined. For example, it is conceivable that the density value of the input image is not appropriate and is blurred or crushed. In this case, for a character that cannot be uniquely determined, storing a feature used for character recognition enables more accurate collation and a more accurate search than storing a character code.
[0071]
FIG. 15 shows an example of a dictionary relating to the above feature storage. As shown in the figure, in this dictionary, the characteristic value of a character whose flag is 1 is stored, and the characteristic value (indicated by 21 in the drawings) is, for example, shown in FIGS. 5 and 6. Such an 8 × 8-dimensional (64-dimensional) feature value is displayed in a line.
[0072]
Also, as a process at the time of retrieval, when the flag of text [k] is 0 in step S220 of FIG. 4, the similarity output unit 7 compares text [k] with key [j], and outputs text [k]. ] Is 1, the feature value shown in FIG. 15 is used as the feature of the character of text [k], and the feature in the dictionary 8 is used as the feature of the character of key [j]. The similarity may be calculated according to the above equation (2).
[0073]
On the other hand, if the number of characters that cannot be uniquely determined is larger than the number of characters in the document image, the document has a wider allowable range for various keywords, and thus the probability of being searched as erroneous extraction increases. Therefore, in such a case, for example, a message that “this document may not be correctly searched” may be displayed on the display unit 9 together with the image input from the input unit 1 to the user. .
[0074]
Therefore, this operation will be described with reference to FIGS.
In this case, in step S340 in FIG. 9, the document storage unit 2 executes a document storage procedure according to the flowchart shown in FIG. That is, in step S510 in FIG. 16, the document storage unit 2 calculates the number of characters A output by the character recognition unit 3. Then, in step S520, the number of characters F that cannot be uniquely determined (the number of characters whose flag is 1) F calculated from the result of the character recognition unit 3 and the result of the post-processing unit 10 is calculated.
[0075]
Next, in step S530, it is determined whether F / A is equal to or greater than a certain value. If the value is equal to or more than the predetermined value, the process proceeds to step S540, a warning message indicating that "this document may not be correctly searched" is displayed on the display unit 9, and the process ends without saving the document. . If F / A is smaller than the predetermined value, the above-mentioned warning message is not displayed, the document is saved, and the process ends.
[0076]
By doing so, the user can guess the cause of the incorrect retrieval of the document by looking at the message and the image, and this is an effective means for saving correct data.
[0077]
【The invention's effect】
As described above, according to the first aspect, in a document filing system that searches for a predetermined document from a document image according to a keyword, means for recognizing characters in the document image, Matching means for matching the document image, and output means for outputting information on the document image as a document search result based on the result of the matching, wherein the matching means uses a standard pattern in a dictionary to The density pattern of the character code of the character in the keyword is compared with the density pattern of the character code of the recognized character, and the similarity between the character in the keyword and the recognized character is calculated for each character. Was done Sum of similarity for each character Is used to calculate the degree of coincidence indicating the proportion of characters that match in the characters in the keyword, perform the matching based on the calculated degree of coincidence, and output means for calculating the degree of coincidence. By outputting the document corresponding to the keyword as the document search result based on the indicated value, the memory capacity required for storing the document can be reduced, and the search omission can be reduced, and at the same time as the keyword. Even if a character having a part mismatch exists in the recognition result, it is possible to calculate an appropriate matching degree, and as a result, it is not necessary for the user to check the search result.
[0078]
According to the second invention, a predetermined grammatical analysis is performed on the recognized character with reference to a word dictionary stored in advance, and a match between the character and the word dictionary is determined based on a result of the grammatical analysis. Means for assigning an evaluation value indicating a state to each character, and the collating means includes a means for comparing the similarity and the evaluation value with each other. To obtain a comprehensive evaluation value by adding Request La By performing the above-described collation based on the total evaluation value of each character obtained, a more appropriate matching degree can be calculated.
[0079]
According to the third aspect, the collating means uniquely determines a character capable of uniquely determining a character code of the recognized character based on whether the comprehensive evaluation value is equal to or more than a predetermined value. Characters that cannot be determined are distinguished from each other, and the characters that can be uniquely determined are stored by associating the character codes with predetermined flags, so that the probability of detecting erroneous extraction can be reduced. Erroneous extraction and a correct search result can be more clearly separated based on the obtained matching degree.
[0080]
According to the fourth aspect, the matching means calculates the similarity between the uniquely determinable character and the character in the keyword based on whether or not the character codes are the same. The similarity is calculated using the similarity, and for the characters that cannot be uniquely determined, the document is calculated by calculating the similarity using the similarity between the character in the keyword and the recognized character. It is possible to calculate an appropriate character matching degree for search.
[0081]
According to the fifth aspect, the apparatus further comprises means for extracting a characteristic amount of each of the character in the keyword and the recognized character, wherein the matching means determines the similarity of the characteristic amount for the character which cannot be uniquely determined. Is calculated, and the similarity is calculated using the similarity, so that more accurate collation and search can be performed as compared with the case where the character code is stored.
[0082]
According to the sixth aspect, the apparatus further comprises means for calculating a ratio between the number of characters of the characters that cannot be uniquely determined and the number of characters of the recognized characters, wherein the output unit includes: By displaying that the predetermined document cannot be properly retrieved, it is possible to guess the cause of the document not being retrieved correctly by looking at the display.
[0083]
A seventh invention is a document image database storing an optically read document image, a character recognizing unit for recognizing characters in the document image, and a character storing a character code as a result of the character recognition. A document filing system comprising a recognition result database, a document filing method for retrieving a predetermined document from a document image according to a keyword, wherein a step of recognizing characters in the document image and a standard pattern in a dictionary Comparing the density pattern of the character code of the character in the keyword with the density pattern of the character code of the recognized character, and calculating the similarity between the character of the keyword and the recognized character for each character; Calculated Sum of similarity for each character Calculating the degree of coincidence indicating the proportion of characters that match in the characters in the keyword, and comparing the characters in the keyword with the recognized characters based on the degree of coincidence. And an output step of outputting a document corresponding to the keyword as a document search result based on the result of the collation, so that the memory capacity required for storing the document can be reduced and the search can be performed. That is, even if characters that do not match the keyword partially exist in the recognition result, it is possible to calculate an appropriate matching degree, and it is possible for the user to save the trouble of confirming the search result. effective.
[0084]
According to the eighth aspect of the present invention, a predetermined grammatical analysis is performed on the recognized character with reference to a word dictionary stored in advance, and matching between the character and the word dictionary is performed based on a result of the grammatical analysis. A step of giving an evaluation value indicating a state to each character, wherein the collation step includes the similarity and the evaluation value To obtain an overall evaluation value by adding Above Be By performing the above-described collation based on the comprehensive evaluation value of each character, a more appropriate matching degree can be calculated.
[0085]
According to the ninth aspect, the collating step determines, based on whether or not the comprehensive evaluation value is equal to or more than a certain value, a character capable of uniquely determining the character code of the recognized character. Characters that cannot be identified are distinguished from each other, and the characters that can be uniquely determined are stored by associating the character codes with predetermined flags, so that the probability of detecting erroneous extraction can be reduced. Erroneous extraction and correct search results can be clearly separated based on the obtained degree of matching.
[0086]
According to the tenth aspect, the matching step calculates the similarity between the uniquely determinable character and the character in the keyword based on whether or not the character codes are the same. The similarity is calculated using the similarity, and for the characters that cannot be uniquely determined, the document is calculated by calculating the similarity using the similarity between the character in the keyword and the recognized character. It is possible to calculate an appropriate character matching degree for search.
[0087]
Further, according to the eleventh aspect, the method further comprises the step of extracting a characteristic amount of each of the character in the keyword and the recognized character, wherein the collating step determines the characteristic amount of the character that cannot be uniquely determined. By calculating the degree of similarity and calculating the degree of coincidence using the degree of similarity, it is possible to perform more accurate collation and search as compared with the case where character codes are stored.
[0088]
According to the twelfth aspect, the method further includes a step of calculating a ratio between the number of characters of the characters that cannot be uniquely determined and the number of characters of the recognized characters, wherein the output step includes the step of: By displaying that the predetermined document cannot be properly searched, the user can look at the display and guess the cause of the incorrect document search.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration of a filing system according to Embodiment 1 of the present invention.
FIG. 2 is a flowchart showing a document storage processing procedure in the system according to the first embodiment of the present invention.
FIG. 3 is a diagram showing an example of storing a character recognition result according to the first embodiment of the present invention.
FIG. 4 is a flowchart showing a search processing procedure in the systems according to the first and second embodiments of the present invention.
FIG. 5 is a diagram showing characteristics of a standard pattern (character “de”) in the dictionary according to the first embodiment.
FIG. 6 is a diagram showing characteristics of a standard pattern (character “in”) in the dictionary according to the first embodiment.
FIG. 7 is a diagram showing a similarity between characters in a table format.
FIG. 8 is a block diagram showing a configuration of a filing system according to Embodiment 2 of the present invention.
FIG. 9 is a flowchart showing a character recognition result saving process executed by the filing system according to the second embodiment.
FIG. 10 is a flowchart illustrating a procedure of post-processing according to the second embodiment.
FIG. 11 is a diagram showing correspondence between words in a word dictionary and morpheme numbers according to the second embodiment;
FIG. 12 is a diagram showing a description of a connection rule between morphemes in the word dictionary according to the second embodiment.
FIG. 13 is a diagram illustrating a comparison result output from a post-processing unit.
FIG. 14 is a diagram showing an example of storing a character recognition result in the second embodiment.
FIG. 15 is a diagram illustrating an example of a dictionary related to feature storage used for character recognition.
FIG. 16 is a flowchart showing another procedure for storing a document according to the second embodiment.
FIG. 17 is a diagram showing a block configuration of a conventional filing apparatus.
FIG. 18 is a diagram showing text data candidates in a conventional device.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Input part, 2 ... Document storage part, 3 ... Character recognition part, 4 ... Document image database, 5 ... Character recognition result database, 6 ... Document search part, 7 ... Similarity output part, 8 ... Dictionary, 9 ... Display Section, 10 post-processing section, 11 word dictionary, 20 keywords

Claims

In a document filing system that searches a predetermined document from a document image according to a keyword,
Means for recognizing characters in the document image;
Collating means for collating the keyword with the recognized character;
Output means for outputting information about the document image as a document search result based on the result of the collation,
The matching means compares the density pattern of the character code of the character in the keyword with the density pattern of the character code of the recognized character using a standard pattern in the dictionary, and The similarity between the character and the recognized character is calculated for each character,
Using the calculated sum of similarities for each character, calculate a matching score indicating the proportion of characters that match in the characters in the keyword, perform the matching based on the calculated matching score, A document filing system, wherein the output unit outputs a document corresponding to the keyword as the document search result based on the value indicating the degree of coincidence.

Further, referring to a previously stored word dictionary, a predetermined grammatical analysis is performed on the recognized characters,
Based on the result of the grammar analysis, a means for assigning an evaluation value indicating a matching state between the character and the word dictionary to each character,
The comparison means, according to claim 1, wherein the similarity and said evaluation value weighted addition obtains a total evaluation value, and performs the verification based on the comprehensive evaluation value for each character the obtained Document filing system described.

The collating means, based on whether or not the comprehensive evaluation value is equal to or more than a certain value, for the recognized character, to distinguish between a character whose character code can be uniquely determined and a character that cannot be uniquely determined, 3. The document filing system according to claim 2, wherein the characters that can be uniquely determined are stored by associating the character codes with predetermined flags.

The collation means calculates a similarity between the uniquely determinable character and a character in the keyword based on whether or not the character codes are the same, and uses the similarity to determine the degree of coincidence. 4. The document filing according to claim 3, wherein, for characters that cannot be uniquely determined, the degree of coincidence is calculated using the similarity between the character in the keyword and the recognized character. system.

Further, there is provided means for extracting a characteristic amount of each of the character in the keyword and the recognized character,
4. The document filing system according to claim 3, wherein the matching unit calculates a similarity of the feature amount for the character that cannot be uniquely determined, and calculates the coincidence using the similarity. 5.

Further, a means for calculating a ratio between the number of characters of the characters that cannot be uniquely determined and the number of characters of the recognized characters,
4. The document filing system according to claim 3, wherein the output unit displays, when the ratio is equal to or more than a predetermined value, that the predetermined document cannot be properly searched.

It comprises a document image database in which a document image read optically is stored, a character recognition unit that recognizes characters in the document image, and a character recognition result database that stores a character code of the result of the character recognition. In a document filing method, a document filing system searches for a predetermined document from a document image according to a keyword.
Recognizing characters in the document image;
Using the standard pattern in the dictionary, comparing the pattern of the density value of the character code of the character in the keyword with the pattern of the density value of the character code of the recognized character, the character in the keyword is recognized as the character. Calculating a similarity with the character for each character;
Using the calculated sum of similarities for each character, calculating a degree of coincidence indicating the proportion of characters that match in the characters in the keyword;
Based on the calculated degree of matching, a matching step of matching the character in the keyword with the recognized character,
Outputting a document corresponding to the keyword as a document search result based on the result of the collation.

Further, referring to a previously stored word dictionary, a predetermined grammatical analysis is performed on the recognized characters,
Based on the result of the grammatical analysis, a step of providing each character with an evaluation value indicating a matching state between the character and the word dictionary,
The verification process, according to claim 7, wherein the similarity and said evaluation value weighted addition obtains a total evaluation value, and performs the verification based on the comprehensive evaluation value for each character the obtained Document filing method described.

The collation step, based on whether or not the comprehensive evaluation value is equal to or more than a certain value, for the recognized character, to distinguish between a character whose character code can be uniquely determined and a character that cannot be uniquely determined, 9. The document filing method according to claim 8, wherein the uniquely determinable character is stored in association with the character code and a predetermined flag.

The matching step calculates a similarity between the uniquely determinable character and a character in the keyword based on whether or not the character codes are the same, and uses the similarity to determine the degree of coincidence. 10. The document filing according to claim 9, wherein for the character that cannot be uniquely determined, the matching degree is calculated using the similarity between the character in the keyword and the recognized character. Method.

The method further includes a step of extracting a feature amount of each of the character in the keyword and the recognized character,
10. The document filing method according to claim 9, wherein the matching step calculates a similarity of the feature amount for the character that cannot be uniquely determined, and calculates the matching degree using the similarity.

Further, a step of calculating a ratio of the number of characters of the characters that cannot be uniquely determined and the number of characters of the recognized characters,
10. The document filing method according to claim 9, wherein in the output step, when the ratio is equal to or more than a predetermined value, a display indicating that the predetermined document cannot be properly searched is displayed.