JP4283038B2

JP4283038B2 - Document registration device, document search device, program, and storage medium

Info

Publication number: JP4283038B2
Application number: JP2003156116A
Authority: JP
Inventors: 裕一小島; 研策山本; 裕子井田; 雅之亀田; 優希子平岡; 泰嗣小川
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2003-06-02
Filing date: 2003-06-02
Publication date: 2009-06-24
Anticipated expiration: 2023-06-02
Also published as: JP2004362007A

Description

【０００１】
【発明の属する技術分野】
この発明は、入力された検索式に合致した文書を登録されている複数の文書データから検索するために文書データの登録を行う文書登録装置、入力された検索式に合致した文書を登録されている複数の文書データから検索する文書検索装置、これらの装置を実現するプログラム及びこのプログラムを記憶している記憶媒体に関する。
【０００２】
【従来の技術】
多言語で検索を行う技術としては、例えば、特許文献１に開示のものがある。かかる技術では、言語横断検索を実現するために、言語間の対訳辞書を利用するものである。
【０００３】
しかしながら、本格的な言語横断検索機能が提供されなくても、文書データベースに複数の言語の文書が格納される場面は十分に考えられる。例えば、日本で利用される文書データベースでは、英語のテキストや英語交じりのテキストが格納されることは、まれなケースではない。
【０００４】
一方、文書検索技術においては、検索漏れを防ぐため、例えば、「コンピュータ」、「コンピューター」などの表記の揺れや“chair”、“chairs”などの単数形／複数形の揺れを吸収する正規化技術が存在する。これらの正規化は、登録する文書ごとに、そこで用いられている言語によって行われるべきであり、日本で使われる文書データベースだからといって、日本語の正規化のみを行うのは、文書データベース中の、日本語以外の言語で記述された文書の検索漏れにつながる。かかる問題を解決するために、ＷＷＷ上の検索エンジンなどでは、言語を指定できるインターフェースを設けているものがある。
【０００５】
【特許文献１】
特開平８−２１２２２９号公報
【０００６】
【発明が解決しようとする課題】
しかしながら、文書データベース技術では、販売国の言語が主言語で、主言語で記述された文書中に、ときどき他の言語が混じる。あるいは、ある言語で記述された文書であっても、その中に一般的に広く用いられている英語が、ときどき混じるなどのケースが存在し、これらのケースでは、主ではない言語の指定ができないために、十分な正規化が行えず、検索漏れの可能性があるという不具合があった。
【０００７】
本発明の目的は、複数言語で記述された文書に対し、漏れの少ない検索の実行を行うことである。
【０００８】
請求項１に記載の発明は、入力された検索式に合致した文書を登録されている複数の文書データから検索するために前記文書データの登録を行う文書登録装置において、前記登録をしようとする文書データの入力を受け付ける文書データ受付手段と、この受け付ける１文書に関する言語の指定を受け付ける言語指定受付手段と、この受け付けた指定言語を含む複数の言語がそれぞれ異なる文字コードエリアを保持するときに、前記文書データに対して、前記複数の言語のうち、一つの言語についての正規化を行って正規化データを作成し、前記正規化データに対して外の一つの言語についての正規化を行って新たな正規化データを作成し、その外の前記複数の言語全てについて、同様に新たな正規化データを作成することにより、一つの正規化データを作成し、前記一つの正規化データに基づいて索引の作成を行い、前記複数の言語が同じ文字コードエリアを保持するときに、前記文書データに対して、前記複数の言語毎に正規化を行って前記複数の言語の数と同数の複数の正規化データを作成し、前記複数の正規化データに基づいて索引の作成を行う索引作成手段と、この索引と前記受け付けた文書データとを登録する登録手段と、を備えてることを特徴とする文書登録装置である。
【０００９】
したがって、１文書に対して複数の言語で正規化を行い、複数言語で記述された文書に対し、漏れの少ない検索の実行を行うことができる。
【００１０】
請求項２に記載の発明は、請求項１に記載の文書登録装置において、前記索引作成手段は、前記言語指定受付手段で受け付けた指定言語の他に所定の言語を指定して当該複数の言語で前記正規化を行うこと、を特徴とする。
【００１１】
したがって、ユーザ側でわざわざ１文書について複数の言語を指定する手間を減らし、あわせて言語の指定し忘れによる検索漏れを防ぐことができる。
【００１２】
請求項３に記載の発明は、請求項２に記載の文書登録装置において、前記索引作成手段は、前記所定の言語として英語を指定すること、を特徴とする。
【００１３】
したがって、特に言語の指定なしに、ある言語の文書の中に突然あらわれる可能性の高い英語に対してデフォルトの言語とすることにより、英語交じりの文書に対する検索漏れを防ぐことができる。
【００１８】
請求項４に記載の発明は、入力された検索式に合致した文書を請求項１の文書登録装置により登録された複数の文書データから検索する文書検索装置において、前記検索の検索要求を受け付ける検索要求受付手段と、この受け付ける１検索要求に関する言語の指定を受け付ける言語指定受付手段と、この受け付けた指定言語を含む複数の言語で前記受け付けた検索要求を正規化して、複数の言語で前記文書データを正規化して作成した索引を用いて前記複数の文書データの検索を実行する検索手段と、を備えていることを特徴とする文書検索装置である。
【００１９】
したがって、１検索要求に対して複数の言語で正規化を行い、複数言語で記述された文書に対し、漏れの少ない検索の実行を行うことができる。
【００２０】
請求項５に記載の発明は、請求項４に記載の文書検索装置において、前記検索手段は、前記言語指定受付手段で受け付けた指定言語の他に所定の言語を指定して当該複数の言語で前記正規化を行うこと、を特徴とする。
【００２１】
したがって、ユーザ側でわざわざ１検索要求について複数の言語を指定する手間を減らし、あわせて言語の指定し忘れによる検索漏れを防ぐことができる。
【００２２】
請求項６に記載の発明は、請求項５に記載の文書検索装置において、前記検索手段は、前記所定の言語として英語を指定すること、を特徴とする。
【００２３】
したがって、特に言語の指定なしに、ある言語の文書の中に突然あらわれる可能性の高い英語に対してデフォルトの言語とすることにより、英語交じりの文書に対する検索漏れを防ぐことができる。
【００２４】
請求項７に記載の発明は、入力された検索式に合致した文書を登録されている複数の文書データから検索するために前記文書データの登録を行う処理をコンピュータに実行させるコンピュータに読み取り可能なプログラムにおいて、前記登録をしようとする文書データの入力を受け付ける文書データ受付処理と、この受け付ける１文書に関する言語の指定を受け付ける言語指定受付処理と、この受け付けた指定言語を含む複数の言語がそれぞれ異なる文字コードエリアを保持するときに、前記文書データに対して、前記複数の言語のうち、一つの言語についての正規化を行って正規化データを作成し、前記正規化データに対して外の一つの言語についての正規化を行って新たな正規化データを作成し、その外の前記複数の言語全てについて、同様に新たな正規化データを作成することにより、一つの正規化データを作成し、前記一つの正規化データに基づいて索引の作成を行い、前記複数の言語が同じ文字コードエリアを保持するときに、前記文書データに対して、前記複数の言語毎に正規化を行って前記複数の言語の数と同数の複数の正規化データを作成し、前記複数の正規化データに基づいて索引の作成を行う索引作成処理と、この索引と前記受け付けた文書データとを登録する登録処理と、をコンピュータに実行させることを特徴とするプログラムである。
【００２５】
したがって、１文書に対して複数の言語で正規化を行い、複数言語で記述された文書に対し、漏れの少ない検索の実行を行うことができる。
【００２６】
請求項８に記載の発明は、入力された検索式に合致した文書を請求項７のプログラムにより登録された複数の文書データから検索する処理をコンピュータに実行させるコンピュータに読み取り可能なプログラムにおいて、前記検索の検索要求を受け付ける検索要求受付処理と、この受け付ける１検索要求に関する言語の指定を受け付ける言語指定受付処理と、この受け付けた指定言語を含む複数の言語で前記受け付けた検索要求を正規化して、複数の言語で前記文書データを正規化して作成した索引を用いて前記複数の文書データの検索を実行する検索処理と、をコンピュータに実行させることを特徴とするプログラムである。
【００２７】
したがって、１検索要求に対して複数の言語で正規化を行い、複数言語で記述された文書に対し、漏れの少ない検索の実行を行うことができる。
【００２８】
請求項９に記載の発明は、コンピュータに読み取り可能なプログラムを記憶している記憶媒体において、前記プログラムは請求項７又は８のいずれかの一に記載のプログラムであること、を特徴とする記憶媒体である。
【００２９】
したがって、記憶されているプログラムにより、請求項７又は８に記載の発明と同様の作用、効果を奏する。
【００３０】
【発明の実施の形態】
本発明の一実施の形態について説明する。
【００３１】
図１は、本実施の形態の文書検索システム１の電気的な接続のブロック図である。文書検索システム１は、本発明の文書登録装置、文書検索装置を実施した装置で、図１に示すように、各種演算を行ない、文書検索システム１の各部を集中的に制御するＣＰＵ１１と、各種のＲＯＭ、ＲＡＭからなるメモリ１２とが、バス１３で接続されている。
【００３２】
バス１３には、所定のインターフェースを介して、ハードディスクなどの磁気記憶装置１４と、キーボード、マウスなどの入力装置１５と、表示装置１６と、光ディスクなどの記憶媒体１７を読み取る記憶媒体読取装置１８とが接続され、また、ネットワーク２と通信を行なう所定の通信インターフェース１９が接続されている。なお、記憶媒体１７としては、ＣＤ，ＤＶＤなどの光ディスク、光磁気ディスク、フレキシブルディスクなどの各種メディアを用いることができる。また、記憶媒体読取装置１８は、具体的には記憶媒体１７の種類に応じて光ディスク装置、光磁気ディスク装置、フレキシブルディスク装置などが用いられる。
【００３３】
文書検索システム１は、この発明の記憶媒体を実施する記憶媒体１７から、この発明のプログラムを実施するプログラム２０を読み取って、磁気記憶装置１４にインストールする。これらのプログラムはインターネットなどのネットワーク２等を介してダウンロードしてインストールするようにしてもよい。このインストールにより、文書検索システム１は、後述の所定の処理の実行が可能な状態となる。なお、プログラム２０は、所定のＯＳ上で動作するものであってもよい。
【００３４】
次に、文書検索システム１が実行する処理について説明する。図２は、文書検索システム１がプログラム２０に基づいて実現する機能の機能ブロック図である。図３、図４は、文書検索システム１が実行する処理を説明するフローチャートである。
【００３５】
まず、文書検索システム１により文書データを登録する場合の処理について、図２、図３を参照して説明する。最初にユーザは、文書検索システム１に文書データを格納する際に、特定の言語の種類を指定して、登録しようとする文書データを入力する（文書データ受付手段、言語指定受付手段、文書データ受付処理、言語指定受付処理）（ステップＳ１のＹ）。この指定（言語指定）は、言語指定部２１に送られ（ステップＳ２）、言語指定部２１では、ユーザによる言語指定に、さらに特定の言語、この例では“英語”の言語指定を付加して言語情報とする（ステップＳ３）。
【００３６】
文書データ格納部２２は、この文書データと言語情報とを受け取り、まず、図５に示すような言語指定―文字コードエリア対応テーブル３１を参照する（ステップＳ４）。このテーブル３１には、図５に示すような言語指定−文字コードエリア対応表が記録されている。これは、日本語、英語など各種の言語３２と、その言語で使用される文字コードの範囲（文字コードエリア３３）とを関連付けて登録したものである。
【００３７】
そして、言語情報で指定された２言語を言語３２から探し、その２言語にそれぞれ対応する文字コードエリア３３同士が重なるか否かを判断する（ステップＳ５）。例えば、言語情報に含まれる複数の言語が“日本語”、“英語”であった場合、文字コードエリア３３の重なりは無い（ステップＳ５のＮ）。このような場合には、文書データ格納部２２は、受け取った文書データにユーザの指定言語（この例では“日本語”）の正規化を実施し（ステップＳ６）、その結果に対して言語指定部２１の指定言語（この例では“英語”）の正規化を実施し（ステップＳ７）、その結果を用いて、文書データ群が蓄積されるデータベースとなる文書データ蓄積部２３（磁気記憶装置１４に構築される）内に索引を作成し（後述する図５の索引テーブル４１に格納される）（ステップＳ８）、文書データを格納する（ステップＳ９）。
【００３８】
例えば、言語情報に含まれる複数の言語が“仏語”、“英語”であった場合、文字コードエリア３３に重なりがある（ステップＳ５のＹ）。この場合には、文書データ格納部２２は受け取った文書データにユーザの指定言語（この例では“仏語”）の正規化を実施し（ステップＳ１０）、その結果を用いて文書データ蓄積部２３内に索引を作成し（ステップＳ１１）、次に言語指定部２１の指定言語（この例では“英語”）の正規化を実施し（ステップＳ１２）、その結果を用いて文書データ蓄積部内に索引を作成し（ステップＳ１３）、文書データを格納する（ステップＳ１４）。ステップＳ６〜８，Ｓ１０〜Ｓ１３により索引作成手段、索引作成処理を実現し、ステップＳ９，Ｓ１４により文書登録手段、文書登録処理を実現している。
【００３９】
かかる処理で作成された索引は図５の索引テーブル４１に格納される。以下では、この索引テーブル４１の索引について具体的に説明する。図５の例では、「messagingマネージャ」なる文書１が“日本語”、“英語”で格納され、同じ内容の文書２が“仏語”、“英語”で格納された場合の索引テーブル４１の例を示す。
【００４０】
文書１では、それぞれ異なった文字コードの部分に対して異なった正規化が適用されているため、文書を構成する単語数分のみの索引が作成される。すなわち、“日本語”（文字コードエリア３３の0x3000-0x30ff，0x3200-0x33ff，0x4e00-0x9fff，0xf900-0xfaff，0xff00-0xff9f）と、“英語”（文字コードエリア３３の0x0020-0x00ff）とは、重なり合う文字コード範囲を持たないので、まず、文書１に対して日本語の正規化を施し、次にその結果に英語の正規化を施すこととなる。文書１に“日本語”の正規化を施した結果は「messagingマネイジャ」、さらに“英語”の正規化を施した結果は、英語の正規化ルールでは、ing形に関する正規化ルールが存在するため、「messageマネイジャ」となる。この結果、正規化後の文書１を構成する２つの単語の「message」、「マネイジャ」が文書１への索引として登録される。よって、索引テーブル４１には、単語表記４２の欄に２つの単語の「message」、「マネイジャ」が登録され、これらにそれぞれ関連付けられて文書４３の欄に文書１が登録される。
【００４１】
一方、文書２では、それぞれ異なった言語の正規化をした分の索引を作るため、正規化によって生成したバリエーション分の索引が作成される。すなわち、「messagingマネージャ」なる文書２が、“仏語”、“英語”で格納された場合、言語指定―文字コードエリア対応テーブル３１によれば、“仏語”（文字コードエリア３３の0x0020-0x00ff）は、“英語”と同じ文字コード範囲を持つため、文書２を“仏語”で正規化した正規化文書と、“英語”で正規化した正規化文書が作成される。“仏語”で正規化した場合は、仏語の正規化ルールでは、“messaging→message”が存在しないため、「messagingマネージャ」なる正規化文書が生成される。また、“英語”で正規化した場合は「messageマネージャ」なる正規化文書が生成される。これら２つの正規化文書から異なる単語を取り出し、３つの単語「message」、「messaging」、「マネージャ」が文書２への索引として登録される。よって、索引テーブル４１には、単語表記４２の欄に３つの単語の「message」、「messaging」、「マネイジャ」が登録され、これらにそれぞれ関連付けられて文書４３の欄に文書２が登録される。
【００４２】
次に、文書検索システム１により文書データの検索を行う場合について、図２、図４を参照して説明する。以下では、「windows」なる単語を例にとって説明する。まず、ユーザは、要求入力部２４に検索要求を入力し（検索要求受付手段、検索要求受付処理）（ステップＳ２１のＹ）、さらに、言語指定（例えば、「日本語」、「英語」、「フランス語」である）を行う（言語指定受付手段、言語指定受付処理）（ステップＳ２２のＹ）。言語指定部２１は、ユーザからの言語指定に通常は特定の言語として「英語」を付加するが（ステップＳ２３）、この例の場合、すでに「英語」が言語指定中に含まれているので、何も付加は行わない。
【００４３】
多言語展開部２５は、検索要求の単語を、指定言語、この例では「日本語」、「英語」、「フランス語」について正規化し（ステップＳ２４）、その結果、この例では、「windows」、「window」、「window」を得る。要求入力部２４はこれを受け取り、この例では、重なり合う「window」を１つにまとめ、正規化の結果、この例では「windows」、「window」を検索部２６に送る（ステップＳ２５）。
【００４４】
検索部２６は、正規化の結果、この例では「windows」あるいは「window」で、索引テーブル４３の索引を用いて検索を実施し（ステップＳ２６）、その結果を検索結果として出力する（ステップＳ２７）。ステップＳ２３〜Ｓ２６により検索手段、検索処理を実現している。
【００４５】
この例では、「windows」という記述を含む文書が、“日本語”として登録された場合には「windows」という単語が、“英語”や“仏語”、あるいは“日本語”、英語”として登録された場合には、「window」という単語が検索においてヒットし、漏れのない検索を実現できる。
【００４６】
【発明の効果】
請求項１，６，９，１０，１１に記載の発明は、１文書、１検索要求に対して複数の言語で正規化を行い、複数言語で記述された文書に対し、漏れの少ない検索の実行を行うことができる。
【００４７】
請求項２，７に記載の発明は、請求項１，６に記載の発明において、ユーザ側でわざわざ１文書、１検索要求について複数の言語を指定する手間を減らし、あわせて言語の指定し忘れによる検索漏れを防ぐことができる。
【００４８】
請求項３，８に記載の発明は、請求項２，７に記載の発明において、特に言語の指定なしに、ある言語の文書の中に突然あらわれる可能性の高い英語に対してデフォルトの言語とすることにより、英語交じりの文書に対する検索漏れを防ぐことができる。
【００４９】
請求項４に記載の発明は、請求項１〜３のいずれかの一に記載の発明において、複数の言語が指定された場合の文書登録処理として、複数の正規化を行って索引を作成することにより、索引のサイズは大きくなっても、特に、言語が使用する文字に依存しないで、１文書中の複数言語に対応することができる。
【００５０】
請求項５に記載の発明は、請求項１〜４のいずれかの一に記載の発明において、複数の言語が指定された場合の文書登録処理として、複数の言語によって実施される正規化の影響がそれぞれ影響しあわないとき、ある言語の正規化と別の言語の正規化とを１文書に同時に行うことにより、検索漏れを防ぎつつ、作成される索引のサイズを低減することができる。
【図面の簡単な説明】
【図１】本発明の一実施の形態である文書検索システムの電気的な接続のブロック図である。
【図２】文書検索システムの機能ブロック図である。
【図３】文書データを登録する場合の処理のフローチャートである。
【図４】文書データを検索する場合の処理のフローチャートである。
【図５】言語指定―文字コードエリア対応テーブルの説明図である。
【図６】索引テーブルの説明図である。
【符号の説明】
１文書登録装置、文書検索装置
１７記憶媒体
２０プログラム[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document registration apparatus for registering document data in order to search a document that matches an input search expression from a plurality of registered document data, and a document that matches the input search expression is registered. The present invention relates to a document retrieval apparatus that retrieves a plurality of document data, a program that realizes these apparatuses, and a storage medium that stores the program.
[0002]
[Prior art]
As a technique for searching in multiple languages, for example, there is one disclosed in Patent Document 1. In such a technique, a bilingual dictionary between languages is used to implement cross-language search.
[0003]
However, even if a full-fledged cross-language search function is not provided, it is fully conceivable that documents in a plurality of languages are stored in the document database. For example, in a document database used in Japan, it is not uncommon to store English text or English mixed text.
[0004]
On the other hand, in document search technology, for example, normalization that absorbs fluctuations of notations such as “computer” and “computer” and singular / plural forms such as “chair” and “chairs” in order to prevent omission of search. Technology exists. These normalizations should be performed for each document to be registered in the language used in the document, and just because Japanese document databases are used in Japan, only Japanese normalization is performed in the document database. This leads to omission of search for documents written in languages other than Japanese. In order to solve such a problem, some search engines on the WWW are provided with an interface for specifying a language.
[0005]
[Patent Document 1]
JP-A-8-212229 gazette
[Problems to be solved by the invention]
However, in document database technology, the language of the country of sale is the main language, and other languages are sometimes mixed in the document described in the main language. Or, even in a document written in a certain language, there are cases where English, which is generally widely used, is sometimes mixed, and in these cases, it is not possible to specify a language other than the main language. Therefore, there is a problem that sufficient normalization cannot be performed and there is a possibility of a search omission.
[0007]
An object of the present invention is to perform a search with few leaks on a document described in a plurality of languages.
[0008]
According to the first aspect of the present invention, in the document registration apparatus for registering the document data in order to search a document that matches the input search formula from a plurality of registered document data, the registration is attempted. A document data receiving unit that receives input of document data, a language designation receiving unit that receives designation of a language related to the one document to be received, and a plurality of languages including the received designated language each having a different character code area. The document data is normalized for one language among the plurality of languages to create normalized data, and the normalized data is normalized for one other language. By creating new normalized data and creating new normalized data in the same way for all the other languages, Create data, create an index based on the one normalized data, and normalize the document data for each of the plurality of languages when the plurality of languages hold the same character code area To create a plurality of normalized data having the same number as the number of the plurality of languages, and to create an index based on the plurality of normalized data, the index and the received document data And a registration unit for registering the document.
[0009]
Therefore, it is possible to normalize one document in a plurality of languages and execute a search with few omissions on a document described in a plurality of languages.
[0010]
According to a second aspect of the present invention, in the document registration apparatus according to the first aspect, the index creating means designates a plurality of languages by designating a predetermined language in addition to the designated language accepted by the language designation accepting means. And performing the normalization.
[0011]
Therefore, it is possible to reduce the trouble of specifying a plurality of languages for one document on the user side, and to prevent a search omission due to forgetting to specify a language.
[0012]
According to a third aspect of the present invention, in the document registration apparatus according to the second aspect, the index creating means designates English as the predetermined language.
[0013]
Therefore, it is possible to prevent omission of search for documents mixed with English by setting the default language for English which is likely to appear suddenly in a document in a certain language without specifying a language.
[0018]
According to a fourth aspect of the present invention, there is provided a document retrieval apparatus for retrieving a document that matches an input retrieval formula from a plurality of document data registered by the document registration apparatus according to the first aspect . A request accepting means; a language designation accepting means for accepting a language specification relating to the accepted one search request; and normalizing the accepted search request in a plurality of languages including the accepted designated language, and the document data in a plurality of languages. And a search means for executing a search of the plurality of document data using an index created by normalizing the document.
[0019]
Therefore, normalization can be performed in a plurality of languages with respect to one search request, and a search with less omission can be performed on a document described in a plurality of languages.
[0020]
According to a fifth aspect of the present invention, in the document retrieval apparatus according to the fourth aspect , the retrieval unit designates a predetermined language in addition to the designated language accepted by the language designation acceptance unit, and uses the plurality of languages. The normalization is performed.
[0021]
Accordingly, it is possible to reduce the trouble of designating a plurality of languages for one search request on the user side, and to prevent a search omission due to forgetting to designate a language.
[0022]
The invention described in claim 6 is the document search apparatus according to claim 5 , wherein the search means specifies English as the predetermined language.
[0023]
Therefore, it is possible to prevent omission of search for documents mixed with English by setting the default language for English which is likely to appear suddenly in a document in a certain language without specifying a language.
[0024]
The invention according to claim 7 is readable by a computer that causes a computer to execute processing for registering the document data in order to search a document that matches the input search formula from a plurality of registered document data. In the program, the document data receiving process for receiving the input of the document data to be registered, the language specifying receiving process for receiving the specification of the language relating to the received one document, and the plurality of languages including the received specified language are different. When the character code area is held, the document data is normalized with respect to one language out of the plurality of languages to create normalized data, and the normalized data is Normalize one language and create new normalized data, and for all the other languages, When creating one normalized data by creating new normalized data, creating an index based on the one normalized data, and when the plurality of languages hold the same character code area In addition, the document data is normalized for each of the plurality of languages to create a plurality of normalized data having the same number as the number of the plurality of languages, and an index is created based on the plurality of normalized data And a registration process for registering the index and the received document data.
[0025]
Therefore, it is possible to normalize one document in a plurality of languages and execute a search with few omissions on a document described in a plurality of languages.
[0026]
The invention described in claim 8 is a computer-readable program that causes a computer to execute a process of searching a plurality of document data registered by the program of claim 7 for a document that matches an input search expression. A search request reception process for receiving a search request for search; a language specification reception process for receiving a language specification relating to this one search request to be received; and normalizing the received search request in a plurality of languages including the received specified language; A program that causes a computer to execute search processing for searching for the plurality of document data using an index created by normalizing the document data in a plurality of languages.
[0027]
Therefore, normalization can be performed in a plurality of languages with respect to one search request, and a search with less omission can be performed on a document described in a plurality of languages.
[0028]
The invention according to claim 9 is a storage medium storing a computer-readable program, wherein the program is the program according to any one of claims 7 and 8. It is a medium.
[0029]
Therefore, the stored program produces the same operations and effects as the invention according to claim 7 or 8 .
[0030]
DETAILED DESCRIPTION OF THE INVENTION
An embodiment of the present invention will be described.
[0031]
FIG. 1 is a block diagram of electrical connection of the document search system 1 according to the present embodiment. The document search system 1 is an apparatus that implements the document registration apparatus and the document search apparatus of the present invention. As shown in FIG. 1, a CPU 11 that performs various operations and controls each part of the document search system 1 in a centralized manner. A memory 12 composed of a ROM and a RAM is connected by a bus 13.
[0032]
The bus 13 is connected to a magnetic storage device 14 such as a hard disk, an input device 15 such as a keyboard and a mouse, a display device 16 and a storage medium reader 18 that reads a storage medium 17 such as an optical disk via a predetermined interface. And a predetermined communication interface 19 for communicating with the network 2 is connected. As the storage medium 17, various media such as an optical disk such as a CD and a DVD, a magneto-optical disk, and a flexible disk can be used. As the storage medium reading device 18, specifically, an optical disk device, a magneto-optical disk device, a flexible disk device, or the like is used according to the type of the storage medium 17.
[0033]
The document retrieval system 1 reads the program 20 that implements the program of the present invention from the storage medium 17 that implements the storage medium of the present invention, and installs it in the magnetic storage device 14. These programs may be downloaded and installed via the network 2 such as the Internet. As a result of this installation, the document search system 1 becomes ready to execute a predetermined process described later. The program 20 may operate on a predetermined OS.
[0034]
Next, processing executed by the document search system 1 will be described. FIG. 2 is a functional block diagram of functions realized by the document search system 1 based on the program 20. 3 and 4 are flowcharts for explaining processing executed by the document search system 1.
[0035]
First, processing when document data is registered by the document search system 1 will be described with reference to FIGS. First, when storing the document data in the document search system 1, the user designates a specific language type and inputs the document data to be registered (document data receiving means, language designation receiving means, document data). Reception process, language designation reception process) (Y in step S1). This designation (language designation) is sent to the language designation unit 21 (step S2). The language designation unit 21 adds a language designation of a specific language, “English” in this example, to the language designation by the user. It is set as language information (step S3).
[0036]
The document data storage unit 22 receives the document data and language information, and first refers to a language designation / character code area correspondence table 31 as shown in FIG. 5 (step S4). In this table 31, a language designation / character code area correspondence table as shown in FIG. 5 is recorded. In this example, various languages 32 such as Japanese and English are registered in association with character code ranges (character code area 33) used in the language.
[0037]
Then, two languages designated by the language information are searched from the language 32, and it is determined whether or not the character code areas 33 corresponding to the two languages overlap each other (step S5). For example, when the plurality of languages included in the language information are “Japanese” and “English”, the character code area 33 does not overlap (N in step S5). In such a case, the document data storage unit 22 normalizes the user specified language (in this example, “Japanese”) on the received document data (step S6), and specifies the language for the result. Normalization of the specified language (in this example, “English”) of the unit 21 is performed (step S7), and using the result, the document data storage unit 23 (magnetic storage device 14) serving as a database in which the document data group is stored. Are created (stored in an index table 41 of FIG. 5 described later) (step S8), and document data is stored (step S9).
[0038]
For example, when the plurality of languages included in the language information are “French” and “English”, there is an overlap in the character code area 33 (Y in step S5). In this case, the document data storage unit 22 normalizes the user specified language (in this example, “French”) to the received document data (step S10), and uses the result to store the document data in the document data storage unit 23. (Step S11), the language specification unit 21 normalizes the specified language (in this example, “English”) (step S12), and uses the result to create an index in the document data storage unit. Create (step S13) and store document data (step S14). Steps S6 to S8 and S10 to S13 realize index creation means and index creation processing, and Steps S9 and S14 realize document registration means and document registration processing.
[0039]
The index created by such processing is stored in the index table 41 of FIG. Below, the index of this index table 41 is demonstrated concretely. In the example of FIG. 5, an example of the index table 41 when the document 1 “messaging manager” is stored in “Japanese” and “English” and the document 2 having the same contents is stored in “French” and “English”. Indicates.
[0040]
In the document 1, since different normalization is applied to different character code portions, only the number of words constituting the document is created. That is, “Japanese” (0x3000-0x30ff, 0x3200-0x33ff, 0x4e00-0x9fff, 0xf900-0xfaff, 0xff00-0xff9f in the character code area 33) and “English” (0x0020-0x00ff in the character code area 33) are Since there is no overlapping character code range, the document 1 is first normalized in Japanese, and then the result is subjected to English normalization. The result of normalizing “Japanese” on document 1 is “messaging manager”, and the result of normalizing “English” is the normalization rule for English because there is a normalization rule related to the ing form. , "Message manager". As a result, the two words “message” and “manager” constituting the normalized document 1 are registered as indexes to the document 1. Therefore, in the index table 41, two words “message” and “manager” are registered in the column of the word notation 42, and the document 1 is registered in the column of the document 43 in association with these.
[0041]
On the other hand, in document 2, in order to create an index for normalization of different languages, an index for variations generated by normalization is created. That is, when the document 2 “messaging manager” is stored in “French” and “English”, according to the language designation-character code area correspondence table 31, “French” (0x0020-0x00ff in the character code area 33). Has the same character code range as “English”, a normalized document obtained by normalizing document 2 with “French” and a normalized document obtained by normalizing with “English” are created. In the case of normalization with “French”, since “messaging → message” does not exist in the French normalization rule, a normalized document “messaging manager” is generated. In addition, when normalized in “English”, a normalized message “message manager” is generated. Different words are extracted from these two normalized documents, and three words “message”, “messaging”, and “manager” are registered as an index to the document 2. Therefore, in the index table 41, three words “message”, “messaging”, and “manager” are registered in the column of the word notation 42, and the document 2 is registered in the column of the document 43 in association with each of them. .
[0042]
Next, a case where document data is searched by the document search system 1 will be described with reference to FIGS. In the following, the word “windows” will be described as an example. First, the user inputs a search request to the request input unit 24 (search request reception means, search request reception processing) (Y in step S21), and further specifies a language (for example, “Japanese”, “English”, “ "French") (Language designation accepting means, language designation accepting process) (Y in step S22). The language designation unit 21 normally adds “English” as a specific language to the language designation from the user (step S23). In this example, “English” is already included in the language designation. Nothing is added.
[0043]
The multilingual expansion unit 25 normalizes the search request word with respect to the specified language, in this example, “Japanese”, “English”, “French” (step S24). As a result, in this example, “windows”, Get “window”, “window”. The request input unit 24 receives this, and in this example, the overlapping “windows” are combined into one, and as a result of normalization, “windows” and “window” in this example are sent to the search unit 26 (step S25).
[0044]
As a result of normalization, the search unit 26 performs a search using “index” in the index table 43 with “windows” or “window” in this example (step S26), and outputs the result as a search result (step S27). ). Search means and search processing are realized by steps S23 to S26.
[0045]
In this example, if a document containing the description “windows” is registered as “Japanese”, the word “windows” is registered as “English” or “French” or “Japanese” or “English”. If the search is performed, the word “window” is hit in the search, and a search without omission can be realized.
[0046]
【The invention's effect】
The inventions according to claims 1, 6, 9, 10, and 11 normalize one document and one search request in a plurality of languages, and perform a search with few leaks for a document described in a plurality of languages. Execution can be performed.
[0047]
The inventions described in claims 2 and 7 are the inventions described in claims 1 and 6, reducing the trouble of specifying multiple languages for one document and one search request on the user side, and forgetting to specify a language. Search omission due to can be prevented.
[0048]
The inventions of claims 3 and 8 are the inventions of claims 2 and 7, in which the default language is used for English which is likely to appear suddenly in a document in a language, without particularly specifying the language. By doing so, it is possible to prevent omission of search for documents mixed with English.
[0049]
The invention according to claim 4 is the invention according to any one of claims 1 to 3, wherein the index is created by performing a plurality of normalizations as document registration processing when a plurality of languages are designated. As a result, even if the size of the index is increased, it is possible to deal with a plurality of languages in one document without depending on the characters used by the language.
[0050]
The invention according to claim 5 is the invention according to any one of claims 1 to 4, and the effect of normalization performed in a plurality of languages as a document registration process when a plurality of languages are designated. When there is no influence on each other, normalization of one language and normalization of another language are simultaneously performed on one document, so that the size of an index to be created can be reduced while preventing a search omission.
[Brief description of the drawings]
FIG. 1 is a block diagram of electrical connection of a document search system according to an embodiment of the present invention.
FIG. 2 is a functional block diagram of a document search system.
FIG. 3 is a flowchart of processing when registering document data.
FIG. 4 is a flowchart of processing when retrieving document data.
FIG. 5 is an explanatory diagram of a language designation / character code area correspondence table;
FIG. 6 is an explanatory diagram of an index table.
[Explanation of symbols]
1 Document Registration Device, Document Search Device 17 Storage Medium 20 Program

Claims

In a document registration apparatus for registering the document data in order to search a document that matches the input search formula from a plurality of registered document data,
Document data receiving means for receiving input of document data to be registered;
Language designation accepting means for accepting designation of a language relating to the one document to be accepted;
When the plurality of languages including the received designated language hold different character code areas, the document data is normalized with respect to one language out of the plurality of languages to obtain normalized data. Create new normalized data by normalizing the normalized data for one other language, and create new normalized data for all the other languages in the same way By creating one normalized data, creating an index based on the one normalized data, when the plurality of languages hold the same character code area, for the document data, wherein the plurality of each language performs normalization to create as many plurality of normalized data of the plurality of languages, and indexing means of making indexes based on the plurality of normalized data
Registration means for registering the index and the received document data;
A document registration apparatus comprising:

2. The document registration according to claim 1, wherein the index creation unit designates a predetermined language in addition to the designated language accepted by the language designation acceptance unit and performs the normalization in the plurality of languages. apparatus.

The document registration apparatus according to claim 2, wherein the index creating unit designates English as the predetermined language.

In a document search apparatus that searches a plurality of document data registered by the document registration apparatus according to claim 1 for a document that matches the input search formula,
Search request accepting means for accepting the search request for the search;
Language designation accepting means for accepting the designation of the language related to the one search request to be accepted;
Retrieval means for normalizing the accepted search request in a plurality of languages including the accepted designated language and performing a search of the plurality of document data using an index created by normalizing the document data in a plurality of languages When,
A document retrieval apparatus comprising:

5. The document search apparatus according to claim 4, wherein the search unit specifies a predetermined language in addition to the specified language received by the language specification reception unit and performs the normalization in the plurality of languages. .

The document search apparatus according to claim 5, wherein the search unit specifies English as the predetermined language.

In a computer-readable program for causing a computer to execute a process of registering the document data in order to search a document that matches the input search formula from a plurality of registered document data,
Document data reception processing for receiving input of document data to be registered;
A language designation accepting process for accepting designation of a language relating to the one document to be accepted;
When the plurality of languages including the received designated language hold different character code areas, the document data is normalized with respect to one language out of the plurality of languages to obtain normalized data. Create new normalized data by normalizing the normalized data for one other language, and create new normalized data for all the other languages in the same way By creating one normalized data, creating an index based on the one normalized data, when the plurality of languages hold the same character code area, for the document data, the normalization to make creating a plurality of normalized data as many of said plurality of languages for each of a plurality of languages, the indexing process for creating an index based on the plurality of normalized data
A registration process for registering this index and the received document data;
A program that causes a computer to execute.

In a computer-readable program for causing a computer to execute processing for searching a plurality of document data registered by the program of claim 7 for a document that matches an input search expression,
A search request receiving process for receiving a search request for the search;
A language designation accepting process for accepting designation of a language related to the one search request to be accepted;
A search process for normalizing the received search request in a plurality of languages including the specified language that has been received, and executing a search for the plurality of document data using an index created by normalizing the document data in a plurality of languages When,
A program that causes a computer to execute.

In a storage medium storing a computer-readable program,
A storage medium, wherein the program is the program according to any one of claims 7 and 8.