JP2885487B2

JP2885487B2 - Document information retrieval device

Info

Publication number: JP2885487B2
Application number: JP2198737A
Authority: JP
Inventors: 比呂志松尾
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 1990-07-26
Filing date: 1990-07-26
Publication date: 1999-04-26
Anticipated expiration: 2014-04-26
Also published as: JPH0484271A

Description

【発明の詳細な説明】〔産業上の利用分野〕この発明は，文，単語列あるいは単語を入力すること
によって，文書ファイル中の内容を検索するための文書
内情報検索装置に関する。Description: BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an in-document information retrieval apparatus for retrieving contents in a document file by inputting a sentence, a word string, or a word.

[Conventional technology]

従来の文書ファイル中の内容を検索する装置として，
文字列を入力することにより，その文字列を含む部分を
検索する装置が知られている。As a device for searching the contents of a conventional document file,
2. Description of the Related Art There is known an apparatus that searches for a portion including a character string by inputting the character string.

[Problems to be solved by the invention]

しかしながら，従来の装置では，文字列が完全に一致
しなければ検索できないため，例えば，「書式」という
文字列を入力して当該書式と同じ意味をもつ「フォーマ
ット」を含む部分を検索することができず，また，「フ
ァイルのオープン」と入力して「ファイルをオープンす
る方法」を含む部分を検索することができなかった。However, in the conventional device, since a search cannot be performed unless the character strings completely match, for example, it is not possible to input a character string of “format” and search for a portion including “format” having the same meaning as the format. You couldn't search for parts containing "how to open a file" by typing "open file".

この発明の目的は，入力した文字列と一致する文字列
でなくても，意味が同じ単語を含む部分や入力した文と
意味的な類似性が高い文を含む部分を簡単にしかも高速
に検索できる機能を有する文書内情報検索装置を提供す
ることにある。SUMMARY OF THE INVENTION An object of the present invention is to easily and quickly search for a part including a word having the same meaning and a part including a sentence having a high semantic similarity to an input sentence, even if the input sentence is not a matched character string. It is an object of the present invention to provide an in-document information search device having a function capable of performing such a function.

[Means for solving the problem]

文書ファイル中の章や見出しや段落の第１の情報を抽
出する文書構造抽出部と，表記や意味カテゴリを抽出する登録文解析部と，表記や意味カテゴリを基にインデックステーブルを生
成するインデックステーブル生成部と，検索文に含まれる単語の表記と意味カテゴリとを抽出
する検索文解析部と，類似度を算出する類似度算出部と，類似度の高いものを表示する候補表示部とを少なくと
もそなえた構成をそなえている。A document structure extraction unit for extracting first information of chapters, headings and paragraphs in a document file, a registered sentence analysis unit for extracting notations and semantic categories, and an index table for generating an index table based on the notations and semantic categories A generating unit, a search sentence analyzing unit for extracting the notation and the semantic category of the words included in the search sentence, a similarity calculating unit for calculating the similarity, and a candidate display unit for displaying a high similarity. It has a configuration that includes it.

[Action]

文書構造抽出部で抽出された文書ファイル中の章や見
出しや段落の情報と，インデックステーブル生成部で生
成されたインデックステーブルとを参照して，検索文解
析部で抽出された単語の表記や意味カテゴリを基に，類
似度算出部で検索文との類似度を算出して，この類似度
を基に内容を表示すべき候補として見出し文や段落を候
補表示手段で表示し，入力された文字列と一致する文字
列がなくても意味的に類似した部分を含む内容を検索す
る。The notation and meaning of the words extracted by the search sentence analysis unit by referring to the chapter, heading, and paragraph information in the document file extracted by the document structure extraction unit and the index table generated by the index table generation unit Based on the category, the similarity calculation unit calculates the similarity with the search sentence, and based on the similarity, displays the headline sentence or paragraph as a candidate whose contents should be displayed by the candidate display means. Searches for a content that contains a semantically similar part even if there is no character string that matches the column.

〔Example〕

第１図は本発明の実施例を示すブロック図である。以
下，第１図において，本発明の実施例の動作について説
明する。FIG. 1 is a block diagram showing an embodiment of the present invention. Hereinafter, the operation of the embodiment of the present invention will be described with reference to FIG.

第２図は文書ファイルの１例を示す図である。文書フ
ァイル蓄積部１には，複数の章で構成され，各章が見出
し文と本文とで構成され，各本文が複数の段落で構成さ
れ，各段落が複数の文で構成される第２図のような文書
ファイルを蓄積しておく。FIG. 2 is a diagram showing an example of a document file. The document file storage unit 1 is composed of a plurality of chapters, each chapter is composed of a headline sentence and a text, each text is composed of a plurality of paragraphs, and each paragraph is composed of a plurality of sentences. Document files such as are stored.

文書構造抽出部２は，文書ファイル蓄積部１に蓄積さ
れた文書ファイル21を解析し，章，見出し文，段落を抽
出し，その位置と階層関係とを表す文書構造テーブルを
作成する。様々な方法で，章，見出し文，段落を抽出す
ることができるが，ここでは，その１例について説明す
る。The document structure extraction unit 2 analyzes the document file 21 stored in the document file storage unit 1, extracts chapters, headlines, and paragraphs, and creates a document structure table indicating the positions and hierarchical relationships. A chapter, a headline, and a paragraph can be extracted by various methods. Here, one example will be described.

まず，［章番］を示す文字列，例えば「1.」や「１
章」などの文字列を見つける。［章番］は，以下のパタ
ーンを満たす文字列を探すことにより抽出できる。First, a character string indicating [chapter number] such as "1."
Find strings such as "Chapter". [Chapter number] can be extracted by searching for a character string that satisfies the following pattern.

［章番］＝［数字列］＋「．」 OR ［数字列］＋「章」（ただし，［数字列］は行の先頭から始まる）次に，［章番］に続く空白文字を除いた文字列で改行
コードまでを［見出し文］として抽出する。[Chapter number] = [Numerical string] + "." OR [Numerical string] + "Chapter" (However, [Numerical string] starts from the beginning of the line) Next, the blank characters following [Chapter number] are removed A character string up to a line feed code is extracted as [headline sentence].

［段落］は以下の規則により抽出する。 [Paragraph] is extracted according to the following rules.

［段落］＝［改行コード］に続く文字列で，［改行コード］までの文字列（ただし，この［改行コード］は行の最後の位置にあ
る改行コードを除く）各［章］は，［章番］が記述されている行から，次の
［章番］が記述されている行の１つ前の行までとする。[Paragraph] = character string following [line feed code] up to [line feed code] (however, this [line feed code] excludes the line feed code at the end of the line) [Chapter number] to the line immediately before the next line describing [Chapter number].

以上のような方法により，文書ファイル中から章，見
出し文，段落を抽出し，各々の文書ファイル上の位置を
求めて文書構造テーブルを作成する。With the above-described method, a chapter, a headline, and a paragraph are extracted from the document file, and a position in each document file is obtained to create a document structure table.

第３図は，文書構造テーブルの１例を示す図である。
文書構造抽出部２で作成された文書構造テーブル22は文
書構造テーブル部３に蓄積される。第３図においてID
は，章や見出し文，段落を管理するための識別子であ
り，例えば,A1は第１番目の章,A2は第２番目の章,B1は
見出し文で第２図に示す「標準入出力」,C1は第１番目
の章の第１番目の段落で第２図に示す「一番簡単な…方
法である。」に対応している。また，このテーブル22に
より，第１番目の章の見出し文はB1,その章に含まれる
段落にはC1,C2,…,C6が含まれるという情報を容易に取
り出すことができる。また,SおよびＥは，章，見出し
文，段落の文書ファイル上の開始位置と終了位置とを表
している。これを参照することにより，各章，各見出し
文，各段落に対応する文字列を文書ファイル蓄積部１か
ら読み出すことができる。FIG. 3 is a diagram showing an example of a document structure table.
The document structure table 22 created by the document structure extraction unit 2 is stored in the document structure table unit 3. ID in Fig. 3
Is an identifier for managing chapters, headlines, and paragraphs. For example, A1 is the first chapter, A2 is the second chapter, and B1 is the headline sentence shown in FIG. , C1 is the first paragraph of the first chapter and corresponds to "the simplest method shown in FIG. 2". Further, from the table 22, it is possible to easily extract information that the headline sentence of the first chapter includes B1, and the paragraphs included in the chapter include C1, C2,..., C6. S and E represent the start position and the end position of a chapter, a headline, and a paragraph in the document file. By referring to this, a character string corresponding to each chapter, each headline, and each paragraph can be read from the document file storage unit 1.

登録文解析部６は，各見出し文および各段落を形態素
解析部５へ送る。形態素解析部５では，単語辞書部４を
参照して，解析対象となる文に含まれる各単語の表記お
よび意味カテゴリを抽出する。The registration sentence analysis unit 6 sends each headline sentence and each paragraph to the morphological analysis unit 5. The morphological analysis unit 5 refers to the word dictionary unit 4 and extracts the notation and semantic category of each word included in the sentence to be analyzed.

第４図は，登録文解析結果の１例を示す図である。第
４図では，見出し文「書式付き出力」を解析した結果の
例を示している。この例では，単語の表記「書式」，
「付き」，「出力」と，「書式」に対する意味カテゴリ
［FMT］，「出力」に対する意味カテゴリ［OUT］が抽出
される。この意味カテゴリ24は，同義語や関連語を結び
付けるために用いられる情報であり，単語辞書部４で自
立語に対して定義されている。ただし，「ある」，「す
る」のような単独では具体的な意味を表さない単語には
付与しない。FIG. 4 is a diagram showing an example of a registered sentence analysis result. FIG. 4 shows an example of the result of analyzing the headline sentence “output with format”. In this example, the word notation "format",
The semantic category [FMT] for "attached", "output", "format", and the semantic category [OUT] for "output" are extracted. The semantic category 24 is information used to link synonyms and related words, and is defined in the word dictionary unit 4 for independent words. However, words such as “a” and “do” that do not express a specific meaning alone are not added.

登録文解析部６は，このようにして得られた単語表記
23や意味カテゴリ24を，それが所属する見出し文や段落
のIDとともに，インデックステーブル作成部７へ送る。The registered sentence analysis unit 6 uses the word notation obtained in this way.
23 and the semantic category 24 are sent to the index table creating section 7 together with the ID of the headline and paragraph to which they belong.

インデックステーブル作成部７では，表記23と意味カ
テゴリ24とを基に，表記23および意味カテゴリ24をイン
デックスとして，その表記あるいは意味カテゴリを有す
る単語を含む見出し文および該単語を含む段落を抽出で
きるインデックステーブル25を生成する。The index table creation unit 7 uses the notation 23 and the semantic category 24 as an index based on the notation 23 and the semantic category 24, and extracts an index sentence including a word having the notation or the semantic category and a paragraph including the word. Generate Table 25.

第５図は意味カテゴリをインデックスとするインデッ
クステーブルの１例を示した図である。このインデック
ステーブル25により，意味カテゴリ［FMT］を有する単
語を含む見出し文B2,B4および段落C8,C18を簡単に取り
出すことができる。FIG. 5 is a diagram showing an example of an index table using a semantic category as an index. With this index table 25, headlines B2 and B4 and paragraphs C8 and C18 including words having the semantic category [FMT] can be easily extracted.

このようにして作成されたインデックステーブル25
は，インデックステーブル蓄積部８へ蓄積される。Index table 25 created in this way
Are stored in the index table storage unit 8.

以上のようにして蓄積された文書ファイル21,文書構
造テーブル22,インデックステーブル25は，以下で説明
する検索処理で使用される。The document file 21, the document structure table 22, and the index table 25 accumulated as described above are used in search processing described below.

検索文入力部９では，文，単語列あるいは単語をユー
ザに入力させ，入力された文字列を検索文として検索文
解析部10へ送る。なお，検索文の入力において，すでに
表示されている分の中からマウス等により指定して入力
するよう検索文入力部９を構成してもよい。The search sentence input unit 9 allows the user to input a sentence, word string, or word, and sends the input character string to the search sentence analysis unit 10 as a search sentence. Note that the search sentence input unit 9 may be configured so that a search sentence is input by designating with a mouse or the like from among those already displayed.

検索文解析部10では，検索文入力部９で得られた検索
文26を形態素解析部５に送って，検索文26に含まれる単
語の表記23と意味カテゴリ24とを抽出する。The search sentence analysis unit 10 sends the search sentence 26 obtained by the search sentence input unit 9 to the morphological analysis unit 5, and extracts the word notation 23 and the semantic category 24 included in the search sentence 26.

第６図は，検索文としての「フォーマット指定による
書き込み」が入力された場合の検索文解析結果の１例を
示す図である。この例では，単語表記23として「フォー
マット」，「指定」，「に」，「よる」，「書き込み」
が抽出され，意味カテゴリ24として［FMT］，［SITE
I］，［OUT］が抽出される。FIG. 6 is a diagram showing an example of a search sentence analysis result when “write by format specification” is input as a search sentence. In this example, "format", "specify", "ni", "depend", "write" as word notation 23
Is extracted and the semantic category 24 is [FMT], [SITE
I] and [OUT] are extracted.

類似度算出部11では，まず，各見出し文および各段落
の検索文との類似度を算出する。類似度算出の方法に
は，様々な方法を用いることができ，例えば，特願平１
−111626号で示される方法で実現できる。ここでは，意
味カテゴリ24が何個一致したかに応じて類似度を算出す
る方法について説明する。例えば，意味カテゴリ24がｎ
個一致した場合には10n点を与えられるものとする。検
索文から抽出される意味カテゴリをインデックスとし
て，インデックステーブル蓄積部８を参照して，見出し
文や段落のIDを取り出し，各IDに対する類似度を算出す
る。第５図に示されるインデックステーブル25が蓄積さ
れている場合，検索文「フォーマット指定による書き込
み」26から抽出される意味カテゴリ［FMT］によって見
出し文B2,B4,段落C8,C18が，［OUT］によって見出し文B
2,段落C8が抽出され，第７図に示される如く類似度が算
出される。例えば，段落C8は，意味カテゴリ［FMT］と
［OUT］とによって抽出されるため20点が与えられる。The similarity calculator 11 first calculates the similarity of each headline sentence and each paragraph with the search sentence. Various methods can be used for calculating the degree of similarity.
This can be realized by the method described in -111626. Here, a method of calculating the similarity according to how many semantic categories 24 match will be described. For example, if the semantic category 24 is n
If there is a match, 10n points shall be given. Using the semantic category extracted from the search sentence as an index, the ID of the headline sentence or paragraph is extracted by referring to the index table storage unit 8, and the similarity to each ID is calculated. When the index table 25 shown in FIG. 5 is stored, the headline sentences B2, B4, paragraphs C8, C18 are converted to [OUT] by the semantic category [FMT] extracted from the search sentence "write by format specification" 26. Headline B by
2. Paragraph C8 is extracted, and the similarity is calculated as shown in FIG. For example, paragraph C8 is given 20 points because it is extracted by the semantic categories [FMT] and [OUT].

次に，これらの類似度を基に各章の類似度を算出す
る。各章の類似度の算出方法にも様々な方法が適用でき
るが，ここでは以下の算出基準による方法で説明する。Next, the similarity of each chapter is calculated based on these similarities. Various methods can be applied to the method of calculating the degree of similarity of each chapter. Here, a method based on the following calculation criteria will be described.

算出基準：章の類似度＝（見出し文の類似度）＋（その章に含ま
れる段落の最大の類似度）文書構造テーブル部３に蓄積された情報を参照するこ
とにより，見出し文B2,段落C8は，第２番目の章A2に，
見出し文B4,段落C18は，第４番目の章A4に，含まれるこ
とがわかり，以下のようにA2,A4の類似度が算出され
る。Calculation criteria: Similarity of chapter = (Similarity of headline sentence) + (Maximum similarity of paragraph included in that chapter) By referring to the information stored in the document structure table 3, heading sentence B2, paragraph C8, in the second chapter A2,
The headline sentence B4 and the paragraph C18 are found to be included in the fourth chapter A4, and the similarity between A2 and A4 is calculated as follows.

A2の類似度＝（B2の類似度）＋（C8の類似度）＝20＋20＝40 A4の類似度＝（B4の類似度）＋（C18の類似度）＝10＋10＝20 このように，見出し文と段落の類似度の両方を考慮し
て章の類似度を算出することにより，見出し文と段落と
の両方に検索文に関連する単語が含まれる章を，一方に
しか含まれない章よりも，高い候補順位で検索できる。Similarity of A2 = (Similarity of B2) + (Similarity of C8) = 20 + 20 = 40 Similarity of A4 = (Similarity of B4) + (Similarity of C18) = 10 + 10 = 20 Thus, the headline sentence By calculating the similarity between chapters taking into account both the similarity between words and paragraphs, a chapter in which both the headline sentence and the paragraph contain the word related to the search sentence can be compared to a chapter in which only one of them contains the word related to the search sentence. ， Can be searched in high candidate order.

候補表示部12では，各章の類似度の高い順に，同一章
内では各段落の類似度が高い順に，その章に属する見出
し文とその段落に属する文を，文書構造テーブル部３を
参照して文書ファイル蓄積部１から読み出して表示す
る。The candidate display unit 12 refers to the document structure table unit 3 for the headline sentence belonging to the chapter and the sentence belonging to the paragraph in descending order of the similarity of each chapter and in the descending order of similarity of each paragraph in the same chapter. Then, it is read from the document file storage unit 1 and displayed.

第８図は，候補表示の１例を示した図である。ここの
例では章A2が１位，章A4が２位になる。章A2の中では段
落C8が１位になる。ここでは，他の段落については示し
てないが，他にも抽出された段落がある場合には，類似
度が高い順に段落を表示する。FIG. 8 is a diagram showing an example of candidate display. In this example, chapter A2 is first and chapter A4 is second. In chapter A2, paragraph C8 takes first place. Here, other paragraphs are not shown, but if there are other extracted paragraphs, the paragraphs are displayed in descending order of similarity.

文書内容表示部13では，候補表示部12で表示された見
出し文または段落をユーザに指定させ，ユーザが指定し
た見出し文が存在する章の内容またはユーザが指定した
段落が存在する前後の内容を，文書構造テーブル部３を
参照して文書ファイル蓄積部１から読み出して表示す
る。The document content display unit 13 allows the user to specify the headline or paragraph displayed on the candidate display unit 12, and displays the contents of the chapter containing the headline specified by the user or the contents before and after the paragraph specified by the user. , With reference to the document structure table section 3 and read out from the document file storage section 1 and displayed.

処理制御部14では，ユーザの指示に応じて，検索文入
力部９の起動，候補表示部12の起動，文書内容表示部13
の起動および表示内容のスクロールを行う。これによ
り，ユーザは再度候補を表示したり，表示された内容を
見てさらに別の部分を検索するなどの処理を指示するこ
とができる。The processing control unit 14 activates the search sentence input unit 9, activates the candidate display unit 12, activates the document content display unit 13 in response to a user's instruction.
Activate and scroll the displayed contents. As a result, the user can instruct processing such as displaying the candidate again or searching for another part by looking at the displayed content.

〔発明の効果〕以上説明したように，本発明によれば，文書ファイル
中から見出し文や段落を抽出して，意味属性や表記をイ
ンデックスとするインデックステーブルを作成してお
き，このインデックステーブルを用いて，検索文と見出
し文および段落との類似度を高速に算出し，これを基に
各章の類似度を算出して，類似度順に検索結果を表示す
るため，文書ファイル中の文字列と一致する文字列でな
くても，入力された検索文と関連する言葉が存在する部
分を高速に検索することができる。このため，ユーザは
知りたい情報を，思いついた言葉で簡単にしかも高速に
見つけることができる。[Effects of the Invention] As described above, according to the present invention, a headline or paragraph is extracted from a document file, and an index table is created in which semantic attributes and expressions are used as indexes. To quickly calculate the similarity between a search sentence, a headline sentence, and a paragraph, calculate the similarity of each chapter based on this, and display the search results in order of similarity, use a character string in a document file. Even if it is not a character string that matches, a portion where a word related to the input search sentence exists can be searched at high speed. For this reason, the user can easily and quickly find information that he or she wants to use in a new word.

さらに，本発明によれば，本文を段落単位で検索でき
るように構成しているため，例えば，「配列をこのよう
に宣言する。そして，以下のように初期化を行う。」の
部分を検索文「配列の初期化」で検索することができ
る。このように,2つ以上の文にまたがって検索文が関連
する場合でも検索が可能である。Further, according to the present invention, since the main body is configured to be searchable in paragraph units, for example, a search is made for a part of "declare an array as described above and initialize as follows." You can search by the statement "initialize array". In this way, a search is possible even when a search sentence is related to two or more sentences.

[Brief description of the drawings]

第１図は本発明の実施例を示すブロック図，第２図は文
書ファイルの１例を示す図，第３図は文書構造テーブル
の１例を示す図，第４図は登録文解析結果の１例を示す
図，第５図はインデックステーブルの１例を示す図，第
６図は検索文解析結果の１例を示す図，第７図は類似度
計算の１例を示す図，第８図は候補表示の１例を示す図
である。１…文書ファイル蓄積部,2…文書構造抽出部,3…文書構
造テーブル部,4…単語辞書部,5…形態素解析部,6…登録
文解析部,7…インデックステーブル作成部,8…インデッ
クステーブル蓄積部,9…検索文入力部,10…検索文解析
部,11…類似度算出部,12…候補表示部,13…文書内容表
示部,14…処理制御部。FIG. 1 is a block diagram showing an embodiment of the present invention, FIG. 2 is a diagram showing an example of a document file, FIG. 3 is a diagram showing an example of a document structure table, and FIG. FIG. 5 shows one example of an index table, FIG. 6 shows one example of a search sentence analysis result, FIG. 7 shows one example of similarity calculation, FIG. The figure shows an example of candidate display. DESCRIPTION OF SYMBOLS 1 ... Document file storage part, 2 ... Document structure extraction part, 3 ... Document structure table part, 4 ... Word dictionary part, 5 ... Morphological analysis part, 6 ... Registered sentence analysis part, 7 ... Index table creation part, 8 ... Index Table storage unit, 9 search sentence input unit, 10 search sentence analysis unit, 11 similarity calculation unit, 12 candidate display unit, 13 document content display unit, 14 processing control unit.

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開平２−27478（ＪＰ，Ａ) 松尾、大山、中川「日本語対話処理のためのユーザー入力支援」情報処理学会第38回（昭和64年前期）全国大会講演論文集、（▲Ｉ▼）Ｐ．400−401（平１− ３−15) 芝野「ＳＧＭＬと全文データベース」情報処理学会研究報告（89−ＦＩ，14− ２）Ｖｏｌ．89，Ｎｏ．66，1989（平１ −７−27) (58)調査した分野(Int.Cl.⁶，ＤＢ名) G06F 17/27 - 17/30 ＪＩＣＳＴ科学技術文献ファイル──────────────────────────────────────────────────続き Continuation of the front page (56) References JP-A-2-27478 (JP, A) Matsuo, Oyama, Nakagawa "User Input Support for Japanese Dialogue Processing" IPSJ 38th (Showa 64) First semester) National Convention Lectures, (I) p. 400-401 (Hei 1-3-15) Shibano, "SGML and full-text database" IPSJ Research Report (89-FI, 14-2) Vol. 89, No. 66, 1989 (Hei 1-7-27) (58) Fields investigated (Int. Cl. ⁶ , DB name) G06F 17/27-17/30 JICST Science and Technology Reference File

Claims

(57) [Claims]

A document file storage unit for storing a document file; a document structure extraction unit for extracting a chapter, a headline sentence, and a paragraph in the document file, and creating a document structure table indicating a position and a hierarchical relationship. A document structure table for storing the document structure table; a word dictionary for defining a word notation and a semantic category for each word; and words forming an input sentence by referring to the word dictionary. And a morphological analysis unit for extracting the notation of the word and the semantic category of the word, sending each headline sentence and each sentence in the document file to the morphological analysis unit, and writing the headline sentence and the word included in each sentence. A registered sentence analysis unit for extracting a semantic category; and a notation and a semantic category as an index based on the notation and the semantic category extracted by the registered sentence analysis unit.
An index table generating unit for generating an index table capable of extracting a headline sentence including the word having the notation or the semantic category and a paragraph including the word, and an index for storing the index table generated by the index table generating unit A table storage unit, a search sentence input unit for inputting a search sentence, a search sentence analysis unit for analyzing the search sentence and extracting the notation and meaning category of a word included in the search sentence, and the search sentence analysis unit Calculating the similarity between each headline sentence and each paragraph and the search sentence by referring to the index table storage unit based on the information obtained in step (1), and calculating the similarity based on the similarity. A calculation unit; a candidate display unit that reads out the high similarity obtained by the similarity calculation unit from the document file storage unit and displays the readout result; Article in the information retrieval apparatus characterized by having.