JP4385697B2

JP4385697B2 - Concept search method and system

Info

Publication number: JP4385697B2
Application number: JP2003330940A
Authority: JP
Inventors: 淳坂田; 十悟野田
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2003-09-24
Filing date: 2003-09-24
Publication date: 2009-12-16
Anticipated expiration: 2023-09-24
Also published as: JP2005099972A

Description

本発明は、電子化文書を、ユーザが登録した検索条件で検索し、条件を満たす文書をユーザに返却する概念検索に係わり、特に電子化文書の内容を走査することにより、ユーザが検索条件として登録した文書に対する適合度を算出する文書間適合度算出機能を有する文書検索方法及びシステムに関する。 The present invention relates to a concept search in which an electronic document is searched with a search condition registered by the user and a document satisfying the condition is returned to the user. In particular, by scanning the contents of the electronic document, the user can set the search condition as a search condition. The present invention relates to a document search method and system having an inter-document fitness level calculation function for calculating a fitness level for a registered document.

近年、電子メールや電子ニュース等により大量の電子化文書（以下、テキストと呼ぶ）が時々刻々ユーザへ配信されるようになってきた。また、ＷＷＷ（World Wide Web）を利用して情報発信を行なう情報源が急増しており、これらの情報源から情報収集ロボット等を用いて収集されるテキストも膨大な量となっている。このため、これらのテキストの中から、真にユーザが求める情報を含むテキストを検索するニーズが高まっている。 In recent years, a large amount of digitized documents (hereinafter referred to as text) has been delivered to users from time to time by e-mail, electronic news, and the like. In addition, information sources that transmit information using the WWW (World Wide Web) are rapidly increasing, and a large amount of text is collected from these information sources using an information collecting robot or the like. For this reason, there is an increasing need to search for texts including information that the user really wants from among these texts.

従来の検索システムでは、ユーザが検索に必要と思われる単語を、ある構文に従い検索式を組み立て、それを入力することで検索を行ってきた。しかし、検索に不慣れなユーザが、所望する情報を得るために適切な単語を入力したり、必要な情報だけを取り出して、不要な情報を振り落とすための複雑な検索式を組み立てる事は困難である。このため、特許文献１では検索式を組み立てる代わりに、ユーザが所望する情報を含む文書（以下、種文書と呼ぶ）を入力して検索する技術（以下、概念検索と呼ぶ）が提案されている。この技術では、種文書から検索に必要な単語（以下、特徴タームと呼ぶ）を自動的に抽出し、この抽出した特徴タームに適切な重みを付けて、検索結果文章の適合度を計算する。この適合度が一定値よりも大きいものを、検索結果とする。 In the conventional search system, a search is performed by assembling a search expression for a word that the user thinks is necessary for the search according to a certain syntax and inputting it. However, it is difficult for a user unfamiliar with searching to enter a suitable word to obtain the desired information, or to extract only necessary information and assemble a complicated search formula to shake out unnecessary information. is there. For this reason, Patent Document 1 proposes a technique (hereinafter referred to as a concept search) in which a document including information desired by a user (hereinafter referred to as a seed document) is input instead of assembling a search expression. . In this technique, a word necessary for a search (hereinafter referred to as a feature term) is automatically extracted from a seed document, an appropriate weight is assigned to the extracted feature term, and the degree of matching of a search result sentence is calculated. A search result having a fitness value greater than a certain value is used.

しかし、種文書が長大な文書の場合にはユーザが所望する文書の情報だけでなく、不必要な概念を含む種文書が入力されることがある。この場合には、ユーザの所望する情報と誤差が生じているので、ユーザは検索結果に満足できないため、なんらかの調整を施して再検索を行おうとする。一般的には、検索結果文書を参照して、その中から種文書よりも適切にユーザの所望する概念を含んでいる文書（または文章）を見つけて、それを入力して再検索する。この作業を繰り返し行うことで、ユーザの求める検索結果に近づけて行くことができる。もしくは、特徴タームに対する重みについて、自分にとって必要と思われる特徴タームの重みを上げたり、いらないと思う特徴タームの重みを下げたりして再検索を行う。 However, when the seed document is a long document, not only the document information desired by the user but also a seed document including an unnecessary concept may be input. In this case, since there is an error with the information desired by the user, the user is not satisfied with the search result, so he tries to perform a search again with some adjustments. In general, a search result document is referred to, and a document (or sentence) including the concept desired by the user is found more appropriately than the seed document. By repeating this operation, it is possible to approach the search result desired by the user. Alternatively, with respect to the weight for the feature term, the search is performed again by increasing the weight of the feature term that is considered necessary for the feature term, or by reducing the weight of the feature term that is deemed unnecessary.

以下、他の関連文献の技術について本発明に関係するものについて述べる。
非特許文献１は各分類項目のキーワードの出現回数と、各文書のキーワードの出現回数を比較して最も近い分類項目に分類することによりクラスタリング（自動分類）を行う技術について触れられている。 Hereinafter, other related literature techniques related to the present invention will be described.
Non-Patent Document 1 mentions a technique for performing clustering (automatic classification) by comparing the number of appearances of the keyword of each classification item with the number of appearances of the keyword of each document and classifying the classification item to the closest classification item.

非特許文献２ではキーワードの重みに基づき文書内の重要文を抽出する技術に関して触れられている。 Non-Patent Document 2 mentions a technique for extracting an important sentence in a document based on keyword weights.

特開平１１−１４３９０２号公報JP-A-11-143902

文書自動クラスタリング・システム gnmz（http://icrouton.as.wakwak.ne.jp/pub/kks/cnamazu.html）Automatic document clustering system gnmz (http://icrouton.as.wakwak.ne.jp/pub/kks/cnamazu.html) 「テキストを自動的に要約する技術-第１回-テキスト中の重要な文を抜き出す」,コンピュータサイエンス誌bit2月号,共立出版,pp.37-42,2000.2"Technology for automatically summarizing texts-Part 1-Extracting important sentences in texts", computer science magazine bit February, Kyoritsu Shuppan, pp. 37-42, 2000.2

概念検索では文書内に含まれる情報がユーザにとって必要な情報か不必要な情報かの判断をせずに検索を実行するため、長大な文書を用いて概念検索をする場合、従来技術の方法のみでは検索精度が低下する場合がある。この検索精度低下を防ぐためには以下の２点が課題である。 In the concept search, the search is executed without determining whether the information contained in the document is necessary information or unnecessary information for the user. Therefore, when performing a concept search using a long document, only the conventional method is used. Then, the search accuracy may decrease. In order to prevent this search accuracy degradation, the following two points are problems.

１）文書に含まれる情報を分類する手段
２）ユーザに必要な情報かどうかを的確に判断させる手段
従来技術においては２）の手段として、特徴タームの取捨選択、および重みの調整があるが、周辺情報の欠落した特徴タームのみで、ユーザの所望する情報かどうかを判断するのは困難な場合がある。 1) Means for classifying information contained in a document 2) Means for allowing a user to accurately determine whether the information is necessary In the prior art, as means for 2), there are selection of feature terms and adjustment of weights. It may be difficult to determine whether or not the information is desired by the user only with feature terms lacking peripheral information.

本発明の目的は、ユーザに必要な情報に基づいて、より的確な検索を実行することにより、従来の概念検索と比較して検索精度が向上する概念検索方法を提供することにある。 An object of the present invention is to provide a concept search method in which search accuracy is improved as compared with a conventional concept search by executing a more accurate search based on information necessary for a user.

前記目的を達成するため、以下のステップからなる処理により、ユーザが検索に最適な特徴タームを知らなくても自動的に検索に相応しい種文書（文章）の候補を提示し、その中からユーザが適切と思う文章を選択することにより、上記の問題を解決し、検索の精度を上げることが出来る。以下、処理手順を述べる。 In order to achieve the above-mentioned object, the process consisting of the following steps automatically presents candidates for seed documents (sentences) suitable for the search without the user knowing the optimum feature terms for the search. By selecting an appropriate sentence, the above problem can be solved and the search accuracy can be improved. The processing procedure will be described below.

ステップ１：入力された種文書から特徴タームとその重みを抽出する。 Step 1: Extract feature terms and their weights from the input seed document.

ステップ２：入力された種文書を文に分解する。 Step 2: The input seed document is decomposed into sentences.

ステップ３：前記ステップ２において分解した各文を１文書とみなし、ステップ１で抽出した特徴タームとその重みを用いて、各文を同値類に分類し（以下、クラスタリングと呼ぶ）、その同値類を代表する特徴タームをステップ１で抽出した特徴タームから決定する。 Step 3: Each sentence decomposed in Step 2 is regarded as one document, and each sentence is classified into equivalence classes using the feature terms extracted in Step 1 and their weights (hereinafter referred to as clustering). Is determined from the feature terms extracted in step 1.

ステップ４：前記ステップ３において分類した各同値類の中で、その同値類の特徴として最もふさわしい文（以下、重要文と呼ぶ）を選択する。 Step 4: Select a sentence (hereinafter referred to as an important sentence) that is most suitable as a feature of the equivalence class among the equivalence classes classified in Step 3 above.

ステップ５：前記ステップ４で抽出した重要文をユーザに提示し、ユーザに必要な情報を含む文を選択してもらう。 Step 5: Present the important sentence extracted in Step 4 to the user, and ask the user to select a sentence including necessary information.

ステップ６：前記ステップ５で選択した文に対応する、ステップ３で分類した同値類を代表する特徴タームとその重みを用いて概念検索を実行する。 Step 6: A concept search is executed using the feature terms representing the equivalence classes classified in Step 3 and their weights corresponding to the sentence selected in Step 5 above.

概念検索を実行するにあたり、ユーザに必要な情報のみを使用して概念検索を実行するため、より的確な検索を実行できる。その結果、従来の概念検索と比較して検索精度が向上する。 In executing the concept search, the concept search is executed using only information necessary for the user, so that a more accurate search can be executed. As a result, the search accuracy is improved as compared with the conventional concept search.

図１に本発明のシステム構成を示す。概念検索装置３００００はクライント１００００とネットワーク２００００を介して通信を行うクライアントサーバ型の検索システムである。ユーザは検索を行う際、クライアント１００００から種文書４００００を入力する。クライアント１００００は入力された種文書４００００を、ネットワーク２００００を通して概念検索装置３００００に送信し、本発明の処理を実行する。 FIG. 1 shows the system configuration of the present invention. The concept search device 30000 is a client server type search system that communicates with the client 10000 via the network 20000. When performing a search, the user inputs a seed document 40000 from the client 10000. The client 10000 transmits the input seed document 40000 to the concept search device 30000 through the network 20000, and executes the processing of the present invention.

概念検索装置３００００は、以下の構成からなる。文書情報ＤＢ３８０００を用いて種文書４００００から特徴タームとその重みを抽出する特徴ターム抽出部３１０００、特徴タームとその重みを用いて、文書ＤＢ３７０００内の文書と種文書４００００との類似度を算出する類似度算出部３２０００、種文書４００００を各文に分解する文分解部３３０００、特徴タームとその重みを用いて種文書４００００内の各文を類似内容の同値類に分類するクラスタリング部３４０００、クラスタリング部３４０００により同値類と分類した文を入力文として検索条件（特徴ターム抽出部により抽出された特徴タームとその重み）に対する重要文を出力する重要文抽出部３５０００、類似度算出部３２０００および重要文抽出部３５０００の結果を受け、クライアント１００００に送信する画面データを生成する画面データ生成部３６０００からなる。以下、図１３の処理フローに従って本発明の概念検索の手順を説明する。 The concept search device 30000 has the following configuration. A feature term extraction unit 31000 that extracts a feature term and its weight from the seed document 40000 using the document information DB 38000, and a similarity that calculates the similarity between the document in the document DB 37000 and the seed document 40000 using the feature term and its weight Degree calculation unit 32000, sentence decomposition unit 33000 that decomposes seed document 40000 into sentences, clustering unit 34000 that classifies each sentence in seed document 40000 into equivalence classes with similar contents using feature terms and their weights, clustering unit 34000 The important sentence extraction unit 35000, the similarity calculation unit 32000, and the important sentence extraction unit that output the important sentence for the search condition (the feature term extracted by the feature term extraction unit and its weight) using the sentence classified as the equivalence class by the input sentence Screen data that receives 35000 results and sends to client 10000 Consisting screen data generating unit 36000 to generate. The concept search procedure of the present invention will be described below according to the processing flow of FIG.

ステップ６００００：ユーザは図２に示すクライアント画面から図３に例示する内容の文書４１０００を種文書入力ＢＯＸ１１０００に入力する。入力後、ユーザは検索開始ボタン１２０００を押下するとクライアント１００００は、入力データを種文書４００００としてネットワーク２００００を介して概念検索装置３００００にデータを送信する。 Step 60000: The user inputs the document 41000 having the contents illustrated in FIG. 3 into the seed document input BOX 11000 from the client screen shown in FIG. After the input, when the user presses the search start button 12000, the client 10000 transmits the input data as a seed document 40000 to the concept search device 30000 via the network 20000.

ステップ６１０００：概念検索装置３００００はクライアント１００００から送信された種文書４００００を受け取って、特徴ターム抽出部３１０００に入力する。特徴ターム抽出部３１０００は文書情報ＤＢ３８０００を用いて、図４に示す特徴ターム４２０００とその重み４３０００を抽出する。ここでは、特徴ターム抽出アルゴリズム例としては特許文献１の方法を用いることができ、この場合、文書情報ＤＢ３８０００は特許文献１に示された必要なデータを格納しているものとする。他の処理方法として特徴ターム抽出に形態素解析、重みの算出には特徴タームの種文書内出現回数を用いてもよい。 Step 61000: The concept retrieval apparatus 30000 receives the seed document 40000 transmitted from the client 10000 and inputs it to the feature term extraction unit 31000. The feature term extraction unit 31000 uses the document information DB 38000 to extract the feature term 42000 and its weight 43000 shown in FIG. Here, as an example of the feature term extraction algorithm, the method of Patent Literature 1 can be used. In this case, the document information DB 38000 is assumed to store necessary data shown in Patent Literature 1. As another processing method, morphological analysis may be used for feature term extraction, and the number of appearances of feature terms in the seed document may be used for weight calculation.

ステップ６２０００：次に、概念検索装置３００００は文分解部３３０００に種文書４００００を入力して、図６に示す種文書４００００内の文群４４０００を得る。文分解部３３０００は図１２に示す処理フローに従って動作する。業務日報などの比較的自由な形式の文章を処理する場合は、文の長さが不規則となりやすいため、文分解部３３０００において幾つかの短い文を一つの文にまとめたり、長い文書をある一定長を超えない単語の区切り目で切り出すように調整してもよい。 Step 62000: Next, the concept search device 30000 inputs the seed document 40000 to the sentence decomposition unit 33000, and obtains a sentence group 44000 in the seed document 40000 shown in FIG. The sentence decomposition unit 33000 operates according to the processing flow shown in FIG. When processing relatively free-form sentences such as business daily reports, the sentence length tends to be irregular, so the sentence disassembly unit 33000 combines several short sentences into one sentence or has a long document. You may adjust so that it may cut out at the break of the word which does not exceed fixed length.

ステップ６３０００：上記ステップにより抽出した特徴ターム４２０００とその重み４３０００及び、文群４４０００をクラスタリング部３４０００に入力し、各文を同値類に分類して、図７に示す分類を代表する特徴ターム群４５０００及び分類文群４６０００を得る。文書クラスタリング手法には様々なものがあるが、ここでは非特許文献１の方法を用いることができる。各文の類似度の算出には、タームの出現頻度や、タームの出現頻度とタームのユニーク度の積、タームの出現頻度とタームの種類数の積を用いてもよい。図７に分類数を３とした場合のクラスタリング結果を示す。また，必要な情報は文書情報ＤＢ３８０００より取得するものとする。分類文群４６０００を代表する特徴ターム群４５０００は、前記ステップで抽出した特徴ターム４２０００のうち各文群に含まれるもの全てとする。また、クラスタリングを実行する対象文は、図４に示す特徴タームを含む文のみである。 Step 63000: The feature term 42000 extracted by the above step, its weight 43000, and the sentence group 44000 are input to the clustering unit 34000, and each sentence is classified into equivalence classes, and the feature term group 45000 representing the classification shown in FIG. And the classification sentence group 46000 is obtained. Although there are various document clustering methods, the method of Non-Patent Document 1 can be used here. The similarity of each sentence may be calculated by using a term appearance frequency, a product of the term appearance frequency and the term uniqueness, or a product of the term appearance frequency and the number of types of terms. FIG. 7 shows the clustering result when the number of classifications is 3. Necessary information is acquired from the document information DB 38000. The feature term group 45000 representing the classified sentence group 46000 is assumed to be all included in each sentence group among the feature terms 42000 extracted in the above step. Further, the target sentence for executing clustering is only a sentence including the feature term shown in FIG.

ステップ６４０００：次に、各分類群の重要文を決定する。重要文の決定については、重要文抽出部３５０００が類似度算出部３２０００を利用して行う。特徴ターム群４５０００と図４のテーブルより取得するその重みを検索条件とし、分類内の各文を類似度算出部３２０００に入力する。ここでは、必要に応じて文書情報ＤＢ３８０００からデータを取得する。類似度算出部３２０００は特許文献１に従い、各文の類似度を求めてもよいし、非特許文献２の手法を用いてもよい。ただし、この場合は非特許文献２に必要なデータを文書情報ＤＢ３８０００が格納しているものとする。求めた類似度４７０００のうち最も高い類似度の文を分類群の重要文４８０００としたものを図９に示す。 Step 64000: Next, an important sentence of each classification group is determined. The important sentence is determined by the important sentence extraction unit 35000 using the similarity calculation unit 32000. Each sentence in the classification is input to the similarity calculation unit 32000 using the feature term group 45000 and the weight obtained from the table of FIG. Here, data is acquired from the document information DB 38000 as necessary. The similarity calculation unit 32000 may obtain the similarity of each sentence in accordance with Patent Document 1 or may use the method of Non-Patent Document 2. However, in this case, it is assumed that the document information DB 38000 stores data necessary for Non-Patent Document 2. FIG. 9 shows a sentence having the highest similarity among the obtained similarities 47000 as the important sentence 48000 of the classification group.

ステップ６５０００：求めた重要文４８０００を出力画面データ生成部３６０００に入力すると、概念検索装置３００００は必要な出力データを、ネットワーク２００００を介してクライアント１００００に送信し、クライアント１００００は検索結果画面１３０００に出力する（図１０）。 Step 65000: When the obtained important sentence 48000 is input to the output screen data generation unit 36000, the concept search device 30000 transmits necessary output data to the client 10000 via the network 20000, and the client 10000 outputs to the search result screen 13000. (FIG. 10).

ステップ６６０００：検索結果画面１３０００は各分類群の重要文４８０００と、その分類群に分類された文の数をその分類群のスコア１５０００として表示する。スコアが高いほど、種文書４００００内にその分類群の概念が多く含まれていることを示す。また、詳細閲覧ボタン１８０００を用いることにより、図８に示す分類群の全文書を閲覧できる。ユーザは画面指示に従って必要な概念を、チェックボックス１７０００を用いてチェックし、検索実行ボタン１４０００を押下すると、選択番号４９０００を送信する。図１０では「２」が選択されており、クライアント１００００は概念検索装置３００００に選択番号４９０００である「２」を送信する。 Step 66000: The search result screen 13000 displays the important sentence 48000 of each classification group and the number of sentences classified into the classification group as the score 15000 of the classification group. The higher the score, the more the concept of the classification group is included in the seed document 40000. Further, by using the detailed browsing button 18000, all documents in the classification group shown in FIG. 8 can be browsed. The user checks a necessary concept according to a screen instruction using a check box 17000, and when a search execution button 14000 is pressed, a selection number 49000 is transmitted. In FIG. 10, “2” is selected, and the client 10000 transmits “2”, which is the selection number 49000, to the concept search device 30000.

ステップ６７０００：概念検索装置３００００は選択番号４９０００を受け取った後、図９に示すテーブルから、選択番号４９０００に該当する特徴ターム群４５０００とその重み４３０００を図４に示すテーブルより取得する。 Step 67000: After receiving the selection number 49000, the concept search device 30000 obtains a feature term group 45000 corresponding to the selection number 49000 and its weight 43000 from the table shown in FIG. 9 from the table shown in FIG.

この特徴ターム４５０００とその重み４３０００を類似度算出部３２０００に入力し、文書ＤＢ３７０００内に格納している各文書との類似度を求める。類似度算出法は、一例としてここでは特許文献１を用いる。 The feature term 45000 and its weight 43000 are input to the similarity calculation unit 32000, and the similarity to each document stored in the document DB 37000 is obtained. As an example of the similarity calculation method, Patent Document 1 is used here.

ステップ６８０００：類似度を算出後、類似度の降順に、文書タイトルを出力画面データ生成部３６０００に入力する。出力画面データ生成部３６０００は、出力データをクライアント１００００に送信する。 Step 68000: After calculating the similarity, the document title is input to the output screen data generation unit 36000 in descending order of similarity. The output screen data generation unit 36000 transmits the output data to the client 10000.

ステップ６９０００：クライアント１００００は図１１に示す検索結果画面を表示する。また、検索条件保存ボタン１９０００を押下すると、選択文をキーとして検索条件をクライアントに保存する。これにより、ユーザが適切な保存名をつける手間を省き、後日利用する際も文章内容により利用可否を決定できる。 Step 69000: The client 10000 displays the search result screen shown in FIG. When the search condition storage button 19000 is pressed, the search condition is stored in the client using the selected sentence as a key. Thereby, the user can save the trouble of assigning an appropriate storage name, and whether or not the user can use it can be determined based on the contents of the text even when the user uses it later.

また、図１２に示す文分解処理を説明する。 Further, the sentence decomposition process shown in FIG. 12 will be described.

ステップ５００００：テキスト操作部３３１００は、作業用記憶領域３３２００に処理文書を読み込む。操作開始文字位置３３３００及び操作文字位置３３４００の操作初期位置（文書の一文字目）、文数３３５００に０を設定し、ステップ５００１０へ進む。 Step 50000: The text operation unit 33100 reads the processed document into the work storage area 33200. The operation initial character position 33300 and the operation initial position (first character of the document) of the operation character position 33400 are set to 0 in the sentence number 33500, and the process proceeds to Step 50010.

ステップ５００１０：テキスト操作部３３１００は、作業用記憶領域３３２００に読み込んだ処理文書のうち、操作位置格納領域３３４００に格納した文字位置の文字が区切り記号（。）かどうか判定する。区切り記号の場合は、ステップ５００２０の処理を行う。区切り記号でない場合は、ステップ５００６０に進む。 Step 50010: The text operation unit 33100 determines whether or not the character at the character position stored in the operation position storage area 33400 among the processed documents read into the work storage area 33200 is a delimiter (.). If it is a delimiter, the process of step 50020 is performed. If it is not a delimiter, the process proceeds to step 50060.

ステップ５００２０：作業用記憶領域３３２００より、操作開始文字位置３３３００から操作文字位置３３４００までの文字を文群格納領域３３６００にコピーして、ステップ５００３０に進む。 Step 50020: The characters from the operation start character position 33300 to the operation character position 33400 are copied from the work storage area 33200 to the sentence group storage area 33600, and the process proceeds to Step 50030.

ステップ５００３０：文数３３５００を１増分し、ステップ５００４０に進む。 Step 50030: The sentence number 33500 is incremented by 1, and the process proceeds to Step 50040.

ステップ５００４０：操作文字位置３３４００を１増分し、ステップ５００５０に進む。 Step 50040: The operation character position 33400 is incremented by 1, and the process proceeds to Step 50050.

ステップ５００５０：操作開始文字位置３３３００に操作文字位置３３４００の値を設定し、ステップ５００６０に進む。 Step 50050: The value of the operation character position 33400 is set in the operation start character position 33300, and the process proceeds to Step 50060.

ステップ５００６０：操作文字位置３３４００の値を１増分し、ステップ５００７０に進む。 Step 50060: The value of the operation character position 33400 is incremented by 1, and the process proceeds to Step 50070.

ステップ５００７０：操作文字位置３３４００の値が処理文書の文字数より少なければ、ステップ５００１０へ進む。操作文字位置３３４００の値が処理文書の文字数と等しければ、ステップ５００８０へ進む。 Step 50070: If the value of the operation character position 33400 is smaller than the number of characters of the processed document, the process proceeds to Step 50010. If the value of operation character position 33400 is equal to the number of characters of the processed document, the process proceeds to step 50080.

ステップ５００８０：操作開始文字位置３３３００から操作文字位置３３４００までの文字を文群格納領域３３６００にコピーする。文数３３５００を１増分し、文群格納領域の各文を出力して処理を終了する。 Step 50080: The characters from the operation start character position 33300 to the operation character position 33400 are copied to the sentence group storage area 33600. The number of sentences 33500 is incremented by 1, each sentence in the sentence group storage area is output, and the process is terminated.

本発明のシステム構成を示す。1 shows a system configuration of the present invention. 本発明の実施例におけるクライアント画面を示す。The client screen in the Example of this invention is shown. 本発明の実施例において入力する種文書を示す。The seed document input in the Example of this invention is shown. 特徴タームとその重みを示す。The feature terms and their weights are shown. 本発明の実施例における文分解部を示す。The sentence decomposition part in the Example of this invention is shown. 文分解部の出力した文群を示す。Indicates the sentence group output by the sentence decomposition unit. 本発明の実施例における分類文群を示す。The classification sentence group in the Example of this invention is shown. 本発明の実施例における分類文群と各文の重要度を示す。The classification sentence group in the Example of this invention and the importance of each sentence are shown. 本発明の実施例における各分類とその重要文を示す。Each classification and the important sentence in the Example of this invention are shown. 本発明の実施例における概念選択画面を示す。The concept selection screen in the Example of this invention is shown. 本発明の実施例における検索結果画面を示す。The search result screen in the Example of this invention is shown. 本発明の実施例における文分解部処理フローを示す。The sentence decomposition part process flow in the Example of this invention is shown. 本発明の実施例における処理の流れを示す。The flow of the process in the Example of this invention is shown.

Explanation of symbols

１００００：クライアント、２００００：ネットワーク、３００００：概念検索装置、
３１０００：特徴ターム抽出部、３２０００：類似度算出部、
３３０００：文分解部、３４０００：クラスタリング部、３５０００：重要文抽出部、
３６０００：出力画面データ生成部、３７０００：文書ＤＢ、３８０００：文書情報ＤＢ、４００００：種文書
10000: client, 20000: network, 30000: concept search device,
31000: Feature term extraction unit, 32000: Similarity calculation unit,
33000: sentence decomposition unit, 34000: clustering unit, 35000: important sentence extraction unit,
36000: output screen data generation unit, 37000: document DB, 38000: document information DB, 40000: seed document

Claims

A concept search method for a server-type concept search device that is connected to a client device via a network and inputs a document as a search condition to search for a similar document ,
The concept search device
As the search condition, a document input from the client device and transmitted to the concept search device is read into a working storage area of the concept search device,
Extracting a keyword that is a feature of the document in the working storage area and the number of appearances of each of the keywords;
Disassemble the document in the working storage area into individual sentences ;
Copy the decomposed individual sentences to the sentence group storage area of the concept retrieval device,
From the keyword and the number of appearances and the decomposed individual sentences in the sentence group storage area, a keyword group consisting of a plurality of the keywords and a plurality of sentences corresponding to each of the keyword groups,
The importance level of each sentence is calculated from the keyword group, the plurality of sentences associated with the keyword group, and the number of appearances.
Extracting the sentence showing the highest value from the calculated importance as an important sentence from the plurality of sentences corresponding to each of the keyword groups, corresponding to the keyword group ,
Sending each of the keyword groups and the important sentence extracted corresponding to the keyword group to the client device,
After the plurality of important sentences received from the concept storage device in the client device are displayed on the screen of the client device, one important sentence selected in the client device from the client device is the concept retrieval device. Is notified for search processing,
Performing a concept search process of a document stored as an electronic document in a document DB connected to the concept search device, using the keyword group corresponding to the selected important sentence and the weight of the keyword group ; and
A concept search method for a server-type concept search device, wherein the result of the concept search processing is transmitted to the client device .

In the method of decomposing the document into each sentence, when the length of the decomposed sentence exceeds a predetermined length, the document is further decomposed into a plurality of sentences so as not to exceed the predetermined length, or the length of the decomposed sentence is 2. The server type concept search according to claim 1, wherein if the predetermined length is not exceeded, a plurality of sentences are combined so as to exceed the predetermined length, and the document is decomposed by selecting one of them . Device concept search method.

In the method of calculating the importance of each sentence,
A method of calculating the uniqueness of each keyword associated with the one keyword group from the one keyword group and the plurality of sentences associated with the one keyword group;
Add a method to calculate the product of the frequency of appearance of the keyword and the uniqueness in each sentence,
2. The concept search method for a server type concept search apparatus according to claim 1, wherein the importance of each sentence is calculated from the product of the frequency of appearance of the keyword and the uniqueness.

A server-type conceptual search device that is connected to a client device via a network and inputs a document as a search condition to search for a similar document ,
Means for reading , as the search condition, a document input from the client device and transmitted to the concept search device into a working storage area of the concept search device;
Means for extracting a keyword that is a feature of the document in the working storage area and the number of appearances of each of the keywords ;
It means for decomposing the document of the working memory area in the individual statements,
Means for copying the decomposed individual sentences to a sentence group storage area of the concept search device;
From individual sentences the decomposing of the keywords and the appearance count and the Bungun storage region, and corresponding with to means with a plurality of said statements corresponding to each of the keyword group consisting of a plurality of the keywords the keyword group,
Means for calculating importance of each sentence from the keyword group, a plurality of the sentences associated with the keyword group, and the number of appearances;
Means for extracting the sentence showing the highest value from the calculated importance as an important sentence, and extracting the sentence corresponding to the keyword group from a plurality of sentences corresponding to each of the keyword groups ;
Means for transmitting each of the keyword groups and the important sentence extracted corresponding to the keyword group to the client device;
After the plurality of important sentences received from the concept storage device in the client device are displayed on the screen of the client device, one important sentence selected in the client device from the client device is the concept retrieval device. Means to be notified of for search processing,
Means for performing a concept retrieval process of a document stored as an electronic document in a document DB connected to the concept retrieval device, using the keyword group corresponding to the selected important sentence and the weight of the keyword group ; ,as well as
A server-type concept search device comprising means for transmitting the result of the concept search processing to the client device .