JP4895645B2

JP4895645B2 - Information search apparatus and information search program

Info

Publication number: JP4895645B2
Application number: JP2006070256A
Authority: JP
Inventors: 真樹村田; 晃一土井; 智裕三森; 安志福田
Original assignee: National Institute of Information and Communications Technology; Sony Corp
Current assignee: National Institute of Information and Communications Technology; Sony Corp
Priority date: 2006-03-15
Filing date: 2006-03-15
Publication date: 2012-03-14
Anticipated expiration: 2026-03-15
Also published as: JP2007249458A

Description

本発明は、教師あり機械学習処理を用いて、テキストデータから二項関係を持つ表現（語、文字列、記事等）の対を考慮した、コンピュータ処理による情報検索技術に関するものである。 The present invention relates to an information retrieval technique based on computer processing that takes into account pairs of expressions (words, character strings, articles, etc.) having a binary relationship from text data using supervised machine learning processing.

テキストデータベースなどから情報を抽出する手法として、人手によって作成したパターンを用いて二項関係を抽出処理する手法や、関連する語句の二項関係に着目して希望する情報を抽出する方法が知られている。例えば、非特許文献１の手法では、構文解析結果である述語項構造を用いて求める情報を抽出するためのパターンフレームを与えて、正解付きのコーパスから抽出し、抽出したパターンのうち不適切なパターンを排除することによって選別したパターンを用いて適合する情報を抽出している。より詳細には、同文献では、パターンの精度を良くするために学習コーパスと照らし合わせてパターンの選別を行って、二項関係の抽出処理の精度向上を図る技術が紹介されている。 Known methods for extracting information from text databases, etc. include methods for extracting binary relationships using manually created patterns, and methods for extracting desired information by paying attention to the binary relationships of related words. ing. For example, in the method of Non-Patent Document 1, a pattern frame for extracting information to be obtained using a predicate term structure that is a result of syntax analysis is given, extracted from a corpus with a correct answer, and an inappropriate pattern among the extracted patterns The matching information is extracted using the selected pattern by eliminating the pattern. More specifically, this document introduces a technique for improving the accuracy of binary relation extraction processing by selecting a pattern against a learning corpus in order to improve the accuracy of the pattern.

また、非特許文献２の手法では、二項関係を考慮せずにキーワードに基づくテキスト（記事）検索方法として、各記事ごとについてキーワードの出現回数に関する情報と、データベース内のキーワードが出現する記事数に関する情報とに基づいたスコアを求めることによって、スコアの高い記事を検索結果として取得する、という技術が示されている。
薬師寺あかね他著、「述語項構造パターンを用いた医学・生物学分野情報抽出」、言語処理学会第１１回年次大会論文集、ｐ．９３−９６、言語処理学会、２００５年３月村田真樹他著、「位置情報と分野情報を用いた情報検索」、自然言語処理第７巻第２号、ｐ．１４１−１６０、言語処理学会、２０００年４月 In the method of Non-Patent Document 2, as a text (article) search method based on keywords without considering binary relations, information on the number of occurrences of keywords for each article and the number of articles in which keywords appear in the database A technique of obtaining an article having a high score as a search result by obtaining a score based on information on the information is shown.
Yakushiji Akane et al., “Extracting Medical and Biological Field Information Using Predicate Item Structure Patterns”, Proc. 93-96, The Language Processing Society of Japan, March 2005 Mata Murata et al., “Information Retrieval Using Location Information and Field Information”, Natural Language Processing Vol. 7, No. 2, p. 141-160, Language Processing Society of Japan, April 2000

ところが、非特許文献１の手法のように、二項関係の抽出ルールとしてパターンを用いる場合に、対象となる問題が複雑になると、パターンが煩雑になるという問題がある。そのため、パターンを利用する手法には限界があった。また、抽出手法の性能も高くならないという問題もあった。 However, when a pattern is used as a binary relation extraction rule as in the method of Non-Patent Document 1, there is a problem that the pattern becomes complicated if the target problem becomes complicated. Therefore, there is a limit to the method using the pattern. There is also a problem that the performance of the extraction method does not increase.

また、例えば「京大」と「総長」という二つの語をキーワードとしてＡＮＤ検索を行った場合、これら「京大」と「総長」の二つのキーワードを含む記事が検索されて得られるが、このようにして得られた記事が全て「京大の総長」に関する記事であるとは限らない。具体的には、「京大の式典に東大の総長が出席した」という文を含む記事は、「京大の総長」に関する記事ではなく、「東大の総長」に関する記事である。このため、このような記事は検索結果としては不適切だと考えられるが、非特許文献２の手法のように、二項関係を考慮しない検索手法では、上記のように不適切な検索結果を除外することができない、という問題があった。 For example, if an AND search is performed using two words “Kyoto Univ.” And “President” as keywords, articles containing these two keywords “Kyoto Univ.” And “President” will be searched. The articles obtained in this way are not all articles related to the “President of Kyoto University”. Specifically, an article including the sentence “The President of the University of Tokyo attended the ceremony of Kyoto University” is not an article about “the President of Kyoto University” but an article about “the President of Tokyo University”. For this reason, such articles are considered inappropriate as search results. However, search methods that do not consider binary relations, such as the method of Non-Patent Document 2, give inappropriate search results as described above. There was a problem that it could not be excluded.

また、検索キーワードの入力時や、検索結果の出力時に、ユーザがキーワード間に関係を持たせることを指定できれば、ユーザが希望する結果に近い検索結果を得ることができると考えられるが、従来から知られている検索手法ではこのようなことは考慮されておらず、いわばコンピュータ任せの検索結果が得られているに過ぎない。例えば、上述のように「京大」と「総長」というキーワードに加えて、「言語処理」というキーワードも指定して検索したい場合、単なるＡＮＤ検索では、これら三つのキーワードが全て含まれるテキストデータ（記事）が得られるにとどまるが、ユーザにとっては、「言語処理」と他のキーワードとはそれほど密接な関係がなくてもよい場合があるが、従来の検索手法で得られた結果から、ユーザ自身で望む記事を探し出さなくてはならない。 In addition, if a user can specify that a keyword has a relationship when inputting a search keyword or outputting a search result, it is considered that a search result close to the result desired by the user can be obtained. This is not taken into account in known search methods, and the search results are left to the computer. For example, if it is desired to search by specifying the keyword “language processing” in addition to the keywords “Kyoto University” and “General Manager” as described above, text data including all three keywords in a simple AND search ( Articles) can only be obtained, but for users, “language processing” and other keywords may not be so closely related, but the results obtained by the conventional search method show that the user himself You have to find the article you want.

そこで本発明は、テキストデータから二項関係を抽出するすべての問題に利用でき、複雑な問題についても性能よく二項関係を抽出できる二項関係抽出処理技術を利用して、キーワード間の関係を考慮した適切な検索結果を得ることができる情報検索装置を提供することを目的としている。また、本発明の別の目的は、検索キーワードを要素とする二項関係に関わる情報を素性として抽出し、その素性を使って例えば記事単位で機械学習することで、キーワード間の関係を考慮した適切な検索結果を得ることができる情報検索装置を提供することを目的としている。また、本発明の別の目的は、二項関係抽出処理技術を利用して、検索キーワードの入力時において、ユーザ又はプログラムによるキーワード間の関係性の有無を指定することで、二以上若しくは三以上のキーワードに基づく適切な検索結果を得ることができるように工夫した情報検索装置や、検索結果の出力時において、ユーザが関連性のある複数のキーワードであると指定したことに基づく絞り込まれた検索結果を得ることができる情報検索装置を提供することを目的としている。さらに本発明の目的は、これらの二項関係抽出処理を使用した情報検索装置としてコンピュータを機能させるためのプログラムを提供することである。 Therefore, the present invention can be used for all problems that extract binary relations from text data, and uses binary relation extraction processing technology that can extract binary relations with high performance even for complex problems. An object of the present invention is to provide an information search apparatus capable of obtaining an appropriate search result in consideration. In addition, another object of the present invention is to consider the relation between keywords by extracting information related to the binary relation having the search keyword as an element as a feature, and machine-learning for each article, for example, using the feature. An object of the present invention is to provide an information search apparatus capable of obtaining appropriate search results. Another object of the present invention is to specify the presence or absence of a relationship between keywords by a user or a program at the time of inputting a search keyword using a binary relation extraction processing technique, so that two or more or three or more are specified. Information search device devised to obtain appropriate search results based on keywords, and refined search based on the fact that the user specified multiple relevant keywords when outputting the search results An object of the present invention is to provide an information search apparatus capable of obtaining a result. A further object of the present invention is to provide a program for causing a computer to function as an information retrieval apparatus using these binary relation extraction processes.

本発明は、複数の検索キーワードによる情報検索処理を行う情報検索装置であって、１）問題と解との組で構成される事例であって、問題が検索キーワードを要素とする二項関係であって解が抽出するべき二項関係であるものを含む教師データが格納された教師データ記憶部から前記事例を取り出し、前記事例ごとに、所定の情報を素性として抽出し、前記抽出した素性の集合と解との組を生成するとともに、所定の機械学習アルゴリズムに基づいて、前記生成した素性の集合と解の組について、どのような素性の集合の場合に前記解となるかということを機械学習処理し、前記どのような素性の集合の場合に前記解となるかということを示す情報を学習結果情報として学習結果記憶部に保存する学習部と、２）入力された複数の検索キーワードを用いた入力検索キーワードの対を生成し、検索対象となるテキストデータから、所定の方法により前記入力検索キーワードに基づいてテキストデータを抽出して取得する情報検索部と、３）前記情報検索部において検索して取得された各テキストデータから前記入力検索キーワードで構成される対を生成し、前記生成した対を二項関係の候補とする候補抽出部と、４）前記学習部で行った抽出処理と同様の抽出処理によって、前記二項関係の候補について前記所定の情報を素性として抽出するとともに、前記学習結果記憶部に格納された前記学習結果情報に基づいて、前記二項関係の候補の素性の集合の場合にどういう解とすべきかを判定し、その判定結果として、前記二項関係の候補について抽出するべき二項関係であることを示す解とすべきと判定された場合に、前記二項関係の候補を抽出するべき二項関係として選択する二項関係選択部と、５）前記二項関係選択部で選択した二項関係を含むテキストデータを検索結果として抽出する検索結果抽出部と、を備えることを特徴とする。 The present invention is an information search apparatus that performs an information search process using a plurality of search keywords. 1) An example of a combination of a problem and a solution, where the problem has a binary relationship with search keywords as elements. The case is extracted from a teacher data storage unit in which teacher data including a binary relation to be extracted is stored, and predetermined information is extracted as a feature for each case, and the extracted feature In addition to generating a set and set of solutions, based on a predetermined machine learning algorithm, a set of features and a set of features for which the set of features is used as the solution A learning unit that performs learning processing and stores in the learning result storage unit information indicating what kind of feature set the solution is obtained as learning result information; and 2) a plurality of input search keywords An information search unit that generates a pair of input search keywords using a keyword, extracts text data from the text data to be searched based on the input search keyword by a predetermined method, and 3) the information search A candidate extraction unit which generates a pair composed of the input search keywords from each text data obtained by searching in the unit, and uses the generated pair as a binary relation candidate; and 4) performed by the learning unit The predetermined information is extracted as a feature for the binomial relationship candidate by an extraction process similar to the extraction process, and the binomial relationship candidate is based on the learning result information stored in the learning result storage unit. In the case of a set of features of, it is determined what solution should be taken, and the determination result indicates that the binary relation to be extracted for the binary relation candidate A binary relation selection unit that selects a binary relation to be extracted as a binary relation to be extracted, and 5) a text including the binary relation selected by the binary relation selection unit. A search result extraction unit that extracts data as a search result.

上述した本発明は、まず、学習部において、検索キーワードを要素とする二項関係に、抽出するべき二項関係であることを示す解の情報を付与された事例を含む教師データを教師データ記憶部から事例を取り出し、事例ごとに、所定の情報を素性として抽出し、抽出した素性の集合と解との組を生成する。そして、所定の機械学習アルゴリズムに基づいて、解と素性の集合との組について、どのような素性の集合の場合にどのような解となるかということを機械学習処理し、「どのような素性の集合の場合にどのような解となるかということ」を示す情報を学習結果情報として学習結果記憶部に保存する。 In the above-described present invention, first, in the learning unit, teacher data including cases in which solution information indicating that the binary relation to be extracted is added to the binary relation having the search keyword as an element is stored in the teacher data A case is taken out from the section, predetermined information is extracted as a feature for each case, and a set of a set of extracted features and a solution is generated. Then, based on a predetermined machine learning algorithm, machine learning processing is performed on what kind of feature set results in a set of solutions and feature sets. The information indicating what kind of solution is obtained in the case of a set of “is stored in the learning result storage unit as learning result information.

その後，情報検索部によって，入力された複数の検索キーワードを用いた入力検索キーワード対を生成し、所定の方法により前記入力検索キーワードに基づいてテキストデータを抽出して取得する。ここで、「入力された複数の検索キーワード」は、ユーザによるキーワード入力を受け付ける態様であるとは限らず、例えば、プログラミング関数による入力によるものであってもよい（以下で言及する「入力された複数の検索キーワード」についても同様である）。また、前記「所定の方法」は、入力検索キーワードを含むＡＮＤ検索による方法であってもよいし、スコアによる方法であってもよい（以下で言及する「情報検索部」についても同様である）。さらに、候補抽出部によって、検索して取得された各テキストデータから、入力検索キーワードで構成される対を生成し、前記生成した対を二項関係の候補とする。そして、二項関係選択部によって、学習部が行う抽出処理と同様の抽出処理によって、二項関係の候補について所定の情報を素性として抽出し、さらに、学習結果記憶部に格納された学習結果情報にもとづいて、二項関係の候補の素性の集合の場合合にどういう解とすべきかを判定し、その判定結果として、前記二項関係の候補について抽出するべき二項関係であることを示す解とすべきと判定した場合に、その二項関係の候補を抽出するべき二項関係として選択する。そして最後に、検索結果抽出部によって、選択した二項関係を含むテキストデータを検索結果として抽出する。 Thereafter, an input search keyword pair using a plurality of input search keywords is generated by the information search unit, and text data is extracted and acquired based on the input search keyword by a predetermined method. Here, “the plurality of input search keywords” is not necessarily a mode of accepting keyword input by the user, and may be, for example, input by a programming function (referred to as “input input” described below). The same applies to “multiple search keywords”). The “predetermined method” may be an AND search method including an input search keyword, or may be a score method (the same applies to an “information search unit” referred to below). . Further, a pair composed of input search keywords is generated from each text data obtained by searching by the candidate extraction unit, and the generated pair is set as a binary relation candidate. Then, the binary relation selection unit extracts predetermined information as features from the binary relation candidates by the extraction process similar to the extraction process performed by the learning unit, and further stores the learning result information stored in the learning result storage unit Based on the above, it is determined what kind of solution should be used in the case of a set of features of binomial relationship candidates, and as a result of the determination, a solution indicating that the binomial relationship is to be extracted for the binary relationship candidate is determined. If it is determined that the binary relation is to be selected, the binary relation candidate is selected as the binary relation to be extracted. Finally, the search result extraction unit extracts text data including the selected binary relation as a search result.

また本発明は、複数の検索キーワードによる情報検索処理を行う情報検索装置であって、１）問題と解との組で構成される事例であって、問題が検索キーワードとテキストデータの組であって解が抽出するべきテキストデータであるものを含む教師データが格納された教師データ記憶部から前記事例を取り出し、前記事例ごとに、少なくとも検索キーワードを要素とする二項関係に関わる情報を素性として抽出し、前記抽出した素性の集合と解との組を生成する解−素性対抽出部と、２）所定の機械学習アルゴリズムに基づいて、前記素性の集合と解の組について、どのような素性の集合の場合に前記解となるかということを機械学習処理し、前記どのような素性の集合の場合に前記解となるかということを示す情報を学習結果情報として学習結果記憶部に保存する機械学習部と、３）入力された複数の検索キーワードを用いた入力検索キーワードの対を生成し、検索対象となるテキストデータから、所定の方法により前記入力検索キーワードに基づいてテキストデータを抽出して取得する情報検索部と、４）前記解−素性対抽出部が行う抽出処理と同様の抽出処理によって、前記検索キーワードと抽出したテキストデータの組について前記少なくとも検索キーワードを要素とする二項関係に関わる情報を素性として抽出する素性抽出部と、５）前記学習結果記憶部に格納された前記学習結果情報に基づいて、前記検索キーワードと抽出したテキストデータの組の素性の集合の場合にどういう解とすべきかを判定する解判定部と、６）前記判定の結果として、前記検索キーワードと抽出したテキストデータの組について抽出するべきテキストデータである解とすべきと判定された場合に、前記テキストデータを抽出するべきテキストデータとして選択し、前記選択したテキストデータを検索結果として抽出する検索結果抽出部と、を備えることを特徴とする。 The present invention is also an information search apparatus that performs information search processing using a plurality of search keywords. 1) An example of a combination of a problem and a solution, where the problem is a combination of a search keyword and text data. The case is extracted from the teacher data storage unit in which the teacher data including the text data to be extracted is stored, and the information related to the binary relation having at least the search keyword as an element is used as the feature for each case. A solution-feature pair extraction unit that extracts and generates a set of the extracted feature set and solution; and 2) any feature of the feature set and solution set based on a predetermined machine learning algorithm. Machine learning processing is performed to determine whether the solution is obtained in the case of a set of information, and information indicating whether the solution is obtained in the case of the feature set is learned as learning result information. A machine learning unit stored in the result storage unit and 3) a pair of input search keywords using a plurality of input search keywords are generated, and based on the input search keywords by a predetermined method from text data to be searched An information search unit for extracting and acquiring text data, and 4) extracting at least the search keyword for a set of the search keyword and the extracted text data by an extraction process similar to the extraction process performed by the solution-feature pair extraction unit. A feature extraction unit that extracts information related to binary relations as elements as features, and 5) a feature of a set of the search keyword and extracted text data based on the learning result information stored in the learning result storage unit A solution determination unit that determines what solution should be taken in the case of a set of 6), and 6) as a result of the determination, A search for selecting the text data as the text data to be extracted and extracting the selected text data as a search result when it is determined that the set is the text data to be extracted for the set of text data And a result extraction unit.

上述した本発明は、まず、問題と解との組で構成される事例であって、問題が検索キーワードとテキストデータの組であって解が抽出するべきテキストデータであるものを含む教師データが格納された教師データ記憶部から、解−素性対抽出部によって事例を取り出し、事例ごとに、少なくとも検索キーワードを要素とする二項関係に関わる情報を素性として抽出し、抽出した素性の集合と解との組を生成する。ここで、「テキストデータ」は、記事や文書でもよい。次に、機械学習部によって、所定の機械学習アルゴリズムに基づいて、解と素性の集合との組について、どのような素性の集合の場合にどのような解となるかということを機械学習処理し、「どのような素性の集合の場合にどのような解となるかということ」を示す情報を学習結果情報として学習結果記憶部に保存する。その後、情報検索部によって、入力された複数の検索キーワードを用いた入力検索キーワードの対を生成し、検索対象となるテキストデータから、所定の方法により入力検索キーワードに基づいてテキストデータを抽出して取得すると、素性抽出部によって、解−素性対抽出部と同様の抽出処理によって、検索キーワードと抽出したテキストデータの組について、少なくとも検索キーワードを要素とする二項関係に関わる情報を素性として抽出する。さらに、解判定部によって、学習結果記憶部に格納された前記学習結果情報に基づいて、前記検索キーワードと抽出したテキストデータの組の素性の集合の場合にどういう解とすべきかを判定すると、検索結果抽出部によって、判定の結果として、前記検索キーワードと抽出したテキストデータの組について抽出するべきテキストデータである解とすべきと判定された場合に、そのテキストデータを抽出するべき記事として選択し、その選択したテキストデータを検索結果として抽出する。 In the present invention described above, first, teacher data including an example composed of a set of a problem and a solution, where the problem is a set of a search keyword and text data and the solution is text data to be extracted. Cases are extracted from the stored teacher data storage unit by the solution-feature pair extraction unit, and for each case, at least information related to the binary relations having the search keyword as an element is extracted as a feature. And a pair. Here, the “text data” may be an article or a document. Next, the machine learning unit performs machine learning processing based on a predetermined machine learning algorithm to determine what type of feature set results in a set of solution and feature set. , Information indicating “what kind of solution is to be obtained in the case of what feature set” is stored in the learning result storage unit as learning result information. Thereafter, an information search unit generates a pair of input search keywords using a plurality of input search keywords, and extracts text data based on the input search keywords from the text data to be searched by a predetermined method. Upon acquisition, the feature extraction unit extracts at least information related to the binary relationship having the search keyword as an element from the search keyword and the extracted text data set by the same extraction process as the solution-feature pair extraction unit. . Further, when the solution determining unit determines what solution should be made in the case of a set of features of the set of the search keyword and the extracted text data based on the learning result information stored in the learning result storage unit, When it is determined by the result extraction unit that the determination result is the text data to be extracted for the combination of the search keyword and the extracted text data, the text data is selected as an article to be extracted. The selected text data is extracted as a search result.

ここで、本発明に関する参考例としての情報検索装置について説明する。参考例としての情報検索装置は、複数の検索キーワードによる情報検索処理を行う情報検索装置であって、１）前記教師あり機械学習処理又は所定の規則を用いた処理方法に基づいて、二項関係の要素に関わる所定の情報を用いて、当該二項関係が抽出すべき二項関係かどうかを判定する二項関係判定部と、２）複数の検索キーワードの入力と、当該複数の検索キーワードのうち任意の２つの検索キーワードの組について意味的な関係を持たせることを示す、当該２つの検索キーワードの組に付与される関係性情報の入力を受ける付ける入力部と、３）入力された複数の検索キーワードを用いた入力検索キーワードの対を生成し、検索対象となるテキストデータから、所定の方法により前記入力検索キーワードに基づいてテキストデータを抽出して取得する情報検索部と、４）前記情報検索部において検索して取得された各テキストデータから前記入力部で入力を受け付けた関係性情報が付与された２つの入力検索キーワードで構成される対を生成し、前記生成した対を二項関係の候補とする候補抽出部と、５）前記二項関係判定部に基づいて、前記二項関係の候補について、前記二項関係の候補となった二項関係の要素に関わる所定の情報を用いて、抽出するべき二項関係か否かを判定し、抽出するべき二項関係であると判定された場合に、前記二項関係の候補を抽出するべき二項関係として選択する二項関係選択部と、６）前記二項関係選択部で選択した二項関係を含むテキストデータを検索結果として抽出する検索結果抽出部と、を備えるものである。 Here, an information retrieval apparatus as a reference example relating to the present invention will be described. An information search apparatus as a reference example is an information search apparatus that performs an information search process using a plurality of search keywords, and 1) a binary relation based on the supervised machine learning process or a processing method using a predetermined rule. A binary relation determination unit that determines whether or not the binary relation is to be extracted by using predetermined information relating to the elements of 2), 2) input of a plurality of search keywords, and a plurality of search keywords An input unit for receiving an input of relationship information given to the set of two search keywords, and 3) a plurality of input A pair of input search keywords using the search keywords is generated, and the text data is extracted from the text data to be searched based on the input search keywords by a predetermined method. And 4) two input search keywords to which the relationship information received by the input unit from each text data searched and acquired by the information search unit is added. A candidate extraction unit that generates a pair and uses the generated pair as a binary relationship candidate; and 5) based on the binary relationship determination unit, the binary relationship candidate is a candidate for the binary relationship. Using the predetermined information related to the binary relation element, it is determined whether or not the binary relation is to be extracted. If it is determined that the binary relation is to be extracted, the binary relation candidate is selected. a binary relation selecting unit which selects a binary relation to be extracted, 6) in which and a search result extraction unit that extracts the text data as a search result including the binary relations selected binary relationships selector There is .

上述した参考例では、まず、二項関係判定部によって、教師あり機械学習処理又は所定の規則を用いた処理方法に基づいて、二項関係の要素に関わる所定の情報を用いて、当該二項関係が抽出すべき二項関係かどうかを判定する。ここで、二項関係かどうかの判定としては、機械学習処理の他に、「所定の規則を用いた処理方法」として、例えば、二項関係として抽出すべきキーワードとキーワードの組み合わせのパターンを利用するものや、その他のルールに基づく処理方法を例示することができる。また、「二項関係の要素に関わる所定の情報」とは、二項関係の要素の周辺の情報、例えば文字列や単語を意味する。そして、入力部によって、複数の検索キーワードの入力と、当該複数の検索キーワードのうち任意の２つの検索キーワードの組について意味的な関係を持たせることを示す、当該２つの検索キーワードの組に付与される関係性情報の入力を受け付ける。この場合も、「入力」は、ユーザによるキーワード入力を受け付ける態様であるとは限らず、例えば、プログラミング関数による入力によるものであってもよい。次に、情報検索部によって、入力された複数の検索キーワードを用いた入力検索キーワードの対を生成し、検索対象となるテキストデータから、所定の方法により入力検索キーワードに基づいてテキストデータを抽出して取得する。そして、候補抽出部によって、検索の結果取得された各テキストデータから入力を受け付けた関係性情報が付与された２つの入力検索キーワードで構成される対を生成し、この生成した対を二項関係の候補として、二項関係選択部によって、前述の二項関係判定部に基づいて、二項関係の候補について、二項関係の候補となった二項関係の要素に関わる所定の情報を用いて、抽出するべき二項関係か否かを判定し、抽出するべき二項関係であると判定された場合に、二項関係の候補を抽出するべき二項関係として選択する。最後に、検索結果抽出部によって、選択した二項関係を含むテキストデータを検索結果として抽出する。なお、二項関係がある入力検索キーワードを含むテキストデータに代えて、二項関係がないものを検索結果として抽出するような態様も考えられる。 In the reference example described above, the binary relation determination unit first uses the supervised machine learning process or the processing method using a predetermined rule to use the predetermined information related to the binary relation element, and It is determined whether the term relation is a binary relation to be extracted. Here, in order to determine whether or not a binary relation exists, in addition to the machine learning process, for example, a “pattern processing method using a predetermined rule” is used, for example, a keyword / keyword combination pattern to be extracted as a binary relation. And processing methods based on other rules. Further, “predetermined information related to binary relation elements” means information around the binary relation elements, such as character strings and words. Then, by the input unit, the input of a plurality of search keywords is given to the set of two search keywords, which indicates that there is a semantic relationship between any two search keyword sets among the plurality of search keywords. Accepts input of relationship information. Also in this case, the “input” is not necessarily a mode of accepting keyword input by the user, and may be input by a programming function, for example. Next, an information search unit generates a pair of input search keywords using a plurality of input search keywords, and extracts text data from the text data to be searched based on the input search keywords by a predetermined method. Get. And the candidate extraction part produces | generates the pair comprised by the two input search keywords to which the relationship information which received the input from each text data acquired as a result of search was provided, and this produced | generated pair is binomial relation. As a candidate for the binary relationship, the binary relationship selection unit uses the predetermined information related to the binary relationship element that is the candidate for the binary relationship, based on the binary relationship determination unit described above. Then, it is determined whether or not the binary relationship is to be extracted, and when it is determined that the binary relationship is to be extracted, a binary relationship candidate is selected as the binary relationship to be extracted. Finally, the search result extraction unit extracts text data including the selected binary relation as a search result. In addition, instead of text data including an input search keyword having a binary relationship, an aspect in which data having no binary relationship is extracted as a search result is also conceivable.

この参考例においては、前記入力部を、複数の検索キーワードの入力、及び当該複数の検索キーワードのうち任意の２つの検索キーワードの組について意味的な関係を持たせることを示す、当該２つの検索キーワードの組に付与される関係性情報の入力を促す入力インタフェースを出力するとともに、当該入力インタフェースに入力された複数の入力検索キーワード及び関係性情報の入力を受けるものとすることで、検索キーワードの入力時に、ユーザがどのキーワードとどのキーワードに関係があると考えており、その関係性のある検索結果を効率的に検索し、抽出することができる。 In this reference example , the two search units indicate that the input unit has a semantic relationship with respect to the input of a plurality of search keywords and a set of two arbitrary search keywords among the plurality of search keywords. By outputting an input interface that prompts input of relationship information given to a set of keywords, and receiving a plurality of input search keywords and relationship information input to the input interface, At the time of input, the user thinks which keyword is related to which keyword, and a search result having the relationship can be efficiently searched and extracted.

さらに、この参考例においては、機械学習処理として、次のような態様を採用することができる。すなわち、前記二項関係判定部が、問題と解との組で構成される事例であって、問題が検索キーワードを要素とする二項関係であって解が抽出するべき二項関係であるものを含む教師データが格納された教師データ記憶部から前記事例を取り出し、前記事例ごとに、所定の情報を素性として抽出し、前記抽出した素性の集合と解との組を生成するとともに、所定の機械学習アルゴリズムに基づいて、前記生成した素性の集合と解との組について、どのような素性の集合の場合に前記解となるかということを機械学習処理し、前記どのような素性の集合の場合に前記解となるかということを示す情報を学習結果情報として学習結果記憶部に保存する学習部、を有する構成である。このとき、前記二項関係選択部は、前記二項関係判定部で行った抽出処理と同様の抽出処理によって、前記二項関係の候補について前記所定の情報を素性として抽出するとともに、前記学習結果記憶部に格納された前記学習結果情報に基づいて、前記二項関係の候補の素性の集合の場合にどういう解とすべきどうかを判定し、その判定結果として、前記二項関係の候補について抽出するべき二項関係であることを示す解とすべきと判定された場合に、前記二項関係の候補を抽出するべき二項関係として選択するものとすればよい。 Furthermore, in this reference example , the following modes can be adopted as the machine learning process. That is, the binary relation determination unit is a case composed of a combination of a problem and a solution, and the problem is a binary relation having a search keyword as an element and a binary relation to be extracted by the solution The example is taken out from the teacher data storage unit in which teacher data including is extracted, predetermined information is extracted as a feature for each case, a set of the extracted feature set and solution is generated, Based on a machine learning algorithm, a machine learning process is performed on what kind of feature set the set of generated feature sets and solutions are, and what kind of feature set In this case, the learning unit stores information indicating whether or not the solution is obtained as learning result information in a learning result storage unit. At this time, the binary relation selection unit extracts the predetermined information as a feature for the binary relation candidate by an extraction process similar to the extraction process performed by the binary relation determination unit, and the learning result Based on the learning result information stored in the storage unit, it is determined what solution should be taken in the case of a set of features of the binomial relationship candidate, and the binomial relationship candidate is extracted as the determination result What is necessary is just to select the said binomial relationship candidate as a binary relationship which should be extracted, when it determines with it being the solution which shows that it is the binary relationship which should be performed.

また、この参考例において、機械学習処理に代えて、パターンによる二項関係抽出を行う場合には、前記二項関係判定部が、二項関係として抽出すべきキーワードとキーワードの組み合わせのパターンをパターン記憶部に保存するパターン格納部、を有する構成を採用するのが好ましく、この場合、前記二項関係選択部は、前記二項関係の候補について、前記パターン記憶部に格納された前記パターンに照合して抽出するべき二項関係であると判定された場合に前記二項関係の候補を抽出するべき二項関係として選択するものとすることが望ましい。 Further, in this reference example , in the case of performing binary relation extraction by pattern instead of machine learning processing, the binary relation determination unit patterns the pattern of keywords and keyword combinations to be extracted as binary relations. It is preferable to employ a configuration having a pattern storage unit stored in the storage unit. In this case, the binary relation selection unit matches the binary stored in the pattern storage unit with respect to the binary relation candidate. When it is determined that the binary relation is to be extracted, the binary relation candidate is preferably selected as the binary relation to be extracted.

以上に述べた本発明と参考例のうち、第２番目に説明した発明を除き、検索結果を出力する際に、どのキーワードとどのキーワードに二項関係があるか又はないか、という情報を出力して、それをユーザが把握できるようにしてもよい。すなわち、本発明の情報検索装置を、上述した各構成に加えて、前記検索結果抽出部で抽出された検索結果を出力する出力部を更に備えるものとして、この出力部において、前記検索結果ごとに、前記入力検索キーワード間の二項関係の有無を表す情報をも出力するようにすればよい。 Of the present invention and reference examples described above, except for the second-described invention, when outputting search results, information indicating which keywords and which keywords have a binary relationship is output. Then, the user may be able to grasp it. That is, the information search apparatus according to the present invention further includes an output unit that outputs the search result extracted by the search result extraction unit in addition to the above-described components. Information indicating the presence or absence of a binary relationship between the input search keywords may also be output.

またこの場合、前記出力部で出力した二項関係の有無を表す情報について、当該二項関係があることを表す情報又はないことを表す情報の何れかの指定を受け付け、その受け付けた情報に対応する前記検索結果のみを再出力する再出力部を、本発明の情報検索装置に設けることで、検索結果の出力を受けてユーザがキーワード間で関係がある又はないと考える検索結果の絞り込みを行うことができる。 Also, in this case, regarding the information indicating the presence or absence of the binary relationship output by the output unit, the designation of either the information indicating the presence of the binary relationship or the information indicating the absence of the binary relationship is accepted and the received information is supported. By providing a re-output unit that re-outputs only the search results to the information search apparatus of the present invention, the search results are narrowed down by receiving the output of the search results and the user thinks that there is no relationship between keywords. be able to.

また、本発明は、コンピュータを、前記情報検索処理装置として実行されるための二項関係判定結果を用いた情報検索処理プログラムである。 The present invention is also an information search processing program using a binary relation determination result for executing a computer as the information search processing device.

本発明によれば、ＡＮＤ検索処理等の検索結果の記事に出現する検索キーワードの関係を、二項関係抽出処理を用いて評価することにより、検索キーワードを含んでいることによってヒットされたが、検索キーワード同士の関係がうすく、その結果として内容的に無関係な、いわば検索意図からはずれるような内容の記事を排除することができる。さらに、教師あり機械学習の精度向上によって、情報検索処理の性能の向上が見込める。 According to the present invention, by evaluating the relationship of search keywords appearing in articles of search results such as AND search processing using binary relation extraction processing, it was hit by including a search keyword. As a result, it is possible to eliminate articles whose contents are irrelevant in terms of search keywords and, as a result, are not related to the search intention. Furthermore, the performance of information retrieval processing can be improved by improving the accuracy of supervised machine learning.

また、記事ごとに機械学習処理を行うことで、情報検索に際してキーワードが含まれていても内容的には当該キーワードと関係のない記事を検索しないようにして、有用な検索結果のみをユーザに提示することができるので、検索性能が飛躍的に向上する。 Also, by performing machine learning processing for each article, even if a keyword is included in the information search, articles that are not related to the keyword in terms of content are not searched, and only useful search results are presented to the user. Search performance is greatly improved.

また、情報検索のキーワード入力時に、ユーザにより又はプログラム関数等により、どのキーワード同士は関連がないといけないか、また、どのキーワード同士は関連がなくてもよいかを指定できるように構成することで、より柔軟でユーザにとって必要な情報検索を効率的に行うことができる。 In addition, when entering keywords for information retrieval, it is possible to specify which keywords should be related by the user or program function, etc., and which keywords need not be related. Therefore, it is possible to efficiently perform information retrieval that is more flexible and necessary for the user.

また、情報検索結果の出力時に、単に検索結果を出力するだけでなく、ユーザがどのキーワード同士は関連がないといけないか、また、どのキーワード同士は関連がなくてもよいかを指定できるように構成することで、必要な情報をより絞り込んだ形で得ることができる。 In addition, when outputting information search results, users can not only output search results but also specify which keywords must be related to each other and which keywords need not be related By configuring, necessary information can be obtained in a more narrowed form.

以下、本発明の実施形態を、図面を参照して説明するが、まず後述する各実施例の説明に先立って、各実施例で利用できる二項関係抽出処理について説明する。 DESCRIPTION OF EMBODIMENTS Hereinafter, embodiments of the present invention will be described with reference to the drawings. First, prior to description of each example described later, binary relation extraction processing that can be used in each example will be described.

二項関係の抽出は、例えば機械学習を利用する場合、二項関係抽出プログラムに従って情報処理を行う二項関係抽出装置１によって実現される。この二項関係抽出装置１は、抽出するべき二項関係か否かのタグを付与したテキストデータである教師データを用いて、どのような語句の対が抽出するべき二項関係であるかを機械学習し、与えられたテキストデータ２から、二項関係の候補を取得して、抽出するべき二項関係３を抽出する処理装置である。 For example, when using machine learning, the binary relation extraction is realized by the binary relation extraction apparatus 1 that performs information processing according to a binary relation extraction program. This binary relation extraction device 1 uses a teacher data, which is text data with a tag indicating whether or not a binary relation to be extracted, to determine what pair of words is a binary relation to be extracted. It is a processing device that performs machine learning, acquires binary relation candidates from given text data 2, and extracts a binary relation 3 to be extracted.

図１に、本発明にかかる二項関係抽出装置１の構成例を示す。二項関係抽出装置１は、教師データ記憶部１１、解−素性対抽出部１２、機械学習部１３、学習結果記憶部１４、候補抽出部１５、素性抽出部１６、解推定部１７、および二項関係抽出部１８を備える。 FIG. 1 shows a configuration example of a binary relation extraction apparatus 1 according to the present invention. The binary relation extraction apparatus 1 includes a teacher data storage unit 11, a solution-feature pair extraction unit 12, a machine learning unit 13, a learning result storage unit 14, a candidate extraction unit 15, a feature extraction unit 16, a solution estimation unit 17, and a binary A term relation extraction unit 18 is provided.

教師データ記憶部１１は、機械学習処理において使用される教師データとなるテキストデータを記憶する手段である。 The teacher data storage unit 11 is means for storing text data that is teacher data used in the machine learning process.

教師データとして、テキストデータの文中に出現している二項関係の要素（一方の要素を第１要素、他方の要素を第２要素という）を問題、抽出するべき二項関係であるか否かの情報を解とする事例を用いる。具体的には、テキストデータの一つの文中に二個以上の二項関係の要素を含む文のみについて、その文中の二項関係にある要素の対について、抽出するべき対（正例）であるか、抽出するべきではない対（負例）かのいずれかの解を示すタグを人手によって付与する。一文中に三個以上の二項関係の要素を含む場合には、要素のすべての組み合わせである対それぞれについてタグを付与する。なお、教師データの事例として、抽出するべき対（正例）を示す解のみが付与された二項関係を使用してもよい。 Whether or not it is a binary relation that should be extracted as a problem for the binary relation elements (one element is the first element and the other element is the second element) appearing in the text data sentence as the teacher data Use the case where the information is the solution. Specifically, for a sentence that includes two or more binary relation elements in one sentence of text data, it is a pair (positive example) that should be extracted for a pair of elements that have a binary relation in that sentence. A tag indicating a solution of either a pair that should not be extracted (negative example) is manually added. When three or more binomial elements are included in one sentence, a tag is assigned to each pair that is a combination of all elements. As an example of teacher data, a binary relation to which only a solution indicating a pair to be extracted (positive example) is given may be used.

解−素性対抽出部１２は、教師データ記憶部１１に記憶されているテキストデータ内の事例から、解と素性の集合との組を抽出する処理手段である。 The solution-feature pair extraction unit 12 is a processing unit that extracts a set of a solution and a set of features from cases in text data stored in the teacher data storage unit 11.

素性は、機械学習処理で使用する情報である。解−素性対抽出部１２は、素性として、して、例えば、二項関係の要素、要素の周囲に出現する単語／文字とその出現位置や順序、要素や周囲の単語の品詞情報、形態素解析情報、構文解析情報、要素間の出現距離、要素間での他の二項関係の要素の有無などの情報を抽出する。 The feature is information used in the machine learning process. The feature-feature pair extraction unit 12 uses, as features, for example, binary relation elements, words / characters appearing around the elements and their appearance positions and order, parts of speech information of the elements and surrounding words, and morphological analysis. Information such as information, parsing information, appearance distance between elements, presence / absence of other binary relation elements between elements, and the like are extracted.

機械学習部１３は、解−素性対抽出部１２によって抽出された解と素性の集合との組から、どのような素性のときにどのような解になりやすいかを、教師あり機械学習法により学習する処理手段である。その学習結果は、学習結果記憶部１４に保存される。 The machine learning unit 13 uses a supervised machine learning method to determine what type of solution is likely to be generated from a set of the solution extracted by the solution-feature pair extraction unit 12 and the feature set. Processing means for learning. The learning result is stored in the learning result storage unit 14.

素性抽出部１６は、テキストデータ２から抽出された二項関係の候補について、所定の素性を抽出する処理手段である。 The feature extraction unit 16 is a processing unit that extracts a predetermined feature for a binary relation candidate extracted from the text data 2.

解推定部１７は、学習結果記憶部１４の学習結果を参照して、二項関係の各候補について、その素性の集合の場合に、どのような解（分類先）になりやすいかの度合いを推定する処理手段である。 The solution estimation unit 17 refers to the learning result stored in the learning result storage unit 14 and determines the degree of what kind of solution (classification destination) is likely to be obtained in the case of a set of features for each candidate of the binary relation. It is a processing means to estimate.

二項関係抽出部１８は、解推定部１７の推定結果にもとづいて、二項関係の候補から、抽出するべき二項関係であることを示す解となる度合いが高いと推定されたものを、二項関係３として出力する処理手段である。 Based on the estimation result of the solution estimation unit 17, the binary relationship extraction unit 18 estimates from the binomial relationship candidates that are estimated to have a high degree of solution indicating that it is a binary relationship to be extracted. This is a processing means for outputting as binary relation 3.

なお、上述した「解推定部１７」は、二項関係の各候補について、その素性の集合の場合に、どのような解（分類先）になりやすいかの度合いを推定するものであるが、このような解推定部１７を、学習結果記憶部１４に格納された前記学習結果情報に基づいて、前記二項関係の候補の素性の集合の場合にどういう解とすべきかを判定する「解判定部」に置き換えることができる。このような「解判定部」であれば、二項関係の各候補について、その素性の集合の場合に、「どのような解（分類先）になりやすいかの度合いを推定する」だけでなく、「解となるか否か」という判定を行うことができる。この場合、前記「二項関係抽出部１８」は、「解判定部」の判定結果に基づいて、二項関係の候補の素性の集合の場合に、どういう解とすべきかの判定として、「二項関係の候補から、抽出するべき二項関係であることを示す解となる度合いが高いと推定されたものを、二項関係３として出力する」だけでなく、「二項関係の候補から、抽出するべき二項関係であることを示す解となると判定されたものを、二項関係３として出力する」ことになる。 In addition, the above-mentioned “solution estimation unit 17” estimates, for each of the binomial relationship candidates, the degree of what kind of solution (classification destination) is likely to be obtained in the case of a set of features. Based on the learning result information stored in the learning result storage unit 14, such a solution estimation unit 17 determines what kind of solution should be used in the case of a set of candidate features of the binomial relationship. Part ". In such a “solution determination unit”, not only “estimate the degree of what kind of solution (classification destination) is likely to be obtained” in the case of a set of features for each candidate of binary relations, , It can be determined whether or not it is a solution. In this case, the “binary relationship extraction unit 18” determines whether the solution should be used in the case of a set of features of binary relationship candidates based on the determination result of the “solution determination unit”. From the candidate for the term relation, not only the one that is estimated to have a high degree of solution indicating that it is the binary relation to be extracted is output as the term “binary relation 3,” "It is output as a binary relation 3 that is determined to be a solution indicating that it is a binary relation to be extracted."

図２に、二項関係抽出装置１の処理の流れを示す。 FIG. 2 shows a processing flow of the binary relation extraction apparatus 1.

二項関係抽出装置１の教師データ記憶部１１には、教師データとして、ある意味を持つ要素の対である二項関係に、抽出するべき二項関係であるか（正）または抽出するべきでない二項関係であるか（負）のいずれかの「解」の情報が付与された事例を含むテキストデータ２を記憶しておく。 In the teacher data storage unit 11 of the binary relation extraction apparatus 1, whether the binary relation that is a pair of elements having a certain meaning is a binary relation to be extracted (positive) or should not be extracted as the teacher data. Text data 2 including a case to which “solution” information of either binary relation (negative) is added is stored.

なお、抽出するべき対にのみ、所定の解を付与した事例を含むテキストデータ２を記憶しておくようにしてもよい。この場合には、テキストデータ２の解が付与された対は、抽出するべき二項関係である（正）の解が与えられているとみなされ、解が付与されていない残りの対は抽出するべきではない二項関係（負）の解が与えられているとみなして扱われる。 In addition, you may make it memorize | store the text data 2 containing the example which provided the predetermined solution only to the pair which should be extracted. In this case, the pair to which the solution of the text data 2 is given is regarded as being given a (positive) solution that is a binary relation to be extracted, and the remaining pairs to which no solution is given are extracted. Treated assuming that a binary (negative) solution that should not be given is given.

まず、解−素性対抽出部１２は、教師データ記憶部１１の教師データから各事例について、所定の素性を抽出し、解（タグによって付与された情報）と抽出した素性の集合との組を生成する（ステップＳ１）。解−素性対抽出部１２は、教師データであるテキストデータから所定のタグによって二項関係を抽出し、抽出した二項関係の要素について、形態素解析処理、構文解析処理、要素の出現位置や要素菅野距離の算出処理などを行って、所定の素性を抽出する。 First, the solution-feature pair extraction unit 12 extracts a predetermined feature for each case from the teacher data in the teacher data storage unit 11, and sets a set of the solution (information given by the tag) and the extracted feature set. Generate (step S1). The solution-feature pair extraction unit 12 extracts binary relations from text data that is teacher data using a predetermined tag, and morphological analysis processing, syntax analysis processing, element appearance positions and elements for the extracted binary relation elements A predetermined feature is extracted by performing a calculation process of the Sagano distance.

そして、機械学習部１３は、解−素性対抽出部１２により生成された解と素性の集合との組から、どのような素性の集合のときにどのような解（正または負）になりやすいかを機械学習法により学習し、学習結果を学習結果記憶部１４に格納する（ステップＳ２）。機械学習部１３は、教師あり機械学習法として、例えば、ｋ近傍法、シンプルベイズ法、決定リスト法、最大エントロピー法、サポートベクトルマシン法などの手法のいずれかを用いて機械学習処理を行う。 The machine learning unit 13 is likely to be any solution (positive or negative) at any feature set from the set of the solution generated by the solution-feature pair extraction unit 12 and the feature set. Is learned by the machine learning method, and the learning result is stored in the learning result storage unit 14 (step S2). The machine learning unit 13 performs machine learning processing by using any one of techniques such as a k-nearest neighbor method, a simple Bayes method, a decision list method, a maximum entropy method, and a support vector machine method as a supervised machine learning method.

その後、候補抽出部１５は、二項関係を抽出したいテキストデータ２を入力し、入力したテキストデータ２から二項関係の候補を抽出する（ステップＳ３）。候補抽出部１５は、テキストデータを文単位に分割し、一文中に二以上の二項関係の要素が出現する文についてのみ処理対象として扱い、その文から二項関係の候補を抽出する。 Thereafter, the candidate extraction unit 15 inputs the text data 2 from which the binary relation is desired to be extracted, and extracts a binary relation candidate from the input text data 2 (step S3). The candidate extraction unit 15 divides the text data into sentence units, treats only sentences in which two or more binary relation elements appear in one sentence as processing targets, and extracts binary relation candidates from the sentence.

素性抽出部１６は、解−素性対抽出部１２での処理とほぼ同様の処理によって、テキストデータ２から抽出した二項関係の各候補について素性を抽出する（ステップＳ４）。 The feature extraction unit 16 extracts features for each binary relation candidate extracted from the text data 2 by a process substantially similar to the process in the solution-feature pair extraction unit 12 (step S4).

解推定部１７は、各候補について、その素性の集合の場合にどのような解になりやすいか、すなわち「正となりやすい」か「負となりやすいか」の度合いを学習結果記憶部１４の学習結果をもとに推定する（ステップＳ５）。そして、二項関係抽出部１８は、より良い度合いで「正となりやすい」と推定された候補のなかから、所定の程度の候補を抽出するべき二項関係２として出力する（ステップＳ６）。 The solution estimation unit 17 determines, for each candidate, what kind of solution is likely to occur in the case of the set of features, that is, the degree of “prone to be positive” or “prone to be negative”. Is estimated based on (step S5). Then, the binary relation extraction unit 18 outputs a predetermined degree of candidates as binary relations 2 to be extracted from candidates estimated to be “prone to be positive” with a better degree (step S6).

次に、本発明の二項関係抽出処理の具体例を説明する。本例では、二項関係抽出装置１を、生物医学関係の論文のテキストデータベースから、相互作用のある蛋白質表現（蛋白質名）の二項関係を抽出するものとし、テキストデータベースでの蛋白質表現を１００％の精度で特定しているものと仮定する。 Next, a specific example of the binary relation extraction process of the present invention will be described. In this example, it is assumed that the binary relation extraction apparatus 1 extracts binary relations of protein expressions (protein names) having an interaction from a text database of biomedical papers. It is assumed that it is specified with accuracy of%.

また、二項関係を構成する要素は同一文中に出現するものとする。なお、二項関係を構成する要素は、同一段落内、同一文書内に出現する要素同士であってもよい。 In addition, the elements constituting the binary relation appear in the same sentence. The elements constituting the binary relation may be elements that appear in the same paragraph or the same document.

教師データを作成する処理において、二項関係の要素となる表現、例えば、蛋白質表現、病名と治療方法などの特定の表現を二項関係の要素として取り出す場合には、以下のようにして行う。 In the process of creating teacher data, when a specific expression such as a protein expression, a disease name and a treatment method is taken out as a binary relation element, for example, an expression that is a binary relation element is performed as follows.

１）ルールを用いて要素を取り出す。人手によって、「ＮＦ−Ｋａｐｐａ［Ａ−Ｚ］，ただし、［Ａ−Ｚ］はＡからＺまでのいずれかの文字」などのパターンを定義して、該当する表現を抽出する。このパターンによって、ＮＦ−ＫａｐｐａＡ，ＮＦ−ＫａｐｐａＢなどの蛋白質名の表現である要素を抽出する。 1) Extract an element using a rule. A pattern such as “NF-Kappa [AZ], where [AZ] is any character from A to Z” is defined manually, and the corresponding expression is extracted. Elements that are expression of protein names such as NF-Kappa A and NF-Kappa B are extracted based on this pattern.

２）辞書を用いて要素を取り出す。病名や治療方法などの表現が記載された辞書を使用して、それらの辞書にあった表現（文字列，単語列など）とまったく同じ文字列等を、病名や治療方法の表現である要素として抽出する。 2) Extract an element using a dictionary. Use dictionaries that describe expressions such as disease names and treatment methods, and use exactly the same character strings as expressions (character strings, word strings, etc.) in those dictionaries as elements that are expressions of disease names and treatment methods. Extract.

３）機械学習処理によって要素を取り出す。蛋白質表現、病名と治療方法などの表現の前後に開始位置タグと終了位置タグとを付与したテキストデータを、学習データとして用意する。そして、このタグ付きの学習データを用いた機械学習処理を行って、その学習結果を利用して、タグが付いていない新しいテキストデータの該当する表現の開始位置と終了位置にタグを挿入することで要素を特定する。 3) Extract elements by machine learning processing. Text data provided with a start position tag and an end position tag before and after the expression such as protein expression, disease name and treatment method is prepared as learning data. Then, perform machine learning processing using the learning data with the tag, and use the learning result to insert the tag at the start position and end position of the corresponding expression of the new text data without the tag. Specify the element with.

４）所定の二項関係を示す情報を用いて取り出す。あらかじめ二項関係の要素になりうる表現にタグが付与されたデータを利用して、そのタグをもとに二項関係の要素である表現を抽出する。 4) Extract using information indicating a predetermined binary relationship. Using data in which a tag is assigned to an expression that can be a binary relation element in advance, an expression that is a binary relation element is extracted based on the tag.

図３に、教師データの例を示す。図３（Ａ）に示すような、相互作用のある蛋白質表現を要素とする二項関係を含む英文テキストデータを、教師データとして使用する。本例では、教師データには、抽出するべき二項関係についてのみ、解（正／ｐｏｓｉｔｉｖｅ）を示すタグが付与される。すなわち、機械学習処理において、正の事例のみを含む教師データが使用される。 FIG. 3 shows an example of teacher data. As shown in FIG. 3A, English text data including a binary relation having an interactive protein expression as an element is used as teacher data. In this example, a tag indicating a solution (positive) is assigned to the teacher data only for the binary relation to be extracted. That is, in the machine learning process, teacher data including only positive cases is used.

図３（Ｂ）に、教師データに付与されているタグの例を示す。教師データには、二つの二項関係の対Ｐ１，対Ｐ２が含まれる。二項関係（対）Ｐ１は、第１要素ｐ１「ｄｅｌｔａ−ｃｅｔｅｎｉｎ」，第２要素ｐ２「ｐｒｅｓｅｎｉｌｉｎ１」で構成されている。また、二項関係（対）Ｐ２は、第１要素ｐ１「ｐｒｅｓｅｎｉｌｉｎ（ＰＳ）１」，第２要素ｐ２「ｄｅｌｔａ−ｃｅｔｅｎｉｎ」で構成されている。 FIG. 3B shows an example of tags attached to teacher data. The teacher data includes two binary relation pairs P1 and P2. The binary relation (pair) P1 includes a first element p1 “delta-cetenin” and a second element p2 “presenilin 1”. The binary relation (pair) P2 includes a first element p1 “presenilin (PS) 1” and a second element p2 “delta-cetenin”.

解−素性対抽出部１２は、教師データ記憶部１１に記憶されているテキストデータ内の事例から、解と素性の集合との組を抽出する。例えば、素性として、以下のような情報を抽出する。
１）二項関係の要素の周囲に出現する単語または文字。例えば、二項関係の第１要素（最初の要素）の前方の所定数の単語／文字，第２要素（二番目の要素）の後方の所定数の単語／文字，第１要素と第２要素の間の所定数の単語／文字；
２）二項関係の要素の周囲に出現する単語／文字の出現位置，出現順序など；
３）二項関係の二つの要素；
４）二項関係の要素または周囲の単語の品詞情報，形態素解析情報など；
５）二項関係の要素または周囲の単語の構文解析情報；
６）二項関係の第１要素と第２要素との出現距離；
７）二項関係の第１要素と第２要素の間での要素の出現の有無。 The solution-feature pair extraction unit 12 extracts a set of a solution and a set of features from cases in the text data stored in the teacher data storage unit 11. For example, the following information is extracted as the feature.
1) Words or characters that appear around binary elements. For example, a predetermined number of words / characters in front of the first element (first element) of the binary relation, a predetermined number of words / characters behind the second element (second element), the first element and the second element A predetermined number of words / characters between
2) Appearance position, appearance order, etc. of words / characters appearing around binary relational elements;
3) Two elements of binary relations;
4) Part-of-speech information, morphological analysis information, etc .;
5) Parsing information of binary relational elements or surrounding words;
6) Appearance distance between the first and second elements of the binary relation;
7) Whether or not an element appears between the first element and the second element in the binary relation.

素性のうち、例えば、品詞情報は、形態素解析システム「ＣｈａＳｅｎ」などの既存の形態素解析処理手法を使用して取得する（参照：http://chasen.aist-nara.ac.jp/index.html.ja）。「ＣｈａＳｅｎ」では、日本語文を分割し、さらに、各単語の品詞も推定することができる。例えば、「学校へ行く」を入力すると以下の結果をえる．
学校ガッコウ学校名詞-一般
へヘへ助詞-格助詞-一般
行くイク行く動詞-自立五段・カ行促音便基本形
EOS
このように、各行に一個の単語が入るように分割され、各単語に読みや品詞の情報が付与される。 Among the features, for example, the part of speech information is acquired using an existing morphological analysis processing method such as a morphological analysis system “ChaSen” (see: http://chasen.aist-nara.ac.jp/index.html). .ja). In “ChaSen”, a Japanese sentence can be divided and the part of speech of each word can be estimated. For example, if you enter "go to school", you get the following results.
School Gakkou School Noun-General To He To particle-Case particle-General Go Iku Go Verb-independence
EOS
In this way, each line is divided so that one word is included, and reading and part-of-speech information are given to each word.

英語のテキストデータの場合の品詞情報は、例えば、「Transformation-Based Error-Driven Learning and Natural Language
Processing: A Case Study in Part-of-Speech Tagging」（Eric Brill, Computational Linguistics, Vol.21, No.4, p.543-565, 1995）を使用して取得する。このシステムを利用することで、英語文の各単語の品詞を推定される。 The part of speech information in the case of English text data is, for example, `` Transformation-Based Error-Driven Learning and Natural Language
Processing: A Case Study in Part-of-Speech Tagging ”(Eric Brill, Computational Linguistics, Vol. 21, No. 4, p. 543-565, 1995). By using this system, the part of speech of each word in the English sentence is estimated.

ここでは、二項関係の要素が、同一段落中に出現する場合には、素性として、二項関係の要素が文をまたぐか否かという情報を用いてもよい。また、二項関係の要素が、同一文書内に出現する場合には、素性として、二項関係の要素が文をまたぐか否かという情報、段落をまたぐか否かという情報を用いてもよい。 Here, when a binary relation element appears in the same paragraph, information on whether or not the binary relation element crosses a sentence may be used as a feature. In addition, when a binary relation element appears in the same document, information on whether the binary relation element straddles a sentence or information on whether a paragraph straddles a paragraph may be used as a feature. .

解−素性対抽出部１２は、図３（Ｂ）に示すようなタグが付与された教師データの事例から、素性を抽出し、素性の集合と解との組を生成する。例えば、二項関係Ｐ２の事例について、図５に示すように、解（ｐｏｓｉｔｉｖｅ：正）と、以下の素性の集合との組が生成されるとする。
「第１要素の前方３単語内に「for」，「interaction」，「with」が出現；
要素間に「and」，「cloned」，「the」，「full」，「-」，「length」，「cDNA」，「of」，「human」が出現；
第２要素の後方３単語内に「which」，「encoded」，「1225」が出現」。 The solution-feature pair extraction unit 12 extracts features from examples of teacher data to which tags as shown in FIG. 3B are assigned, and generates a set of feature sets and solutions. For example, for the case of the binary relation P2, as shown in FIG. 5, a set of a solution (positive) and the following feature set is generated.
“For,” “interaction,” and “with” appear in the first three words of the first element;
"And", "cloned", "the", "full", "-", "length", "cDNA", "of", "human" appear between elements;
“Which”, “encoded”, and “1225” appear in the last three words of the second element ”.

機械学習部１３は、この解と素性の集合とをもとに、どのような素性の集合の場合に解（ｐｏｓｉｔｉｖｅ）となりやすいかを機械学習処理し、学習結果を学習結果記憶部１４に記憶する。 Based on this solution and the set of features, the machine learning unit 13 performs machine learning processing on which feature set is likely to be a solution, and stores the learning result in the learning result storage unit 14. To do.

機械学習部１３は、教師あり機械学習法として、例えば、ｋ近傍法、シンプルベイズ法、決定リスト法、最大エントロピー法、サポートベクトルマシン法などの手法を用いる。 The machine learning unit 13 uses a technique such as a k-nearest neighbor method, a simple Bayes method, a decision list method, a maximum entropy method, a support vector machine method, or the like as a supervised machine learning method.

ｋ近傍法は、最も類似する一つの事例のかわりに、最も類似するｋ個の事例を用いて、このｋ個の事例での多数決によって分類先（解）を求める手法である。ｋは、あらかじめ定める整数の数字であって、一般的に、１から９の間の奇数を用いる。シンプルベイズ法は、ベイズの定理にもとづいて各分類になる確率を推定し、その確率値が最も大きい分類を求める分類先とする方法である。 The k-nearest neighbor method is a method for obtaining a classification destination (solution) by using the k most similar cases instead of the most similar case, and by majority decision of the k cases. k is a predetermined integer number, and generally an odd number between 1 and 9 is used. The Simple Bayes method is a method of estimating the probability of each classification based on Bayes' theorem and determining the classification having the highest probability value as a classification destination.

シンプルベイズ法において、文脈ｂで分類ａを出力する確率は、以下の式（１）で与えられる。 In the simple Bayes method, the probability of outputting the classification a in the context b is given by the following equation (1).

ただし、ここで文脈ｂは、あらかじめ設定しておいた素性ｆ_ｊ（∈Ｆ，１≦ｊ≦ｋ）の集合である。ｐ（ｂ）は、文脈ｂの出現確率である。ここで、分類ａに非依存であって定数のために計算しない。Ｐ（ａ）（ここでＰはｐの上部にチルダ）とＰ（ｆ_ｉ｜ａ）は、それぞれ教師データから推定された確率であって、分類ａの出現確率，分類ａのときに素性ｆ_ｉを持つ確率を意味する。Ｐ（ｆ_ｉ｜ａ）として最尤推定を行って求めた値を用いると、しばしば値がゼロとなり、式（２）の値がゼロで分類先を決定することが困難な場合が生じる。そのため、スームージングを行う。ここでは、以下の式（３）を用いてスームージングを行ったものを用いる。 Here, the context b is a set of features f _j (εF, 1 ≦ j ≦ k) set in advance. p (b) is the appearance probability of the context b. Here, since it is independent of the classification a and is a constant, it is not calculated. P (a) (where P is a tilde at the top of p) and P (f _i | a) are the probabilities estimated from the teacher data, respectively. means the probability of having _i . When a value obtained by performing maximum likelihood estimation is used as P (f _i | a), the value often becomes zero, and it may be difficult to determine the classification destination because the value of Equation (2) is zero. Therefore, smoothing is performed. Here, what smoothed using the following formula | equation (3) is used.

ただし、ｆｒｅｑ（ｆ_ｉ，ａ）は、素性ｆ_ｉを持ちかつ分類がａである事例の個数、ｆｒｅｑ（ａ）は、分類がａである事例の個数を意味する。 However, freq _(f i, a), the number of cases has a feature _{f i} and classification is a, freq (a), the classification means the number of cases is a.

決定リスト法は、素性と分類先の組とを規則とし、それらをあらかじめ定めた優先順序でリストに蓄えおき、検出する対象となる入力が与えられたときに、リストで優先順位の高いところから入力のデータと規則の素性とを比較し、素性が一致した規則の分類先をその入力の分類先とする方法である。 The decision list method uses features and combinations of classification destinations as rules, stores them in the list in a predetermined priority order, and when input to be detected is given, from the highest priority in the list This is a method in which input data is compared with the feature of the rule, and the classification destination of the rule having the same feature is set as the classification destination of the input.

決定リスト方法では、あらかじめ設定しておいた素性ｆ_ｊ(∈Ｆ，１≦ｊ≦ｋ）のうち、いずれか一つの素性のみを文脈として各分類の確率値を求める。ある文脈ｂで分類ａを出力する確率は以下の式によって与えられる。 In the decision list method, the probability value of each classification is obtained using only one of the features f _j (εF, 1 ≦ j ≦ k) set in advance as a context. The probability of outputting classification a in a context b is given by

ｐ（ａ｜ｂ）＝ｐ（ａ｜ｆmax）（４） p (a | b) = p (a | fmax) (4)

ただし、ｆmax は以下の式によって与えられる。 However, fmax is given by the following equation.

また、Ｐ（ａ_ｉ｜ｆ_ｊ）（ここでＰはｐの上部にチルダ）は、素性ｆ_ｊを文脈に持つ場合の分類ａ_ｉの出現の割合である。 Further, P (a _i | f _j ) (where P is a tilde at the top of p) is a rate of appearance of the classification a _i when the feature f _j is included in the context.

最大エントロピー法は、あらかじめ設定しておいた素性ｆj （１≦ｊ≦ｋ）の集合をＦとするとき、以下の式（６）を満足しながらエントロピーを意味する式（７）を最大にするときの確率分布ｐ（ａ，ｂ）を求め、その確率分布にしたがって求まる各分類の確率のうち、最も大きい確率値を持つ分類を求める分類先とする方法である。 In the maximum entropy method, assuming that a set of preset features fj (1≤j≤k) is F, expression (7) representing entropy is maximized while satisfying expression (6) below. This is a method of obtaining a probability distribution p (a, b) and determining a classification having the largest probability value among the classification probabilities obtained according to the probability distribution.

ただし、Ａ，Ｂは分類と文脈の集合を意味し、ｇ_ｊ（ａ，ｂ）は文脈ｂに素性ｆ_ｊがあって、なおかつ分類がａの場合１となり、それ以外で０となる関数を意味する。また、Ｐ（ａ_ｉ｜ｆ_ｊ）（ここでＰはｐの上部にチルダ）は、既知データでの（ａ，ｂ）の出現の割合を意味する。 However, A and B mean a set of classifications and contexts, and g _j (a, b) is a function that is 1 if the context b has a feature f _j and the classification is a, and is 0 otherwise. means. Further, P (a _i | f _j ) (where P is a tilde at the top of p) means the rate of occurrence of (a, b) in the known data.

式（６）は、確率ｐと出力と素性の組の出現を意味する関数ｇをかけることで出力と素性の組の頻度の期待値を求めることになっており、右辺の既知データにおける期待値と、左辺の求める確率分布に基づいて計算される期待値が等しいことを制約として、エントロピー最大化( 確率分布の平滑化) を行なって、出力と文脈の確率分布を求めるものとなっている。最大エントロピー法の詳細については、以下の参考文献１および参考文献２を参照されたい。
（参考文献１：Eric Sven
Ristad, Maximum Entropy Modeling for Natural Language,(ACL/EACL Tutorial
Program, Madrid, 1997)；
（参考文献２：Eric Sven
Ristad, Maximum Entropy Modeling Toolkit, Release 1.6beta,
(http://www.mnemonic.com/software/memt,1998)） In equation (6), the expected value of the frequency of the output and feature pair is obtained by multiplying the probability p and the function g meaning the appearance of the pair of output and feature. And the expected value calculated based on the probability distribution calculated on the left side is the constraint, and entropy maximization (smoothing of the probability distribution) is performed to determine the probability distribution of the output and the context. For details of the maximum entropy method, see Reference 1 and Reference 2 below.
(Reference 1: Eric Sven
Ristad, Maximum Entropy Modeling for Natural Language, (ACL / EACL Tutorial
Program, Madrid, 1997);
(Reference 2: Eric Sven
Ristad, Maximum Entropy Modeling Toolkit, Release 1.6beta,
(http://www.mnemonic.com/software/memt,1998)

サポートベクトルマシン法は、空間を超平面で分割することにより、二つの分類からなるデータを分類する手法である。 The support vector machine method is a method of classifying data composed of two classifications by dividing a space by a hyperplane.

図４にサポートベクトルマシン法のマージン最大化の概念を示す。図４において、白丸は正例、黒丸は負例を意味し、実線は空間を分割する超平面を意味し、破線はマージン領域の境界を表す面を意味する。図４（Ａ）は、正例と負例の間隔が狭い場合（スモールマージン）の概念図、図４（Ｂ）は、正例と負例の間隔が広い場合（ラージマージン）の概念図である。 FIG. 4 shows the concept of margin maximization in the support vector machine method. In FIG. 4, a white circle means a positive example, a black circle means a negative example, a solid line means a hyperplane that divides the space, and a broken line means a surface that represents the boundary of the margin area. 4A is a conceptual diagram when the interval between the positive example and the negative example is narrow (small margin), and FIG. 4B is a conceptual diagram when the interval between the positive example and the negative example is wide (large margin). is there.

このとき、二つの分類が正例と負例からなるものとすると、学習データにおける正例と負例の間隔（マージン) が大きいものほどオープンデータで誤った分類をする可能性が低いと考えられ、図４（Ｂ）に示すように、このマージンを最大にする超平面を求めそれを用いて分類を行なう。 At this time, if the two classifications consist of positive and negative examples, the larger the interval (margin) between the positive and negative examples in the learning data, the less likely it is to make an incorrect classification with open data. As shown in FIG. 4B, a hyperplane that maximizes this margin is obtained, and classification is performed using it.

基本的には上記のとおりであるが、通常、学習データにおいてマージンの内部領域に少数の事例が含まれてもよいとする手法の拡張や、超平面の線形の部分を非線型にする拡張（カーネル関数の導入) がなされたものが用いられる。 Basically, it is as described above. Usually, an extension of the method that the training data may contain a small number of cases in the inner area of the margin, or an extension that makes the linear part of the hyperplane nonlinear ( The one with the introduction of the kernel function is used.

この拡張された方法は、以下の識別関数を用いて分類することと等価であり、その識別関数の出力値が正か負かによって二つの分類を判別することができる。 This extended method is equivalent to classification using the following discriminant function, and the two classes can be discriminated depending on whether the output value of the discriminant function is positive or negative.

ただし、ｘは識別したい事例の文脈（素性の集合) を、ｘ_ｉとｙ_ｊ（ｉ＝１，…，ｌ，ｙ_ｊ∈｛１，−１｝）は学習データの文脈と分類先を意味し、関数ｓｇｎは、
ｓｇｎ（ｘ）＝１（ｘ≧０）（９）
−１（otherwise ）
であり、また、各α_ｉは式（１０）と式（１１）の制約のもと式（９）を最大にする場合のものである。 Where x is the context (set of features) to be identified, and x _i and y _j (i = 1,..., L, y _j ∈ {1, −1}) mean the context and classification destination of the learning data And the function sgn is
sgn (x) = 1 (x ≧ 0) (9)
-1 (otherwise)
In addition, each α _i is for maximizing equation (9) under the constraints of equations (10) and (11).

また、関数Ｋはカーネル関数と呼ばれ、様々なものが用いられるが、本形態では以下の多項式のものを用いる。 The function K is called a kernel function, and various functions are used. In this embodiment, the following polynomial is used.

Ｋ（ｘ，ｙ）＝（ｘ・ｙ＋１）ｄ（１２）
Ｃ，ｄは実験的に設定される定数である。後述する具体例ではＣはすべての処理を通して１に固定した。また、ｄは、１と２の二種類を試している。ここで、α_ｉ＞０となるｘ_ｉは、サポートベクトルと呼ばれ、通常、式（８）の和をとっている部分は、この事例のみを用いて計算される。つまり、実際の解析には学習データのうちサポートベクトルと呼ばれる事例のみしか用いられない。 K (x, y) = (x · y + 1) d (12)
C and d are constants set experimentally. In a specific example to be described later, C is fixed to 1 through all the processes. Moreover, two types of 1 and 2 are tried for d. Here, x _i satisfying α _i > 0 is called a support vector, and the portion taking the sum of Expression (8) is normally calculated using only this case. That is, only actual cases called support vectors are used for actual analysis.

なお、拡張されたサポートベクトルマシン法の詳細については、以下の参考文献３および参考文献４を参照されたい。
（参考文献４：Nello
Cristianini and John Shawe-Taylor, An Introduction to Support Vector Machines
and other kernel-based learning methods,(Cambridge University Press,2000)；
参考文献５：Taku
Kudoh, Tinysvm:Support Vector
machines,(http://cl.aist-nara.ac.jp/taku-ku//software/Tiny SVM/index.html,2000)） For details of the extended support vector machine method, see Reference 3 and Reference 4 below.
(Reference 4: Nello
Cristianini and John Shawe-Taylor, An Introduction to Support Vector Machines
and other kernel-based learning methods, (Cambridge University Press, 2000);
Reference 5: Taku
Kudoh, Tinysvm: Support Vector
machines, (http://cl.aist-nara.ac.jp/taku-ku//software/Tiny SVM / index.html, 2000))

サポートベクトルマシン法は、分類の数が２個のデータを扱うものである。したがって、分類の数が３個以上の事例を扱う場合には、通常、これにペアワイズ法またはワンＶＳレスト法などの手法を組み合わせて用いることになる。 The support vector machine method handles data with two classifications. Therefore, when handling cases with three or more classifications, a pair-wise method or a one-VS rest method is usually used in combination with this.

ペアワイズ法は、ｎ個の分類を持つデータの場合に、異なる二つの分類先のあらゆるペア（ｎ（ｎ−１）／２個）を生成し、各ペアごとにどちらがよいかを二値分類器、すなわちサポートベクトルマシン法処理モジュールで求めて、最終的に、ｎ（ｎ−１）／２個の二値分類による分類先の多数決によって、分類先を求める方法である。 In the pairwise method, in the case of data having n classifications, every pair (n (n-1) / 2) of two different classification destinations is generated, and a binary classifier indicates which is better for each pair. That is, it is obtained by the support vector machine method processing module and finally obtains the classification destination by majority decision of the classification destination by n (n−1) / 2 binary classification.

ワンＶＳレスト法は、例えば、ａ，ｂ，ｃという三つの分類先があるときは、分類先ａとその他、分類先ｂとその他、分類先ｃとその他、という三つの組を生成し、それぞれの組についてサポートベクトルマシン法で学習処理する。そして、学習結果による推定処理において、その三つの組のサポートベクトルマシンの学習結果を利用する。推定すべき二項関係の候補が、その三つのサポートベクトルマシンではどのように推定されるかを見て、その三つのサポートベクトルマシンのうち、その他でないほうの分類先であって、かつサポートベクトルマシンの分離平面から最も離れた場合のものの分類先を求める解とする方法である。例えば、ある候補が、「分類先ａとその他」の組の学習処理で作成したサポートベクトルマシンにおいて分離平面から最も離れた場合には、その候補の分類先はaと推定する。 For example, when there are three classification destinations a, b, and c, the one VS rest method generates three sets of classification destination a and other, classification destination b and other, classification destination c and other, The learning process is performed on the set of the support vector machine method. Then, in the estimation process based on the learning result, the learning results of the three sets of support vector machines are used. Look at how the three support vector machines estimate the binomial candidates to be estimated. Of the three support vector machines, the other classification destination and the support vector This is a method for obtaining a classification destination of the one farthest from the separation plane of the machine. For example, when a candidate is farthest from the separation plane in the support vector machine created by the learning process of “classification destination a and others”, the candidate classification destination is estimated as a.

その後、候補抽出部１５は、入力された新しいテキストデータ２から、二項関係の候補を抽出する。具体的には、テキストデータ２を文単位に分割し、各文中の二項関係の要素となる表現（文字列）を抽出する。そして、一文中に二項関係の要素となる表現が二個以上存在するか否かを調べ、一文中にある二項関係の要素のすべての二つの組み合わせ（対）を二項関係の候補として生成する。 Thereafter, the candidate extraction unit 15 extracts a binary relation candidate from the input new text data 2. Specifically, the text data 2 is divided into sentence units, and expressions (character strings) that are binary relation elements in each sentence are extracted. Then, it is checked whether there are two or more expressions that are binary relation elements in one sentence, and all two combinations (pairs) of binary relation elements in one sentence are considered as binary relation candidates. Generate.

また、新しいテキストデータ２を各段落に分割し、各段落中の二項関係の要素となる表現を抽出し、同じ段落内から二以上の要素がある段落について、すべての二つの組み合わせ（対）を二項関係の候補として生成してもよい。または、テキストデータ２の一文書内からの二項関係の要素となる表現を抽出し、すべての二つの組み合わせ（対）を二項関係の候補として生成してもよい。 Also, the new text data 2 is divided into each paragraph, the expressions that are the binary relation elements in each paragraph are extracted, and all the two combinations (pairs) of the paragraph that has two or more elements from the same paragraph. May be generated as a binary relation candidate. Alternatively, expressions that are binary relation elements from within one document of text data 2 may be extracted, and all two combinations (pairs) may be generated as binary relation candidates.

テキストデータ２から二項関係の要素となる表現を抽出する手法としては、前述の教師データの生成方法で説明した手法を使用する。例えば、パターンや辞書の記述と合致する表現を抽出する、教師あり機械学習の学習結果にもとづいて推定した表現を抽出する。 As a method for extracting an expression that is a binary relation element from the text data 2, the method described in the above-described teacher data generation method is used. For example, an expression that is extracted based on a learning result of supervised machine learning that extracts an expression that matches a description of a pattern or a dictionary is extracted.

テキストデータ２の一文中に二個以上の要素が出現する場合に、その要素の対を二項関係の候補とする。なお、一文中に三個以上の要素が出現する場合には、要素のあらゆる組み合わせの対を二項関係の候補とする。 When two or more elements appear in one sentence of the text data 2, the pair of elements is set as a binary relation candidate. When three or more elements appear in one sentence, any combination of elements is considered as a binary relation candidate.

そして、素性抽出部１６は、二項関係の候補から、解−素性対抽出部１２と同様の処理によって同様の素性を抽出する。 Then, the feature extraction unit 16 extracts similar features from the binomial relationship candidates by the same processing as the solution-feature pair extraction unit 12.

解推定部１７は、学習結果記憶部１４に記憶されている学習結果をもとに、各二項関係の候補について、その候補の素性の集合の場合に正の解（ｐｏｓｉｔｉｖｅ）のなりやすさを推定する。二項関係抽出部１８は、解推定部１７の推定結果をもとに二項関係の候補から、正の解となりやすい推定の度合いが高いものを二項関係２として出力する。 Based on the learning result stored in the learning result storage unit 14, the solution estimation unit 17 is likely to be a positive solution in the case of a set of candidate features for each binary relation candidate. Is estimated. Based on the estimation result of the solution estimation unit 17, the binary relation extraction unit 18 outputs, as a binary relation 2, a candidate having a high degree of estimation that is likely to be a positive solution.

本例では、上記の素性を抽出し、機械学習処理としてサポートベクトルマシン法を用いた。１０分割のクロスバリデーションを利用して精度を調べたところ、Ｆ値＝４７．５％の精度が得られた。Ｆ値は、再現率と適合率の調和平均をいう。再現率は、テキストデータ２から抽出するべき二項関係のうち、どの程度のものが出力できたかを示す割合である。適合率は、二項関係抽出装置１が抽出した二項関係のうち、どの程度のものが取り出すべき二項関係であったかを示す割合である。 In this example, the above features are extracted and the support vector machine method is used as machine learning processing. When the accuracy was examined using 10-part cross validation, an accuracy of F value = 47.5% was obtained. The F value is a harmonic average of the recall and the precision. The recall is a ratio indicating how much of the binary relations to be extracted from the text data 2 can be output. The relevance ratio is a ratio indicating how much of the binary relations extracted by the binary relation extraction apparatus 1 is the binary relation to be extracted.

二項関係抽出装置１では、機械学習部１３によって、所定の機械学習アルゴリズムにもとづいて、与えられた教師データを用いて、各二項関係の解と素性の集合との組について、どのような素性の集合の場合にどのような解となるかということを機械学習処理し、どのような素性の集合の場合にどのような解となるかということを示す情報を学習結果情報として学習結果記憶部１４に保存し、解推定部１７によって、この学習結果情報にもとづいて、二項関係の候補の素性の集合の場合についての前記解となりやすい度合いを推定する。 In the binary relation extraction apparatus 1, the machine learning unit 13 uses a given teacher data to determine what kind of combination of a binary relation solution and a feature set is based on a given teacher data. Machine learning processing for what kind of solution is the case of a feature set, and learning result information as information indicating what kind of solution is for what kind of feature set The result is stored in the unit 14, and the solution estimation unit 17 estimates the degree of the solution that is likely to be the solution in the case of a set of binary candidate features based on the learning result information.

二項関係抽出装置１において、機械学習手法としてｋ近傍法を用いる場合には、機械学習部１３は、教師データの事例同士で、その事例から抽出された素性の集合のうち重複する素性の割合（同じ素性をいくつ持っているかの割合）にもとづく事例同士の類似度と定義して、前記定義した類似度と事例とを学習結果情報として学習結果記憶部１４に記憶しておく。 When the k-nearest neighbor method is used as the machine learning method in the binary relation extraction apparatus 1, the machine learning unit 13 uses the feature ratios that overlap among the feature sets extracted from the case examples of the teacher data. It is defined as the similarity between cases based on (the ratio of how many same features), and the defined similarity and the case are stored in the learning result storage unit 14 as learning result information.

そして、解推定部１７は、新しいテキストデータ２が入力されたときに、学習結果記憶部１４の定義した類似度と事例を参照して、テキストデータ２から抽出された二項関係の候補について、その候補の類似度が高い順にｋ個の事例を学習結果記憶部１４の事例から選択し、選択したｋ個の事例での多数決によって決まった分類先を、二項関係の候補の分類先（解）として推定する。すなわち、解推定部１７では、二項関係の候補の素性の集合の場合にある解となりやすさの度合いを、選択したｋ個の事例での多数決の票数、ここでは「抽出するべき」という分類が獲得した票数とする。また、機械学習手法として、シンプルベイズ法を用いる場合には、機械学習部１３は、教師データの事例について、前記事例の解と素性の集合との組を学習結果情報として学習結果記憶部１４に記憶する。そして、解推定部１７は、新しいテキストデータ２が入力されたときに、学習結果記憶部１４の学習結果情報の解と素性の集合との組をもとに、ベイズの定理にもとづいて素性抽出部１６で取得した二項関係の候補の素性の集合の場合の各分類になる確率を算出して、その確率の値が最も大きい分類を、その二項関係の候補の素性の分類（解）と推定する。すなわち、解推定部１７では、二項関係の候補の素性の集合の場合にある解となりやすさの度合いを、各分類になる確率、ここでは「抽出するべき」という分類になる確率とする。 Then, when new text data 2 is input, the solution estimation unit 17 refers to the similarity and the case defined by the learning result storage unit 14 and uses the binary relation candidates extracted from the text data 2 as follows. The k cases are selected from the cases in the learning result storage unit 14 in descending order of the similarity of the candidates, and the classification destination determined by the majority vote in the selected k cases is selected as the binomial candidate classification destination (solution ). That is, the solution estimation unit 17 determines the degree of likelihood of being a certain solution in the case of a set of binary candidate features, the number of votes of majority vote in the selected k cases, in this case, the classification “to be extracted” Is the number of votes obtained by. When the simple Bayes method is used as the machine learning method, the machine learning unit 13 stores, in the learning result storage unit 14, a set of the solution of the case and a set of features as learning result information for the case of the teacher data. Remember. Then, when new text data 2 is input, the solution estimation unit 17 extracts features based on the Bayes' theorem based on the set of the learning result information in the learning result storage unit 14 and the set of features. The probability of becoming each classification in the case of the feature set of the binomial relationship candidates acquired by the unit 16 is calculated, and the classification having the highest probability value is classified into the classification (solution) of the candidate of the binomial relationship Estimated. That is, in the solution estimation unit 17, the degree of easiness of becoming a solution in the case of a set of binary candidate features is set as the probability of becoming each classification, here, the probability of becoming “to be extracted”.

また、機械学習手法として決定リスト法を用いる場合には、機械学習部１３は、教師データの事例について、素性と分類先との規則を所定の優先順序で並べたリストを学習結果記憶部１４に記憶する。そして、新しいテキストデータ２が入力されたときに、解推定部１７は、学習結果記憶部１４のリストの優先順位の高い順にテキストデータ２から抽出された二項関係の候補の素性と規則の素性とを比較し、素性が一致した規則の分類先をその候補の分類先（解）として推定する。すなわち、解推定部１７では、二項関係の候補の素性の集合の場合にある解となりやすさの度合いを、所定の優先順位またはそれに相当する数値、尺度、ここでは「抽出するべき」という分類になる確率のリストにおける優先順位とする。 When the decision list method is used as the machine learning method, the machine learning unit 13 stores, in the learning result storage unit 14, a list in which rules of features and classification destinations are arranged in a predetermined priority order with respect to examples of teacher data. Remember. Then, when new text data 2 is input, the solution estimation unit 17 causes the features of the binomial relation candidates extracted from the text data 2 and the features of the rules in descending order of priority in the list of the learning result storage unit 14. And the classification destination of the rule with the same feature is estimated as the candidate classification destination (solution). That is, the solution estimation unit 17 classifies the degree of likelihood of being a solution in the case of a set of binary candidate features with a predetermined priority or a numerical value or scale corresponding thereto, in this case, “to be extracted”. Is a priority in the probability list.

また、機械学習手法として最大エントロピー法を使用する場合には、機械学習部１３は、教師データの事例から解となりうる分類を特定し、所定の条件式を満足しかつエントロピーを示す式を最大にするときの素性の集合と解となりうる分類の二項からなる確率分布を求めて学習結果記憶部１４に記憶する。そして、新しいテキストデータ２が入力されたときに、解推定部１７は、学習結果記憶部１４の確率分布を利用して、テキストデータ２から抽出された二項関係の候補の素性の集合についてその解となりうる分類の確率を求めて、最も大きい確率値を持つ解となりうる分類を特定し、その特定した分類をその候補の解と推定する。すなわち、解推定部１７では、二項関係の候補の素性の集合の場合にある解となりやすさの度合いを、各分類になる確率、ここでは「抽出するべき」という分類になる確率とする。 When the maximum entropy method is used as the machine learning method, the machine learning unit 13 specifies a class that can be a solution from the example of the teacher data, and maximizes an expression that satisfies a predetermined conditional expression and shows entropy. A probability distribution composed of a set of features and a binomial classification that can be a solution is obtained and stored in the learning result storage unit 14. Then, when new text data 2 is input, the solution estimation unit 17 uses the probability distribution of the learning result storage unit 14 to determine the set of features of the binary relation candidates extracted from the text data 2. A probability of a class that can be a solution is obtained, a class that can be a solution having the largest probability value is specified, and the specified class is estimated as a candidate solution. That is, in the solution estimation unit 17, the degree of easiness of becoming a solution in the case of a set of binary candidate features is set as the probability of becoming each classification, here, the probability of becoming “to be extracted”.

また、機械学習手法としてサポートベクトルマシン法を使用する場合には、機械学習部１３は、教師データの事例から解となりうる分類を特定し、分類を正例と負例に分割して、カーネル関数を用いた所定の実行関数にしたがって事例の素性の集合を次元とする空間上で、その事例の正例と負例の間隔を最大にし、かつ正例と負例を超平面で分割する超平面を求めて学習結果記憶部１４に記憶する。そして、新しいテキストデータ２が入力されたときに、解推定部１７は、学習結果記憶部１４の超平面を利用して、テキストデータ２から抽出された二項関係の候補の素性の集合が超平面で分割された空間において正例側か負例側のどちらにあるかを特定し、その特定された結果にもとづいて定まる分類を、その候補の解と推定する。すなわち、解推定部１７では、二項関係の候補の素性の集合の場合にある解となりやすさの度合いを、分離平面からの正例（抽出するべき二項関係）の空間への距離の大きさとする。より詳しくは、抽出するべき二項関係を正例、抽出するべきではない二項関係を負例とする場合に、分離平面に対して正例側の空間に位置する事例が「抽出するべき事例」と判断され、その事例の分離平面からの距離をその事例の度合いとする。 When the support vector machine method is used as the machine learning method, the machine learning unit 13 specifies a class that can be a solution from the example of the teacher data, divides the class into a positive example and a negative example, A hyperplane that maximizes the interval between the positive and negative examples of a case and divides the positive and negative examples by a hyperplane in a space whose dimension is a set of case features according to a predetermined execution function using Is stored in the learning result storage unit 14. Then, when new text data 2 is input, the solution estimation unit 17 uses the hyperplane of the learning result storage unit 14, and the set of features of the binomial relationship candidates extracted from the text data 2 is superfluous. In the space divided by the plane, it is identified whether the space is on the positive example side or the negative example side, and the classification determined based on the identified result is estimated as the candidate solution. That is, the solution estimation unit 17 determines the degree of likelihood of being a certain solution in the case of a set of candidate features of binary relations by determining the distance from the separation plane to the space of the positive example (binary relations to be extracted). Say it. More specifically, when a binary relation to be extracted is a positive example and a binary relation that should not be extracted is a negative example, a case located in the space on the positive example side with respect to the separation plane The distance from the separation plane of the case is set as the degree of the case.

また、解−素性対抽出部１２では、素性として、例えば、「二つの要素自体の単語」を使用してもよい。また、「要素の前方から一つ目の単語／文字列、二つ目の単語／文字列、後方から一つ目の単語／文字列、二つ目の単語／文字列」を素性として使用してもよい。図３（Ａ）の場合には、素性は、
「第１要素が「presenilin
(PS) 1」；
第２要素が「delta − catenin」；
第１要素の一つ目の単語が「presenilin」；
同二つ目の単語が「(PS)」；
第１要素の最後から二つ目の単語が「(PS)」；
同最後から一つ目の単語が「1」；
第２要素の一つ目の単語が「delta」；
同二つ目の単語が「-」；
第２要素の最後から二つ目の単語が「-」；
同最後から一つ目の単語が「catenin」である」となる。 Further, the solution-feature pair extraction unit 12 may use, for example, “a word of two elements” as the feature. In addition, “first word / character string from the front of the element, second word / character string, first word / character string from the rear, second word / character string” is used as a feature. May be. In the case of FIG. 3A, the feature is
"First element is presenilin"
(PS) 1 ”;
The second element is "delta-catenin";
The first word of the first element is “presenilin”;
The second word is “(PS)”;
The second word from the end of the first element is “(PS)”;
The first word from the end is “1”;
The first word of the second element is "delta";
The second word is “-”;
The second word from the end of the second element is "-";
The first word from the end is “catenin”.

または、
「第１要素の最初の１文字が「p」；
同最初の２文字が「pr」；
同最初の３文字が「pre」；
同最後の１文字が「1」；
同最後の２文字が「スペース，1」；
同最後の３文字が「)，スペース，1」；
第２要素の最初の１文字が「d」；
同最初の２文字が「de」；
同最初の３文字が「del」；
同最後の１文字が「n」；
同最後の２文字が「in」；
同最後の３文字が「ini」である」となる。 Or
“The first letter of the first element is“ p ”;
The first two letters are “pr”;
The first three letters are “pre”;
The last character is “1”;
The last two characters are “space, 1”;
The last three characters are “), space, 1”;
The first character of the second element is "d";
The first two letters are “de”;
The first three letters are “del”;
The last character is “n”;
The last two letters are “in”;
The last three characters are “ini”.

また、要素の前後２単語の単語自体とその品詞情報を素性とする場合には、素性は、
「第１要素の二つ前の単語は「interaction」；
同二つ前の単語の品詞は「名詞」；
同一つ前の単語は「with」；
同一つ前の単語の品詞は「前置詞」；
同一つ後の単語は「and」；
同一つ後の単語の品詞は「接続詞」；
同二つ後の単語は「cloned」；
同二つ後の単語の品詞は「動詞」；
第２要素の二つ前の単語は「of」；
同二つ前の単語の品詞は「前置詞」；
同一つ前の単語は「human」；
同一つ前の単語の品詞は「名詞」；
同一つ後の単語は「which」；
同一つ後の単語の品詞は「代名詞」；
同二つ後の単語は「encoced」；
同二つ後の単語の品詞は「動詞」である」となる。 Also, if the feature itself is the two words before and after the element and its part of speech information,
"The word before the first element is" interaction ";
The part of speech of the previous two words is “noun”;
The previous word is “with”;
The part of speech of the previous word is “preposition”;
The next word is “and”;
The part of speech of the next word is “connective”;
The word after the second is “cloned”;
The part of speech of the second word is “verb”;
The word immediately before the second element is “of”;
The part of speech of the previous two words is “preposition”;
The previous word is “human”;
The part of speech of the previous word is “noun”;
The next word is “which”;
The part of speech of the next word is “pronoun”;
The second word is "encoced";
The part of speech of the next two words is “verb”.

また、二つの要素の間の距離として、その要素間にある単語の数を素性として用いる場合には、「二つの要素間の距離は、「９」である」という情報が素性となる。 Further, when the number of words between the two elements is used as the feature as the distance between the two elements, the information “the distance between the two elements is“ 9 ”” is the feature.

また、二つの要素の間の単語数が０から１の状態を「距離小」とし、２から４の状態を「距離中」とし、５から９の状態を「距離大」とし、１０以上の状態を「距離特大」とするそれぞれの状態を素性とする場合に、「二つの要素間の距離は、「距離大」である」という情報が素性となる。 In addition, a state where the number of words between two elements is 0 to 1, “distance is small”, a state 2 to 4 is “medium”, a state 5 to 9 is “distance is large”, and 10 or more When each state having a state of “distance extra large” is a feature, information that “the distance between the two elements is“ distance large ”” is the feature.

また、二つの要素の間に他の要素がないかどうかという状態を素性とする場合に、「二つの要素の間に他の要素はない」という情報が素性となる。 Further, when a feature is whether there is no other element between two elements, information that “there is no other element between the two elements” is the feature.

さらに、二項関係の要素として異種の用語が設定されるような場合には、要素の出現順位を素性として用いてもよい。例えば、病名と治療方法の二項関係の場合には、「第１要素が「病名」で第２要素が「治療方法」である」または「第１要素が「治療方法」で第２要素が「病名」である」との情報が素性となる。 Furthermore, when different terms are set as binary relation elements, the appearance order of elements may be used as a feature. For example, in the case of a binary relation between a disease name and a treatment method, “the first element is“ disease name ”and the second element is“ treatment method ”” or “the first element is“ treatment method ”and the second element is Information that “is a disease name” is a feature.

二項関係抽出装置１は、教師データとして、相互作用のある蛋白質表現の二項関係以外に、病名と治療方法との二項関係、病名と蛋白質表現との二項関係、病名と器官（臓器）との二項関係、病名と動物種との二項関係、病名と関連のある化学物質との二項関係、蛋白質表現とその蛋白質についてこれまでになされた実験方法との二項関係などのさまざまな二項関係の事例を与えることによって、生物医学論文のテキストデータ２から、これらの対応する二項関係を抽出することができる。 The binary relation extraction apparatus 1 uses, as teacher data, a binary relation between a disease name and a treatment method, a binary relation between a disease name and a protein expression, a disease name and an organ (organ). ), Binomial relationship between disease name and animal species, binary relationship between disease name and chemical substance, binary relationship between protein expression and the experimental methods used so far By giving various binary relation examples, these corresponding binary relations can be extracted from the text data 2 of the biomedical paper.

例えば、教師データとして、以下のような二項関係を含むテキストデータを用いることができる。
「Oral
corticosteroids（要素：治療方法）are
the preference of many for the treatment of CIDP（要素：病名）, being much less expensive than IVIG（要素：治療方法）infusion or TA（要素：治療方法）.」
「In the CIDP （要素：病名）patient, the IgG antibody（要素：蛋白質表現） titer to GD3 （要素：化学物質表現）was remarkably elevated (titer,
1:10,000), indicating maximal avidity to the tetrasaccharide epitope
(-NeuAcalpha2-8NeuAcalpha2-3Galbeta1-4Glc-).」
「Ciliated
metaplasia (CM) in the stomach（要素：器官名）is mainly found in gastric mucosa （要素：器官名）that harbours gastric cancer（要素：病名）」
「Variant
Creutzfeldt-Jakob disease (CJD)（要素：病名） is a transmissible spongiform encephalopathy believed to be caused
by the bovine（要素：動物種）
spongiform encephalopathy agent, an abnormal isoform of the prion protein
(PrP(sc))（要素：蛋白質表現）.」
「AIDP （要素：病名）and CIDP （要素：病名）having specific antibodies to the carbohydrate epitope
(-NeuAcalpha2-8NeuAcalpha2-3Galbeta1-4Glc-) of gangliosides.
（要素：化学物質表現）」
「Gene expression
in archived frozen sural nerve biopsies of patients with chronic inflammatory
demyelinating polyneuropathy (CIDP) （要素：病名）was compared to that in vasculitic nerve biopsies (VAS) and to
normal nerve (NN) by DNA microarray technology（要素：実験方法）.」
「This novel
interaction was identified in a yeast two-hybrid screen（要素：実験方法） using PrP(C)（要素：蛋白質表現） as bait and confirmed by an in
vitro binding assay and co-immunoprecipitations」
「Comparative
study of the PrP(BSE)（要素：蛋白質表現） distribution in brains f（要素：器官名）rom BSE（要素：病名）
field cases using rapid tests（要素：検査法）.」 For example, text data including the following binary relationship can be used as the teacher data.
"Oral
corticosteroids (element: cure) are
the preference of many for the treatment of CIDP (element: disease name), being much less expensive than IVIG (element: treatment method) infusion or TA (element: treatment method). "
“In the CIDP (element: disease name) patient, the IgG antibody (element: protein expression) titer to GD3 (element: chemical expression) was remarkably elevated (titer,
1: 10,000), indicating maximal avidity to the tetrasaccharide epitope
(-NeuAcalpha2-8NeuAcalpha2-3Galbeta1-4Glc-). ''
"Ciliated
metaplasia (CM) in the stomach (element: organ name) is mainly found in gastric mucosa (element: organ name) that harbours gastric cancer (element: disease name) "
"Variant
Creutzfeldt-Jakob disease (CJD) (element: disease name) is a transmissible spongiform encephalopathy believed to be caused
by the bovine (element: animal species)
spongiform encephalopathy agent, an abnormal isoform of the prion protein
(PrP (sc)) (element: protein expression). "
“AIDP (element: disease name) and CIDP (element: disease name) having specific antibodies to the carbohydrate epitope
(-NeuAcalpha2-8NeuAcalpha2-3Galbeta1-4Glc-) of gangliosides.
(Element: chemical substance expression)
"Gene expression
in archived frozen sural nerve biopsies of patients with chronic inflammatory
demyelinating polyneuropathy (CIDP) (element: disease name) was compared to that in vasculitic nerve biopsies (VAS) and to
normal nerve (NN) by DNA microarray technology.
"This novel
interaction was identified in a yeast two-hybrid screen (element: experimental method) using PrP (C) (element: protein expression) as bait and confirmed by an in
in vitro binding assay and co-immunoprecipitations "
"Comparative
study of the PrP (BSE) (element: protein expression) distribution in brains f (element: organ name) rom BSE (element: disease name)
field cases using rapid tests. "

また、例えば、会社の製品名とその製品に対する評判（例えば、評判がいい、悪いなどの情報）との対を、二項関係として抽出することもできる。 Further, for example, a pair of a company product name and a reputation for the product (for example, information such as good reputation or bad reputation) can be extracted as a binary relation.

以上のように、上述した二項関係抽出装置１を利用した二項関係抽出処理によれば、機械学習処理用の教師データとして、抽出するべき二項関係であるか否かの評価（解）を付与したテキストデータを用意するだけで、新しいテキストデータから抽出するべきものに値すると推定した二項関係を自動的に抽出することが可能となる。これによって、二項関係抽出処理に使用するパターン生成の煩雑さを回避することができる。また、教師あり機械学習の精度向上によって、二項関係抽出処理の性能の向上が期待できる。 As described above, according to the binary relation extraction process using the binary relation extraction apparatus 1 described above, the evaluation (solution) of whether or not the binary relation to be extracted is the teacher data for the machine learning process. It is possible to automatically extract a binary relation estimated to be worth extracting from new text data simply by preparing text data to which is added. Thereby, it is possible to avoid the complexity of generating the pattern used for the binary relation extraction process. In addition, improvement in the performance of binary relation extraction processing can be expected by improving the accuracy of supervised machine learning.

なお、上述した機械学習処理では、キーワードが二つの場合について説明したが、キーワードが三つ（又はそれ以上）の場合にも、二項関係の抽出を行うことができる。上述の機械学習処理では、「度合い」を導入して、その積の値が大きいものをとるようにしていたが、すべての二項関係が、テキストデータ中少なくとも一箇所では抽出すべき二項関係となった場合、抽出すべきテキストデータとしてとるとしてもよい。 In the machine learning process described above, the case where there are two keywords has been described. However, even when there are three (or more) keywords, binary relations can be extracted. In the machine learning process described above, “degree” was introduced and the value of the product was large, but all binary relations should be extracted at least at one place in the text data. In this case, it may be taken as text data to be extracted.

また、二項関係の抽出は、上述のような機械学習処理を利用した態様以外に、パターンに基づく処理方法による、抽出すべき二項関係の取り出し方を採用することも可能である。例えば、キーワードとして「Ａ」という語と「Ｂ」という語を想定した場合、
「ＡのＢ」，「ＡはＢ」，「Ａは、Ｂ」，「Ａの関係するＢ」，「Ａに依存するＢ」
などのパターンをあらかじめ作成しておく。そして、テキストデータ（記事）中において、そのパターンに合致すれば、そのテキストデータにおいて、その二項関係は抽出すべき二項関係と判断され、合致しなければ、抽出すべき二項関係でないと判断されるので、抽出すべき二項関係と判断されたものを抽出処理するようにすればよい。なお、このようなパターンによる二項関係の抽出処理においても、キーワードが三つ以上の場合にも適用できる。すなわち、すべての二項関係が、テキストデータ中少なくとも一箇所では抽出すべき二項関係となった場合、抽出すべきテキストデータとしてとるようにすればよい。 In addition to the aspect using the machine learning process as described above, it is also possible to employ a method of extracting a binary relation to be extracted by a pattern-based processing method. For example, assuming the words “A” and “B” as keywords,
"A's B", "A is B", "A is B", "A's related B", "A dependent B"
Create a pattern such as If the text data (article) matches the pattern, the binary relation in the text data is determined to be a binary relation to be extracted. If the text data does not match, the binary relation is not to be extracted. Since it is determined, what is determined as a binary relation to be extracted may be extracted. Note that the binomial relation extraction process using such a pattern can also be applied to a case where there are three or more keywords. That is, when all binary relations are binary relations to be extracted at least at one place in the text data, the text data to be extracted may be taken.

次に、本発明の第１実施例である情報検索装置４について説明する。 Next, the information search apparatus 4 which is 1st Example of this invention is demonstrated.

この情報検索装置４は、ＡＮＤ検索処理の二つの検索キーワードの関係を意味のある二項関係とみなして、この検索キーワードを要素とする二項関係について、抽出するべき関係であること（正）または、抽出するべき関係でないこと（負）のいずれかの解を示すタグを付与した教師データを用いて機械学習し、検索対象である検索用テキストデータ５から、二つの検索キーワードを含む記事であって、その検索キーワードの対が抽出するべき二項関係であると判定されものを検索結果６として出力する処理装置である。 This information search device 4 regards the relationship between two search keywords in AND search processing as a meaningful binary relationship, and is a relationship to be extracted for a binary relationship having this search keyword as an element (correct) Alternatively, in an article including two search keywords from the search text data 5 to be searched by using machine data with a teacher data to which a tag indicating any solution that is not to be extracted (negative) is attached. In this processing apparatus, the search keyword pair is determined to be a binary relation to be extracted, and a search result 6 is output.

図６に、本発明に係る情報検索装置４の構成例を示す。情報検索装置４は、情報検索部４０、学習部４２、候補抽出部４４、二項関係選択部４５、および検索結果抽出部４８を備える。ただし、教師データ記憶部４１と学習結果記憶部４３については、この情報検索装置４が内部要素として備えていてもよいし、外部要素として情報検索装置４から利用可能としてもよい。また、同図では、学習部４２がさらに、解−素性対抽出部４２１及び機械学習部４２２を有し、二項関係選択部４５がさらに、素性抽出部４５１及び解判定部４５２を備えた構成を示している。 FIG. 6 shows a configuration example of the information search apparatus 4 according to the present invention. The information search apparatus 4 includes an information search unit 40, a learning unit 42, a candidate extraction unit 44, a binary relation selection unit 45, and a search result extraction unit 48. However, the teacher data storage unit 41 and the learning result storage unit 43 may be included in the information search device 4 as internal elements, or may be made available from the information search device 4 as external elements. In the figure, the learning unit 42 further includes a solution-feature pair extraction unit 421 and a machine learning unit 422, and the binary relation selection unit 45 further includes a feature extraction unit 451 and a solution determination unit 452. Is shown.

この情報検索装置４において、教師データ記憶部４１、解−素性対抽出部４２１、機械学習部４２２、学習結果記憶部４３、候補抽出部４４、素性抽出部４５１、および解判定部４５２は、図１に示す二項関係抽出装置１の教師データ記憶部１１、解−素性対抽出部１２、機械学習部１３、学習結果記憶部１４、候補抽出部１５、素性抽出部１６、および解推定部１７とそれぞれ同様の処理を行う処理手段である。ただし、解判定部４５２については、二項関係の各候補について、その素性の集合の場合に、どのような解（分類先）になりやすいかの度合いを推定する処理を行う場合は、前述の解推定部１７と同様の処理を行うが、二項関係抽出装置１の項で述べたように、解推定部１７を「解判定部」と置き換える場合には、この情報処理装置４の解判定部５５２は、前記「解判定部」と同様の処理を行うものとする。以下で言及する解判定部４５２は主に、解推定部１７と同様の処理を行う場合について説明するものとする。 In this information search device 4, a teacher data storage unit 41, a solution-feature pair extraction unit 421, a machine learning unit 422, a learning result storage unit 43, a candidate extraction unit 44, a feature extraction unit 451, and a solution determination unit 452 1, the teacher data storage unit 11, the solution-feature pair extraction unit 12, the machine learning unit 13, the learning result storage unit 14, the candidate extraction unit 15, the feature extraction unit 16, and the solution estimation unit 17. And processing means for performing similar processing. However, for the solution determination unit 452, in the case of performing a process of estimating the degree of the solution (classification destination) that is likely to be obtained in the case of the set of features for each candidate of the binary relation, Although the same processing as that of the solution estimation unit 17 is performed, as described in the section of the binary relation extraction device 1, when the solution estimation unit 17 is replaced with a “solution determination unit”, the solution determination of the information processing device 4 is performed. The unit 552 performs the same processing as the “solution determination unit”. The solution determination unit 452 to be described below mainly describes a case where the same processing as that of the solution estimation unit 17 is performed.

情報検索部４０は、ＡＮＤ検索処理で与えられた検索キーワードを用いて検索用テキストデータ５を検索し、該当する記事（テキストデータ）を取得する。なお、情報検索部４０には、ＡＮＤ検索処理以外の検索処理の態様を採用することもできるが、これについては後述する。 The information search unit 40 searches the search text data 5 using the search keyword given in the AND search process, and acquires the corresponding article (text data). The information search unit 40 may employ a search process other than the AND search process, which will be described later.

候補抽出部４４は、情報検索部４０が取得した記事に含まれている二つの検索キーワードと同じ文字列（語）の対を要素とする二項関係の候補を抽出する。 The candidate extraction unit 44 extracts a binary relation candidate having the same character string (word) pair as two search keywords included in the article acquired by the information search unit 40 as elements.

検索結果抽出部４６は、解判定部４５２の判定結果（推定結果）をもとに、検索用テキストデータ５から検索された記事の二項関係の候補から、推定された正の解（抽出するべき二項関係であること）のなりやすさの度合いが所定の程度より良いものを抽出し、抽出した二項関係の候補を含む記事または記事を特定する情報を検索結果６として出力する。 Based on the determination result (estimation result) of the solution determination unit 452, the search result extraction unit 46 extracts an estimated positive solution (extracts from the binary relation candidates of the articles searched from the search text data 5. (A power binomial relationship) that is better than a predetermined level is extracted, and an article that includes the extracted binary relation candidate or information that identifies an article is output as a search result 6.

図７に、情報検索装置４の処理の流れを示す。情報検索装置４の教師データ記憶部４１には、教師データとして、ＡＮＤ検索処理で与えられる二つの検索キーワードを要素とする二項関係に、抽出するべき二項関係であるか（正）または抽出するべきでない二項関係であるか（負）のいずれかの「解」の情報が付与された事例を含むテキストデータを記憶しておく。
まず、学習部４２において解−素性対抽出部４２１は、教師データ記憶部４１の教師データから各事例について、所定の素性を抽出し、解（タグによって付与された情報）と抽出した素性の集合との組を生成する（ステップＳ１１）。解−素性対抽出部４２１は、教師データであるテキストデータから所定のタグによって二項関係を抽出し、抽出した二項関係の要素（検索キーワード）について、形態素解析処理、構文解析処理、要素の出現位置や要素間の距離の算出処理などを行って、所定の素性を抽出する。 FIG. 7 shows a processing flow of the information search apparatus 4. In the teacher data storage unit 41 of the information search device 4, whether or not the binary relation to be extracted is the teacher data, that is, the binary relation having the two search keywords given by the AND search processing as elements (positive) or extraction. Text data including a case to which information of “solution”, which is a binary relationship that should not be performed or is negative, is stored.
First, in the learning unit 42, the solution-feature pair extraction unit 421 extracts a predetermined feature for each case from the teacher data in the teacher data storage unit 41, and sets a solution (information given by the tag) and the extracted features. Are generated (step S11). The solution-feature pair extraction unit 421 extracts binary relations from text data that is teacher data by a predetermined tag, and uses the morphological analysis process, the syntax analysis process, and the element of the extracted binary relation elements (search keywords). A predetermined feature is extracted by performing a process of calculating an appearance position and a distance between elements.

そして、同じく学習部４２において機械学習部４２２は、解−素性対抽出部４２１により生成された解と素性の集合との組から、どのような素性の集合のときにどのような解（正または負）になりやすいかを機械学習法により学習し、学習結果を学習結果記憶部４３に格納する（ステップＳ１２）。機械学習部４２２は、教師あり機械学習法として、例えば、ｋ近傍法、シンプルベイズ法、決定リスト法、最大エントロピー法、サポートベクトルマシン法などの手法のいずれかを用いて機械学習処理を行う。 Similarly, in the learning unit 42, the machine learning unit 422 determines what kind of solution (positive or negative) from the combination of the solution generated by the solution-feature pair extraction unit 421 and the set of features. The learning result is stored by the machine learning method and the learning result is stored in the learning result storage unit 43 (step S12). The machine learning unit 422 performs machine learning processing using any one of techniques such as a k-nearest neighbor method, a simple Bayes method, a decision list method, a maximum entropy method, a support vector machine method, and the like as a supervised machine learning method.

その後、候補抽出部４４は、ＡＮＤ検索処理で与えられた二つの入力検索キーワードを用いてすべての二つの組み合わせ（対）を生成する（ステップＳ１３）。情報検索部４０は、二つの入力検索キーワードの対を用いて検索用テキストデータ５をＡＮＤ検索処理し、入力検索キーワード対を含む記事（テキストデータ）を抽出し、候補抽出部４４は、検索処理によって抽出された記事に出現する入力検索キーワードを用いて、すべての二つの組み合わせ（対）を二項関係の候補として抽出する（ステップＳ１４）。 Thereafter, the candidate extraction unit 44 generates all two combinations (pairs) using the two input search keywords given in the AND search process (step S13). The information search unit 40 AND-searches the search text data 5 using two pairs of input search keywords to extract articles (text data) including the input search keyword pairs, and the candidate extraction unit 44 performs search processing. All the two combinations (pairs) are extracted as binomial relationship candidates using the input search keyword appearing in the article extracted by (Step S14).

そして、二項関係選択部４５において、素性抽出部４５１は、解−素性対抽出部４２１での処理とほぼ同様の処理によって、検索した記事に出現している二項関係の各候補について、所定の素性の集合を抽出する（ステップＳ１５）。 Then, in the binary relation selection unit 45, the feature extraction unit 451 performs predetermined processing for each binary relation candidate appearing in the searched article by a process substantially similar to the process in the solution-feature pair extraction unit 421. A set of features is extracted (step S15).

さらに二項関係選択部４５において解判定部４５２は、各候補について、その素性の集合の場合にどのような解になりやすいか、すなわち、「正となりやすい」または「負となりやすいか」の度合いを学習結果記憶部４３の学習結果をもとに推定する（ステップＳ１６）。そして、検索結果抽出部４６は、二項関係の候補から、所定の程度より良い程度で「正となりやすい」と推定されたものを抽出するべき二項関係として選択し、この二項関係を含む記事または記事を特定する情報を検索結果６として選択する（ステップＳ１７）。 Further, in the binary relation selection unit 45, the solution determination unit 452 determines, for each candidate, what kind of solution is likely to occur in the case of the set of features, that is, the degree of “prone to be positive” or “prone to be negative”. Is estimated based on the learning result of the learning result storage unit 43 (step S16). Then, the search result extraction unit 46 selects, as a binary relation to be extracted from the binary relation candidates that are estimated to be “prone to be positive” to a better degree than a predetermined degree, and includes this binary relation. An article or information specifying the article is selected as the search result 6 (step S17).

次に、この情報検索装置４による情報検索処理の具体例を説明する。本例では、情報検索装置４を、検索用テキストデータ５から、ＡＮＤ検索処理で使用される二つの検索キーワードとなりうる文字列を要素とする二項関係を含むテキストデータを教師データとする。そして、ＡＮＤ検索処理で与えられた入力検索キーワードを要素とする二項関係の候補を作成し、検索用テキストデータ５からこの二項関係の候補を用いて検索を行い記事を抽出する。検索された記事に含まれる入力検索キーワードの二項関係の候補が抽出するべきであるか否かを推定して、抽出するべきものと推定された度合いがよい二項関係の候補を含む記事を検索結果６として出力するものとする。 Next, a specific example of information search processing by the information search device 4 will be described. In this example, the information search apparatus 4 uses, as teacher data, text data including a binary relation having elements of character strings that can be two search keywords used in the AND search process from the search text data 5. Then, a binary relation candidate having the input search keyword given in the AND search process as an element is created, and an article is extracted by performing a search using the binary relation candidate from the search text data 5. Estimate whether binary search candidates for the input search keywords included in the searched articles should be extracted, and search for articles containing binary search candidates that are estimated to be good. The search result 6 is output.

ＡＮＤ検索の検索キーワードとして、「京大」と「総長」を設定すると仮定する。また、検索キーワードの二項関係が正または負であるかの判断は人が行い、正または負の解を示すタグを人手で付与する。したがって、機械学習処理において正の事例および負の事例を含む教師データが使用される。 It is assumed that “Kyoto University” and “General Manager” are set as search keywords for AND search. Further, it is determined by a person whether the binary relation of the search keyword is positive or negative, and a tag indicating a positive or negative solution is manually added. Therefore, teacher data including positive cases and negative cases is used in the machine learning process.

図８〜図１０に、教師データ記憶部４１に記憶される教師データの例および、その教師データから解−素性対抽出部４２によって抽出される素性の例を示す。本例では、図８および図９の教師データには、抽出するべき二項関係について解が正（ｐｏｓｉｔｉｖｅ）であることを示すタグが付与される。また、図１０の教師データには、抽出するべきでない二項関係について解が負（ｎｅｇａｔｉｖｅ）であることを示すタグが付与される。 FIGS. 8 to 10 show examples of teacher data stored in the teacher data storage unit 41 and examples of features extracted by the solution-feature pair extraction unit 42 from the teacher data. In this example, a tag indicating that the solution is positive for the binary relation to be extracted is assigned to the teacher data in FIGS. 8 and 9. In addition, the teacher data in FIG. 10 is provided with a tag indicating that the solution is negative for binary relations that should not be extracted.

図８の教師データには、二つ検索キーワードの対であるの二項関係の対Ｐ３が含まれ、二項関係（対）Ｐ３は、第１要素ｐ１（検索キーＫ１）「京大」、第２の要素ｐ２（検索キーＫ２）「総長」で構成され、二項関係の対Ｐ３には正の解（ｐｏｓｉｔｉｖｅ）が付与されている。 The teacher data of FIG. 8 includes a binary relation pair P3 which is a pair of two search keywords, and the binary relation (pair) P3 includes a first element p1 (search key K1) “Kyoto University”, The second element p2 (search key K2) is composed of “total length”, and a positive solution is given to the binary relation pair P3.

同様に、図９の教師データには、二つ検索キーワードの対であるの二項関係の対Ｐ４が含まれ、二項関係（対）Ｐ４は、第１要素ｐ１（検索キーＫ１）「京大」、第２の要素ｐ２（検索キーＫ２）「総長」で構成され、二項関係の対Ｐ４には正の解（ｐｏｓｉｔｉｖｅ）が付与されている。図８および図９の教師データが「京大の総長」の内容であると判断できるからである。 Similarly, the teacher data of FIG. 9 includes a binary relation pair P4 which is a pair of two search keywords, and the binary relation (pair) P4 includes the first element p1 (search key K1) “K”. “Large” and the second element p2 (search key K2) “total length”, and a positive relation (positive) is given to the pair P4 of the binary relation. This is because it can be determined that the teacher data in FIGS. 8 and 9 is the content of “the total length of Kyoto University”.

また、図１０の教師データには、二つ検索キーワードの対であるの二項関係の対Ｐ５が含まれ、二項関係（対）Ｐ５は、第１要素ｐ１（検索キーＫ１）「京大」、第２の要素ｐ２（検索キーＫ２）「総長」で構成され、二項関係の対Ｐ５には負の解（ｎｅｇａｔｉｖｅ）が付与されている。同じデータ内に「京大」と「総長」とが出現しているが、相互に関係を持つものではなく、「京大の総長」の内容でないと判断できるからである。 The teacher data in FIG. 10 includes a binary relation pair P5 that is a pair of two search keywords, and the binary relation (pair) P5 is the first element p1 (search key K1) “Kyoto Univ. ”, The second element p2 (search key K2)“ total length ”, and a negative relation (negative) is given to the pair P5 of the binary relation. This is because “Kyoto University” and “President” appear in the same data, but they are not related to each other and can be determined not to be the contents of “President of Kyoto University”.

解−素性対抽出部４２１は、教師データ記憶部４１に記憶されている教師データの事例から、解と素性の集合との組を抽出する。例えば、素性として、要素（検索キーワード）の前後の二単語の単語自体、単語の品詞を素性とする。この場合に、素性は、
「第１要素の二つ前の単語は「今日」；
同二つ前の単語の品詞は「名詞」；
同一つ前の単語は「，」；
同一つ前の単語の品詞は「読点」；
同一つ後の単語は「で」；
同一つ後の単語の品詞は「助詞」；
同一つ後の単語は「の」；
同一つ後の単語の品詞は「助詞」；
第２要素の二つ前の単語は「で」；
同二つ前の単語の品詞は「助詞」；
同一つ前の単語は「，」；
同一つ前の単語の品詞は「読点」；
同一つ後の単語は「が」；
同一つ後の単語の品詞は「助詞」；
同二つ後の単語は「出席」；
同二つ後の単語の品詞は「名詞」である」となる。 The solution-feature pair extraction unit 421 extracts a set of a solution and a set of features from an example of teacher data stored in the teacher data storage unit 41. For example, as the feature, the two words before and after the element (search keyword) and the part of speech of the word are used as the feature. In this case, the feature is
“The word before the first element is“ Today ”;
The part of speech of the previous two words is “noun”;
The previous word is “,”;
The part of speech of the previous word is “reading”;
The next word is “de”;
The part of speech of the next word is “particle”;
The next word is “no”;
The part of speech of the next word is “particle”;
The second word before the second element is “de”;
The part of speech of the previous two words is “particle”;
The previous word is “,”;
The part of speech of the previous word is “reading”;
The next word is “ga”;
The part of speech of the next word is “particle”;
The second word is “attendance”;
The part of speech of the word after the second is “noun”.

なお、解−素性対抽出部４２１は、二項関係抽出処理で説明したような情報を素性として抽出することができる。 The solution-feature pair extraction unit 421 can extract information as described in the binary relation extraction process as a feature.

機械学習部４２２は、この解と素性の集合とをもとに、どのような素性の集合の場合にどのような解（正（ｐｏｓｉｔｉｖｅ）／負（ｎｅｇａｔｉｖｅ））となりやすいかを機械学習処理し、学習結果を学習結果記憶部４３に記憶する。機械学習部４２２は、教師あり機械学習法として、例えば、ｋ近傍法、シンプルベイズ法、決定リスト法、最大エントロピー法、サポートベクトルマシン法などの前述の処理手法を用いる。 Based on this solution and the set of features, the machine learning unit 422 performs machine learning processing to determine what type of solution (positive / negative) is likely to occur in what type of feature set. The learning result is stored in the learning result storage unit 43. The machine learning unit 422 uses the above-described processing methods such as the k-nearest neighbor method, the simple Bayes method, the decision list method, the maximum entropy method, and the support vector machine method as the supervised machine learning method.

その後、情報検索部４０は、与えられた入力検索キーワード「京大」と「総長」とをもとに検索用テキストデータ５をＡＮＤ検索し、入力検索キーワードを含む記事を取得する。そして、候補抽出部４４は、抽出された記事から二項関係の候補を抽出する。具体的には、ＡＮＤ検索の検索結果である記事中に含まれる入力検索キーワードから二項関係の候補を抽出する。そして、素性抽出部４５１は、二項関係の候補から、解−素性対抽出部４２１と同じ素性を抽出し、解判定部４５２は、学習結果記憶部４３に記憶されている学習結果をもとに、各二項関係の候補について、その候補の素性の集合の場合に正（ｐｏｓｉｔｉｖｅ）または負（ｎｅｇａｔｉｖｅ）のなりやすさの度合いを推定する。検索結果抽出部４６は、解判定部４５２の推定結果をもとに二項関係の候補から、推定された正の解となりやすさの度合いがよい二項関係を抽出し、この二項関係を含む記事、記事を特定する情報を検索結果６として出力する。 Thereafter, the information search unit 40 AND-searches the search text data 5 based on the given input search keywords “Kyoto Univ.” And “total length”, and acquires an article including the input search keyword. Then, the candidate extraction unit 44 extracts a binary relation candidate from the extracted article. Specifically, a binary relation candidate is extracted from an input search keyword included in an article as a search result of an AND search. Then, the feature extraction unit 451 extracts the same feature as the solution-feature pair extraction unit 421 from the binomial relationship candidates, and the solution determination unit 452 uses the learning result stored in the learning result storage unit 43. For each binomial relationship candidate, the degree of likelihood of being positive or negative is estimated in the case of a set of candidate features. The search result extraction unit 46 extracts a binary relationship that is likely to be an estimated positive solution from the binomial relationship candidates based on the estimation result of the solution determination unit 452, and obtains this binary relationship. The information including the article and the information specifying the article is output as the search result 6.

例えば、候補抽出部４４は、与えられた入力検索キーワードから、二つの入力検索キーワードのすべての組み合わせ（対）を生成し、生成した対を二項関係の候補とする。そして、情報検索部４０は、それぞれの二項関係の候補の要素（二つの入力検索キーワード）を用いてＡＮＤ検索処理を行う。そして、素性抽出部４５１は、抽出された記事に出現している二項関係の候補について所定の素性の集合を抽出する。 For example, the candidate extraction unit 44 generates all combinations (pairs) of two input search keywords from a given input search keyword, and sets the generated pair as a binary relation candidate. Then, the information search unit 40 performs an AND search process using each binary relation candidate element (two input search keywords). Then, the feature extraction unit 451 extracts a set of predetermined features for the binomial relationship candidates appearing in the extracted article.

解判定部４５２は、学習結果記憶部４３の学習結果をもとに、各二項関係の候補について、その候補の素性の集合の場合に解のなりやすさの度合いを推定する。入力検索キーワードの対である二項関係の候補それぞれが、検索されたその記事内で一つずつしか出現していないときは、それらすべての二項関係の候補が正（抽出するべき）との度合いがよいと推定した場合に、その記事、記事を特定する情報を検索結果６とする。 Based on the learning result of the learning result storage unit 43, the solution determination unit 452 estimates the degree of ease of solution for each binary relation candidate in the case of a set of candidate features. If each of the binomial candidates that are pairs of the input search keywords appears only once in the searched article, all of the binomial candidates are positive (should be extracted). When it is estimated that the degree is good, the information specifying the article and the article is set as the search result 6.

また、入力検索キーワードの対である二項関係が、検索されたその記事内で複数出現しているときは、出現する複数の二項関係の候補のうちの一つの候補について正（抽出するべき）との度合いがよいと推定していることを条件とし、さらに二項関係の候補それぞれが、前述の条件をすべて満足して正の度合いがよいと推定した場合に、その記事、記事を特定する情報を検索結果６とする。 Also, when multiple binary relations that are pairs of input search keywords appear in the searched article, one of the multiple binary relation candidates appearing should be positive (extracted) )), And if each of the binomial candidates is estimated to be positive by satisfying all the above conditions, the article or article is identified. This information is designated as search result 6.

さらに、候補抽出部４４は、与えられた入力検索キーワードから、すべての二つの入力検索キーワードの対を生成し、生成した対を二項関係の候補とする。そして、情報検索部４０は、それぞれの二項関係の候補の要素（二つの入力検索キーワード）を用いてＡＮＤ検索処理を行う。そして、素性抽出部４５１は、抽出された記事に出現している二項関係の候補について所定の素性の集合を抽出する。 Further, the candidate extraction unit 44 generates a pair of all two input search keywords from the given input search keyword, and sets the generated pair as a binary relation candidate. Then, the information search unit 40 performs an AND search process using each binary relation candidate element (two input search keywords). Then, the feature extraction unit 451 extracts a set of predetermined features for the binomial relationship candidates appearing in the extracted article.

解判定部４５２は、学習結果記憶部４３の学習結果をもとに、各二項関係の候補について、その候補の素性の集合の場合に解のなりやすさの度合いを推定する。入力検索キーワードの対である二項関係の候補それぞれが、検索されたその記事内で一つずつしか出現していないときは、それらすべての二項関係の候補について正（抽出するべき）の度合いを推定し、それらすべての二項関係の候補について推定された正の度合いを掛け合わせたものを、その記事の正の度合いとする。そして正の度合いがよいと推定した記事、記事を特定する情報を検索結果６とする。 Based on the learning result of the learning result storage unit 43, the solution determination unit 452 estimates the degree of ease of solution for each binary relation candidate in the case of a set of candidate features. When each of the binary relation candidates that are a pair of input search keywords appears only once in the searched article, the degree of positiveness (to be extracted) for all the binary relation candidates Is multiplied by the positive degree estimated for all of the binomial relationship candidates, and is set as the positive degree of the article. Then, an article that is estimated to have a positive degree and information that identifies the article are set as a search result 6.

また、入力検索キーワードの対である二項関係が、検索された記事内で複数出現しているときは、出現する複数の二項関係の候補について正の度合いを推定し、それらの複数の二項関係の候補の推定した度合いのうち、最も値がよい度合いをその二項関係の候補の度合いとする。そして、それぞれの二項関係の度合いを求め、求めた度合いを掛け合わせたものを、その記事の正の度合いとする。そして正の度合いがよいと推定した記事、記事を特定する情報を検索結果６とする。 In addition, when a plurality of binary relations that are pairs of input search keywords appear in the searched article, a positive degree is estimated for a plurality of binary relation candidates that appear, and the two binary relations are estimated. Among the estimated degrees of the term relation candidates, the degree with the best value is set as the degree of the binary relation candidate. Then, the degree of each binary relation is obtained, and the product obtained by multiplying the obtained degree is set as the positive degree of the article. Then, an article that is estimated to have a positive degree and information that identifies the article are set as a search result 6.

以上のように、本実施例の情報検索装置４によれば、機械学習処理用の教師データとして、ＡＮＤ検索処理の二つの検索キーワードの二項関係に、抽出するべき二項関係であるか否かの評価を付与したテキストデータを用意するだけで、新しい検索用テキストデータ５から、抽出するべきものに値するとされた二項関係を含む記事を自動的に抽出することが可能となる。 As described above, according to the information search apparatus 4 of this embodiment, whether or not the binary relation to be extracted is the binary relation between the two search keywords of the AND search process as the teacher data for the machine learning process. It is possible to automatically extract articles including binary relations that are supposed to be extracted from the new search text data 5 simply by preparing text data with such an evaluation.

本実施例の情報検索装置４は、ＡＮＤ検索処理の検索結果の記事に出現する検索キーワードの関係を、二項関係抽出処理を用いて評価することにより、検索キーワードを含んでいることによってヒットされたが、検索キーワード同士の関係がうすく、その結果として内容的に無関係な、いわば検索意図からはずれるような内容の記事を排除することができる。また、教師あり機械学習の精度向上によって、情報検索処理の性能の向上が期待できる。 The information search apparatus 4 of the present embodiment is hit by including a search keyword by evaluating the relationship of search keywords appearing in an article as a search result of AND search processing using a binary relation extraction process. However, it is possible to eliminate articles whose contents are irrelevant in terms of search, that is, contents that are not related to search intentions. In addition, improvement in the performance of information retrieval processing can be expected by improving the accuracy of supervised machine learning.

以上の実施例においては、二項関係抽出処理および情報検索処理において、二つの要素からなる二項関係の例を説明した。本発明は、三つの要素で構成される三項項関係についても適用することができる。 In the above embodiment, an example of a binary relation composed of two elements has been described in the binary relation extraction process and the information search process. The present invention can also be applied to a ternary term relationship composed of three elements.

例えば、二項関係抽出装置１において、教師データとして、三つの要素の三項関係を含むデータを用意する。そして、解−素性対抽出部１２は、この三項関係についての素性を、例えば、三つの要素のうちの、第１要素（最初に出現する要素）の前方二単語、第３要素（最後に出現する要素）の後方二単語、第１要素と第２要素（中間に出現する要素）間の単語すべて、第２要素と第３要素間の単語すべての単語情報とすることによって、機械学習部１３は、三項関係の素性の集合をもとに解のなりやすさを学習することができ、二項関係抽出部１８において、三項関係の抽出を扱うことができる。なお、三項関係に与えられる解は、二項関係の場合と同様に、「抽出するべき三項関係」または「抽出するべきでない三項関係」とする。 For example, in the binary relation extraction apparatus 1, data including a ternary relation of three elements is prepared as teacher data. Then, the solution-feature pair extraction unit 12 determines the feature about the ternary relationship by, for example, the first two words of the first element (first appearing element) of the three elements, the third element (finally Machine learning unit by making the word information of the two words behind the appearing element), all the words between the first element and the second element (element appearing in the middle), and all the words between the second element and the third element 13 can learn the easiness of the solution based on the set of features of the ternary relationship, and the binary relationship extraction unit 18 can handle the extraction of the ternary relationship. Note that the solution given to the ternary relationship is the “ternary relationship that should be extracted” or “the ternary relationship that should not be extracted”, as in the case of the binary relationship.

例えば、二項関係抽出装置１において、教師データとして、三つの要素の三項関係を含むデータを用意する。そして、二項関係抽出装置１の各処理手段は、教師データの三項関係を分解して得られたそれぞれの二項関係、第１要素と第２要素の二項関係、第２要素と第３要素の二項関係、第１要素と第３要素の二項関係をそれぞれ別個の二項関係として扱う。そして、それぞれの二項関係すべてについて、抽出するべき三項関係であるかの解の度合いを算出し、算出した度合いを掛け合わせて求めた値をその三項関係の度合いとする。そして、その度合いの大きいものを抽出するべき三項関係として取り出すようにする。 For example, in the binary relation extraction apparatus 1, data including a ternary relation of three elements is prepared as teacher data. Each processing means of the binary relation extraction device 1 then obtains the binary relation obtained by decomposing the ternary relation of the teacher data, the binary relation between the first element and the second element, and the second element and the second element. The binary relation between the three elements and the binary relation between the first element and the third element are treated as separate binary relations. Then, for each of the binomial relationships, the degree of solution as to whether it is a ternary relationship to be extracted is calculated, and a value obtained by multiplying the calculated degrees is set as the degree of the ternary relationship. And the thing with the big degree is taken out as a ternary relation which should be extracted.

このとき、機械学習部１３が、サポートベクトルマシン法を使用する場合には、分類先が二つ（正または負）となるので、ペアワイズ法またはワンＶＳレスト法を用いて三項関係を機械学習する。 At this time, when the machine learning unit 13 uses the support vector machine method, there are two classification destinations (positive or negative), and therefore machine learning of the ternary relationship using the pairwise method or the one VS rest method. To do.

また、二項関係抽出部１８では、二項関係２の抽出の際に、抽出の確信度を求められるようにする。そして、二項関係を複数組み合わせて作成した三項関係の確信度として、それぞれの組み合わせた二項関係の確信度の積を用いて、三項関係の確信度の大きなものを取り出すようにする。二項関係の確信度は、通常の機械学習処理において算出される確信度を利用する。 In addition, the binary relation extraction unit 18 can obtain the certainty of extraction when extracting the binary relation 2. Then, as a certainty factor of the ternary relationship created by combining a plurality of binary relationships, a product having a high certainty factor of the ternary relationship is extracted using a product of the certainty factors of the combined binary relationships. As the certainty of the binary relation, the certainty calculated in the normal machine learning process is used.

このような三項関係の抽出処理は、情報検索装置４においても同様に行うことができる。例えば、「平成１２年の京大の総長」に関する記事を検索する場合に、教師データとして、「平成１２年」，「京大」，および「総長」の三つの検索キーワードによる三項関係を含むデータを与えて、検索用テキストデータ５から、これら三つの検索キーワードによるＡＮＤ検索の検索結果６を出力する。 Such a ternary relation extraction process can be similarly performed in the information search apparatus 4. For example, when searching for articles related to “the president of Kyoto University in 2000”, the teacher data includes a ternary relationship based on three search keywords of “2000”, “Kyoto University”, and “total president”. Given data, a search result 6 of AND search using these three search keywords is output from the search text data 5.

また、本例では、事例の二項関係または三項関係に付与する解の情報として、「正（抽出するべき二項関係である）」または「負（抽出するべきでない二項関係である）」を用いて説明したが、付与する解の情報として、例えば、「相互作用のある」，「反作用のある」，「作用がない」などの多分類のものであってもよい。 Moreover, in this example, as the information of the solution given to the binary relation or ternary relation of the case, “positive (is a binary relation that should be extracted)” or “negative (is a binary relation that should not be extracted)” However, the information of the solution to be assigned may be multi-category such as “having interaction”, “having reaction”, and “having no action”.

また、機械学習処理としては、上述した以外にも、次のような処理方法を採用することができる。例えば、ニューラルネット（パーセプトロン）、ＨＭＭ（隠れマルコフモデル）,ＣＲＦ（条件付確率場）、重回帰分析、決定木学習、最尤推定などを用いて、問題に対して、解の分類をもとめてもよい。 In addition to the above-described machine learning processing, the following processing method can be employed. For example, using neural network (perceptron), HMM (Hidden Markov Model), CRF (Conditional Random Field), multiple regression analysis, decision tree learning, maximum likelihood estimation, etc. Also good.

なお、この情報検索装置４の学習部４２による二項関係の抽出は、上述したような機械学習に基づく態様以外にも、前述した二項関係抽出装置１の項で触れたように、パターンに基づく処理方法による、抽出すべき二項関係の取り出し方を採用してもよい。 It should be noted that the binary relation extraction by the learning unit 42 of the information search device 4 is not limited to the above-described mode based on machine learning, but as described in the section of the binary relation extraction apparatus 1 described above. You may employ | adopt the method of taking out the binary relationship which should be extracted by the processing method based on.

特に、キーワードが三つ以上ある場合、機械学習処理では、「度合い」を導入して、その積の値が大きいものを取るようにしていたが、パターンに基づく処理方法の場合は、全ての二項関係が、テキストデータ中少なくとも一箇所では抽出すべき二項関係となった場合、抽出すべきテキストデータとして取るようにすることができる。このことは、機械学習処理の場合であっても適用することができる。すなわち、機械学習の方法で、全ての二項関係が、テキストデータ中少なくとも一箇所では抽出すべき二項関係となった場合、抽出すべきテキストデータとして取るようにしてもよい。 In particular, when there are three or more keywords, in the machine learning process, “degree” is introduced and the product value is large, but in the case of the pattern-based processing method, all two When the term relation becomes a binary relation to be extracted at least at one place in the text data, it can be taken as the text data to be extracted. This can be applied even in the case of machine learning processing. That is, when all binary relations become binary relations to be extracted at least at one place in the text data by the machine learning method, the text data to be extracted may be taken.

また、上述した情報検索部４０は、ＡＮＤ検索処理を行うものであると説明したが、それ以外に、スコアによる検索処理を行う態様を採用することができる。以下にスコアによる検索処理の例について説明する。 Moreover, although the information search part 40 mentioned above demonstrated performing AND search process, the aspect which performs the search process by a score other than that can be employ | adopted. Hereinafter, an example of search processing based on scores will be described.

スコアによる検索処理の例としては、一般に、重要キーワードの自動抽出にはＴＦ／ＩＤＦ法が主に用いられる。ＴＦは、その文書でのその語の出現回数を表し、ＩＤＦは、その語があらかじめ持っている多数の文書のうち、何個の文書に出現するかのその個数の逆数を表す。一般には、ＴＦとＩＤＦの積が大きい語ほどキーフレーズとして妥当なものとなる。ところで、ＩＤＦは文書が複数いるため、ある本を読んでいるときであると、その本は一つしかないので、複数の文書がない状態である。そこで例えば、複数の本を持ってきて、その一つひとつを文書とみなしてＩＤＦを求める方法が考えられる。また、ＩＤＦは本のデータ以外のデータで求めて、それをこの本のデータに使う方法もある。すなわち、本に限らず文書データを集めて、それぞれの文書を一つ一つの文書として考慮してＩＤＦを求める。そこでもとめたＩＤＦを本発明でＩＤＦを用いる場合に利用することができる。ＴＦについては、その文書での出現回数であるので、本発明に適用する場合は、電子文書全体での出現回数とするとよい。 As an example of a search process using scores, the TF / IDF method is mainly used for automatic extraction of important keywords. TF represents the number of appearances of the word in the document, and IDF represents the reciprocal of the number of documents that appear in a number of documents that the word has in advance. In general, a word having a larger product of TF and IDF is more appropriate as a key phrase. By the way, since there are a plurality of documents in the IDF, when reading a book, there is only one book, so there is no plurality of documents. Therefore, for example, a method is considered in which a plurality of books are taken and each ID is regarded as a document to obtain an IDF. Also, there is a method in which the IDF is obtained from data other than book data and used as the book data. That is, not only books but also document data is collected, and IDFs are obtained considering each document as an individual document. Therefore, the IDF obtained can be used when the IDF is used in the present invention. Since TF is the number of appearances in the document, when applied to the present invention, the number of appearances in the entire electronic document may be used.

厳密にはＩＤＦは、単にＴＦを逆数にするだけではなく、次式（１２）で計算される。 Strictly speaking, the IDF is calculated not only by reciprocal TF but also by the following equation (12).

基本的なＴＦ／ＩＤＦ法では、次式（１３）によってスコアが求められる。 In the basic TF / IDF method, the score is obtained by the following equation (13).

ただし、ｗ∈Ｗで加算する。Ｗは入力されたキーワードの集合、ｔｆ（ｗ，Ｄ）は文書Ｄでのｗの出現回数、ｄｆ（ｗ）は全文書でｗが出現した文書の数、Ｎは文書の総数である。そして、ｓｃｏｒｅ（Ｄ）が高い文書を検索結果として出力する。

However, addition is performed with wεW. W is a set of input keywords, tf (w, D) is the number of occurrences of w in document D, df (w) is the number of documents in which w appears in all documents, and N is the total number of documents. Then, a document with a high score (D) is output as a search result.

特に性能がよいとされているスコアによる検索処理としては、確率型手法の一つの Robertson の2-ポアソンモデル（Robertson, S. E. and Walker, S.
(1994). “Some Simple Effective Approximations to the 2-Poisson Model for
Probabilistic Weighted Retrieval.” In Proceedings of the Seventeenth Annual
International ACM SIGIR Conference on Research and Development in Information
Retrieval.）を挙げることができる。Robertson
らの方法とは、各記事毎に以下の式（１４）で与えられるScore
を算出し、Score の上位のものを検索結果として出力する方法である（以下の Score(d) は記事dのScore である）。 As a search process based on scores that are considered to be particularly good, Robertson's 2-Poisson model (Robertson, SE and Walker, S.
(1994). “Some Simple Effective Approximations to the 2-Poisson Model for
Probabilistic Weighted Retrieval. ”In Proceedings of the Seventeenth Annual
International ACM SIGIR Conference on Research and Development in Information
Retrieval.). Robertson
These methods are the score given by the following formula (14) for each article.
Is calculated, and the top score is output as the search result (Score (d) below is the score of article d).

ただし、ここでのキーワードとは入力されたキーワードである。式（１４）において、ＴＦ（ｄ，ｔ）はキーワードｔの記事ｄでの出現回数である。ＤＦ（ｔ）は全データベースでのキーワードｔが出現している記事の数である。Ｎは全データベースに存在する記事の、ｌｅｎｇｔｈ（ｄ）は記事ｄの長さ(文字列単位)である。Δは全データベースでの記事の長さの平均である。また、式（１４）のＡ部をＴＦに関する項としてＴＦ項、Ｂ部をＩＤＦ（ＤＦの逆数）に関する項としてＩＤＦ項と呼ぶことにする。この式（１４）のＴＦの項が一般のベクトル空間法で重みの一部として用いられるＴＦと異なり、Ａ部のようになっている。 However, the keyword here is an input keyword. In Expression (14), TF (d, t) is the number of appearances of the keyword t in the article d. DF (t) is the number of articles in which the keyword t appears in all databases. N is an article existing in all databases, and length (d) is the length of the article d (character string unit). Δ is the average length of articles in all databases. In addition, the A part of the equation (14) is referred to as a TF term as a term relating to TF, and the B part is referred to as an IDF term as a term relating to IDF (the reciprocal of DF). Unlike the TF used as a part of the weight in the general vector space method, the TF term in the equation (14) is like the A part.

Ｓｃｏｒｅをキーワードによる加点として扱うとき、ＴＦの影響が大きい場合だと一つでもＴＦの値が大きいキーワードがあれば、他のキーワードがほとんど存在していないという関連性が低いときでも十分大きな得点を取ってしまう。この式はこのことを防ぐのに役に立っている。この式ではＴＦが無限になってもたかだか１の値を持つにすぎない。このため全キーワードが万遍なく評価されることになる。さらに、Ａ部の分母における＋の左側の項という複雑な部分があるが、これは長い記事ほどＴＦの値が大きくなるのでそれを補正するための項である。さらに、式（１４）にいくつかの補強項をつけたものも考えられる。 When treating Score as an additional point by keyword, if there is a keyword with a large TF value even if the influence of TF is large, a sufficiently large score will be obtained even when the relevance that almost no other keyword exists is low I will take it. This formula helps to prevent this. In this expression, even if TF becomes infinite, it has only a value of 1. For this reason, all keywords will be evaluated uniformly. Furthermore, there is a complicated part of the left side of + in the denominator of the A part, but this is a term for correcting the longer article because the value of TF becomes larger. Furthermore, what added several reinforcement terms to Formula (14) is also considered.

また、この方法はＯｋａｐｉのウェイティング法とも呼ばれる。 This method is also called the Okapi weighting method.

また、式（１４）において、Σで積を取る前のＴＦ項とＩＤＦ項の積を単語の重みに使うことができる。 Also, in equation (14), the product of the TF term and the IDF term before taking the product by Σ can be used as the word weight.

また、各単語を次元と、各単語のスコアを要素とするベクトルを作成し、記事のベクトルを記事に含まれる単語を使ってベクトル(vector ＿x)にし、また、入力されたキーワード群のベクトルを入力されたキーワード群に含まれる単語を使ってベクトル(vector ＿y)にし、それらベクトルの余弦(cos(vector ＿x,vector＿y)) の値を記事の類似度としてもよい。各単語のスコアの算出には、tf・idf やokapiを用いるとよい。それらの式のΣの後ろの部分の式がスコアの算出の式となる。その式の値が各単語のスコアとなる。
tf・idf だと tf(d,t) * log(N/df(t)) （１５）
okapi だと tf(d,t)/(tf(d,t) + length(d)/delta)
* log(N/df(t)) （１６）
がその式となる。 Also, create a vector with each word as a dimension and the score of each word as an element, change the article vector to a vector (vector _x) using the words contained in the article, and change the input keyword group vector to The words included in the input keyword group may be used as a vector (vector_y), and the value of the cosine (cos (vector_x, vector_y)) of the vectors may be used as the article similarity. To calculate the score of each word, tf / idf or okapi should be used. The expression after the Σ of those expressions is the expression for calculating the score. The value of the expression is the score for each word.
tf (d, t) * log (N / df (t)) for tf ・ idf (15)
For okapi, tf (d, t) / (tf (d, t) + length (d) / delta)
* log (N / df (t)) (16)
Is the formula.

さらに、以下の情報検索を行うこともできる。
（Okapi について、参考文献６；S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, and M. Gatford Okapi at TREC-3, TREC-3, 1994）
（SMART について、参考文献７； Amit Singhal
AT&T at TREC-6, TREC-6, 1997）
より高度な情報検索の方法として、これらの Okapiや SMARTの式を用いてもよい。さらにそれらの変形の式を用いてもよい。 Further, the following information search can be performed.
(For Okapi, reference 6; SE Robertson, S. Walker, S. Jones, MM Hancock-Beaulieu, and M. Gatford Okapi at TREC-3, TREC-3, 1994)
(For SMART, reference 7; Amit Singhal
AT & T at TREC-6, TREC-6, 1997)
These Okapi and SMART formulas may be used as more advanced information retrieval methods. Furthermore, the equation of the deformation may be used.

また、たんぱく質表現を検索キーワードとし、先に示した抽出すべきたんぱく質表現の二項関係を抽出する方法により、検索キーワードのたんぱく質表現の二項関係が、検索対象のテキストデータにおいて、抽出すべきとして抽出されたテキストデータのみを検索結果として出力する、情報検索装置を構成してもよい。また、このことはたんぱく質表現の二項関係に限らず、会社の製品名とその製品に対する評判（例えば、評判がいい，悪いなどの情報）との対からなる二項関係を対象にして、その二項関係の要素を検索キーワードとし、二項関係を抽出する方法により、検索キーワードの二項関係が、検索対象のテキストデータにおいて、抽出すべきと抽出されたテキストデータのみを検索結果として出力する、情報検索装置を構成してもよい。 In addition, by using the protein expression as a search keyword and extracting the binary relation of the protein expression to be extracted as described above, the binary relation of the protein expression of the search keyword should be extracted in the text data to be searched. An information search device that outputs only the extracted text data as a search result may be configured. In addition, this is not limited to the binary relationship of protein expression, but targets a binary relationship consisting of a pair of the company's product name and the reputation of the product (for example, information such as good or bad reputation). By using a binary relation element as a search keyword and extracting the binary relation, only the text data extracted that the binary relation of the search keyword should be extracted in the search target text data is output as the search result. An information search apparatus may be configured.

またさらに、英語の単語の場合、例えば「studies」を「study」と直すように、複数形や、過去形、三人称単数現在形の「s」の表現から基本形に戻す処理を「steming」というが、機械学習の素性に使う単語について、この「steming」を行った後の単語を素性に使ってもよい。 Furthermore, in the case of English words, the process of returning from the plural form, the past form, and the third person singular present form "s" to the basic form, for example, "studies" is changed to "study" is called "steming" As for the word used for the machine learning feature, the word after the “steming” may be used for the feature.

次に、本発明の第２実施例である情報検索装置７について説明する。 Next, an information retrieval apparatus 7 that is a second embodiment of the present invention will be described.

この情報検索装置７は、記事について、抽出するべき記事であること（正）または、抽出するべき記事でないこと（負）のいずれかの解を示すタグを付与した教師データを用いて機械学習し、検索対象である検索用テキストデータ５から、二つの検索キーワードを含む記事をとりだし、それら記事において、その記事が抽出するべき記事であると推定されものを検索結果６として出力する処理装置である This information search device 7 performs machine learning on the article using teacher data to which a tag indicating a solution that is either an article to be extracted (positive) or not an article to be extracted (negative) is attached. The processing device extracts articles including two search keywords from the search text data 5 to be searched, and outputs, as search results 6, those articles that are estimated to be articles to be extracted.

図１１に、本発明に係る情報検索装置７の構成例を示す。情報検索装置７は、情報検索部７０、解−素性対抽出部７２、機械学習部７３、素性抽出部７５、解判定部７６、および検索結果抽出部７７を備える。ただし、第１実施例と同様に、教師データ記憶部７１と学習結果記憶部７４については、この情報検索装置７が内部要素として備えていてもよいし、外部要素として情報検索装置７から利用可能としてもよい。 FIG. 11 shows a configuration example of the information search apparatus 7 according to the present invention. The information search device 7 includes an information search unit 70, a solution-feature pair extraction unit 72, a machine learning unit 73, a feature extraction unit 75, a solution determination unit 76, and a search result extraction unit 77. However, as in the first embodiment, the teacher data storage unit 71 and the learning result storage unit 74 may be included in the information search device 7 as internal elements, or can be used from the information search device 7 as external elements. It is good.

特に本実施例では、記事単位で機械学習を行うものであり、上述した機械学習処理において言及した「問題」については、「二項関係」に代えて「検索キーワードと記事の組」に、「解」については、「抽出すべき二項関係か抽出すべきでない二項関係」に代えて「抽出すべき記事か抽出すべきでない記事」に、「素性の取り出しの対象」については「二項関係」に代えて「検索キーワードと記事の組」に、それぞれ読み替えるものとする。 In particular, in this embodiment, machine learning is performed in units of articles, and the “problem” mentioned in the machine learning process described above is replaced with “binary relationship” instead of “search keyword and article pair”. For “solution”, instead of “binary relationship to be extracted or not to be extracted”, “article to be extracted or not to be extracted”, and for “target extraction target” to “binary In place of “relation”, “search keyword and article pair” are each read.

このことを踏まえて、本実施例の情報検索装置７における教師データ記憶部７１、解−素性対抽出部７２、機械学習部７３、素性抽出部７５、解判定部７６は、図１に示す二項関係抽出装置１の教師データ記憶部１１、解−素性対抽出部１２、機械学習部１３、学習結果記憶部１４、素性抽出部１６、および解推定部１７と概ねそれぞれ同様の処理を行う処理手段である。情報検索装置７における情報検索部７０は、第１実施例の情報検索装置４における情報検索部４０と概ね同様の処理を行う処理手段である。さらに、この情報検索装置７は、候補抽出部１５に相当する処理手段や、第１実施例の情報検索装置７の二項関係選択部４５に相当する処理手段を備えていない。また、ただし、第１実施例の場合と同様に、解判定部７６については、二項関係の各候補について、その素性の集合の場合に、どのような解（分類先）になりやすいかの度合いを推定する処理を行う場合は、前述の解推定部１７と同様の処理を行うが、解推定部１７を「解判定部」と置き換える場合には、この情報処理装置７の解判定部７６は、前記「解判定部」と同様の処理を行うものとする。 Based on this, the teacher data storage unit 71, the solution-feature pair extraction unit 72, the machine learning unit 73, the feature extraction unit 75, and the solution determination unit 76 in the information search apparatus 7 of this embodiment are shown in FIG. Processing that performs substantially the same processing as the teacher data storage unit 11, the solution-feature pair extraction unit 12, the machine learning unit 13, the learning result storage unit 14, the feature extraction unit 16, and the solution estimation unit 17 of the term relation extraction device 1. Means. The information search unit 70 in the information search device 7 is a processing unit that performs substantially the same processing as the information search unit 40 in the information search device 4 of the first embodiment. Further, the information search device 7 does not include a processing unit corresponding to the candidate extraction unit 15 or a processing unit corresponding to the binary relation selection unit 45 of the information search device 7 of the first embodiment. However, as in the case of the first embodiment, for the solution determination unit 76, what kind of solution (classification destination) is likely to be obtained in the case of a set of features for each candidate of binary relations? When performing the process of estimating the degree, the same process as the solution estimation unit 17 described above is performed. However, when the solution estimation unit 17 is replaced with a “solution determination unit”, the solution determination unit 76 of the information processing apparatus 7 is performed. Performs the same processing as the “solution determination unit”.

図１２に、情報検索装置７の処理の流れを示す。情報検索装置７の教師データ記憶部７１には、教師データとして、検索キーワードと記事の組に、抽出するべき記事であるか（正）または抽出するべきでない記事であるか（負）のいずれかの「解」の情報が付与された事例を含むテキストデータを記憶しておく。 FIG. 12 shows a processing flow of the information search apparatus 7. In the teacher data storage unit 71 of the information search device 7, the teacher data is either an article that should be extracted (positive) or an article that should not be extracted (negative) as a set of search keywords and articles. Text data including a case to which the “solution” information is added is stored.

まず、解−素性対抽出部７２は、教師データ記憶部７１から、問題が検索キーワードと記事の組であり、解が抽出すべき記事である事例を取り出し、その事例ごとに、検索キーワードを要素とする二項関係に関わる情報を素性として抽出して、それら素性の集合と解の組を生成する（ステップＳ２１）。例えば、素性として、以下のような情報を抽出する。
１）入力キーワードを要素とする二項関係の要素の周囲に出現する単語または文字。例えば，二項関係の第１要素（最初の要素）の前方の所定数の単語／文字，第２要素（二番目の要素）の後方の所定数の単語／文字，第１要素と第２要素の間の所定数の単語／文字；
２）入力キーワードを要素とする二項関係の要素の周囲に出現する単語／文字の出現位置，出現順序など；
３）入力キーワードを要素とする二項関係の二つの要素；
４）入力キーワードを要素とする二項関係の要素または周囲の単語の品詞情報，形態素解析情報など；
５）入力キーワードを要素とする二項関係の要素または周囲の単語の構文解析情報；
６）入力キーワードを要素とする二項関係の第１要素と第２要素との出現距離；
７）入力キーワードを要素とする二項関係の第１要素と第２要素の間での要素の出現の有無；
８）入力キーワードを要素とする二項関係を記事からすべて取り出し，それら二項関係が意義のある二項関係であるかどうかを機械学習の方法かパターンに基づく方法かのいずれかによって調べ，すくなくとも一つは意義のある二項関係であると判断されたかどうか；
９）入力キーワードの記事中での出現において，そのキーワードの周囲に出現する単語または文字；
なお、上記の例のうち、１）−７）の素性の例は、上述した二項関係抽出装置１で述べたものと同様である。 First, the solution-feature pair extraction unit 72 extracts, from the teacher data storage unit 71, a case in which the problem is a combination of a search keyword and an article, and an article from which a solution is to be extracted. The information related to the binary relation is extracted as a feature, and a set of these features and a set of solutions are generated (step S21). For example, the following information is extracted as the feature.
1) A word or a character that appears around a binary relation element having the input keyword as an element. For example, a predetermined number of words / characters before the first element (first element) of the binary relation, a predetermined number of words / characters behind the second element (second element), the first element and the second element A predetermined number of words / characters between
2) Appearance position, appearance order, etc. of words / characters appearing around binary-related elements whose elements are input keywords;
3) Two elements of binary relations with the input keyword as an element;
4) Binary element or part-of-speech information, morphological analysis information, etc.
5) parse information of binary relation elements or surrounding words with the input keyword as an element;
6) Appearance distance between the first element and the second element of the binary relation having the input keyword as an element;
7) Presence / absence of an element between the first and second elements of the binary relation having the input keyword as an element;
8) Extract all binary relations that have the input keyword as an element from the article, and check whether these binary relations are meaningful binary relations by either a machine learning method or a pattern-based method. Whether one was determined to be a meaningful binary relationship;
9) When an input keyword appears in an article, words or characters that appear around the keyword;
Of the above examples, the example of the feature 1) -7) is the same as that described in the binary relation extraction apparatus 1 described above.

８）の例としては、例えば次のようなものを挙げることができる。すなわち、キーワードが「京大」，「総長」であり、記事が「．．．京大の総長．．．」のときに、記事Ｃにおいて、キーワードＡ，Ｂが機械学習またはパターンの方法で、意義のある二項関係と判断された場合、素性は「意義のある二項関係あり」となる。 Examples of 8) include the following. That is, when the keywords are “Kyoto University” and “Governor”, and the article is “... Kyoto University President ...”, in Article C, the keywords A and B are machine learning or pattern methods. When it is determined that the binary relation is significant, the feature is “significant binary relation”.

また、９）の例としては、例えば次のようなものを挙げることができる。すなわち、キーワードが「京大」，「総長」であり、記事が「．．今日，京大の総長が，．．．」のとき、また周囲の単語として直前と直後のみをとる場合、素性（直前）は読点「、」，「の」の二個となり、素性（直後）は読点「の」，「が」の二個となる。 Examples of 9) include the following. In other words, if the keywords are “Kyoto University” and “General Manager” and the article is “… Today, the President of Kyoto University is…”, and if only the immediately preceding and following words are taken as surrounding words, the feature ( Immediately before), there are two reading marks "," and "no", and the features (immediately after) are two reading marks "no" and "ga".

次に、機械学習部７３は、解−素性対抽出部７２により生成された素性の集合と解の組から、どのような素性の集合のときにどのような解（正または負）になりやすいか（またはどのような解になるか）を機械学習法により学習し、学習結果を学習結果記憶部７４に格納する（ステップＳ２２）。機械学習部７３によって学習結果記憶部７４に格納される学習データの例は、例えば次のようなものである。
問題１：キーワード「京大」「総長」，記事「．．．京大の総長．．．」
解答１：抽出すべき
問題２：キーワード「京大」「総長」，記事「．．．東大の総長が京大に来た．．．」
解答２：抽出すべきでない
これを学習すると、上述した素性の６）入力キーワードを要素とする二項関係の第１要素と第２要素との出現距離；が小さいと抽出すべきになりやすい，遠いと抽出すべきでないなどを学習する。 Next, the machine learning unit 73 tends to be any solution (positive or negative) at any feature set from the feature set and solution set generated by the solution-feature pair extraction unit 72. (Or what kind of solution is to be obtained) is learned by the machine learning method, and the learning result is stored in the learning result storage unit 74 (step S22). Examples of learning data stored in the learning result storage unit 74 by the machine learning unit 73 are as follows, for example.
Problem 1: Keywords “Kyoto University” “President”, article “… Kyodai President”
Answer 1: Problem to be extracted 2: Keywords "Kyoto University""President", article "... The President of the University of Tokyo has come to Kyoto University ..."
Answer 2: If you learn that this should not be extracted, it is likely to be extracted when the appearance distance of the first element and the second element of the binary relation having the input keyword as an element is small. Learn what should not be extracted if it is far away.

その後、情報検索部７０は、ＡＮＤ検索処理で与えられた二つの検索キーワードの対を用いて検索用テキストデータ５を検索し、該当する記事を取得する（ステップＳ２３）。なお、情報検索部７０には、ＡＮＤ検索処理以外の検索処理の態様を採用することができるのは、第１実施例で述べた通りである。 After that, the information search unit 70 searches the search text data 5 using the two search keyword pairs given in the AND search process, and acquires the corresponding article (step S23). As described in the first embodiment, the information search unit 70 can adopt a search process other than the AND search process.

そして、素性抽出部７５は、解−素性対抽出部７２と同様にして、情報検索部７０によりテキストデータ５から抽出された記事と検索キーワードの組について、検索キーワードを要素とする二項関係に関わる情報を素性として抽出する（ステップＳ２４）。この点で、素性抽出部７５の処理内容は、二項関係抽出装置１の素性抽出部１６や情報検索装置４の素性抽出部４５１と異なる。 Then, the feature extraction unit 75, in the same manner as the solution-feature pair extraction unit 72, has a binary relation with the search keyword as an element for the set of the article and the search keyword extracted from the text data 5 by the information search unit 70. The related information is extracted as a feature (step S24). In this respect, the processing content of the feature extraction unit 75 is different from the feature extraction unit 16 of the binary relation extraction device 1 and the feature extraction unit 451 of the information search device 4.

解判定部７６は、学習結果記憶部７４に記憶されている学習データをもとに、検索キーワードと抽出された記事の組について、どういう解とすべきかを判定する（ステップＳ２５）。上述した学習データの例を用いて具体例を説明する。
新しい問題として
問題：キーワード「阪大」「総長」，記事「．．．阪大の総長．．．」
が入ると、
６）入力キーワードを要素とする二項関係の第１要素と第２要素との出現距離；
が小さいと抽出すべきになりやすい，遠いと抽出すべきでないなどといった学習結果を使って、今回は距離が小さいので抽出すべき解とするべきであると判定する。ここでは距離の６）の素性を例にあげたが、他の素性の特徴を利用した学習も可能である。例えば、二項関係の第１要素と第２要素の間の単語が「の」の場合、抽出すべきになりやすいなども学習する場合もある。 Based on the learning data stored in the learning result storage unit 74, the solution determination unit 76 determines what kind of solution should be taken for the combination of the search keyword and the extracted article (step S25). A specific example will be described using the example of learning data described above.
Problem as a new problem: Keywords "Osaka University""President", article "... President of Osaka University ..."
Enter
6) Appearance distance between the first element and the second element of the binary relation having the input keyword as an element;
It is determined that the solution to be extracted should be extracted because the distance is small this time, using learning results such as being easy to be extracted if it is small and not being extracted if it is far. Here, the feature of distance 6) is taken as an example, but learning using features of other features is also possible. For example, when the word between the first element and the second element of the binary relation is “no”, it may be learned that it is likely to be extracted.

検索結果抽出部７７は、このような解判定部７６の判定結果をもとに、検索用テキストデータ５から検索された記事から、抽出すべきと判定された記事を検索結果６として抽出する（ステップＳ２６）。 Based on the determination result of the solution determination unit 76, the search result extraction unit 77 extracts, as the search result 6, an article determined to be extracted from the articles searched from the search text data 5 ( Step S26).

以上のように、本実施例の情報検索装置７によれば、機械学習処理用の教師データとして、検索キーワードと記事の組に、抽出するべき記事であるか否かの「解」の情報が付与された事例を含むテキストデータテキストデータを用意しておき、記事ごとに機械学習することで、新しい検索用テキストデータ５から、抽出するべきものに値するとされた記事を自動的に抽出することが可能となる。 As described above, according to the information search device 7 of the present embodiment, as the teacher data for machine learning processing, information on “solution” indicating whether or not an article to be extracted is included in a set of a search keyword and an article. Text data including given examples Text data Text data is prepared, and by machine learning for each article, articles that are considered to be extracted are automatically extracted from new search text data 5 Is possible.

本実施例の情報検索装置７は、情報検索において、複数与えられたキーワード間に意味的な関係があるかを判断し、その判断結果も利用した情報検索を行なうことができる。具体例を挙げて説明すると、例えば、既存のＷＥＢ上におけるキーワード検索サービスで、「京大」と「総長」の二つをキーワードとして、ＡＮＤ検索をする。そうすると、「京大」と「総長」の二つのキーワードを含むページが検索されて出てくる。しかし、この得られたページがすべて京大の総長に関するページであるとは限らない。例えば、「京大の式典に東大の総長が出席した」という文を記述したページであるとこのページは京大の総長に関するページではなく、東大の総長に関するページである。このため、このようなページは検索で結果として抽出すべきでない。一方、本実施例の情報検索装置７では、機械学習を利用して、各ページが、「京大」と「総長」が密接に関連したページであるかどうかを判断できるようにして、欲しいページだけを取り出してユーザに提示することができる。すなわち、より具体的には、教師データとして、「京大」と「総長」の二つのキーワードを含む記事のうち、欲しい記事、つまり、「京大の総長」に関して記述した記事を正例、それ以外を負例として作成し、機械学習し、欲しい記事かそうでないかを判断できるようにし、その結果を利用して検索結果を得ることができる。したがって、従来の検索技術では不要な検索結果が多く含まれていたが、本実施例の情報検索装置７を利用することで、有用な検索結果のみをユーザに提示することができ、検索性能が飛躍的に向上する。 In the information search, the information search device 7 of the present embodiment can determine whether there is a semantic relationship between a plurality of given keywords, and can perform an information search using the determination result. For example, in an existing WEB keyword search service, an AND search is performed using “Kyoto University” and “General Manager” as keywords. Then, a page containing two keywords “Kyoto University” and “President” will be searched and come out. However, the obtained pages are not all pages related to the president of Kyoto University. For example, a page describing a sentence “The President of the University of Tokyo attended the Kyoto University ceremony” is not a page related to the president of Kyoto University, but a page related to the president of the University of Tokyo. For this reason, such a page should not be extracted as a result of a search. On the other hand, the information search apparatus 7 of this embodiment uses machine learning to enable each page to determine whether “Kyoto University” and “General Manager” are closely related pages. Can be taken out and presented to the user. That is, more specifically, as an example of an article containing two keywords “Kyoto University” and “President” as the teacher data, an article describing the desired article, that is, “Kyoto University President”, Is created as a negative example, machine learning can be performed to determine whether the article is desired or not, and the result can be used to obtain a search result. Therefore, although many unnecessary search results are included in the conventional search technique, only the useful search results can be presented to the user by using the information search device 7 of this embodiment, and the search performance is high. Improve dramatically.

次に、本発明の参考例である情報検索装置８について説明する。 Next, an information search apparatus 8 that is a reference example of the present invention will be described.

この情報検索装置８は、複数の検索キーワードの入力時に、そのうち二つのキーワードの関係性を示す情報をユーザに入力させておき、又はプログラム関数により入力されるようにしておき、ＡＮＤ検索処理の二つ以上の検索キーワードの関係を意味のある二項関係とみなして、この検索キーワードを要素とする二項関係について、抽出するべき関係であること（正）または、抽出するべき関係でないこと（負）のいずれかの解を示すタグを付与した教師データを用いて機械学習し、検索対象である検索用テキストデータ５から、二つの検索キーワードを含む記事であって、その検索キーワードの対が抽出するべき二項関係であると判定されもののうち、その検索キーワードの対が入力時に関係性があると指定されたものを検索結果６として出力する処理装置である。 The information search device 8 allows the user to input information indicating the relationship between two keywords when inputting a plurality of search keywords, or to input the information by a program function. Considering a relationship of two or more search keywords as a meaningful binary relationship, the binary relationship having this search keyword as an element is a relationship to be extracted (positive) or not to be extracted (negative) ) Machine learning using teacher data to which a tag indicating one of the solutions is attached, and an article including two search keywords is extracted from the search text data 5 to be searched, and the search keyword pair is extracted. Among those determined to be binary relations to be performed, the search result 6 is the one in which the search keyword pair is designated as related at the time of input It is a processing device that force.

図１３に、この参考例に係る情報検索装置８の構成例を示す。情報検索装置８は、情報検索部８０、二項関係判定部８２、入力部８４、候補抽出部８５、二項関係選択部８６、および検索結果抽出部８７を備える。また、同図でに示した情報検索装置８は、二項関係判定部８２がさらに、学習部８２０（解−素性対抽出部８２１及び機械学習部８２２を備える）を有し、二項関係選択部８６がさらに、素性抽出部８６１及び解判定部８６２を備えた構成を示しており、この場合は情報検索装置８が、教師データ記憶部８１と学習結果記憶部８３を利用可能としている。なお、教師データ記憶部８１と学習結果記憶部８３は、この情報検索装置４が内部要素として備えていてもよい。二項関係判定部８２の他の構成等については後述する。 FIG. 13 shows a configuration example of the information search apparatus 8 according to this reference example . The information search device 8 includes an information search unit 80, a binary relationship determination unit 82, an input unit 84, a candidate extraction unit 85, a binary relationship selection unit 86, and a search result extraction unit 87. Further, in the information search apparatus 8 shown in the figure, the binary relation determination unit 82 further includes a learning unit 820 (including a solution-feature pair extraction unit 821 and a machine learning unit 822), and binary relation selection. The unit 86 further includes a feature extraction unit 861 and a solution determination unit 862. In this case, the information search device 8 can use the teacher data storage unit 81 and the learning result storage unit 83. The teacher data storage unit 81 and the learning result storage unit 83 may be provided as internal elements in the information search device 4. Other configurations of the binomial relationship determination unit 82 will be described later.

この情報検索装置８において、二項関係判定部８２｛（学習部８２０（解−素性対抽出部８２１及び機械学習部８２２）｝は、第１実施例の情報検索装置４における学習部４２と同様の処理を行う処理手段である。また、情報検索部８０、候補抽出部８５、二項関係選択部８６（素性抽出部８６１及び解判定部８６２を含む）、および検索結果抽出部８７は、情報検索装置４の情報検索部４０、候補抽出部４４、二項関係選択部４５（素性抽出部４５１及び解判定部４５２を含む）、および検索結果抽出部４６と同様の処理を行う処理手段である。教師データ記憶部８１、及び学習結果記憶部８３についても、第１実施例で述べた教師データ記憶部４１、および学習結果記憶部４３と同様である。 In this information search device 8, the binary relation determination unit 82 {(learning unit 820 (solution-feature pair extraction unit 821 and machine learning unit 822)}) is the same as the learning unit 42 in the information search device 4 of the first embodiment. The information search unit 80, candidate extraction unit 85, binary relation selection unit 86 (including a feature extraction unit 861 and a solution determination unit 862), and a search result extraction unit 87 Processing means for performing the same processing as the information search unit 40, candidate extraction unit 44, binary relation selection unit 45 (including feature extraction unit 451 and solution determination unit 452), and search result extraction unit 46 of the search device 4. The teacher data storage unit 81 and the learning result storage unit 83 are the same as the teacher data storage unit 41 and the learning result storage unit 43 described in the first embodiment.

入力部８４は、本参考例では、複数の検索キーワードの入力と、当該複数の検索キーワードのうち任意の２つの検索キーワードの組について意味的な関係を持たせることを示す、当該２つの検索キーワードの組に付与される関係性情報の入力を受ける処理を行う処理手段である。より具体的には、入力部８４は、複数の検索キーワードの入力、及び当該複数の検索キーワードのうち任意の２つの検索キーワードの組について意味的な関係を持たせることを示す、当該２つの検索キーワードの組に付与される関係性情報の入力を促す入力インタフェースを出力するとともに、この入力インタフェースに入力された複数の入力検索キーワード及び関係性情報の入力を受ける。なお、この入力部８４については、入力インタフェースを一旦出力して、ユーザによる関係性情報の入力を伴う態様について説明するが、例えばプログラム関数により自動的に検索キーワード及び関連性情報の入力を受け付けたり、検索キーワードはユーザに入力させておき、関連性情報はプログラム関数により入力させたものを受け付けるという態様も採用することができる。 In the present reference example, the input unit 84 indicates that the input of a plurality of search keywords and the two search keywords indicating that a set of two arbitrary search keywords among the plurality of search keywords has a semantic relationship. It is a processing means which performs the process which receives the input of the relationship information provided to this group. More specifically, the input unit 84 indicates the input of a plurality of search keywords and the two searches indicating that a set of arbitrary two search keywords among the plurality of search keywords has a semantic relationship. An input interface that prompts input of relationship information given to the set of keywords is output, and a plurality of input search keywords and relationship information input to the input interface are received. The input unit 84 will be described with respect to a mode in which an input interface is temporarily output and the user is input relationship information. For example, a search function and relevance information can be automatically input by a program function. Also, it is possible to adopt a mode in which the search keyword is input by the user and the relevance information is input by a program function.

また、このような入力部８４に対応して、候補抽出部８５では、入力部８４で関係性情報が付与されて入力された二つの検索キーワードの対を生成して、その対を二項関係の候補とする。 Corresponding to such an input unit 84, the candidate extraction unit 85 generates a pair of two search keywords that are input with the relationship information given by the input unit 84, and uses the pair as a binary relationship. Candidate for

図１４に、情報検索装置８の処理の流れを示す。第１実施例の場合と同様に、情報検索装置４の教師データ記憶部４１には、教師データとして、ＡＮＤ検索処理で与えられる二つの検索キーワードを要素とする二項関係に、抽出するべき二項関係であるか（正）または抽出するべきでない二項関係であるか（負）のいずれかの「解」の情報が付与された事例を含むテキストデータを記憶しておく。
まず、二項関係判定部８２の学習部８２０において解−素性対抽出部８２１は、教師データ記憶部８１の教師データから各事例について、所定の素性を抽出し、解（タグによって付与された情報）と抽出した素性の集合との組を生成する（ステップＳ３１）。解−素性対抽出部８２１は、教師データであるテキストデータから所定のタグによって二項関係を抽出し、抽出した二項関係の要素（検索キーワード）について、形態素解析処理、構文解析処理、要素の出現位置や要素間の距離の算出処理などを行って、所定の素性を抽出する。 FIG. 14 shows a processing flow of the information search apparatus 8. As in the case of the first embodiment, the teacher data storage unit 41 of the information search device 4 extracts, as teacher data, a binary relation having two search keywords given by AND search processing as elements. Text data including a case to which information of “solution”, which is either a term relationship (positive) or a binary relationship that should not be extracted (negative), is stored.
First, in the learning unit 820 of the binary relation determination unit 82, the solution-feature pair extraction unit 821 extracts a predetermined feature for each case from the teacher data in the teacher data storage unit 81, and determines the solution (information given by the tag). ) And the extracted feature set (step S31). The solution-feature pair extraction unit 821 extracts a binary relation from text data that is teacher data using a predetermined tag, and uses the morphological analysis process, the syntax analysis process, and the element of the extracted binary relation element (search keyword). A predetermined feature is extracted by performing a process of calculating an appearance position and a distance between elements.

そして、同じく学習部８２０において機械学習部８２２は、解−素性対抽出部８２１により生成された解と素性の集合との組から、どのような素性の集合のときにどのような解（正または負）になりやすいかを機械学習法により学習し、学習結果を学習結果記憶部８３に格納する（ステップＳ３２）。機械学習部８２２は、教師あり機械学習法として、例えば、ｋ近傍法、シンプルベイズ法、決定リスト法、最大エントロピー法、サポートベクトルマシン法などの手法のいずれかを用いて機械学習処理を行うのは、第１実施例の場合と同様である。 Similarly, in the learning unit 820, the machine learning unit 822 determines which solution (positive or negative) from the combination of the solution generated by the solution-feature pair extraction unit 821 and the feature set. The learning result is stored by the machine learning method, and the learning result is stored in the learning result storage unit 83 (step S32). The machine learning unit 822 performs machine learning processing using any one of techniques such as a k-nearest neighbor method, a simple Bayes method, a decision list method, a maximum entropy method, a support vector machine method, and the like as a supervised machine learning method. Is the same as in the first embodiment.

一方、入力部８４は、検索キーワードと、任意の二つの検索キーワード間の関係性情報の入力を受け付ける（ステップＳ３３）。候補抽出部８５は、ＡＮＤ検索処理で与えられた二つの入力検索キーワードを用いてすべての二つの組み合わせ（対）を生成する（ステップＳ３４）。情報検索部８０は、二つの入力検索キーワードの対を用いて検索用テキストデータ５をＡＮＤ検索処理し、入力検索キーワード対を含む記事（テキストデータ）を抽出し、候補抽出部４４は、検索処理によって抽出された記事に出現する入力検索キーワードを用いて、すべての二つの組み合わせ（対）を二項関係の候補として抽出する（ステップＳ３５）。 On the other hand, the input unit 84 accepts input of a search keyword and relationship information between any two search keywords (step S33). The candidate extraction unit 85 generates all two combinations (pairs) using the two input search keywords given in the AND search process (step S34). The information search unit 80 performs AND search processing on the search text data 5 using two pairs of input search keywords to extract articles (text data) including the input search keyword pairs, and the candidate extraction unit 44 performs search processing. All the two combinations (pairs) are extracted as binomial relationship candidates using the input search keyword appearing in the article extracted by (step S35).

そして、二項関係選択部８６において、素性抽出部８６１は、解−素性対抽出部８２１での処理とほぼ同様の処理によって、検索した記事に出現している二項関係の各候補について、所定の素性の集合を抽出する（ステップＳ３６）。 Then, in the binary relation selection unit 86, the feature extraction unit 861 performs predetermined processing on each candidate of the binary relation that appears in the searched article by a process substantially similar to the process in the solution-feature pair extraction unit 821. A set of features is extracted (step S36).

さらに二項関係選択部８６において解判定部８６２は、各候補について、その素性の集合の場合にどのような解になりやすいか、すなわち、「正となりやすい」または「負となりやすいか」の度合いを学習結果記憶部８３の学習結果をもとに推定する（ステップＳ３７）。そして、検索結果抽出部８７は、二項関係の候補から、所定の程度より良い程度で「正となりやすい」と推定されたものを抽出するべき二項関係として選択し、この二項関係を含む記事または記事を特定する情報を検索結果６として選択する（ステップＳ３８）。 Further, in the binary relation selection unit 86, the solution determination unit 862 determines, for each candidate, what kind of solution is likely to occur in the case of the set of features, that is, the degree of “probably positive” or “prone to be negative” Is estimated based on the learning result of the learning result storage unit 83 (step S37). Then, the search result extraction unit 87 selects, as a binary relation to be extracted from candidates of the binary relation that are estimated to be “positive” to a better degree than a predetermined degree, and includes this binary relation. An article or information specifying the article is selected as the search result 6 (step S38).

ここで、複数の検索キーワードとして、三つの検索キーワードが入力される場合について、入力インタフェースの例を説明する。図１５（Ａ）は、円周上にキーワード入力ウィンドウを配置した入力インタフェースである。図示例では、三つのキーワード入力ウィンドウに、それぞれ「京大」，「総長」，「言語処理」の語をユーザが入力したものとする。そして、この入力インタフェースでは、ユーザが意味的な関係性を持たせたい二つのキーワードを示す関連性情報として、任意の二つのキーワード入力ウィンドウを線で結ぶことができるようにしてある。図示例では「京大」と「総長」のキーワード入力ウィンドウを関係性情報を意味する線で結んでいる。すなわち、ユーザは、「京大」と「総長」は関係があることを条件として、「言語処理」はそれほど関係がなくてもよいとする条件で検索することを希望していることを意味している。なお、検索キーワードを四つ以上としたい場合は、適宜のユーザの操作により、円周上にキーワード入力ウィンドウを増やすことができるようにするとよい。また、上記の「円」は、図示例のような正円であってもよいし、楕円であってもよい。 Here, an example of an input interface will be described in the case where three search keywords are input as a plurality of search keywords. FIG. 15A shows an input interface in which a keyword input window is arranged on the circumference. In the illustrated example, it is assumed that the words “Kyoto University”, “General Manager”, and “Language Processing” are input to three keyword input windows, respectively. In this input interface, any two keyword input windows can be connected with a line as relevance information indicating two keywords that the user wants to have a semantic relationship. In the illustrated example, the keyword input windows of “Kyoto University” and “Governor” are connected by a line representing the relationship information. In other words, the user wishes to search on the condition that “Kyoto University” and “General Manager” are related and “Language Processing” may not be so related. ing. If it is desired to use four or more search keywords, it is preferable to increase the keyword input windows on the circumference by appropriate user operations. Further, the “circle” may be a perfect circle as shown in the figure, or may be an ellipse.

図１５（Ｂ）は、一つのキーワード入力ウィンドウが示される入力インタフェースである。このキーワード入力ウィンドウでは、ユーザによって入力されるキーワード間は「，」で区分される。そして、関係性情報として、キーワード間に意味的な関係を持たせたい場合は、図示例のように、ユーザにｒｅｌ（Ａ，Ｂ）と入力させる（ｒｅｌは、「ｒｅｌａｔｉｏｎ」を意味する関数である）。図示例のように、キーワード入力ウィンドウに、「Ｃ，ｒｅｌ（Ａ，Ｂ）」と入力された場合は、ユーザが、キーワード「Ａ」と「Ｂ」は関係があることを条件として、キーワード「Ｃ」はそれほど関係がなくてもよいとする条件で検索することを希望していることを意味している。キーワードが二つ又は四つ以上であっても同様である。 FIG. 15B shows an input interface in which one keyword input window is shown. In the keyword input window, keywords input by the user are separated by “,”. Then, when it is desired to have a semantic relationship between keywords as the relationship information, the user inputs rel (A, B) as shown in the example (rel is a function meaning “relation”). is there). As shown in the example, when “C, rel (A, B)” is entered in the keyword input window, the user can use the keyword “A” and “B” on the condition that the keywords “A” and “B” are related. “C” means that the user wishes to search under the condition that the relationship may not be so much. The same is true for two or more keywords.

図１５（Ｃ）は、横方向に行、縦方向に列が形成される表形式の入力インタフェースの例である。この入力インタフェースでは、第１行の各列がキーワード入力ウィンドウであり、第２行以下に、第１行に入力された各列のキーワードに対して意味的な関係性を持たせたいキーワードを入力する。図示例では、第１行各列のキーワード入力ウィンドウに、キーワード「Ａ」，「Ｂ」，「Ｃ」が入力され、第２行第１列にキーワード「Ｂ」を、第２行第２列にキーワード「Ａ」が入力された場合を示している。この例では、ユーザが、キーワード「Ａ」と「Ｂ」は関係があることを条件として、キーワード「Ｃ」はそれほど関係がなくてもよいとする条件で検索することを希望していることを意味している。ただし、第２行以下のキーワード入力は、第１行の何れかの列のキーワードを入力デバイスで指定すれば自動的に入力されたり、第２行以下各列内を指定すれば当該列以外のキーワードが自動的にポップアップ表示されるようにするなどの工夫をすることで、ユーザによる手入力の手間を軽減することができる。また、キーワードが二つ又は四つ以上であっても同様である。 FIG. 15C shows an example of a tabular input interface in which rows are formed in the horizontal direction and columns are formed in the vertical direction. In this input interface, each column of the first row is a keyword input window, and a keyword to be given a semantic relationship with the keyword of each column input to the first row is input below the second row. To do. In the illustrated example, the keywords “A”, “B”, and “C” are input to the keyword input window in each column of the first row, the keyword “B” is input in the second row and first column, and the second row and second column. The case where the keyword “A” is input is shown. In this example, the user wishes to search on the condition that the keywords “A” and “B” are related and the keyword “C” may not be so related. I mean. However, keyword input in the second row and below is automatically input if the keyword in any column in the first row is designated by the input device, or other than that column is designated in each column after the second row. By taking measures such as automatically popping up keywords, it is possible to reduce the time and effort of manual input by the user. The same applies to two or four or more keywords.

図１５（Ｄ）は、表形式の入力インタフェースの別の例である。この例では、表の上段各列をキーワード入力ウィンドウとしており、上段各列の下段に、全てのキーワードが上下に順番に並べて表示され、そのキーワードの横に、当該列のキーワードと関係性があることを示す関係性情報として例えばラジオボタン（これに代えて、チェックボックス等でもよい）を配置している。なお、当該列と同一のキーワードは、ラジオボタンを選択できないようにしている。図示例では、ユーザが、キーワード「Ａ」について、キーワード「Ｂ」のラジオボタンをＯＮにした状態を示している。この場合、そのラジオボタンのＯＮ入力に連動させて、キーワード「Ｂ」についてのキーワード「Ａ」のラジオボタンが自動的にＯＮ入力されるようにしておくと、ユーザによる入力の手間が省ける。この例では、ユーザが、キーワード「Ａ」と「Ｂ」は関係があることを条件として、キーワード「Ｃ」はそれほど関係がなくてもよいとする条件で検索することを希望していることを意味している。また、キーワードが二つ又は四つ以上であっても同様である。 FIG. 15D is another example of a tabular input interface. In this example, each column at the top of the table is a keyword input window, and all the keywords are displayed in the top and bottom in the bottom of each column at the top, and there is a relationship with the keyword in that column next to the keyword. For example, a radio button (instead of this, a check box or the like) is arranged as the relationship information indicating this. It should be noted that radio buttons cannot be selected for the same keyword as that column. The illustrated example shows a state where the user has turned on the radio button of the keyword “B” for the keyword “A”. In this case, if the radio button of the keyword “A” for the keyword “B” is automatically turned ON in conjunction with the ON input of the radio button, the user's input is saved. In this example, the user wishes to search on the condition that the keywords “A” and “B” are related and the keyword “C” may not be so related. I mean. The same applies to two or four or more keywords.

また、上記参考例の別例である情報検索装置８’は、図１６に示すように、上述した学習部８２０を備える二項関係判定部８２の構成に代えて、パターン格納部８２０’を備える二項関係判定部８２’を採用することもできる。この場合、二項関係選択部８６’は、上述した二項関係選択部８６とは処理機能が異なる。さらにパターン格納部８２０’と二項関係選択部８６（特に解判定部８６２）は、図１３に示した学習結果記憶部８３に代えて、パターン記憶部８３’を利用する。また、この情報検索装置８の変形例では、教師データ記憶部８１は利用しない。 Further, as shown in FIG. 16 , the information search device 8 ′, which is another example of the reference example, includes a pattern storage unit 820 ′ instead of the configuration of the binary relation determination unit 82 including the learning unit 820 described above. A binomial relationship determination unit 82 'can also be employed. In this case, the binary relation selection unit 86 ′ has a processing function different from that of the binary relation selection unit 86 described above. Further, the pattern storage unit 820 ′ and the binary relation selection unit 86 (particularly the solution determination unit 862) use a pattern storage unit 83 ′ instead of the learning result storage unit 83 shown in FIG. Further, in the modification of the information search device 8, the teacher data storage unit 81 is not used.

この場合パターン格納部８２０’が、パターンとして、二項関係として抽出すべきキーワードとキーワードの組み合わせを、予めパターン記憶部８３’に保存しておく。そして、二項関係選択部８６’が、候補抽出部８５で抽出した二項関係の候補に関して、パターン記憶部８３’のパターンと照合し、抽出するべき二項関係であると判定した場合に、その二項関係の候補を、抽出するべき二項関係として選択する。この選択された二項関係は、検索結果抽出部８７で抽出されることとなる。 In this case, the pattern storage unit 820 'stores in advance in the pattern storage unit 83' a keyword and a combination of keywords to be extracted as a binary relationship as a pattern. When the binary relation selection unit 86 ′ determines that the binary relation candidate extracted by the candidate extraction unit 85 matches the pattern in the pattern storage unit 83 ′ and is to be extracted, The binary relation candidate is selected as a binary relation to be extracted. The selected binary relation is extracted by the search result extraction unit 87.

また、入力時にキーワード間の関係を指定する場合、キーワードが三つ以上であれば、全ての二項関係が、テキストデータ中少なくとも一箇所では抽出すべき二項関係となった場合、抽出すべきテキストデータとして取るように、上記参考例における処理を変更してもよい。 Also, when specifying the relationship between keywords at the time of input, if there are three or more keywords, all binary relationships should be extracted if they become binary relationships that should be extracted at least in one place in the text data You may change the process in the said reference example so that it may take as text data.

以上のように、本参考例の情報検索装置８，８’は、ユーザにより（又は予め設定されたプログラムにより）、どのキーワード同士は関連がないといけないか、また、どのキーワード同士は関連がなくてもよいかを指定できるように構成しており、より柔軟な情報検索を実現することができる。具体例を挙げると、情報検索において、複数与えられたキーワード間に意味的な関係があるかを判断し、その判断結果も利用した情報検索を行なう方法は従来からある。例えば、既存のＷＥＢ上におけるキーワード検索サービスで、「京大」と「総長」の二つをキーワードとして、ＡＮＤ検索をする。そうすると、「京大」と「総長」の二つのキーワードを含むページが検索されて出てくる。しかし、この得られたページがすべて京大の総長に関するページであるとは限らない。例えば、「京大の式典に東大の総長が出席した」という文を記述したページであるとこのページは京大の総長に関するページではなく、東大の総長に関するページである。このため、ユーザにとってはこのようなページは検索で結果として得たくないものである。これに対して本参考例では、「京大」と「総長」が関係がある場合のものだけを検索結果として取ってくるように、入力時に指定して検索する。この方法において、この二つのキーワード以外に「言語処理」というキーワードも指定して検索することを考える。このときは、「言語処理」と他のキーワードとはそれほど密接な関係がなくてもよいとする。その場合は、「京大」と「総長」と「言語処理」のキーワードをユーザは入力するが、「京大」と「総長」は関係があることを条件として、他はそれほど関係がなくてもよいとする条件で検索する。本参考例の情報検索装置８，８’はこのように、キーワードの入力時にキーワード同士の関係の条件を指定できるものである。 As described above, the information search devices 8 and 8 ′ of the present reference example have which keywords must be related by the user (or by a preset program), and which keywords are not related. It is possible to specify whether or not the information can be specified, and a more flexible information search can be realized. As a specific example, there has conventionally been a method of determining whether or not there is a semantic relationship between a plurality of given keywords in information retrieval and performing information retrieval using the determination result. For example, in an existing WEB keyword search service, an AND search is performed using “Kyoto University” and “General Manager” as keywords. Then, a page containing two keywords “Kyoto University” and “President” will be searched and come out. However, the obtained pages are not all pages related to the president of Kyoto University. For example, a page describing a sentence “The President of the University of Tokyo attended the Kyoto University ceremony” is not a page related to the president of Kyoto University, but a page related to the president of the University of Tokyo. For this reason, such a page is not desired for the user as a result of the search. On the other hand, in this reference example, a search is made by designating at the time of input so that only the case where “Kyoto University” and “total length” are related is taken as a search result. In this method, it is considered that a keyword “language processing” is specified and searched in addition to these two keywords. In this case, it is assumed that “language processing” and other keywords need not be so closely related. In this case, the user inputs the keywords “Kyoto University”, “General Manager”, and “Language Processing”, but the other is not so much, provided that “Kyoto University” and “General Manager” are related. Search on the condition that it is good. In this way, the information search devices 8 and 8 ' of the present reference example can specify the condition of the relationship between the keywords when inputting the keywords.

次に、本発明の第３実施例の情報検索装置９について説明する。 Next, the information search device 9 of the third embodiment of the present invention will be described.

情報検索装置９は、上述した第１、第２実施例及び参考例の情報検索装置４、７、８，８’に適用することができる出力部９１及び再出力部９２を備えたものであり、図１７にその構成例（検索結果抽出部よりも上流の処理手段については省略する）を示し、図１８に処理の流れを示す。 The information search device 9 includes an output unit 91 and a re-output unit 92 that can be applied to the information search devices 4 , 7 , 8, 8 ′ of the first and second embodiments and the reference example described above. FIG. 17 shows an example of the configuration (the processing means upstream from the search result extraction unit is omitted), and FIG. 18 shows the flow of processing.

出力部９１は、情報検索装置４、７、８における各検索結果抽出部４６、７７、８７で抽出された検索結果を出力する処理手段である。具体的には、例えば当該検索結果をディスプレイの所定の画面に表示出力する。その出力の際に、出力部９１は、各検索結果ごとについて、検索キーワード間の二項関係の有無を表す情報も一緒に出力する（ステップＳ４１）。図１９に、出力部９１により表示出力される検索結果出力画面の例を示す。この例では、検索結果が「１，２，３，４，５…」と表示される。各検索結果の横上段に表示される「ＡＢ」，「ＢＣ」，「ＣＡ」は、例えば入力された三つのキーワード「Ａ」，「Ｂ」，「Ｃ」のうち全ての二つずつについて「ＡとＢに関係がある」，「ＢとＣに関係がある」，「ＢとＣに関係がある」ことを示す指定ボタンである。各検索結果の横にマーク（図示例では○）が付された検索結果は、「ＡＢ」，「ＢＣ」，「ＣＡ」のうち何れかに該当することを意味している。なお、このような表示出力形式に代えて又は加えて、二つのキーワード間に関係がないことを示すボタン等を表示出力し、ユーザが選択できるようにしてもよい。 The output unit 91 is a processing unit that outputs the search results extracted by the search result extraction units 46, 77, and 87 in the information search devices 4, 7, and 8. Specifically, for example, the search result is displayed and output on a predetermined screen of the display. At the time of the output, the output unit 91 also outputs information indicating the presence or absence of the binary relationship between the search keywords for each search result (step S41). FIG. 19 shows an example of a search result output screen displayed and output by the output unit 91. In this example, the search result is displayed as “1, 2, 3, 4, 5,. For example, “AB”, “BC”, and “CA” displayed on the upper side of each search result are “two” for all two of the input keywords “A”, “B”, and “C”. This is a designation button indicating that “A and B are related”, “B and C are related”, and “B and C are related”. A search result with a mark (O in the illustrated example) beside each search result means that it corresponds to one of “AB”, “BC”, and “CA”. Instead of or in addition to such a display output format, a button or the like indicating that there is no relationship between the two keywords may be displayed and output so that the user can select it.

再出力部９２は、前述したようなユーザによるキーワード間の関係の有無の指定入力を受け付けて、それに該当する検索結果を出力部９１で出力した検索結果のなかから抽出して、再度出力する処理手段である。すなわち、再出力部９２は、上述の表示例に従って説明すれば、ユーザが入力デバイスを用いた「ＡＢ」，「ＢＣ」，「ＣＡ」の指定ボタンの何れか（複数でもよい）の指定入力を受け付けて（ステップＳ４２）、指定された二項関係の有無に対応する検索結果を抽出して再出力する（ステップＳ４３）。なお、再出力部９２により表示出力される画面には、検索結果のみが表示されてもよいし、図１９で示した画面例と同様に、検索結果と検索キーワード間の関係性の有無を表す情報の両方が表示されてもよい。後者の場合は、さらに検索結果の絞り込みを行うことができる。 The re-output unit 92 receives the designation input of the presence / absence of the relationship between keywords by the user as described above, extracts the search result corresponding to the input from the search result output by the output unit 91, and outputs it again Means. In other words, the re-output unit 92, according to the display example described above, allows the user to input a designation input of one or more of the designation buttons “AB”, “BC”, and “CA” using the input device. Receiving (step S42), a search result corresponding to the presence / absence of the specified binary relation is extracted and re-output (step S43). In addition, only the search result may be displayed on the screen displayed and output by the re-output unit 92, and the presence / absence of the relationship between the search result and the search keyword is expressed as in the screen example shown in FIG. Both information may be displayed. In the latter case, the search results can be further narrowed down.

また、上述したような再出力部９２の機能が必要とされない場合は、情報検索装置９として、情報検索装置４、７、８に加えて出力部９１のみを備えた態様のものも実現することができる。 In addition, when the function of the re-output unit 92 as described above is not required, an information search device 9 that includes only the output unit 91 in addition to the information search devices 4, 7, and 8 is also realized. Can do.

以上のように、本実施例の情報検索装置９であれば、検索キーワード間の関係性の有無のユーザによる指定を情報検索の出力時に行うことができる。例えば、三つのキーワード「京大」と「総長」と「言語処理」でＡＮＤ検索を行ない、検索結果において、各キーワードの関係を表すボタンを検索結果と共に表示させて、ユーザ自身が各キーワードに関係を持たせたいか否かをそのボタンにより指定し、その指定した結果に基づいて、各キーワード間に関係があるかどうかを調べて、指定した結果に基づいた検索をやり直して、ユーザに結果を提示することで、ユーザにとって有益に絞り込まれた検索結果を提供することができる。 As described above, with the information search device 9 of the present embodiment, the user can specify whether or not there is a relationship between search keywords when outputting information search. For example, an AND search is performed with the three keywords “Kyoto University”, “General Manager”, and “Language Processing”, and a button indicating the relationship of each keyword is displayed together with the search result in the search result, and the user himself / herself relates to each keyword. Specify whether or not you want to have a button, check whether there is a relationship between each keyword based on the specified result, perform a search based on the specified result, and return the result to the user By presenting, it is possible to provide a search result that has been narrowed down beneficially for the user.

なお、本発明は上述した実施形態に限定されるものではない。 In addition, this invention is not limited to embodiment mentioned above.

例えば参考例で説明した情報検索装置８において、キーワード間の関係性情報に関する入力部８４を除いた構成も考えられる。この場合、情報検索装置は、「教師あり機械学習処理又は所定の規則を用いた処理方法に基づいて、二項関係の要素に関わる所定の情報を用いて、当該二項関係が抽出すべき二項関係かどうかを判定する二項関係判定部と、入力された複数の検索キーワードを用いた入力検索キーワードの対を生成し、検索対象となるテキストデータから、所定の方法により前記入力検索キーワードに基づいてテキストデータを抽出して取得する情報検索部と、前記情報検索部において検索して取得された各テキストデータから前記入力部で入力を受け付けた関係性情報が付与された２つの入力検索キーワードで構成される対を生成し、前記生成した対を二項関係の候補とする候補抽出部と、前記二項関係判定部に基づいて、前記二項関係の候補について、前記二項関係の候補となった二項関係の要素に関わる所定の情報を用いて、抽出するべき二項関係か否かを判定し、抽出するべき二項関係であると判定された場合に、前記二項関係の候補を抽出するべき二項関係として選択する二項関係選択部と前記二項関係判定部で選択した二項関係を含むテキストデータを検索結果として抽出する検索結果抽出部と、を備える」構成となる。 For example, in the information search apparatus 8 described in the reference example, a configuration in which the input unit 84 related to the relationship information between keywords is excluded can be considered. In this case, the information retrieval apparatus uses the predetermined information relating to the binary relation element based on the supervised machine learning process or the processing method using a predetermined rule. A binary relation determination unit that determines whether or not a term relation exists and an input search keyword pair using a plurality of input search keywords are generated, and the input search keyword is generated from text data to be searched by a predetermined method. An information search unit that extracts and acquires text data based on the input search keyword, and two input search keywords to which the relationship information received by the input unit from each text data searched and acquired by the information search unit is added And a candidate extraction unit that uses the generated pair as a binary relation candidate and a binary relation determination unit, When it is determined that the binary relationship to be extracted is determined using predetermined information related to the binary relationship element that is a candidate for the binary relationship, A binary relation selection unit for selecting the binary relation as a binary relation to be extracted, and a search result extraction unit for extracting text data including the binary relation selected by the binary relation determination unit as a search result; It has a configuration comprising “

さらに発展的に、抽出すべきテキストデータを判定する方法には、種々の処理方法があり得るので、第１、第３実施形態の情報検索装置の概念を包含する概念の情報検索装置として、「所定の方法に基づいて、テキストデータ中に含まれる二項関係の要素に関わる所定の情報を用いて当該テキストデータが抽出すべき二項関係かどうかを判定する
テキストデータ判定部と、入力された複数の検索キーワードを用いた入力検索キーワードの対を生成し、検索対象となるテキストデータから、所定の方法により前記入力検索キーワードに基づいてテキストデータを抽出して取得する情報検索部と、前記情報検索部において検索して取得された各テキストデータから前記入力部で入力を受け付けた関係性情報が付与された２つの入力検索キーワードで構成される対を生成し、前記生成した対を二項関係の候補とする候補抽出部と、前記テキストデータ判定部に基づいて、前記二項関係の候補について、前記二項関係の候補となった二項関係の要素に関わる所定の情報を用いて、抽出するべき二項関係か否かを判定し、抽出するべき二項関係であると判定された場合に、前記二項関係の候補を抽出するべき二項関係として選択する二項関係選択部と前記二項関係判定部で選択した二項関係を含むテキストデータを検索結果として抽出する検索結果抽出部と、を備える」という構成の情報処理装置も考えられる。 Further, since there are various processing methods for determining the text data to be extracted, as a concept information search apparatus including the concept of the information search apparatus of the first and third embodiments, “ Based on a predetermined method, a text data determination unit that determines whether the text data is a binary relationship to be extracted by using predetermined information related to a binary relationship element included in the text data is input An information search unit that generates a pair of input search keywords using a plurality of search keywords, extracts text data based on the input search keywords by a predetermined method from text data to be searched, and the information Two input search keywords to which the relationship information received by the input unit from each text data retrieved by the search unit is added Based on the candidate extraction unit that generates the configured pair and uses the generated pair as a binary relation candidate, and the text data determination unit, the binary relation candidate becomes the binary relation candidate. Using the predetermined information related to the binary relation element, it is determined whether or not the binary relation is to be extracted. If it is determined that the binary relation is to be extracted, the binary relation candidate is selected. Information including a binary relation selection unit that is selected as a binary relation to be extracted and a search result extraction unit that extracts text data including the binary relation selected by the binary relation determination unit as a search result " A processing device is also conceivable.

また、以上で言及した処理において、「値が大きいものを取る」処理については、例えば「値が閾値以上のものを取る」処理や、「値が大きいものを所定の値の個数以上のものを大きい順に取る」処理や、「取り出されたものの値の最大値に対して所定の割合をかけた値を求め、その求めた値以上の値を持つものを取る」処理に置き換えることができる。また、これら閾値や所定の値は、あらかじめ定めることも可能であるし、適宜ユーザが値を変更・設定できることも可能である。 In addition, in the processing referred to above, regarding the processing of “taking a larger value”, for example, “processing that takes a value that is greater than or equal to a threshold” or “processing that has a larger value than the number of predetermined values” The processing can be replaced with “taken in descending order” processing or “take a value obtained by multiplying the maximum value of the extracted values by a predetermined ratio and take a value that is equal to or greater than the obtained value”. Further, these threshold values and predetermined values can be determined in advance, and the user can change and set the values as appropriate.

その他、各部の具体的構成についても上記実施形態に限られるものではなく、本発明の趣旨を逸脱しない範囲で種々変形が可能である。 In addition, the specific configuration of each part is not limited to the above embodiment, and various modifications can be made without departing from the spirit of the present invention.

本発明で利用される二項関係抽出装置の構成例を示す図である。It is a figure which shows the structural example of the binary relation extraction apparatus utilized by this invention. 二項関係抽出装置の処理の流れを示す図である。It is a figure which shows the flow of a process of a binary relation extraction apparatus. 教師データの例を示す図である。It is a figure which shows the example of teacher data. サポートベクトルマシン法のマージン最大化の概念を示す図である。It is a figure which shows the concept of margin maximization of a support vector machine method. 図３に示す二項関係の素性の集合との組の例を示す図である。It is a figure which shows the example of a set with the set of the features of a binary relation shown in FIG. 本発明の第１実施例に係る情報検索装置の構成例を示す図である。It is a figure which shows the structural example of the information search device which concerns on 1st Example of this invention. 同情報検索装置の処理の流れを示す図である。It is a figure which shows the flow of a process of the information search device. 同実施例における、教師データおよび、その二項関係の素性の集合との組の例を示す図である。It is a figure which shows the example of the group with the set of teacher data and the feature of the binary relation in the Example. 同実施例における、教師データおよび、その二項関係の素性の集合との組の例を示す図である。It is a figure which shows the example of the group with the set of teacher data and the feature of the binary relation in the Example. 教師データおよび、その二項関係の素性の集合との組の例を示す図である。It is a figure which shows the example of a group with teacher data and the set of the feature of the binary relation. 本発明の第２実施例に係る情報検索装置の構成例を示す図である。It is a figure which shows the structural example of the information search device which concerns on 2nd Example of this invention. 同情報検索装置の処理の流れを示す図である。It is a figure which shows the flow of a process of the information search device. 本発明の参考例に係る情報検索装置の構成例を示す図である。It is a figure which shows the structural example of the information search device which concerns on the reference example of this invention. 同情報検索装置の処理の流れを示す図である。It is a figure which shows the flow of a process of the information search device. 同情報検索装置が出力する入力インタフェースの例を示す図である。It is a figure which shows the example of the input interface which the same information search device outputs. 同情報検索装置の他の構成例を示す図である。It is a figure which shows the other structural example of the information search device. 本発明の第３実施例に係る情報検索装置の部分的な構成例を示す図である。It is a figure which shows the example of a partial structure of the information search device based on 3rd Example of this invention. 同情報検索装置の処理の流れを示す図である。It is a figure which shows the flow of a process of the information search device. 同情報検索装置が出力する検索結果出力画面の例を示す図である。It is a figure which shows the example of the search result output screen which the same information search device outputs.

Explanation of symbols

１…二項関係抽出装置
１１…教師データ記憶部
１２…解−素性対抽出部
１３…機械学習部
１４…学習結果記憶部
１５…候補抽出部
１６…素性抽出部
１７…解推定部
１８…二項関係抽出部
２…テキストデータ
３…二項関係
４…情報検索装置
４０…情報検索部
４１…教師データ記憶部
４２…学習部
４２１…解−素性対抽出部
４２２…機械学習部
４３…学習結果記憶部
４４…候補抽出部
４５…二項関係選択部
４５１…素性抽出部
４５２…解判定部
４６…検索結果抽出部
５…検索用テキストデータ
６…検索結果
７…情報検索装置
７０…情報検索部
７１…教師データ記憶部
７２…解−素性対抽出部
７３…機械学習部
７４…学習結果記憶部
７５…素性抽出部
７６…解判定部
７７…検索結果抽出部
８，８’…情報検索装置
８１…教師データ記憶部
８２，８２’…二項関係判定部
８２０…学習部
８２１…解−素性対抽出部
８２２…機械学習部
８２０’…パターン格納部
８３…学習結果記憶部
８３’…パターン記憶部
８４…入力部
８５…候補抽出部
８６，８６’…二項関係選択部
８６１…素性抽出部
８６２…解判定部
８７…検索結果抽出部
９…情報検索装置
９１…出力部
９２…再出力部
DESCRIPTION OF SYMBOLS 1 ... Binary relation extraction apparatus 11 ... Teacher data storage part 12 ... Solution-feature pair extraction part 13 ... Machine learning part 14 ... Learning result storage part 15 ... Candidate extraction part 16 ... Feature extraction part 17 ... Solution estimation part 18 ... Two Term relation extraction unit 2 ... text data 3 ... binary relation 4 ... information search device 40 ... information search unit 41 ... teacher data storage unit 42 ... learning unit 421 ... solution-feature pair extraction unit 422 ... machine learning unit 43 ... learning result Storage unit 44 ... Candidate extraction unit 45 ... Binary relation selection unit 451 ... Feature extraction unit 452 ... Solution determination unit 46 ... Search result extraction unit 5 ... Search text data 6 ... Search result 7 ... Information search device 70 ... Information search unit 71 ... Teacher data storage unit 72 ... Solution-feature pair extraction unit 73 ... Machine learning unit 74 ... Learning result storage unit 75 ... Feature extraction unit 76 ... Solution determination unit 77 ... Search result extraction unit 8, 8 '... Information search device 81 ... teacher data storage units 82, 82 ' Binary relationship determination unit 820 ... learning unit 821 ... solution-feature pair extraction unit 822 ... machine learning unit 820 '... pattern storage unit 83 ... learning result storage unit 83' ... pattern storage unit 84 ... input unit 85 ... candidate extraction unit 86 , 86 '... binary relation selection unit 861 ... feature extraction unit 862 ... solution determination unit 87 ... search result extraction unit 9 ... information search device 91 ... output unit 92 ... re-output unit

Claims

An information search device that performs an information search process using a plurality of search keywords,
Teacher data in which teacher data is stored, including cases that consist of pairs of problems and solutions, where the problem is a binary relationship with the search keyword as an element and the solution is a binary relationship to be extracted Taking out the case from the storage unit, extracting predetermined information as a feature for each case, generating a set of the extracted feature set and solution, and generating the set based on a predetermined machine learning algorithm For a set of features and a solution, machine learning processing is performed to determine what kind of feature set the solution is, and indicates what type of feature set the solution is A learning unit that stores information in the learning result storage unit as learning result information;
An information search unit that generates a pair of input search keywords using a plurality of input search keywords, and extracts and acquires text data based on the input search keywords from a text data to be searched by a predetermined method; ,
A candidate extraction unit that generates a pair composed of the input search keyword from each text data searched and acquired in the information search unit, and uses the generated pair as a binary relation candidate;
The extraction process similar to the extraction process performed in the learning unit extracts the predetermined information as a feature for the binomial relationship candidates, and based on the learning result information stored in the learning result storage unit, In the case of the set of features of the binomial relationship candidates, it is determined what solution should be taken, and the determination result should be a solution indicating that the binomial relationship is to be extracted for the binomial relationship candidate. A binary relation selection unit that, when determined, selects the binary relation candidate to be extracted as a binary relation;
A search result extraction unit that extracts text data including the binary relationship selected by the binary relationship selection unit as a search result;
An information retrieval apparatus comprising:

An output unit for outputting the search result extracted by the search result extraction unit;
The information search device according to claim 1, wherein the output unit also outputs information indicating the presence or absence of a binary relationship between the input search keywords for each search result.

Regarding the information indicating the presence or absence of the binary relationship output by the output unit, the search result corresponding to the received information is accepted by accepting any designation of information indicating the presence or absence of the binary relationship. The information search device according to claim 2, further comprising a re-output unit that re-outputs only the information.

An information search device that performs an information search process using a plurality of search keywords,
From a teacher data storage unit in which teacher data is stored that includes examples of a combination of a problem and a solution, where the problem is a combination of a search keyword and text data and the solution is text data to be extracted A solution-feature pair extraction unit that takes out the case, extracts information related to a binary relationship having at least a search keyword as an element for each case, and generates a set of the extracted feature set and solution; ,
Based on a predetermined machine learning algorithm, with respect to the set of features and solution, machine learning processing is performed to determine what kind of feature set results in the solution. A machine learning unit that stores information indicating whether the solution is obtained as learning result information in a learning result storage unit;
An information search unit that generates a pair of input search keywords using a plurality of input search keywords, and extracts and acquires text data based on the input search keywords from a text data to be searched by a predetermined method; ,
Feature extraction that extracts information related to a binary relation having at least the search keyword as an element for a set of the search keyword and the extracted text data by an extraction process similar to the extraction process performed by the solution-feature pair extraction unit And
Based on the learning result information stored in the learning result storage unit, a solution determination unit that determines what solution should be taken in the case of a set of features of the set of the search keyword and the extracted text data;
As a result of the determination, when it is determined that the solution is text data to be extracted for a set of the search keyword and extracted text data, the text data is selected as an article to be extracted, and the selected A search result extraction unit that extracts text data as a search result;
An information retrieval apparatus comprising:

A program for causing a computer to perform information search processing using a plurality of search keywords,
Teacher data in which teacher data is stored, including cases that consist of pairs of problems and solutions, where the problem is a binary relationship with the search keyword as an element and the solution is a binary relationship to be extracted Taking out the case from the storage unit, extracting predetermined information as a feature for each case, generating a set of the extracted feature set and solution, and generating the set based on a predetermined machine learning algorithm For a set of features and a solution, machine learning processing is performed to determine what kind of feature set the solution is, and indicates what type of feature set the solution is A learning step of storing information in the learning result storage unit as learning result information;
An information search step of generating a pair of input search keywords using a plurality of input search keywords, and extracting and acquiring text data from the text data to be searched based on the input search keywords by a predetermined method; ,
A candidate extraction step of generating a pair composed of the input search keyword from each text data searched and acquired in the information search step, and using the generated pair as a binary relation candidate;
The extraction process similar to the extraction process performed in the learning step extracts the predetermined information as a feature for the binomial relationship candidates, and based on the learning result information stored in the learning result storage unit, In the case of the set of features of the binomial relationship candidates, it is determined what solution should be taken, and the determination result should be a solution indicating that the binomial relationship is to be extracted for the binomial relationship candidate. A binary relation selection step of selecting, as a binary relation, to be extracted as a binary relation to be extracted;
A search result extraction step of extracting text data including the binary relationship selected in the binary relationship selection step as a search result;
An information retrieval program characterized by causing

A program for causing a computer to perform information search processing using a plurality of search keywords,
From a teacher data storage unit in which teacher data is stored that includes examples of a combination of a problem and a solution, where the problem is a combination of a search keyword and text data and the solution is text data to be extracted A solution-feature pair extraction step of taking out the case, extracting information related to a binary relation having at least a search keyword as an element for each case as a feature, and generating a set of the extracted feature set and solution; ,
Based on a predetermined machine learning algorithm, with respect to the set of features and solution, machine learning processing is performed to determine what kind of feature set results in the solution. A machine learning step of storing information indicating whether the solution is to be stored in the learning result storage unit as learning result information;
An information search step of generating a pair of input search keywords using a plurality of input search keywords, and extracting and acquiring text data from the text data to be searched based on the input search keywords by a predetermined method; ,
Feature extraction that extracts information related to a binary relation having at least the search keyword as an element for a set of the search keyword and the extracted text data by an extraction process similar to the extraction process performed by the solution-feature pair extraction step Process,
Based on the learning result information stored in the learning result storage unit, a solution determination step for determining what solution should be taken in the case of a set of features of the set of the search keyword and the extracted text data;
As a result of the determination, when it is determined that the solution is text data to be extracted for a set of the search keyword and extracted text data, the text data is selected as an article to be extracted, and the selected A search result extraction process for extracting text data as a search result;
An information retrieval program characterized by causing