JP6495206B2

JP6495206B2 - Document concept base generation device, document concept search device, method, and program

Info

Publication number: JP6495206B2
Application number: JP2016138881A
Authority: JP
Inventors: 克人別所; 淳史大塚; 久子浅野; 松尾　義博; 義博松尾
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2016-07-13
Filing date: 2016-07-13
Publication date: 2019-04-03
Anticipated expiration: 2036-07-13
Also published as: JP2018010482A

Description

本発明は、ユーザが入力した検索クエリに概念的に適合する検索対象文書を検索するための文書概念ベース生成装置、文書概念検索装置、方法、及びプログラムに関する。 The present invention relates to a document concept base generation device, a document concept search device, a method, and a program for searching for a search target document that conceptually matches a search query input by a user.

概念検索は、検索対象となる文書である検索対象文書の集合から、ユーザが入力した検索クエリに概念的に適合する検索対象文書を検索するというものである。
以下の非特許文献１では、コーパスから、単語と該単語の概念を表す単語概念ベクトルとの組の集合である単語概念ベースを生成する。各検索対象文書に対し、該検索対象文書中の単語の、単語概念ベースにおける対応する単語概念ベクトルを合成することにより、該検索対象文書の概念ベクトルである検索対象文書概念ベクトルを生成する。検索クエリに対し、該検索クエリ中の単語の、単語概念ベースにおける対応する単語概念ベクトルを合成することにより、該検索クエリの概念ベクトルである検索クエリ概念ベクトルを生成し、各検索対象文書に対し、該検索クエリ概念ベクトルと、該検索対象文書の概念ベクトルとの類似度を算出する。検索結果として、類似度の降順にランキングした検索対象文書を表示する。あるいは、ある閾値以上の類似度をもつ検索対象文書を表示する。 The concept search is a search for a search target document that conceptually matches a search query input by a user from a set of search target documents that are documents to be searched.
In Non-Patent Document 1 below, a word concept base that is a set of a word and a word concept vector representing the concept of the word is generated from the corpus. For each search target document, a corresponding word concept vector in the word concept base of the words in the search target document is synthesized to generate a search target document concept vector that is a concept vector of the search target document. A search query concept vector that is a concept vector of the search query is generated by synthesizing the corresponding word concept vector in the word concept base of the words in the search query with respect to the search query, and for each search target document The similarity between the search query concept vector and the concept vector of the search target document is calculated. As a search result, search target documents ranked in descending order of similarity are displayed. Alternatively, a search target document having a similarity greater than a certain threshold is displayed.

別所克人, 内山俊郎, 内山匡, 片岡良治, 奥雅博,“単語・意味属性間共起に基づくコーパス概念ベースの生成方式,”情報処理学会論文誌, Vol.49, No.12, pp.3997-4006, Dec. 2008.Katsuto Bessho, Toshiro Uchiyama, Kei Uchiyama, Ryoji Kataoka, Masahiro Oku, “Corpus Concept-Based Generation Based on Co-occurrence Between Words and Semantic Attributes,” Information Processing Society of Japan, Vol.49, No.12, pp. 3997-4006, Dec. 2008.

検索クエリと、該検索クエリに概念的に適合する検索対象文書である正解文書の集合との、組の集合が与えられているとする。この正解情報は、検索精度を向上させる可能性をもっていると考えられるが、従来の概念検索技術では、この情報を扱えなかった。 It is assumed that a set of a search query and a set of correct documents that are search target documents conceptually matching the search query are given. This correct answer information is considered to have a possibility of improving the search accuracy, but the conventional concept search technique cannot handle this information.

本発明の目的は、この正解情報を用いて、検索精度を向上させる文書概念ベース生成装置、文書概念検索装置、方法、及びプログラムを提供することにある。 An object of the present invention is to provide a document concept base generation device, a document concept search device, a method, and a program that improve the search accuracy using the correct answer information.

上記課題を解決するため、第１の発明に係る文書概念ベース生成装置は、単語と該単語の概念を表す単語概念ベクトルとの組の集合である単語概念ベースと、検索対象となる文書である検索対象文書の集合Ａと、検索クエリと、該検索クエリに概念的に適合する前記集合Ａ中の検索対象文書である正解文書の集合との、組の集合Ｂを入力とし、前記集合Ｂ中の各正解文書に対し、前記集合Ｂにおいて該正解文書に対応する検索クエリのそれぞれを、該正解文書に対応づけることにより、前記集合Ａ中の該正解文書の情報を更新する学習手段と、前記集合Ａ中の各検索対象文書について、該検索対象文書及び該検索対象文書に対応する検索クエリの各テキストに対し、該テキスト中の単語の、前記単語概念ベースにおける対応する単語概念ベクトルを合成し正規化することにより、該テキストの概念ベクトルであるテキスト概念ベクトルを生成し、各テキスト概念ベクトルを合成することにより、該検索対象文書の概念ベクトルである検索対象文書概念ベクトルを生成し、該検索対象文書と前記検索対象文書概念ベクトルとの組の集合を格納する検索対象文書概念ベースを生成する検索対象文書概念ベース生成手段と、を含んで構成されている。 In order to solve the above problems, a document concept base generation device according to a first invention is a word concept base which is a set of a word and a word concept vector representing the concept of the word, and a document to be searched. A set B of a set A of search target documents, a search query, and a set of correct documents that are search target documents in the set A that conceptually match the search query are input. Learning means for updating the information on the correct document in the set A by associating each correct query in the set B with each correct query in the set B. For each search target document in set A, for each text of the search query corresponding to the search target document and the search target document, the corresponding word concept vector in the word concept base of the word in the text. A text concept vector that is a concept vector of the text is generated by synthesizing and normalizing the text, and a search target document concept vector that is a concept vector of the search target document is generated by synthesizing each text concept vector And a search target document concept base generating means for generating a search target document concept base that stores a set of sets of the search target document and the search target document concept vector.

第２の発明に係る文書概念検索装置は、単語と該単語の概念を表す単語概念ベクトルとの組の集合である単語概念ベースと、検索対象となる文書である検索対象文書の集合Ａであって、検索クエリと、該検索クエリに概念的に適合する前記集合Ａ中の検索対象文書である正解文書の集合との、組の集合Ｂ中の各正解文書に対し、前記集合Ｂにおいて該正解文書に対応する検索クエリのそれぞれを、該正解文書に対応づけることにより該正解文書の情報が更新された前記集合Ａ中の各検索対象文書について、該検索対象文書及び該検索対象文書に対応する検索クエリの各テキストに対し、該テキスト中の単語の、前記単語概念ベースにおける対応する単語概念ベクトルを合成し正規化することにより生成された、該テキストの概念ベクトルであるテキスト概念ベクトルを用いて、各テキスト概念ベクトルを合成することにより生成された、該検索対象文書の概念ベクトルである検索対象文書概念ベクトルと該検索対象文書との組の集合を格納する検索対象文書概念ベースと、新規の検索クエリに対し、該検索クエリ中の単語の、前記単語概念ベースにおける対応する単語概念ベクトルを合成することにより、該検索クエリの概念ベクトルである検索クエリ概念ベクトルを生成し、前記検索対象文書概念ベース中の各検索対象文書に対し、該検索クエリ概念ベクトルと、該検索対象文書の概念ベクトルとの類似度を算出する検索手段と、を含んで構成されている。 The document concept search apparatus according to the second invention includes a word concept base that is a set of a word and a word concept vector representing the concept of the word, and a set A of search target documents that are documents to be searched. For each correct document in the set B of the search query and a set of correct documents that are search target documents in the set A that conceptually match the search query, the correct answer in the set B Each search query corresponding to a document corresponds to the search target document and the search target document for each search target document in the set A in which information on the correct document is updated by associating the search query corresponding to the document with the correct answer document. For each text of the search query, a concept vector of the text generated by synthesizing and normalizing corresponding word concept vectors in the word concept base of the words in the text. A search target document storing a set of a search target document concept vector, which is a concept vector of the search target document, and a set of the search target document generated by combining the text concept vectors using the text concept vector A search query concept vector, which is a concept vector of the search query, is generated by synthesizing a concept base and a new search query with corresponding word concept vectors in the word concept base of words in the search query. And a search means for calculating a similarity between the search query concept vector and the concept vector of the search target document for each search target document in the search target document concept base.

第３の発明に係る文書概念検索装置は、単語と該単語の概念を表す単語概念ベクトルとの組の集合である単語概念ベースと、検索対象となる文書である検索対象文書の集合Ａと、検索クエリと、該検索クエリに概念的に適合する前記集合Ａ中の検索対象文書である正解文書の集合との、組の集合Ｂを入力とし、前記集合Ｂ中の各正解文書に対し、前記集合Ｂにおいて該正解文書に対応する検索クエリのそれぞれを、該正解文書に対応づけることにより、前記集合Ａ中の該正解文書の情報を更新する学習手段と、前記集合Ａ中の各検索対象文書について、該検索対象文書及び該検索対象文書に対応する検索クエリの各テキストに対し、該テキスト中の単語の、前記単語概念ベースにおける対応する単語概念ベクトルを合成し正規化することにより、該テキストの概念ベクトルであるテキスト概念ベクトルを生成し、各テキスト概念ベクトルを合成することにより、該検索対象文書の概念ベクトルである検索対象文書概念ベクトルを生成し、該検索対象文書と前記検索対象文書概念ベクトルとの組の集合を格納する検索対象文書概念ベースを生成する検索対象文書概念ベース生成手段と、新規の検索クエリに対し、該検索クエリ中の単語の、前記単語概念ベースにおける対応する単語概念ベクトルを合成することにより、該検索クエリの概念ベクトルである検索クエリ概念ベクトルを生成し、前記検索対象文書概念ベース中の各検索対象文書に対し、該検索クエリ概念ベクトルと、該検索対象文書の概念ベクトルとの類似度を算出する検索手段と、を含んで構成されている。 A document concept search device according to a third invention includes a word concept base that is a set of a word and a word concept vector representing the concept of the word, a set A of search target documents that are documents to be searched, A set B of a search query and a set of correct documents as search target documents in the set A that conceptually matches the search query is input, and for each correct document in the set B, the Learning means for updating information on the correct document in the set A by associating each search query corresponding to the correct document in the set B with the correct document, and each search target document in the set A For each text of the search target document and the search query corresponding to the search target document, the corresponding word concept vector in the word concept base of the words in the text is synthesized and normalized. Generating a text concept vector that is a concept vector of the text and combining the text concept vectors to generate a search target document concept vector that is a concept vector of the search target document, and the search target document and the search A search target document concept base generating means for generating a search target document concept base for storing a set of pairs with the target document concept vector, and correspondence of words in the search query to the new search query in the word concept base A search query concept vector which is a concept vector of the search query is generated by synthesizing the word concept vector to be searched, and for each search target document in the search target document concept base, the search query concept vector and the search And a search means for calculating the similarity with the concept vector of the target document.

第４の発明に係る文書概念ベース生成方法は、単語と該単語の概念を表す単語概念ベクトルとの組の集合である単語概念ベースと、検索対象となる文書である検索対象文書の集合Ａと、学習手段と、検索対象文書概念ベース生成手段とを含む文書概念ベース生成装置における文書概念ベース生成方法であって、前記学習手段が、検索クエリと、該検索クエリに概念的に適合する前記集合Ａ中の検索対象文書である正解文書の集合との、組の集合Ｂを入力とし、前記集合Ｂ中の各正解文書に対し、前記集合Ｂにおいて該正解文書に対応する検索クエリのそれぞれを、該正解文書に対応づけることにより、前記集合Ａ中の該正解文書の情報を更新するステップと、前記検索対象文書概念ベース生成手段が、前記集合Ａ中の各検索対象文書について、該検索対象文書及び該検索対象文書に対応する検索クエリの各テキストに対し、該テキスト中の単語の、前記単語概念ベースにおける対応する単語概念ベクトルを合成し正規化することにより、該テキストの概念ベクトルであるテキスト概念ベクトルを生成し、各テキスト概念ベクトルを合成することにより、該検索対象文書の概念ベクトルである検索対象文書概念ベクトルを生成し、該検索対象文書と前記検索対象文書概念ベクトルとの組の集合を格納する検索対象文書概念ベースを生成するステップと、を含んで構成されている。 A document concept base generation method according to a fourth invention includes a word concept base that is a set of a word and a word concept vector representing the concept of the word, and a set A of search target documents that are documents to be searched. A document concept base generation method in a document concept base generation device including learning means and search target document concept base generation means, wherein the learning means conceptually matches the search query and the set Each set of search queries corresponding to the correct document in the set B is set for each correct document in the set B with the set B of the set of correct documents as search target documents in A as input. Updating the information of the correct document in the set A by associating with the correct document, and the search target document concept base generation means for each search target document in the set A For each text of the search target document and the search query corresponding to the search target document, the concept of the text is synthesized by synthesizing and normalizing the corresponding word concept vector of the word in the text in the word concept base. A text concept vector that is a vector is generated, and each text concept vector is synthesized to generate a search target document concept vector that is a concept vector of the search target document, and the search target document and the search target document concept vector Generating a search target document concept base that stores a set of sets.

第５の発明に係る文書概念検索方法は、単語と該単語の概念を表す単語概念ベクトルとの組の集合である単語概念ベースと、検索対象となる文書である検索対象文書の集合Ａであって、検索クエリと、該検索クエリに概念的に適合する前記集合Ａ中の検索対象文書である正解文書の集合との、組の集合Ｂ中の各正解文書に対し、前記集合Ｂにおいて該正解文書に対応する検索クエリのそれぞれを、該正解文書に対応づけることにより該正解文書の情報が更新された前記集合Ａ中の各検索対象文書について、該検索対象文書及び該検索対象文書に対応する検索クエリの各テキストに対し、該テキスト中の単語の、前記単語概念ベースにおける対応する単語概念ベクトルを合成し正規化することにより生成された、該テキストの概念ベクトルであるテキスト概念ベクトルを用いて、各テキスト概念ベクトルを合成することにより生成された、該検索対象文書の概念ベクトルである検索対象文書概念ベクトルと該検索対象文書との組の集合を格納する検索対象文書概念ベースと、検索手段とを含む文書概念検索装置における文書概念検索方法であって、前記検索手段が、新規の検索クエリに対し、該検索クエリ中の単語の、前記単語概念ベースにおける対応する単語概念ベクトルを合成することにより、該検索クエリの概念ベクトルである検索クエリ概念ベクトルを生成し、前記検索対象文書概念ベース中の各検索対象文書に対し、該検索クエリ概念ベクトルと、該検索対象文書の概念ベクトルとの類似度を算出するステップを含んで構成されている。 The document concept search method according to the fifth invention is a word concept base that is a set of a word and a word concept vector representing the concept of the word, and a set A of search target documents that are documents to be searched. For each correct document in the set B of the search query and a set of correct documents that are search target documents in the set A that conceptually match the search query, the correct answer in the set B Each search query corresponding to a document corresponds to the search target document and the search target document for each search target document in the set A in which information on the correct document is updated by associating the search query corresponding to the document with the correct answer document. For each text of the search query, a concept vector of the text generated by synthesizing and normalizing corresponding word concept vectors in the word concept base of the words in the text. A search target document storing a set of a search target document concept vector, which is a concept vector of the search target document, and a set of the search target document generated by combining the text concept vectors using the text concept vector A document concept search method in a document concept search apparatus including a concept base and a search means, wherein the search means corresponds to a new search query, a word corresponding to the word in the word concept base in the search query A search query concept vector that is a concept vector of the search query is generated by synthesizing the concept vector, and for each search target document in the search target document concept base, the search query concept vector and the search target document The step of calculating the similarity to the concept vector is included.

第６の発明に係る文書概念検索方法は、単語と該単語の概念を表す単語概念ベクトルとの組の集合である単語概念ベースと、検索対象となる文書である検索対象文書の集合Ａと、学習手段と、検索対象文書概念ベース生成手段と、検索手段とを含む文書概念検索装置における文書概念検索方法であって、前記学習手段が、検索クエリと、該検索クエリに概念的に適合する前記集合Ａ中の検索対象文書である正解文書の集合との、組の集合Ｂを入力とし、前記集合Ｂ中の各正解文書に対し、前記集合Ｂにおいて該正解文書に対応する検索クエリのそれぞれを、該正解文書に対応づけることにより、前記集合Ａ中の該正解文書の情報を更新するステップと、前記検索対象文書概念ベース生成手段が、前記集合Ａ中の各検索対象文書について、該検索対象文書及び該検索対象文書に対応する検索クエリの各テキストに対し、該テキスト中の単語の、前記単語概念ベースにおける対応する単語概念ベクトルを合成し正規化することにより、該テキストの概念ベクトルであるテキスト概念ベクトルを生成し、各テキスト概念ベクトルを合成することにより、該検索対象文書の概念ベクトルである検索対象文書概念ベクトルを生成し、該検索対象文書と前記検索対象文書概念ベクトルとの組の集合を格納する検索対象文書概念ベースを生成するステップと、前記検索手段が、新規の検索クエリに対し、該検索クエリ中の単語の、前記単語概念ベースにおける対応する単語概念ベクトルを合成することにより、該検索クエリの概念ベクトルである検索クエリ概念ベクトルを生成し、前記検索対象文書概念ベース中の各検索対象文書に対し、該検索クエリ概念ベクトルと、該検索対象文書の概念ベクトルとの類似度を算出するステップと、を含んで構成されている。 A document concept search method according to a sixth invention includes a word concept base that is a set of a word and a word concept vector that represents the concept of the word, a set A of search target documents that are documents to be searched, A document concept search method in a document concept search apparatus including a learning means, a search target document concept base generation means, and a search means, wherein the learning means conceptually matches the search query and the search query. Each set of search queries corresponding to the correct document in the set B is set for each correct document in the set B with the set B of the correct document as a search target document in the set A as an input. The step of updating the information of the correct document in the set A by associating with the correct document, and the search target document concept base generation means for the search target document in the set A By synthesizing and normalizing the corresponding word concept vector in the word concept base of the word in the text for each text of the target document and the search query corresponding to the search target document, A text concept vector is generated, and each text concept vector is synthesized to generate a search target document concept vector that is a concept vector of the search target document, and a set of the search target document and the search target document concept vector Generating a search target document concept base that stores a set of words, and the search means synthesizes a corresponding word concept vector in the word concept base of a word in the search query for a new search query. To generate a search query concept vector that is a concept vector of the search query, and For each target document in concept-based, and the search query concept vector is constituted by including a step of calculating a similarity between the concept vectors of the target document, the.

本発明のプログラムは、コンピュータを、上記の文書概念ベース生成装置若しくは上記の文書概念検索装置の各手段として機能させるための、又はコンピュータに、上記の文書概念ベース生成方法若しくは上記の文書概念検索方法の各ステップを実行させるためのプログラムである。 The program of the present invention causes a computer to function as each means of the document concept base generation device or the document concept search device described above, or causes the computer to execute the document concept base generation method or the document concept search method described above. This is a program for executing each step.

本発明では、学習手段と検索対象文書概念ベース生成手段の処理までが、検索の事前処理であり、検索手段の処理が検索処理である。 In the present invention, the processing up to the processing of the learning means and the search target document concept base generation means is the search pre-processing, and the search means processing is the search processing.

本発明の文書概念ベース生成装置、文書概念検索装置、方法、及びプログラムによれば、正解情報を用いて、検索精度を向上させることができる。 According to the document concept base generation device, document concept search device, method, and program of the present invention, it is possible to improve search accuracy using correct answer information.

本発明の実施の形態に係る文書概念検索装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the document concept search apparatus which concerns on embodiment of this invention. 検索対象文書集合の構成例を示す図である。It is a figure which shows the structural example of a search object document set. 正解情報の構成例を示す図である。It is a figure which shows the structural example of correct information. 更新後の検索対象文書集合の構成例を示す図である。It is a figure which shows the structural example of the search object document set after an update. 単語概念ベース２６の例を示す図である。It is a figure which shows the example of the word concept base. 検索対象文書概念ベース３０の例を示す図である。3 is a diagram illustrating an example of a search target document concept base 30. FIG. 本発明の実施の形態に係る文書概念検索装置の学習手段及び検索対象文書概念ベース生成手段における処理ルーチンを示すフローチャート図である。It is a flowchart figure which shows the processing routine in the learning means and search object document concept base production | generation means of the document concept search apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る文書概念検索装置の検索手段における処理ルーチンを示すフローチャート図である。It is a flowchart figure which shows the processing routine in the search means of the document concept search apparatus concerning embodiment of this invention.

以下、図面とともに本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

＜本発明の実施の形態の概要＞
本発明の実施の形態の学習手段は、検索対象文書Ｘの情報を、対応する検索クエリのそれぞれを包含するように更新する。検索対象文書概念ベース生成手段は、Ｘ及びＸに対応する検索クエリの各テキストの概念ベクトルを合成することにより、Ｘの概念ベクトルを生成するので、Ｘの概念ベクトルは、Ｘに対応する検索クエリpの概念ベクトルの方へ、更新前と比べて近づく。検索手段において、検索クエリｐに概念的に近い新規の検索クエリｇが入力されたとき、新規検索クエリｇの概念ベクトルは、検索クエリｐの概念ベクトルと近い。このため、Ｘの概念ベクトルは、新規検索クエリｇの概念ベクトルの方へ、更新前と比べて近づく。これにより、新規検索クエリｇに対し、概念的に適合するＸとの類似度が、更新前と比べ高くなる。 <Outline of Embodiment of the Present Invention>
The learning means of the embodiment of the present invention updates the information of the search target document X so as to include each of the corresponding search queries. The search target document concept base generation unit generates a concept vector of X by synthesizing a concept vector of each text of a search query corresponding to X and X. Therefore, the concept vector of X is a search query corresponding to X. The concept vector of p is closer than before the update. When a new search query g conceptually close to the search query p is input in the search means, the concept vector of the new search query g is close to the concept vector of the search query p. For this reason, the concept vector of X is closer to the concept vector of the new search query g than before the update. As a result, the similarity with X that conceptually matches the new search query g is higher than before the update.

また、検索対象文書概念ベース生成手段は、Ｘ及びＸに対応する検索クエリの各テキストに対し、該テキスト中の単語の、単語概念ベースにおける対応する単語概念ベクトルを合成し正規化することにより、該テキストの概念ベクトルを生成する。具体的には、単語概念ベクトルを加算した後、長さ１に正規化する。 Further, the search target document concept base generating means synthesizes and normalizes the corresponding word concept vector in the word concept base of the words in the text for each text of the search query corresponding to X and X, A concept vector of the text is generated. Specifically, the word concept vectors are added and then normalized to length 1.

各テキスト概念ベクトルを合成することにより、Ｘの概念ベクトルを生成するが、テキスト概念ベクトルに正規化処理がされていないと、テキストに含まれる単語の数（テキスト長）がより少ないテキストの概念ベクトルの方が、より多いテキストの概念ベクトルよりも長さが小さくなり、その結果、Ｘの概念ベクトルへの寄与度が小さくなる。このため、テキスト長のより小さいテキストの意味が、Ｘの概念ベクトルには反映されにくくなる。 A concept vector of X is generated by synthesizing each text concept vector, but if the text concept vector is not normalized, the text concept vector has a smaller number of words (text length) included in the text. Is smaller in length than the concept vector of more text, and as a result, the contribution of X to the concept vector is smaller. For this reason, the meaning of text having a smaller text length is less likely to be reflected in the X concept vector.

本発明の実施の形態では、テキスト概念ベクトルに正規化処理を施すので、各テキストの概念ベクトルの、Ｘの概念ベクトルへの寄与度がテキスト長によらず等しくなり、各テキストの意味が、Ｘの概念ベクトルに等しく反映される。この結果、いずれのテキストに概念的に近い新規検索クエリが入力されても、Ｘとの類似度の高さは安定して高くなる。 In the embodiment of the present invention, since normalization processing is performed on the text concept vector, the contribution of the concept vector of each text to the concept vector of X is equal regardless of the text length, and the meaning of each text is X Is equally reflected in the concept vector. As a result, even if a new search query conceptually close to any text is input, the degree of similarity with X is stably increased.

＜文書概念検索装置の構成＞
本発明の実施の形態に係る文書概念検索装置の構成について説明する。図１は、本発明の請求項３の文書概念検索装置の構成例である。図１に示すように、本発明の実施の形態に係る文書概念検索装置１００は、ＣＰＵと、ＲＡＭと、後述する各処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この文書概念検索装置１００は、機能的には図１に示すように入力手段１０と、演算手段２０と、出力手段４０とを備えている。 <Configuration of document concept retrieval device>
A configuration of the document concept retrieval apparatus according to the embodiment of the present invention will be described. FIG. 1 is a configuration example of a document concept retrieval apparatus according to claim 3 of the present invention. As shown in FIG. 1, a document concept retrieval apparatus 100 according to an embodiment of the present invention includes a CPU, a RAM, and a ROM that stores programs and various data for executing processing routines to be described later. Can be configured with a computer. Functionally, the document concept retrieval apparatus 100 includes an input unit 10, a calculation unit 20, and an output unit 40 as shown in FIG.

入力手段１０は、検索対象文書の集合である検索対象文書集合と、正解情報とを入力として受け付ける。 The input unit 10 receives a search target document set, which is a set of search target documents, and correct answer information as inputs.

図２は、検索対象文書集合の構成例である。各レコードは、検索対象となる文書である検索対象文書を一意に特定する検索対象文書ＩＤと、検索対象文書テキストからなる。正解情報は、検索クエリと、該検索クエリに概念的に適合する検索対象文書集合中の検索対象文書である正解文書の集合との、組の集合である。正解文書のそれぞれが、該検索クエリに概念的に適合する。図３は、正解情報の構成例である。各レコードは、検索クエリテキストと、それに概念的に適合する正解文書のＩＤの集合とからなる。 FIG. 2 is a configuration example of a search target document set. Each record includes a search target document ID that uniquely identifies a search target document that is a search target document, and a search target document text. The correct answer information is a set of a search query and a set of correct documents that are search target documents in a search target document set that conceptually matches the search query. Each correct document conceptually matches the search query. FIG. 3 is a configuration example of correct answer information. Each record includes a search query text and a set of IDs of correct documents that conceptually match the search query text.

また、入力手段１０は、新規の検索クエリを受け付ける。 The input means 10 accepts a new search query.

演算手段２０は、学習手段２２と、更新後検索対象文書集合データベース２４と、単語概念ベース２６と、検索対象文書概念ベース生成手段２８と、検索対象文書概念ベース３０と、検索手段３２と、を含んで構成されている。なお、学習手段２２と、単語概念ベース２６と、検索対象文書概念ベース生成手段２８とが、文書概念ベース生成装置の一例である。 The computing means 20 includes a learning means 22, an updated search target document set database 24, a word concept base 26, a search target document concept base generation means 28, a search target document concept base 30, and a search means 32. It is configured to include. The learning unit 22, the word concept base 26, and the search target document concept base generation unit 28 are examples of the document concept base generation device.

学習手段２２は、正解情報中の各正解文書に対し、正解情報において該正解文書に対応する検索クエリのそれぞれを、該正解文書に対応づけることにより、検索対象文書集合中の該正解文書の情報を更新する。以下、詳細に説明する。 The learning means 22 associates each correct query in the correct answer information with each correct query in the correct answer information corresponding to the correct answer document to the correct answer document. Update. Details will be described below.

図４は、更新後の検索対象文書集合の構成例である。図３の正解情報において、正解文書Ｘに対応する検索クエリは、テキストｐ、テキストｑ、テキストｓである。そこで、図４の正解文書Ｘのレコードのように、正解文書Ｘのテキストｘの他に、テキストｐ、テキストｑ、テキストｓを対応づける。図３の他の正解文書（Ｙ、Ｚ、・・・）についても、同様の処理を行う。 FIG. 4 is a configuration example of the search target document set after the update. In the correct answer information in FIG. 3, the search query corresponding to the correct answer document X is text p, text q, and text s. Therefore, as in the record of the correct document X in FIG. 4, in addition to the text x of the correct document X, the text p, the text q, and the text s are associated with each other. Similar processing is performed for the other correct answer documents (Y, Z,...) Shown in FIG.

なお、対応づけるテキストで、文字列が全く同じものが複数あれば、そのようなテキストの内、２番目以降のものは対応づけないというようにしてもよい。 If there are a plurality of texts with the same character string, the second and subsequent texts may not be associated with each other.

図３の正解情報では、１つの検索クエリに対し、正解文書ＩＤの集合が対応づけられているが、１つの正解文書ＩＤに対し、検索クエリの集合が対応づけられている構成例をとっていてもよい。この場合は、正解文書に対応づける検索クエリテキストの集合が、既に得られていることになる。 In the correct answer information in FIG. 3, a set of correct document IDs is associated with one search query, but a configuration example in which a set of search queries is associated with one correct document ID is taken. May be. In this case, a set of search query texts associated with the correct document has already been obtained.

また、学習手段２２は、更新後の検索対象文書集合を、更新後検索対象文書集合データベース２４に格納する。 The learning unit 22 stores the updated search target document set in the updated search target document set database 24.

検索対象文書概念ベース生成手段２８は、更新後の検索対象文書集合中の各検索対象文書について、該検索対象文書及び該検索対象文書に対応する検索クエリの各テキストに対し、該テキスト中の単語の、単語概念ベース２６における対応する単語概念ベクトルを合成し正規化することにより、該テキストの概念ベクトルであるテキスト概念ベクトルを生成し、各テキスト概念ベクトルを合成することにより、該検索対象文書の概念ベクトルである検索対象文書概念ベクトルを生成し、該検索対象文書とその検索対象文書概念ベクトルとの組の集合を格納する検索対象文書概念ベース３０を生成する。以下、詳細に説明する。 The search target document concept base generation means 28, for each search target document in the updated search target document set, for each text of the search query corresponding to the search target document and the search target document, the word in the text By synthesizing and normalizing corresponding word concept vectors in the word concept base 26, a text concept vector that is a concept vector of the text is generated, and by synthesizing each text concept vector, the search target document A search target document concept vector, which is a concept vector, is generated, and a search target document concept base 30 that stores a set of the search target document and the search target document concept vector is generated. Details will be described below.

単語概念ベース２６は、単語と該単語の概念を表す単語概念ベクトルとの組の集合である。図５は、単語概念ベース２６の例である。単語概念ベース２６は、例えば、非特許文献１の手法によって生成する。 The word concept base 26 is a set of a set of a word and a word concept vector representing the concept of the word. FIG. 5 is an example of the word concept base 26. The word concept base 26 is generated by the method of Non-Patent Document 1, for example.

単語概念ベース２６には名詞、動詞、形容詞等の内容語のみを登録するというようにしてもよい。単語概念ベース２６において単語は、該単語の終止形で登録されており、単語概念ベース２６を検索する際は、単語の終止形で検索する。
各単語の単語概念ベクトルは長さ１に正規化されたｄ次元ベクトルであり、概念的に近い単語の概念ベクトルは、近くに配置されている。 Only word words such as nouns, verbs, and adjectives may be registered in the word concept base 26. In the word concept base 26, the word is registered with the word end form, and when searching the word concept base 26, the word concept base 26 is searched with the word end form.
The word concept vector of each word is a d-dimensional vector normalized to length 1, and the concept vectors of words that are conceptually close are arranged nearby.

検索対象文書概念ベース生成手段２８の処理では、検索対象文書及び該検索対象文書に対応する検索クエリの各テキストに対し、単語分割を行う。各テキストについて、単語分割結果における各単語で単語概念ベース２６を検索し、取得した単語概念ベクトルを加算し、その結果得られた概念ベクトルを長さ１に正規化したものを、該テキストの概念ベクトルとする。 In the processing of the search target document concept base generation unit 28, word division is performed on each text of the search target document and the search query corresponding to the search target document. For each text, the word concept base 26 is searched for each word in the word division result, the acquired word concept vectors are added, and the concept vector obtained as a result is normalized to a length of 1 to obtain the concept of the text. Let it be a vector.

ここで、単語分割結果における単語の内、内容語のみを使用して、テキスト概念ベクトルを生成してもよい。また、同一の単語が複数ある場合は、対応する単語概念ベクトルを、その個数分加算してもよいし、１回だけ加算してもよい。また、取得した単語概念ベクトルに対し、対応する単語の所属するテキストによって、異なる重みを該単語概念ベクトルに乗じた上で加算するというようにしてもよい。 Here, the text concept vector may be generated using only the content word among the words in the word division result. When there are a plurality of the same words, the corresponding word concept vectors may be added by the number thereof, or may be added only once. Further, the acquired word concept vector may be added after a different weight is multiplied by the word concept vector depending on the text to which the corresponding word belongs.

検索対象文書概念ベース生成手段２８の処理では、その後、各テキスト概念ベクトルを加算し、その結果得られた概念ベクトルを長さ1に正規化したものを、該検索対象文書の概念ベクトルとする。 In the processing of the search target document concept base generation unit 28, each text concept vector is added, and the concept vector obtained as a result of normalization is set to a concept vector of the search target document.

図６は、検索対象文書概念ベース３０の構成例である。各検索対象文書に対し、そのＩＤと検索対象文書概念ベクトルとの組を、検索対象文書概念ベース３０の１レコードとして登録する。 FIG. 6 is a configuration example of the search target document concept base 30. For each search target document, a set of the ID and search target document concept vector is registered as one record of the search target document concept base 30.

検索手段３２の処理では、入力手段１０によって受け付けた新規の検索クエリに対し、そのテキストを単語分割する。単語分割結果における各単語で単語概念ベース２６を検索し、取得した単語概念ベクトルを加算し、その結果得られた概念ベクトルを長さ１に正規化したものを、該検索クエリの概念ベクトルとする。 In the processing of the search means 32, the text is divided into words for the new search query received by the input means 10. The word concept base 26 is searched for each word in the word division result, the acquired word concept vectors are added, and the concept vector obtained as a result is normalized to a length of 1 as the concept vector of the search query. .

ここで、単語分割結果における単語の内、内容語のみを使用して、検索クエリ概念ベクトルを生成してもよい。また、同一の単語が複数ある場合は、対応する単語概念ベクトルを、その個数分加算してもよいし、１回だけ加算してもよい。 Here, the search query concept vector may be generated using only the content word among the words in the word division result. When there are a plurality of the same words, the corresponding word concept vectors may be added by the number thereof, or may be added only once.

検索対象文書概念ベース３０中の各検索対象文書に対し、該検索クエリ概念ベクトルと、該検索対象文書の概念ベクトルとの類似度を算出する。類似度として、例えばコサイン類似度をとることができる。 For each search target document in the search target document concept base 30, the similarity between the search query concept vector and the concept vector of the search target document is calculated. As the similarity, for example, a cosine similarity can be taken.

出力手段４０は、検索結果として、類似度の降順にランキングした検索対象文書を表示する。あるいは、ある閾値以上の類似度をもつ検索対象文書を表示する。 The output means 40 displays search target documents ranked in descending order of similarity as search results. Alternatively, a search target document having a similarity greater than a certain threshold is displayed.

なお、本発明の構成において、学習手段２２の処理を行わず、更新前の検索対象文書集合を入力として検索対象文書概念ベース生成手段２８の処理を行って検索対象文書概念ベース３０を生成し、その検索対象文書概念ベース３０を使用して、検索手段３２の処理を行うことも、もちろん可能である。 In the configuration of the present invention, the processing of the learning unit 22 is not performed, the search target document concept base generation unit 28 is processed by using the search target document set before update as an input, and the search target document concept base 30 is generated. Of course, it is possible to perform the processing of the search means 32 using the search object document concept base 30.

図７は、学習手段２２及び検索対象文書概念ベース生成手段２８の処理フローの一例である。入力手段１０が、検索対象文書集合と正解情報とを受け付けると、図７に示す検索対象文書概念ベース生成処理ルーチンが実行される。 FIG. 7 is an example of a processing flow of the learning unit 22 and the search target document concept base generation unit 28. When the input unit 10 receives the search target document set and the correct answer information, the search target document concept base generation processing routine shown in FIG. 7 is executed.

まず、ステップＳ１００において、学習手段２２は、入力手段１０によって受け付けた、検索対象文書集合及び正解情報を取得する。 First, in step S <b> 100, the learning unit 22 acquires a search target document set and correct information received by the input unit 10.

そして、ステップＳ１０２において、学習手段２２は、上記ステップＳ１００で取得された検索対象文書集合及び正解情報に基づいて、正解情報中の各正解文書に対し、正解情報において該正解文書に対応する検索クエリのそれぞれを、該正解文書に対応づけることにより、検索対象文書集合中の該正解文書の情報を更新し、更新後検索対象文書集合データベース２４に格納する。 Then, in step S102, the learning means 22 determines, for each correct document in the correct information, a search query corresponding to the correct document in the correct information based on the search target document set and correct information acquired in step S100. Is associated with the correct document, the information on the correct document in the search target document set is updated and stored in the post-update search target document set database 24.

ステップＳ１０４において、検索対象文書概念ベース生成手段２８は、上記ステップＳ１０２で更新後検索対象文書集合データベース２４に格納された検索対象文書のそれぞれについて、該検索対象文書及び該検索対象文書に対応する検索クエリの各テキストに対し、単語分割を行う。そして、検索対象文書概念ベース生成手段２８は、各テキストについて、単語分割結果における各単語で単語概念ベース２６を検索し、取得した単語概念ベクトルを加算し、その結果得られた概念ベクトルを長さ１に正規化したものを、該テキストの概念ベクトルとする。検索対象文書概念ベース生成手段２８は、その後、各テキスト概念ベクトルを加算し、その結果得られた概念ベクトルを長さ1に正規化したものを、該検索対象文書の概念ベクトルとする。そして、検索対象文書概念ベース生成手段２８は、検索対象文書のＩＤと検索対象文書の概念ベクトルとの組を、検索対象文書概念ベース３０に格納し、学習処理ルーチンを終了する。 In step S104, the search target document concept base generation unit 28 searches the search target document and the search corresponding to the search target document for each of the search target documents stored in the updated search target document set database 24 in step S102. Perform word segmentation for each text in the query. Then, the search target document concept base generation unit 28 searches the word concept base 26 for each word in the word division result for each text, adds the acquired word concept vector, and adds the obtained concept vector to the length. The one normalized to 1 is used as the concept vector of the text. The search target document concept base generation unit 28 then adds the text concept vectors, and normalizes the concept vector obtained as a result to the length 1 as the concept vector of the search target document. Then, the search target document concept base generation unit 28 stores the set of the search target document ID and the search target document concept vector in the search target document concept base 30 and ends the learning processing routine.

図８は、検索手段３２の処理フローの一例である。入力手段１０が、新規の検索クエリを受け付けると、図８に示す検索処理ルーチンが実行される。 FIG. 8 is an example of the processing flow of the search means 32. When the input means 10 receives a new search query, a search processing routine shown in FIG. 8 is executed.

まず、ステップＳ２００において、検索手段３２は、入力手段１０によって受け付けた新規の検索クエリを取得する。 First, in step S200, the search means 32 acquires a new search query accepted by the input means 10.

次に、ステップＳ２０２において、検索手段３２は、上記ステップＳ２００で取得した新規の検索クエリに対し、そのテキストを単語分割する。そして、検索手段３２は、単語分割結果における各単語で単語概念ベース２６を検索し、取得した単語概念ベクトルを加算し、その結果得られた概念ベクトルを長さ１に正規化したものを、該検索クエリの概念ベクトルとする。 Next, in step S202, the search means 32 divides the text into words for the new search query acquired in step S200. Then, the search means 32 searches the word concept base 26 for each word in the word division result, adds the acquired word concept vector, and normalizes the obtained concept vector to length 1, A concept vector of a search query.

次に、ステップＳ２０４において、検索手段３２は、検索対象文書概念ベース３０中の各検索対象文書に対し、上記ステップＳ２０２で生成された新規の検索クエリの概念ベクトルと、該検索対象文書の概念ベクトルとの類似度を算出する。 Next, in step S204, the search means 32, for each search target document in the search target document concept base 30, the concept vector of the new search query generated in step S202 and the concept vector of the search target document. The similarity is calculated.

そして、ステップＳ２０６において、出力手段４０は、検索結果として、上記ステップＳ２０４で算出された類似度の降順にランキングした検索対象文書を表示する。あるいは、ある閾値以上の類似度をもつ検索対象文書を表示する。 In step S206, the output unit 40 displays the search target documents ranked in descending order of the similarity calculated in step S204 as a search result. Alternatively, a search target document having a similarity greater than a certain threshold is displayed.

これまで述べた処理をプログラムとして構築し、当該プログラムを通信回線または記録媒体からインストールし、ＣＰＵ等の手段で実施することが可能である。 It is possible to construct the processing described so far as a program, install the program from a communication line or a recording medium, and implement it by means such as a CPU.

なお、本発明は、上記の実施例に限定されることなく、特許請求の範囲内において、種々変更・応用が可能である。 The present invention is not limited to the above-described embodiments, and various modifications and applications are possible within the scope of the claims.

本発明は、ユーザが入力した検索クエリに概念的に適合する検索対象文書を検索する概念検索技術に適用可能である。 The present invention is applicable to a concept search technique for searching for a search target document that conceptually matches a search query input by a user.

１０入力手段
２０演算手段
２２学習手段
２４更新後検索対象文書集合データベース
２６単語概念ベース
２８検索対象文書概念ベース生成手段
３０検索対象文書概念ベース
３２検索手段
４０出力手段
１００文書概念検索装置 DESCRIPTION OF SYMBOLS 10 Input means 20 Calculation means 22 Learning means 24 Updated search object document set database 26 Word concept base 28 Search target document concept base generation means 30 Search target document concept base 32 Search means 40 Output means 100 Document concept search apparatus

Claims

A word concept base that is a set of a word and a word concept vector representing the concept of the word;
A set A of search target documents which are documents to be searched;
A set B of a search query and a set of correct documents as search target documents in the set A that conceptually matches the search query is input, and for each correct document in the set B, the Learning means for updating information on the correct document in the set A by associating each search query corresponding to the correct document in the set B with the correct document;
For each search target document in the set A, for each text of the search target document and the search query corresponding to the search target document, a corresponding word concept vector in the word concept base of the words in the text is synthesized. Then, by normalizing, a text concept vector that is a concept vector of the text is generated, and by combining each text concept vector, a search target document concept vector that is a concept vector of the search target document is generated, and Search target document concept base generation means for generating a search target document concept base for storing a set of sets of the search target document and the search target document concept vector;
A document concept base generating apparatus characterized by including:

A word concept base that is a set of a word and a word concept vector representing the concept of the word;
A set A of search target documents that are documents to be searched, and a set of a search query and a set of correct documents that are search target documents in the set A that conceptually match the search query For each correct document in B, each search query in the set A in which the information of the correct document is updated by associating each search query corresponding to the correct document in the set B with the correct document. About the document
For each text of the search query corresponding to the search target document and the search target document, generated by synthesizing and normalizing the corresponding word concept vector in the word concept base of the words in the text, A set of sets of a search target document concept vector, which is a concept vector of the search target document, and a set of the search target document, which are generated by combining the text concept vectors using the text concept vector which is a text concept vector A search target document concept base for storing
A search query concept vector that is a concept vector of the search query is generated by synthesizing a corresponding word concept vector in the word concept base of a word in the search query with respect to a new search query, and the search target Search means for calculating the similarity between the search query concept vector and the concept vector of the search target document for each search target document in the document concept base;
A document concept retrieval apparatus comprising:

A word concept base that is a set of a word and a word concept vector representing the concept of the word;
A set A of search target documents which are documents to be searched;
A set B of a search query and a set of correct documents as search target documents in the set A that conceptually matches the search query is input, and for each correct document in the set B, the Learning means for updating information on the correct document in the set A by associating each search query corresponding to the correct document in the set B with the correct document;
For each search target document in the set A, for each text of the search target document and the search query corresponding to the search target document, a corresponding word concept vector in the word concept base of the words in the text is synthesized. Then, by normalizing, a text concept vector that is a concept vector of the text is generated, and by combining each text concept vector, a search target document concept vector that is a concept vector of the search target document is generated, and Search target document concept base generation means for generating a search target document concept base for storing a set of sets of the search target document and the search target document concept vector;
A search query concept vector that is a concept vector of the search query is generated by synthesizing a corresponding word concept vector in the word concept base of a word in the search query with respect to a new search query, and the search target Search means for calculating the similarity between the search query concept vector and the concept vector of the search target document for each search target document in the document concept base;
A document concept retrieval apparatus comprising:

A word concept base which is a set of a word and a word concept vector representing the concept of the word, a set A of search target documents which are documents to be searched, a learning means, a search target document concept base generation means, A document concept base generation method in a document concept base generation device including:
The learning means takes as input a set B of a search query and a set of correct documents as search target documents in the set A that conceptually match the search query, and each correct answer in the set B Updating information of the correct document in the set A by associating each search query corresponding to the correct document in the set B with the correct document for the document;
For each search target document in the set A, the search target document concept base generation unit is configured to, for each text of the search query corresponding to the search target document and the search target document, the word of the word in the text By synthesizing and normalizing corresponding word concept vectors in the concept base, a text concept vector that is a concept vector of the text is generated, and by combining each text concept vector, the concept vector of the search target document is obtained. Generating a search target document concept vector and generating a search target document concept base for storing a set of sets of the search target document and the search target document concept vector;
A document concept base generation method comprising:

A word concept base that is a set of a word and a word concept vector representing the concept of the word;
A set A of search target documents that are documents to be searched, and a set of a search query and a set of correct documents that are search target documents in the set A that conceptually match the search query For each correct document in B, each search query in the set A in which the information of the correct document is updated by associating each search query corresponding to the correct document in the set B with the correct document. The document is generated by synthesizing and normalizing the corresponding word concept vector in the word concept base of the word in the text for each text of the search target document and the search query corresponding to the search target document. Furthermore, the concept vector of the search target document generated by synthesizing each text concept vector using the text concept vector that is the concept vector of the text. A search target document concept base for storing a set of set of the search target document concept vector and the target document is, a document concept retrieval method in document concept retrieval device comprising a search unit,
The search means generates a search query concept vector that is a concept vector of the search query by synthesizing a corresponding word concept vector in the word concept base of a word in the search query with respect to a new search query. And calculating a similarity between the search query concept vector and the concept vector of the search target document for each search target document in the search target document concept base. .

A word concept base which is a set of a word and a word concept vector representing the concept of the word, a set A of search target documents which are documents to be searched, a learning means, a search target document concept base generation means, A document concept search method in a document concept search apparatus including a search means,
The learning means takes as input a set B of a search query and a set of correct documents as search target documents in the set A that conceptually match the search query, and each correct answer in the set B Updating information of the correct document in the set A by associating each search query corresponding to the correct document in the set B with the correct document for the document;
For each search target document in the set A, the search target document concept base generation unit is configured to, for each text of the search query corresponding to the search target document and the search target document, the word of the word in the text By synthesizing and normalizing corresponding word concept vectors in the concept base, a text concept vector that is a concept vector of the text is generated, and by combining each text concept vector, the concept vector of the search target document is obtained. Generating a search target document concept vector and generating a search target document concept base for storing a set of sets of the search target document and the search target document concept vector;
The search means generates a search query concept vector that is a concept vector of the search query by synthesizing a corresponding word concept vector in the word concept base of a word in the search query with respect to a new search query. Calculating a similarity between the search query concept vector and the concept vector of the search target document for each search target document in the search target document concept base;
A document concept retrieval method comprising:

The computer according to claim 4, for causing the computer to function as each means of the document concept base generation device according to claim 1 or the document concept search device according to any one of claims 2 to 3. A program for executing the steps of a document concept base generation method or a document concept search method according to any one of claims 5 to 6.