JP4774016B2

JP4774016B2 - Information search method, information search program, and information search device

Info

Publication number: JP4774016B2
Application number: JP2007140895A
Authority: JP
Inventors: 一生青山; 和巳斉藤
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2007-05-28
Filing date: 2007-05-28
Publication date: 2011-09-14
Anticipated expiration: 2027-05-28
Also published as: JP2008293444A

Description

本発明は、情報探索方法、情報探索プログラムおよび情報探索装置に関する。 The present invention relates to an information search method, an information search program, and an information search device.

情報である要素の間の関係が距離、非類似度または類似度により定義される集合を情報空間Ωとする。情報空間Ωの任意の２つの要素ｘ，ｙ∈Ωの距離を距離関数ｄ（ｘ，ｙ）により定義する。情報空間Ωにおける距離関数は、次の式を充足する。
ｄ（ｘ，ｙ）≧０・・・式（１）
ｄ（ｘ，ｙ）＝ｄ（ｙ，ｘ）・・・式（２）
ｄ（ｘ，ｙ）＝０ｉｆｘ＝ｙ・・・式（３）
式（１）を非負数条件、式（２）を対称性条件という。情報空間の部分集合でもある情報探索集合（被探索集合、探索対象集合とも呼ぶ。）Ｘ⊂Ωにおいて、当該情報空間の要素であるクエリｑ∈Ωと最も距離の小さい要素の集合Ｒ（ｑ）⊂Ｘは、式（４）により表される。 A set in which the relationship between elements as information is defined by distance, dissimilarity or similarity is defined as information space Ω. The distance between any two elements x and yεΩ in the information space Ω is defined by a distance function d (x, y). The distance function in the information space Ω satisfies the following equation.
d (x, y) ≧ 0 (1)
d (x, y) = d (y, x) (2)
d (x, y) = 0 if x = y Expression (3)
Equation (1) is called a non-negative condition, and Equation (2) is called a symmetry condition. In an information search set (also called a searched set or a search target set) X⊂Ω that is also a subset of the information space, a set R (q) of elements having the smallest distance from the query qεΩ that is an element of the information space ⊂X is represented by Equation (4).

情報空間のうち要素間の距離が、非負数条件および対称条件に加えて、次の式（５）および式（６）を充足するものを距離空間Ωｄとする。
ｄ（ｘ，ｙ）＝０ｉｆｆｘ＝ｙ・・・式（５）
ｄ（ｘ，ｙ）＋ｄ（ｙ，ｚ）≧ｄ（ｘ，ｙ）・・・式（６）
式（５）を反射条件とよび、式（６）を三角不等式とよぶ。 In the information space, the distance between the elements satisfies the following expressions (5) and (6) in addition to the non-negative number condition and the symmetry condition, and is defined as a distance space Ωd.
d (x, y) = 0 iff x = y Expression (5)
d (x, y) + d (y, z) ≧ d (x, y) (6)
Equation (5) is called a reflection condition, and equation (6) is called a triangle inequality.

従来、距離空間の要素（情報）であるクエリに類似する要素を情報探索集合から、入力されたクエリに類似する情報を探索する情報探索方法として、ＴＬＡＥＳＡ（Tree Linear Approximating and Eliminating Search Algorithm）がある（例えば、非特許文献１）。ＴＬＡＥＳＡは、クエリが入力される前の事前処理と、クエリが入力された後の事後処理とを行うことで、情報の探索を行う。事前処理は、情報探索集合（被探索集合）における複数の要素（ベースプロトタイプと呼ぶ）を選択し、それらと他の全て要素との距離を算出する工程と、情報探索集合の全ての要素からなる二分木を構築する工程とから成る。事後処理は、クエリが入力された直後に、当該クエリと、選択されたベースプロトタイプとの距離を算出する工程、距離空間の性質の1つである３つの要素間の距離の大小関係を表す三角不等式と二分木とを利用し、探索空間を削減しながら探索する工程とから成る。ＴＬＡＥＳＡは、三角不等式を有効利用し探索空間を削減し、探索コストを低減している。ここで、探索コストとは、情報探索の効率を評価する際に用いられる値であり、クエリと探索対象集合の要素との類似度計算または距離計算の回数である。 Conventionally, TLAESA (Tree Linear Approximating and Eliminating Search Algorithm) is an information search method for searching information similar to an input query from an information search set for elements similar to a query that is an element (information) of a metric space. (For example, Non-Patent Document 1). TLAESA searches for information by performing pre-processing before a query is input and post-processing after the query is input. The pre-processing includes a step of selecting a plurality of elements (referred to as a base prototype) in the information search set (searched set), calculating the distance between them and all other elements, and all elements of the information search set. And constructing a binary tree. Post-processing is a process that calculates the distance between the query and the selected base prototype immediately after the query is input, and a triangle that indicates the magnitude relationship between the three elements that is one of the properties of the metric space. The process includes searching using the inequality and the binary tree while reducing the search space. TLAESA effectively uses triangular inequalities to reduce search space and reduce search costs. Here, the search cost is a value used when evaluating the efficiency of information search, and is the number of times of similarity calculation or distance calculation between a query and an element of a search target set.

このように、三角不等式を用いて、探索空間を削減する情報探索方法としては、ＴＬＡＥＳＡの他に、ＬＡＥＳＡ（Linear Approximationg and Eliminating Search Algorithm：例えば、非特許文献２参照）や、ＡＥＳＡ（Approximating and Eliminating Search Algorithm：例えば、非特許文献３参照）などが提案されている。
「A fast branch & bound nearest neighbour classifier in metric space」,Luisa Mico,Jose Oncina, Rafael C. Carrasco, Pattern Recognition Letters vol.17,p.731-p.739 1996年「A new version of the Nearest-Neighbour Approximating and Eliminating Search Algorithm(AESA) with linear preprocessing time and memory requirements」,Luisa Mico,Jose Oncina, Pattern Recognition Letters vol.15,p.9-p.7 1994年1月「An algorithm for finding nearest neighbours in (approximately) constant average time」,E. Vidal Pattern Recognition Letters vol.4,p.145-p.157 1986年7月 As described above, as an information search method for reducing the search space using the triangle inequality, in addition to TLAESA, LAESA (Linear Approximation and Eliminating Search Algorithm: see, for example, Non-Patent Document 2) and AESA (Approximating and Eliminating) Search Algorithm: For example, see Non-Patent Document 3) has been proposed.
`` A fast branch & bound nearest neighbor classifier in metric space '', Luisa Mico, Jose Oncina, Rafael C. Carrasco, Pattern Recognition Letters vol.17, p.731-p.739 1996 `` A new version of the Nearest-Neighbour Approximating and Eliminating Search Algorithm (AESA) with linear preprocessing time and memory requirements '', Luisa Mico, Jose Oncina, Pattern Recognition Letters vol.15, p.9-p.7 January 1994 `` An algorithm for finding nearest neighbors in (approximately) constant average time '', E. Vidal Pattern Recognition Letters vol.4, p.145-p.157 July 1986

ところで、例えば、入力された文書に類似の文書を文書ファイル群から探索する場合、すなわち、文書を被探索集合とした場合、文書間の関係性を規定する類似度や、距離を算出する際に用いられる文書から抽出される特徴量が高次元になる場合がある。これは、文書の特徴量として、文書中に出現する異なる単語から成る単語ベクトルを用い、その単語ベクトルの1要素を１次元とするため、情報探索集合（探索対象集合または被探索集合とも呼ぶ。）中の全文書ファイルに生じる単語の異なり数だけ次元が生じるためである。 By the way, for example, when searching for a document similar to the input document from the document file group, that is, when the document is a set to be searched, when calculating the similarity or the distance defining the relationship between documents. The feature quantity extracted from the used document may be high-dimensional. This is because a word vector composed of different words appearing in a document is used as a feature amount of the document, and one element of the word vector is one-dimensional, and therefore, an information search set (also called a search target set or a searched set). This is because there are as many dimensions as the number of different words that occur in all the document files.

次に、図１７および図１８に沿って、ＴＬＡＥＳＡにおける問題点を説明する。
なお、図１７および図１８における情報探索集合は、１０年分の新聞記事の文書ファイルを要素とする集合である。
ここで、情報探索集合における要素間の距離は、以下の手順で算出される。まず、各文書ファイル中に記載されている文書を形態素解析し、不要なストップワードを削除した上で、単語を文書（文書ファイル）から抽出する。ここで、ストップワードとは、情報探索において、ありふれた単語であるため検索語としては不適切なため、検索語としては無視される語である。日本語では、ひらがなやカタカナの１文字の語などがストップワードとなる。そして、抽出された単語に対し、ｔｆ−ｉｄｆ（term frequency-inverted document frequency）法で各単語に対し、重み付けを行う。この結果、生じる重み付け単語ベクトルを、該文書ファイルの特徴量とする。その上で、情報探索集合の文書ファイルを要素とし、特徴量に対するコサイン距離を用いて、要素間の距離を規定する。単語ベクトルを特徴量とした場合、前記コサイン類似度は、類似または非類似の尺度として広く用いられている。
この例で用いた要素数（文書ファイル数）は、６４５８５個であり、特徴量である重み付け単語ベクトルは、５１０３０次元となった。これは距離空間の次元が５１０３０であるとも言える。 Next, problems in TLAESA will be described with reference to FIGS. 17 and 18.
Note that the information search set in FIGS. 17 and 18 is a set whose elements are document files of newspaper articles for 10 years.
Here, the distance between elements in the information search set is calculated by the following procedure. First, a document described in each document file is subjected to morphological analysis, unnecessary stop words are deleted, and words are extracted from the document (document file). Here, a stop word is a word that is ignored as a search word because it is a common word in information search and is inappropriate as a search word. In Japanese, hiragana and katakana single-letter words are stop words. Then, each extracted word is weighted by a tf-idf (term frequency-inverted document frequency) method. As a result, the resulting weighted word vector is used as the feature amount of the document file. Then, the document file of the information search set is used as an element, and the distance between elements is defined using the cosine distance with respect to the feature amount. When a word vector is used as a feature amount, the cosine similarity is widely used as a similar or dissimilar measure.
The number of elements (number of document files) used in this example is 64585, and the weighted word vector as the feature amount is 51030 dimensions. This can be said that the dimension of the metric space is 51030.

図１７は、この情報探索空間から、無作為に１×１０^６個のペア要素（２つの要素）を選択し、このペア要素間の距離の累積分布を示す図である。
図１７において、横軸は、ペア要素間の距離を示し、縦軸は、対応する距離を有するペア要素の全体のペア要素に対する割合の累積値である。
なお、距離は、コサイン距離を用い、かつ情報探索集合内で最も遠い要素間の距離が１．０となるよう規格化されている。
図１７では、距離が０．８以下である要素数は、非常に少なく、１．０付近にほとんどのペア要素が存在することが示されている。詳細には、距離が０．９８以上のペア要素の割合は、全体の９０％であることが示されている（図１７の太線）。
すなわち、図１７から、この１０年分の新聞記事の文書ファイルを要素とする情報探索集合では、各要素間が疎になっていることがわかる。 FIG. 17 is a diagram showing a cumulative distribution of distances between the pair elements by randomly selecting 1 × 10 ⁶ pair elements (two elements) from the information search space.
In FIG. 17, the horizontal axis indicates the distance between the pair elements, and the vertical axis indicates the cumulative value of the ratio of the pair elements having the corresponding distance to the entire pair elements.
Note that the distance is standardized so that the distance between the farthest elements in the information search set is 1.0 using a cosine distance.
FIG. 17 shows that the number of elements having a distance of 0.8 or less is very small, and most pair elements exist in the vicinity of 1.0. Specifically, it is shown that the proportion of pair elements having a distance of 0.98 or more is 90% of the total (thick line in FIG. 17).
That is, it can be seen from FIG. 17 that the elements are sparse in the information search set whose elements are the document files of newspaper articles for 10 years.

図１８は、図１７と同様の条件下における距離の下界の累積分布を示す図である。
距離の下界は、図１７と同じペア要素と、ランダムに選択した２００個の要素（ＴＬＡＥＳＡのベースプロトタイプに相当）とを用いて、距離の下界を算出した。
図１８において、横軸は、このような方法で算出した距離の下界の値を示し、縦軸は、距離の下界の値が、算出したすべての距離の下界に対する割合の累積値である。
図１８では、０．４以下に、距離の下界のほとんどが入っていることが示されている。特に、０．１３８以下の距離の下界が、全体の９０％を占めている。探索空間の削減は、情報探索過程のある時点でのクエリとある要素との距離と比較して、距離の下界が大きい要素を探索対象集合、すなわち、クエリとの距離を計算する対象の要素の集合から除くことによりなされる。情報探索過程のある時点でのクエリとある要素との距離が０．９８であったと仮定する。図１８より、距離の下界が０．９８よりも大きい要素はほとんど存在しないので、この時点で削減される要素はほとんどない。このように、要素における特徴量が高次元になる場合、情報探索集合におけるある要素とクエリとの距離を計算し、距離の下界が比較対象の距離より大きい要素を情報探索集合から除き、探索空間を削減する方法は有効に機能しない。 FIG. 18 is a diagram showing a cumulative distribution of lower bounds of distances under the same conditions as in FIG.
The lower bound of the distance was calculated using the same pair elements as in FIG. 17 and 200 elements selected at random (corresponding to the base prototype of TLAESA).
In FIG. 18, the horizontal axis represents the lower bound value of the distance calculated by such a method, and the vertical axis represents the cumulative value of the ratio of the lower bound value of the distance to the lower bound of all the calculated distances.
In FIG. 18, it is shown that most of the lower bound of the distance is below 0.4. In particular, the lower bound of a distance of 0.138 or less accounts for 90% of the total. The search space is reduced by comparing an element having a large lower bound with a search target set, that is, a target element for calculating a distance to the query, as compared with a distance between the query and a certain element at a certain point in the information search process. This is done by removing it from the set. Assume that the distance between a query and an element at a certain point in the information search process is 0.98. As shown in FIG. 18, since there is almost no element having a lower bound of distance greater than 0.98, there is almost no element to be reduced at this point. In this way, when the feature amount of an element is high-dimensional, the distance between a certain element in the information search set and the query is calculated, and the element whose lower bound of the distance is larger than the comparison target distance is removed from the information search set, and the search space The way to reduce does not work effectively.

このように、文書ファイルなど距離空間の次元数が大きくなる情報探索集合では、三角不等式から算出される距離の下界を用いた探索空間の削減が有効に機能しない。
すなわち、文書ファイルなど距離空間の次元数が大きくなる情報探索集合に対し、ＴＬＡＥＳＡや、ＬＡＥＳＡや、ＡＥＳＡなどを適用しても、探索空間の削減がほとんどなされず、結果として、１つ１つの要素ごとにクエリとの距離を算出することになり、効率的な情報探索が行われないという問題が生じる。
さらに、ＴＬＡＥＳＡなどでは、前述の通り三角不等式を用いるが、これは距離空間が情報探索集合であることが前提となる。従って、距離空間ではない情報に関して、ＴＬＡＥＳＡなどの三角不等式を利用し探索空間を削減する枝刈り方法に基づくアルゴリズムを、直接適用することは困難である。 Thus, in an information search set such as a document file in which the number of dimensions of the metric space is large, the reduction of the search space using the lower bound of the distance calculated from the triangle inequality does not function effectively.
That is, even if TLAESA, LAESA, AESA, or the like is applied to an information search set in which the number of dimensions of the metric space is large, such as a document file, the search space is hardly reduced. Each time the distance to the query is calculated, there arises a problem that efficient information search is not performed.
Further, in TLAESA and the like, the triangle inequality is used as described above, and this assumes that the metric space is an information search set. Therefore, it is difficult to directly apply an algorithm based on a pruning method that uses a triangle inequality such as TLAESA to reduce search space for information that is not a metric space.

本発明は、情報探索集合が高次元距離空間である場合または情報空間である場合であっても、情報を探索できることを目的とする。 An object of the present invention is to be able to search for information even when the information search set is a high-dimensional metric space or an information space.

本発明は、前記課題を解決するために創案されたものであり、請求項１に記載の情報探索方法は、記憶部に保持されている情報探索集合からクエリと類似した情報を探索する情報探索装置における情報探索方法であって、前記情報探索方法は、ネットワーク生成処理と、情報探索処理とを含んでなり、前記ネットワーク生成処理は、前記情報探索装置のネットワーク生成部が、前記情報探索集合における任意の２つの要素間を、直接的、または、前記２つの要素以外の要素を介して間接的にリンク結合することにより、１コンポーネントのネットワークを生成し、前記記憶部に保持させ、前記情報探索処理は、前記情報探索装置の探索処理部が、（ａ１）前記ネットワークの所定の第1の要素に直接的にリンク結合された要素を、前記記憶部から取得し、当該要素のうち、前記クエリとの類似度が最も大きい要素を第２の要素として選択し、（ａ２）当該第２の要素と前記クエリとの類似度が、前記記憶部に保持されている所定の設定類似度よりも大きいならば、前記第２の要素と前記クエリとの類似度を新たな設定類似度として、前記記憶部に保持し、（ａ３）前記第２の要素を第３の要素とし、当該第３の要素に直接的にリンク結合された要素を、前記記憶部から取得し、（ａ４）前記第３の要素に直接的にリンク結合された要素のうち、過去に前記第２の要素になったことのない要素であり、かつ、前記クエリとの類似度が最も大きい要素を選択し、新たな第２の要素とし、当該新たな第２の要素に対して、前記（ａ２）の処理を行うことにより、前記クエリと類似した要素を探索する情報探索処理と、を実行する方法とした。 The present invention has been made to solve the above-described problem, and the information search method according to claim 1 is an information search for searching for information similar to a query from an information search set stored in a storage unit. An information search method in an apparatus, wherein the information search method includes a network generation process and an information search process. The network generation process is performed by a network generation unit of the information search apparatus in the information search set. An arbitrary two elements are linked directly or indirectly via an element other than the two elements, thereby generating a one-component network and storing it in the storage unit, and the information search The processing is performed by the search processing unit of the information search device, wherein (a1) an element that is directly linked to the predetermined first element of the network And the element having the highest similarity with the query is selected as the second element, and (a2) the similarity between the second element and the query is stored in the storage unit. If it is greater than the predetermined set similarity, the similarity between the second element and the query is stored in the storage unit as a new set similarity, and (a3) the second element is An element directly linked to the third element is obtained from the storage unit as the third element, and (a4) among the elements directly linked to the third element, the past The element that has never become the second element and the element having the highest similarity to the query is selected as a new second element, and the new second element By performing the process (a2), an element similar to the query is obtained. And as a way to perform the information searching process of search, a.

このような方法によれば、要素間がリンクによって、結合した１コンポーネントのネットワークを用いて探索を行い、距離空間の性質である三角不等式を用いていないため、情報探索集合が、高次元距離空間である場合や、情報探索集合が、距離空間でない場合であっても、情報探索を行うことができる。 According to such a method, a search is performed using a network of one component in which elements are linked by a link, and a trigonometric inequality that is a property of a metric space is not used. Even when the information search set is not a metric space, the information search can be performed.

さらに、請求項２に記載の情報探索方法は、請求項１に記載の情報探索方法において、前記ネットワーク生成処理は前記ネットワーク生成部が、前記情報探索集合における任意の要素である要素ｘと、前記要素ｘ以外の要素それぞれとの類似度を算出し、前記要素ｘ以外の要素のうち、前記要素ｘとの前記類似度が大きい順に、予め設定された所定数の要素を取得し、当該取得した要素と、前記要素ｘとの間をリンク結合することによって、前記要素ｘに対する最近傍要素ネットワークΓ（ｘ）を生成する処理を、前記情報探索集合におけるすべての要素に対して行う方法とした。 Furthermore, the information search method according to claim 2 is the information search method according to claim 1, wherein the network generation process is performed by the network generation unit including an element x that is an arbitrary element in the information search set, The degree of similarity with each element other than the element x is calculated, and among elements other than the element x, a predetermined number of elements set in advance are obtained in descending order of the degree of similarity with the element x. The method of generating the nearest element network Γ (x) for the element x by linking the element with the element x is performed for all elements in the information search set.

このような方法によれば、情報探索集合内の要素間のネットワークを、任意の要素ｘの近傍に存在する所定数の要素にリンクを張った最近傍要素ネットワークの集まりとすることができるため、平均最短パス長の小さいネットワークを生成することができる。このようなネットワークを情報探索に用いることにより、探索コストの削減が可能となる。 According to such a method, since the network between elements in the information search set can be a collection of nearest neighbor element networks linked to a predetermined number of elements existing in the vicinity of an arbitrary element x, A network having a small average shortest path length can be generated. By using such a network for information search, search costs can be reduced.

また、請求項３に記載の情報探索方法は、請求項２に記載の情報探索方法であって、前記情報探索処理は、前記探索処理部が、前記情報探索集合における任意の要素である要素ｘ０に対する前記最近傍要素ネットワークΓ（ｘ０）と、任意の要素であるｘｍａｘとを前記記憶部から取得し、前記入力部を介して、前記クエリｑが入力されると、（ｂ１）Ａ＝Γ（ｘ０）∪｛ｘ０｝，Ｂ＝｛ｘ０｝である集合Ａおよび集合Ｂを算出し、（ｂ２）次の式（８）を満たす要素ｘ１を、集合Ａ−Ｂから取得し、 The information search method according to claim 3 is the information search method according to claim 2, wherein the search processing unit includes an element x0 that is an arbitrary element in the information search set. When the query q is input via the input unit, the nearest neighbor element network Γ (x0) and the arbitrary element xmax for the element are acquired from the storage unit, and (b1) A = Γ ( x0) Calculate sets A and B that satisfy {x0}, B = {x0}, (b2) obtain an element x1 satisfying the following expression (8) from the set AB,

（ただし、ρ（ａ，ｂ）は、前記情報探索集合内の要素であるａと、ｂとの類似度、｜Ａ｜は、集合Ａにおける要素数、Ａ―Ｂは、要素Ａと要素Ｂとの差集合、Γ（ａ）は、要素ａに対する最近傍要素ネットワーク（ｂ３）ρ（ｘ１，ｑ）＞ρ（ｘｍａｘ，ｑ）であれば、要素ｘ１を新たな要素ｘｍａｘとし、（ｂ４）ＡにＡ∪Γ（ｘ１）を代入し、ＢにＢ∪｛ｘ１｝を代入し、（ｂ５）前記（ｂ２）〜（ｂ４）の処理を繰り返し、｜Ａ｜が所定の値βを超える、または、前記クエリと前記要素ｘ１とが一致するとき、要素ｘｍａｘを最終出力要素として出力する方法とした。 (Where ρ (a, b) is the similarity between elements a and b in the information search set, | A | is the number of elements in the set A, and AB is elements A and B Γ (a) is the nearest element network (b3) ρ (x1, q)> ρ (xmax, q) for the element a, and the element x1 is set as a new element xmax, and (b4) A∪Γ (x1) is substituted for A, B 、 {x1} is substituted for B, (b5) The processing of (b2) to (b4) is repeated, and | A | exceeds a predetermined value β. Alternatively, when the query matches the element x1, the element xmax is output as the final output element.

このような方法によれば、１コンポーネントかつ平均最短パス長の小さいネットワークを使用して、情報探索を行うことができるため、探索コストの削減が可能となる。 According to such a method, information search can be performed using a network having one component and a small average shortest path length, so that search cost can be reduced.

また、請求項４に記載の情報探索プログラムは、請求項１から請求項３のいずれか一項に記載の情報探索方法をコンピュータに実行させるプログラムとした。 The information search program according to claim 4 is a program that causes a computer to execute the information search method according to any one of claims 1 to 3.

このようなプログラムによれば、要素間がリンクによって、結合した１コンポーネントのネットワークを用いて探索を行い、距離空間の性質である三角不等式を用いていないため、情報探索集合が、高次元距離空間である場合や、情報探索集合が、距離空間でない場合であっても、情報探索を行うことができる。また、情報探索集合内の要素間のネットワークを、任意の要素ｘの近傍に存在する所定数の要素にリンクを張った最近傍要素ネットワークの集まりとすることができるため、平均最短パス長の小さいネットワークを生成することができる。このようなネットワークを情報探索に用いることにより、探索コストの削減が可能となる。 According to such a program, a search is performed using a network of one component in which elements are linked by links, and a trigonometric inequality that is a property of metric space is not used. Even when the information search set is not a metric space, the information search can be performed. Further, since the network between elements in the information search set can be a set of nearest neighbor element networks linked to a predetermined number of elements existing in the vicinity of an arbitrary element x, the average shortest path length is small. A network can be created. By using such a network for information search, search costs can be reduced.

そして、請求項５に記載の情報探索装置は、記憶部に保持されている情報探索集合からクエリと類似した情報を探索する情報探索装置であって、前記情報探索装置は、ネットワーク生成部と、探索処理部とを有してなり、前記ネットワーク生成部は、前記情報探索集合における任意の２つの要素間を、直接的、または、前記２つの要素以外の要素を介すことにより、間接的にリンク結合することにより、１コンポーネントのネットワークを生成し、前記記憶部に保持させる機能を有し、前記探索処理部は、（ａ１）前記ネットワークの所定の第1の要素に直接的にリンク結合された要素を、前記記憶部から取得し、当該要素のうち、前記クエリとの類似度が最も大きい要素を第２の要素として選択し、（ａ２）当該第２の要素と前記クエリとの類似度が、前記記憶部に保持されている所定の設定類似度よりも大きいならば、前記第２の要素と前記クエリとの類似度を新たな設定類似度として、前記記憶部に保持し、（ａ３）前記第２の要素を第３の要素とし、当該第３の要素に直接的にリンク結合された要素を、前記記憶部から取得し、（ａ４）前記第３の要素に直接的にリンク結合された要素のうち、過去に前記第２の要素になったことのない要素であり、かつ、前記クエリとの類似度が最も大きい要素を選択し、新たな第２の要素とし、当該新たな第２の要素に対して、前記（ａ２）の処理を行うことにより、前記クエリと類似した要素を探索する機能を有する装置とした。 The information search device according to claim 5 is an information search device that searches for information similar to a query from an information search set held in a storage unit, and the information search device includes a network generation unit, A search processing unit, wherein the network generation unit directly or indirectly through any element other than the two elements between any two elements in the information search set It has a function of generating a one-component network by link connection and holding it in the storage unit. The search processing unit is (a1) directly link-connected to a predetermined first element of the network. And from the storage unit, the element having the highest similarity with the query is selected as the second element, and (a2) the second element and the query If the similarity is greater than a predetermined setting similarity held in the storage unit, the similarity between the second element and the query is held in the storage unit as a new setting similarity, (A3) The second element is a third element, and an element that is directly linked to the third element is acquired from the storage unit, and (a4) the third element is directly connected to the third element. Among the linked elements, an element that has never become the second element in the past and has the highest similarity with the query is selected as a new second element. By performing the process (a2) for the new second element, the apparatus has a function of searching for an element similar to the query.

このような情報探索装置によれば、要素間がリンクによって、結合した１コンポーネントのネットワークを用いて探索を行い、距離空間の性質である三角不等式を用いていないため、情報探索集合が、高次元距離空間である場合や、情報探索集合が、距離空間でない場合であっても、情報探索を行うことができる。 According to such an information search apparatus, since a search is performed using a network of one component in which elements are linked by links and the triangular inequality that is the property of the metric space is not used, the information search set has a high dimension. Information search can be performed even in a metric space or when the information search set is not a metric space.

本発明によれば、情報探索集合が高次元距離空間である場合または情報空間である場合であっても、情報を探索することが可能となる。 According to the present invention, it is possible to search for information even when the information search set is a high-dimensional metric space or an information space.

以下、図面を参照して、本発明を実施するための最良の形態（以下、「実施形態」という）について詳細に説明する。 The best mode for carrying out the present invention (hereinafter referred to as “embodiment”) will be described in detail below with reference to the drawings.

（システム構成）
図１は、本実施形態に係る情報探索システムの構成例を示す図である。
情報探索システム１２は、情報の探索を行う情報探索装置１と、情報探索装置１に対してクエリを送信する端末１１とが、ＷＡＮ（Wide Area Network）や、ＬＡＮ（Local Area Network）などの物理的ネットワーク１０を介して接続している。 (System configuration)
FIG. 1 is a diagram illustrating a configuration example of an information search system according to the present embodiment.
In the information search system 12, an information search device 1 that searches for information and a terminal 11 that transmits a query to the information search device 1 include a physical area such as a WAN (Wide Area Network) or a LAN (Local Area Network). Connected via a network 10.

情報探索装置１は、情報の処理を行う処理部２と、探索対象の情報などが格納されている記憶部３と、情報が入力される入力部４と、情報探索の結果などを出力する出力部５とを含んでなる。記憶部３は、ＨＤ（Hard Disk）、不揮発性メモリ，ＲＡＭ（Random Access Memory）等の種々の記憶媒体の少なくとも１つから構成され、プログラムが実装される計算機の構成形態に依存した前記記憶媒体の組合せで構なされる。
端末１１から送信されたクエリは、物理的ネットワーク１０および入力部４を介して、処理部２へと送られる。また、ユーザが、入力部４を介して直接処理部２へクエリを入力してもよい。
処理部２は、ネットワーク生成処理を行うネットワーク生成部２１と、情報探索処理を行う探索処理部２２とを含んでなる。ここで、ネットワークとは、情報探索集合内における要素が、リンクによって結合しているときの、要素間のネットワークを指す。 The information search apparatus 1 includes a processing unit 2 that processes information, a storage unit 3 that stores information to be searched, an input unit 4 to which information is input, and an output that outputs information search results and the like. Part 5. The storage unit 3 includes at least one of various storage media such as an HD (Hard Disk), a non-volatile memory, and a RAM (Random Access Memory), and depends on the configuration of the computer on which the program is installed. It consists of a combination of
The query transmitted from the terminal 11 is sent to the processing unit 2 via the physical network 10 and the input unit 4. Further, the user may directly input a query to the processing unit 2 through the input unit 4.
The processing unit 2 includes a network generation unit 21 that performs network generation processing and a search processing unit 22 that performs information search processing. Here, the network refers to a network between elements when the elements in the information search set are connected by a link.

処理部２と、処理部２内のネットワーク生成部２１および探索処理部２２とは、図示しないＨＤやＲＯＭ（Read Only Memory）などの記憶装置に格納されている情報探索プログラムが、図示しないＲＡＭに展開され、図示しないＣＰＵ（Central Processing Unit）によって実行されることで具現化する。 The processing unit 2 and the network generation unit 21 and the search processing unit 22 in the processing unit 2 are configured such that an information search program stored in a storage device such as an HD or a ROM (Read Only Memory) (not shown) is stored in a RAM (not shown). This is realized by being developed and executed by a CPU (Central Processing Unit) (not shown).

（近傍要素ネットワークの生成処理）
まず、図１を参照しつつ、図２に沿って、ネットワーク生成処理の概要を説明する。
図２は、本実施形態に係るネットワーク生成処理の概要を示す図である。
図２において、符号１００は、情報探索集合における要素、すなわち、記憶部３に格納され、探索対象となる情報である。ここで、要素とは、具体的には、例えば、新聞、特許公報等のテキストファイル、または、ＸＭＬ（Extensive Markup Language）による文書ファイル等である。
まず、ネットワーク生成部２１は、各要素１００間の類似度を算出する。本実施形態において、類似度は、コサイン類似度を指すものとするが、これに限らず、ミンコフスキー距離に代表される一般的な距離定義に基づく計算式や、コサイン類似度以外の類似度に基づく計算式を用いてもよい。
そして、図２（ａ）に示すように、ネットワーク生成部２１は、要素１００のうち、任意の要素１００（符号ｘ）を選択する。
そして、図２（ｂ）に示すように、ネットワーク生成部２１は、この要素ｘとの類似度が大きい要素１００を、予め入力されている近傍要素数ｋ個選択する。すなわち、ネットワーク生成部２１は、要素ｘ以外の要素１００から、要素ｘとの類似度が大きい順に、ｋ個の要素１００を選択する。近傍要素数ｋとは、生成するネットワークにおける全要素１００がリンクによって結合している１コンポーネントのネットワークとなるために必要なパラメータである。 (Neighboring element network generation processing)
First, the outline of the network generation process will be described with reference to FIG. 1 and along FIG.
FIG. 2 is a diagram showing an overview of network generation processing according to the present embodiment.
In FIG. 2, reference numeral 100 denotes an element in the information search set, that is, information stored in the storage unit 3 and to be searched. Here, specifically, the element is, for example, a text file such as a newspaper or a patent gazette, or a document file by XML (Extensive Markup Language).
First, the network generation unit 21 calculates the similarity between the elements 100. In the present embodiment, the similarity refers to the cosine similarity, but is not limited thereto, and is based on a calculation formula based on a general distance definition represented by the Minkowski distance, or a similarity other than the cosine similarity. A calculation formula may be used.
Then, as illustrated in FIG. 2A, the network generation unit 21 selects an arbitrary element 100 (symbol x) among the elements 100.
Then, as illustrated in FIG. 2B, the network generation unit 21 selects k elements 100 having a high similarity with the element x and the number of neighboring elements input in advance. That is, the network generation unit 21 selects k elements 100 from the elements 100 other than the element x in descending order of similarity to the element x. The number k of neighboring elements is a parameter necessary for a one-component network in which all elements 100 in the generated network are connected by links.

ここで、コンポーネントとは、情報探索集合の部分集合であり、ある集合の任意の２つの要素１００間が少なくとも１つのリンクまたはリンクの連結により接続されているものである。但し、リンクの連結とは、第１の要素１００と第２の要素１００との間のリンク、第２の要素１００と第３の要素１００との間のリンク、…、第（ｎ−１）の要素１００と第ｎの要素１００との間のリンクのように、リンクの連なりのことをいう。このような場合、第１の要素１００と第ｎの要素１００とはリンクの連結により、間接的に接続されている。
例えば、「ネットワークが１コンポーネントである」とは、「任意の２つの要素１００間がリンクまたはリンクの連結により互いに接続されているネットワーク」であることをいう。 Here, the component is a subset of the information search set, and any two elements 100 of a set are connected by at least one link or link connection. However, the link connection is a link between the first element 100 and the second element 100, a link between the second element 100 and the third element 100,..., (N-1) th. A series of links such as a link between the element 100 and the n-th element 100. In such a case, the first element 100 and the nth element 100 are indirectly connected by link connection.
For example, “a network is one component” means “a network in which any two elements 100 are connected to each other by a link or link connection”.

近傍要素数ｋは、テストデータなどを用い、探索コストを評価関数とすることで、求められる（図７、図８、図１２および図１３で後記）。図２では、ｋ＝３としている。
図２（ｂ）に示すように、ネットワーク生成部２１は、選択されたｋ個の要素１００と、要素ｘとの間にリンクを張ることにより、要素ｘに対する要素群Ｎｋ（ｘ）を形成する。 The number k of neighboring elements can be obtained by using test data or the like and using the search cost as an evaluation function (described later in FIGS. 7, 8, 12 and 13). In FIG. 2, k = 3.
As shown in FIG. 2B, the network generation unit 21 forms an element group Nk (x) for the element x by establishing a link between the selected k elements 100 and the element x. .

この処理を、各々の要素１００に対して行った後、ネットワーク生成部２１は、図２（ｃ）に示すように、要素ｘ以外の要素ｙに対する要素群Ｎｋ（ｙ）を参照し、この要素群Ｎｋ（ｙ）に要素ｘが含まれている要素ｙを抽出する。
そして、ネットワーク生成部２１は、図２（ｂ）および図２（ｃ）に示す要素ｘと、要素ｙにリンクを張ることにより、要素群Ｎｋ（ｘ）と、要素ｙとの和集合を算出する。この手順を無向化という。この和集合が、図２（ｄ）に示す要素ｘに対する最近傍要素ネットワークΓ（ｘ）となる。このような手順で生成された最近傍要素ネットワークΓ（ｘ）は、請求項におけるネットワークであり、無向性を有するネットワークである。ここで、無向性を有するネットワークとは、始点および終点の規定のないリンクである無向リンクにより結合された要素からなるネットワークである。 After performing this process for each element 100, the network generation unit 21 refers to the element group Nk (y) for the element y other than the element x as shown in FIG. An element y whose element x is included in the group Nk (y) is extracted.
Then, the network generation unit 21 calculates the union of the element group Nk (x) and the element y by linking the element x and the element y shown in FIGS. 2 (b) and 2 (c). To do. This procedure is called undirected. This union becomes the nearest neighbor element network Γ (x) for the element x shown in FIG. The nearest neighbor element network Γ (x) generated by such a procedure is a network in claims, and is a network having non-directivity. Here, the network having non-directivity is a network composed of elements connected by an undirected link that is a link with no definition of a start point and an end point.

このよう手順によって、ネットワーク生成部２１が、各要素１００に対して最近傍要素ネットワークを算出すると、この最近傍要素ネットワークには以下のような性質が備わることとなる。
すなわち、任意の要素ｘ０に対する最近傍要素ネットワークΓ（ｘ０）を取得し、その最近傍要素ネットワークΓ（ｘ０）内における要素ｘ０以外の要素ｘ１を取得する。そして、取得した要素ｘ１に対する最近傍要素ネットワークΓ（ｘ１）を取得し、その最近傍要素ネットワークΓ（ｘ１）内における要素ｘ１以外の要素ｘ２を取得する。次に、取得した要素ｘ２に対する最近傍要素ネットワークΓ（ｘ２）を取得し、その最近傍要素ネットワークΓ（ｘ２）内における要素ｘ２以外の要素ｘ３を取得する、といった手順を繰り返すことにより、情報探索集合内の全要素１００をたどることができる。すなわち、図２（ｄ）の破線で示すように、情報探索集合内の全要素１００がリンクによって、直接的または間接的に連結した１コンポーネントのネットワークを生成することができる。ここで、要素ｘに対する最近傍要素ネットワークΓ（ｘ）を構成する要素は、要素ｘに直接的にリンク結合している要素となる。例えば、図２（ｄ）において、要素１０１と、要素１０２とは、直接的にリンク結合している。そして、要素１０１と、要素１０４とは、要素１０２および要素１０３を介することによって間接的に結合している。このように、生成されたネットワークの任意の２つの要素は、直接的または間接的にリンク結合している。このように、最近傍要素ネットワークを連結することによって、生なされる１コンポーネントのネットワークを近傍要素ネットワークΓ（図示せず）と記載することとする。 When the network generation unit 21 calculates the nearest element network for each element 100 according to such a procedure, the nearest element network has the following properties.
That is, the nearest element network Γ (x0) for an arbitrary element x0 is acquired, and an element x1 other than the element x0 in the nearest element network Γ (x0) is acquired. Then, the nearest element network Γ (x1) for the obtained element x1 is obtained, and the element x2 other than the element x1 in the nearest element network Γ (x1) is obtained. Next, an information search is performed by repeating the procedure of obtaining the nearest element network Γ (x2) for the obtained element x2 and obtaining the element x3 other than the element x2 in the nearest element network Γ (x2). All elements 100 in the set can be traced. That is, as shown by a broken line in FIG. 2D, a one-component network in which all elements 100 in the information search set are directly or indirectly connected by a link can be generated. Here, the elements constituting the nearest element network Γ (x) for the element x are elements that are directly linked to the element x. For example, in FIG. 2D, the element 101 and the element 102 are directly linked. The element 101 and the element 104 are indirectly coupled via the element 102 and the element 103. Thus, any two elements of the generated network are linked directly or indirectly. In this way, the one-component network generated by connecting the nearest element networks is described as a neighborhood element network Γ (not shown).

このように、近傍要素ネットワークΓでは、情報探索集合内のすべての前記要素１００間をリンクによって結合させているため（１コンポーネント）、リンクに従って、任意の要素１００よりクエリに近い要素１００を順に取得することにより、情報探索を行う際に、情報探索が途絶することがない。 In this way, in the neighborhood element network Γ, all the elements 100 in the information search set are linked by a link (one component), and thus the element 100 closer to the query than the arbitrary element 100 is sequentially acquired according to the link. Thus, the information search is not interrupted when the information search is performed.

次に、図１および図２を参照しつつ、図３を参照してネットワーク生成処理の流れを説明する。
図３は、本実施形態に係るネットワーク生成処理の流れを示すフローチャートである。
なお、図３の処理は、例えば、新たな情報探索集合を探索対象とする場合、または、新しい情報（要素）が情報探索集合に加わった場合に行われる処理である。
まず、情報探索装置１の入力部４を介して、近傍要素数ｋと、要素１００（図２参照）および要素１００の特徴量のリストとが、情報探索装置１へ入力され（Ｓ１０１）、記憶部３に格納される。
ネットワーク生成部２１は、記憶部３から、要素１００の特徴量を取得し、探索対象集合Ｘの要素１００である各要素ｘ（ｘ∈Ｘ）に対し、他の要素１００との類似度を、取得した特徴量を使用することによって算出する（Ｓ１０２）。
そして、ネットワーク生成部２１は、記憶部３から、近傍要素数ｋを取得すると、算出した類似度を基に、各要素ｘに対して、類似度の高いｋ個の要素群Ｎｋ（ｘ）を生成し（Ｓ１０３：図２（ｂ）参照）、記憶部３に記憶する。すなわち、要素ｘと、要素群Ｎｋ（ｘ）を構成する各要素とをリンク結合する。要素群Ｎｋ（ｘ）は、要素ｘに対し類似度の最も高い要素から降順にｋ個の要素の集合になる。
次に、ネットワーク生成部２１は、探索対象集合Ｘ中のすべての要素ｘに対して、ステップＳ１０３の処理を行ったか否かを判定する（Ｓ１０４）。判定は、例えば、ステップＳ１０３の処理を行った要素ｘにフラグを付し、このフラグがすべての要素１００に対し付されているか否かを、ネットワーク生成部２１が判定することによって行われる。
ステップＳ１０４の結果、すべての要素ｘに対して、ステップＳ１０３の処理を行っていないと判定された場合（Ｓ１０４→Ｎｏ）、ネットワーク生成部２１は、ステップＳ１０３の処理へ戻る。
ステップＳ１０４の結果、すべての要素ｘに対して、ステップＳ１０３の処理を行っていると判定された場合（Ｓ１０４→Ｙｅｓ）、ネットワーク生成部２１は、ステップＳ１０５の処理へ進む。 Next, the flow of network generation processing will be described with reference to FIG. 3 while referring to FIG. 1 and FIG.
FIG. 3 is a flowchart showing a flow of network generation processing according to the present embodiment.
3 is a process performed when, for example, a new information search set is set as a search target or when new information (element) is added to the information search set.
First, the number k of neighboring elements, the element 100 (see FIG. 2), and the feature quantity list of the element 100 are input to the information search apparatus 1 via the input unit 4 of the information search apparatus 1 (S101) and stored. Stored in part 3.
The network generation unit 21 acquires the feature amount of the element 100 from the storage unit 3, and for each element x (xεX) that is the element 100 of the search target set X, the similarity with the other elements 100 is Calculation is performed by using the acquired feature amount (S102).
Then, when the network generation unit 21 obtains the number k of neighboring elements from the storage unit 3, based on the calculated similarity, k network elements Nk (x) having high similarity are obtained for each element x. (S103: refer to FIG. 2B) and stored in the storage unit 3. That is, the element x and each element constituting the element group Nk (x) are link-coupled. The element group Nk (x) is a set of k elements in descending order from the element having the highest similarity to the element x.
Next, the network generation unit 21 determines whether or not the process of step S103 has been performed on all the elements x in the search target set X (S104). The determination is made, for example, by adding a flag to the element x that has been processed in step S103, and determining whether or not this flag is assigned to all the elements 100 by the network generation unit 21.
As a result of step S104, when it is determined that the process of step S103 is not performed for all the elements x (S104 → No), the network generation unit 21 returns to the process of step S103.
As a result of step S104, when it is determined that the process of step S103 is performed for all the elements x (S104 → Yes), the network generation unit 21 proceeds to the process of step S105.

次に、ネットワーク生成部２１は、要素ｙ（ｙ∈Ｘ）に関して、記憶部３に記憶されている要素群Ｎｋ（ｙ）を取得し、要素ｙ自身は、Ｎｋ（ｘ）に含まれないが、要素ｘは、Ｎｋ（ｙ）に含まれている要素ｙを求める（図２（ｃ）参照）。そして、ネットワーク生成部２１は、このような要素ｙを、Ｎｋ（ｘ）と対応付けて記憶部３に記憶させることによって、要素ｘと、要素ｙとをリンク結合し、要素ｘに対する無向リンクを設定する（Ｓ１０５：図２（ｄ）参照）。
すなわち、式（７）に示す最近傍要素ネットワークΓ（ｘ）を算出する。 Next, the network generation unit 21 acquires the element group Nk (y) stored in the storage unit 3 for the element y (yεX), and the element y itself is not included in Nk (x). , Element x finds element y included in Nk (y) (see FIG. 2C). Then, the network generation unit 21 associates such an element y with Nk (x) and stores it in the storage unit 3, thereby link-linking the element x and the element y, and an undirected link to the element x. Is set (S105: see FIG. 2D).
That is, the nearest neighbor element network Γ (x) shown in Expression (7) is calculated.

そして、ネットワーク生成部２１は、探索対象集合Ｘ中のすべての要素ｙに対して、ステップＳ１０５の処理を行ったか否かを判定する（Ｓ１０６）。判定は、例えば、ステップＳ１０５の後に、要素ｙにフラグを付し、このフラグがすべての要素１００に対し付されているか否かを、ネットワーク生成部２１が判定することによって行われる。
ステップＳ１０６の結果、すべての要素ｙに対して、ステップＳ１０５の処理を行っていないと判定された場合（Ｓ１０６→Ｎｏ）、ネットワーク生成部２１は、ステップＳ１０５の処理へ戻る。
ステップＳ１０６の結果、すべての要素ｙに対して、ステップＳ１０５の処理を行ったと判定された場合（Ｓ１０６→Ｙｅｓ）、ネットワーク生成部２１は、処理を終了する。 Then, the network generation unit 21 determines whether or not the process of step S105 has been performed on all elements y in the search target set X (S106). The determination is made, for example, by adding a flag to the element y after step S105 and determining whether or not this flag is attached to all the elements 100 by the network generation unit 21.
As a result of step S106, when it is determined that the process of step S105 is not performed for all elements y (S106 → No), the network generation unit 21 returns to the process of step S105.
As a result of step S106, when it is determined that the process of step S105 has been performed on all elements y (S106 → Yes), the network generation unit 21 ends the process.

図４は、ネットワーク生成部によって算出された近傍要素ネットワークの記憶部での記憶状態を示す図である。
図４において、符号２００は、最近傍要素ネットワークの中心となる要素（中心要素：図２（ｄ）における要素ｘ。ただし、中心要素自身は、最近傍要素ネットワークに含まれない）の要素番号であり、符号２０１は、この中心要素に対して最近傍要素ネットワークを構成している要素（図２（ｄ）における符号Γ（ｘ））の要素番号である。なお、ここでは、要素毎に一意の要素番号を予め付されているものとする。
例えば、要素番号「１」の要素に対して最近傍要素ネットワークを構成している要素は、要素番号「３」，「５」，「７」である。そして、要素番号「２」の要素に対して最近傍要素ネットワークを構成している要素は、要素番号「３」，「６」，「８」，「９」，「１１」である。また、要素番号「３」の要素に対して最近傍要素ネットワークを形成している要素は、要素番号「１」，「２」，「９」，「１０」である。
最近傍要素ネットワークを構成する要素の数が異なるのは、図２（ｃ）における要素ｙの数がそれぞれの要素毎に異なるためである。 FIG. 4 is a diagram illustrating a storage state in the storage unit of the neighborhood element network calculated by the network generation unit.
In FIG. 4, reference numeral 200 denotes an element number of an element that is the center of the nearest element network (center element: element x in FIG. 2D, where the center element itself is not included in the nearest element network). Yes, reference numeral 201 denotes an element number of an element (symbol Γ (x) in FIG. 2D) that constitutes a nearest neighbor element network with respect to the central element. Here, it is assumed that a unique element number is assigned in advance for each element.
For example, the elements constituting the nearest element network with respect to the element with the element number “1” are the element numbers “3”, “5”, and “7”. The elements constituting the nearest element network with respect to the element with the element number “2” are the element numbers “3”, “6”, “8”, “9”, and “11”. In addition, the elements forming the nearest element network with respect to the element with the element number “3” are the element numbers “1”, “2”, “9”, and “10”.
The number of elements constituting the nearest neighbor element network is different because the number of elements y in FIG. 2C is different for each element.

このようなネットワーク生成処理によれば、所定数の近傍要素の集合である最近傍要素ネットワークの集合として、近傍要素ネットワークを生成するため、平均最短パス長の小さいネットワークの生成が可能となる。ここで、パス長とは、情報探索集合内における任意の２つのノード間のリンクの数である。また、このようなネットワーク生成処理によれば、情報探索集合内のすべての要素に対し、各要素を中心要素とした最近傍要素ネットワークが存在するため、任意の要素を中心要素とする最近傍ネットワークを取得していくことで、情報探索集合内のすべての要素がリンクによって結合している最近傍要素ネットワークを生成することができる。 According to such a network generation process, a neighborhood element network is generated as a set of nearest neighbor element networks, which is a set of a predetermined number of neighborhood elements, so that a network with a small average shortest path length can be generated. Here, the path length is the number of links between any two nodes in the information search set. Also, according to such a network generation process, for every element in the information search set, there is a nearest element network with each element as a central element, so the nearest neighbor network with any element as a central element By acquiring, it is possible to generate a nearest neighbor element network in which all elements in the information search set are connected by links.

また、情報探索集合内における要素ｘとは異なる任意の要素である要素ｙを取得し、要素ｙとは異なる要素のうち、要素ｙとの類似度が高い順に、所定数の要素を取得し、当該取得した要素の集合を最近傍要素ネットワークΓ（ｙ）とし、最近傍要素ネットワークΓ（ｙ）の要素に、要素ｘが含まれている場合、要素ｙと、最近傍要素ネットワークΓ（ｘ）との和集合を、要素ｘに対する最近傍要素ネットワークとすることにより、無向リンクが設定される。この無向リンクが設定されることにより、要素ｘが、要素ｙに対する最近傍要素ネットワークに含まれるが、要素ｘに対する最近傍要素ネットワークに、要素ｙが含まれない状態となることを避けることができ、確実に情報探索集合内のすべての要素がリンクによって結合している近傍要素ネットワークを生成することができる。 Also, an element y that is an arbitrary element different from the element x in the information search set is acquired, and among elements different from the element y, a predetermined number of elements are acquired in descending order of similarity to the element y. When the set of the acquired elements is the nearest neighbor element network Γ (y), and the element x is included in the elements of the nearest neighbor element network Γ (y), the element y and the nearest neighbor element network Γ (x) The undirected link is set by using the union set of and the nearest neighbor element network for the element x. By setting the undirected link, the element x is included in the nearest element network for the element y, but it is avoided that the element y is not included in the nearest element network for the element x. It is possible to generate a neighborhood element network in which all elements in the information search set are connected by links.

（情報探索処理）
次に、図１を参照しつつ、図５に沿って、情報探索処理の概要について説明する。
図５は、本実施形態に係る情報探索処理の概要を示す図である。
図５において、符号は、図２と同様に、情報探索集合における要素（白丸または黒丸にて表現）、すなわち、記憶部３に格納され、探索対象となる情報である。
また、図５における破線で示すリンクによって全要素が連結した１コンポーネントの近傍要素ネットワークがネットワーク生成処理部２によって生成されているものとする。
探索処理部２２は、予め定められている要素、または、任意の要素を起点要素ｘ０とする。そして、当該起点要素ｘ０に対する最近傍要素ネットワークΓ（ｘ０）を記憶部３から取得する。
続いて、探索処理部２２は、起点要素ｘ０を展開要素集合Ｂの要素とし（Ｂ＝｛ｘ０｝）、取得した最近傍要素ネットワークΓ（ｘ０）と展開要素集合Ｂとの和集合を類似度計算要素集合Ａ（Ａ＝Γ（ｘ０）∪｛ｘ０｝）として求める。ここで、展開要素集合とは、ある要素ｘに対する最近傍要素ネットワークΓ（ｘ）の要素とクエリとの類似度計算を実行する場合の要素ｘから構なされる集合である。要素ｘから直接リンク結合されている要素を要素ｘの子要素と表現するときは、子要素とクエリとの類似度が計算される要素の集合である。一方、類似度計算要素集合とは、クエリとの類似度計算が実行される要素の集合である。以降、展開要素集合Ａを集合Ａと、類似度計算要素集合Ｂを集合Ｂと簡略し表現する。
そして、探索処理部２２は、集合Ａと集合Ｂとの差集合を構成する要素のうち、図示しないクエリとの類似度が最も大きい要素を抽出する。前記差集合は、既にクエリとの類似度計算を実行された要素であって、未だ展開されていない（子要素とクエリとの類似度計算が実行されていない）要素からなる集合である。
この場合、図５（ｂ）に示すように、要素ｘ１が、探索処理部２２によって抽出されたとする。 (Information search process)
Next, the outline of the information search process will be described along FIG. 5 with reference to FIG.
FIG. 5 is a diagram showing an outline of the information search process according to the present embodiment.
In FIG. 5, as in FIG. 2, the code is an element in the information search set (expressed by a white circle or a black circle), that is, information stored in the storage unit 3 and to be searched.
In addition, it is assumed that a one-component neighborhood element network in which all elements are connected by links shown by broken lines in FIG.
The search processing unit 22 sets a predetermined element or an arbitrary element as a starting element x0. Then, the nearest neighbor element network Γ (x0) for the origin element x0 is acquired from the storage unit 3.
Subsequently, the search processing unit 22 sets the starting element x0 as an element of the expanded element set B (B = {x0}), and uses the obtained union of the nearest neighbor element network Γ (x0) and the expanded element set B as the degree of similarity. It is obtained as a calculation element set A (A = Γ (x0)｝ {x0}). Here, the expanded element set is a set made up of elements x when the similarity calculation between the elements of the nearest neighbor element network Γ (x) for a certain element x and the query is executed. When an element that is directly linked from the element x is expressed as a child element of the element x, it is a set of elements for which the similarity between the child element and the query is calculated. On the other hand, the similarity calculation element set is a set of elements for which similarity calculation with a query is executed. Hereinafter, the expanded element set A is simply expressed as set A, and the similarity calculation element set B is simply expressed as set B.
And the search process part 22 extracts the element with the largest similarity with the query which is not illustrated among the elements which comprise the difference set of the set A and the set B. FIG. The difference set is a set of elements that have already been subjected to similarity calculation with the query and have not yet been expanded (similarity calculation between the child element and the query has not been performed).
In this case, it is assumed that the element x1 is extracted by the search processing unit 22 as illustrated in FIG.

次に、類似度ρ（ｘ１，ｑ）＞類似度ρ（ｘｍａｘ，ｑ）を満たしているとしたとき、探索処理部２２は、図５（ｃ）に示すように、要素ｘ１を要素ｘｍａｘ（図示せず）とし、要素ｘｍａｘを更新し保持する。また、要素ｘ１が前記条件を充足しない場合は、要素ｘｍａｘの更新は行われない。探索処理部２２は、要素ｘ１に対して最近傍要素ネットワークΓ（ｘ１）（図５（ｃ）において実線で示されるリンクで結合している要素）を記憶部３から取得する。ここで、類似度ρ（ｘｍａｘ，ｑ）が、請求項における設定類似度であり、情報探索部が、要素ｘｍａｘを保持することにより、設定類似度も保持することになる。
そして、探索処理部２２は、要素ｘ１に対する最近傍要素ネットワークを構成する要素群Γ（ｘ１）の要素からなる集合と集合Ａとの和集合を新たな集合Ａとする。さらに、探索処理部２２は、要素ｘ１を、図５（ａ）に示す集合Ｂの要素に加え、新たな集合Ｂとする。そして、探索処理部２２は、新たな集合Ａと、集合Ｂとの差集合を構成する要素の中で、クエリとの類似度が最も大きい要素を抽出し、当該要素を新たな要素ｘ１とする（図５（ｃ）：Γ（ｘ１→ｘ１）。すなわち、図５（ｃ）に示すように、探索処理部２２は、新たな要素ｘ１を抽出する。 Next, assuming that the similarity ρ (x1, q)> similarity ρ (xmax, q) is satisfied, the search processing unit 22 converts the element x1 into the element xmax ( The element xmax is updated and held. Further, when the element x1 does not satisfy the condition, the element xmax is not updated. The search processing unit 22 acquires, from the storage unit 3, the nearest element network Γ (x1) (element connected by a link indicated by a solid line in FIG. 5C) with respect to the element x1. Here, the similarity ρ (xmax, q) is the set similarity in the claims, and the information search unit holds the set similarity by holding the element xmax.
Then, the search processing unit 22 sets a new set A as a union of a set composed of elements of the element group Γ (x1) constituting the nearest neighbor element network for the element x1 and the set A. Further, the search processing unit 22 sets the element x1 as a new set B in addition to the elements of the set B shown in FIG. Then, the search processing unit 22 extracts the element having the highest similarity with the query from the elements constituting the difference set between the new set A and the set B, and sets the element as the new element x1. (FIG. 5C): Γ (x1 → x1) That is, as shown in FIG.5C, the search processing unit 22 extracts a new element x1.

そして、そして、類似度ρ（ｘ１，ｑ）＞類似度ρ（ｘｍａｘ，ｑ）を満たしているとしたとき、探索処理部２２は、この要素ｘ１を新たな要素ｘｍａｘ（図示せず）として保持する。そして、探索処理部２２は、要素ｘ１に対する最近傍要素ネットワークΓ（ｘ１）の要素（図５（ｄ）中、実線で示されるリンクで結合している要素）からなる集合と集合Ａとの和集合を新たな集合Ａとし、要素ｘ１を、図５（ａ）に示す集合Ｂの要素に加え、新たな集合Ｂとする。そして、新たな集合Ａと、集合Ｂとの差集合を構成する要素の中で、クエリとの類似度が最も大きい要素を抽出する（図５（ｄ））。
このような処理を繰り返し、集合Ａの要素数が上限コストβを超えたとき（第１終了条件）、または、要素ｘ１とクエリとの類似度が１となった（クエリと一致する要素を抽出した：第２終了条件）とき、要素ｘｍａｘを最終出力要素とする。ただし、第２終了条件の設定の有無は情報探索集合に依存する。 When the similarity ρ (x1, q)> similarity ρ (xmax, q) is satisfied, the search processing unit 22 holds the element x1 as a new element xmax (not shown). To do. Then, the search processing unit 22 sums the set A and the set A composed of elements of the nearest element network Γ (x1) for the element x1 (elements connected by links indicated by solid lines in FIG. 5D). The set is a new set A, and the element x1 is added to the elements of the set B shown in FIG. Then, from the elements constituting the difference set between the new set A and the set B, the element having the highest similarity with the query is extracted (FIG. 5D).
Such processing is repeated, and when the number of elements in the set A exceeds the upper limit cost β (first termination condition), or the similarity between the element x1 and the query becomes 1 (extracts elements that match the query) : Second end condition), the element xmax is set as the final output element. However, whether or not the second end condition is set depends on the information search set.

次に、図１および図５を参照しつつ、図６に沿って、情報探索処理の流れを説明する。
図６は、本実施形態に係る情報探索処理の流れを示すフローチャートである。
情報探索装置１の記憶部３には、予め入力部４を介して入力されたコスト上限βと、要素と特徴量のリストと、起点要素ｘ０と、ネットワーク生成処理で算出された最近傍要素ネットワークΓ（ｘ）が要素ごとに格納されている。
まず、探索処理部２２は、起点要素ｘ０（ｘ０∈Ｘ）を記憶部から取得し、この起点要素ｘ０に対する最近傍要素ネットワークΓ（ｘ０）を記憶部３から取得する（Ｓ２０１）。すなわち、情報探索装置１は、最近傍要素ネットワークΓ（ｘ０）をＲＡＭなどのメモリ上に常駐させている。
次に、入力部４を介して、クエリｑが情報探索装置１に入力される（Ｓ２０２）。クエリの入力は、端末１１から物理的ネットワーク１０を介することによって、入力されてもよいし、直接入力部４から入力されてもよい。また、本実施形態では、探索処理部２２が、起点要素ｘ０を対する最近傍要素ネットワークΓ（ｘ０）を記憶部３から取得した後に、クエリｑが入力されたが、これに限らず、クエリｑが入力されてから、探索処理部２２が、起点要素ｘ０に対する最近傍要素ネットワークΓ（ｘ０）を記憶部３から取得してもよい。
次に、探索処理部２２は、起点要素ｘ０とクエリｑとの類似度ρ（ｘ０，ｑ）を算出し（Ｓ２０３）、記憶部３に格納する。なお、探索処理部２２は、この時点で、起点要素ｘ０に対する最近傍要素ネットワークΓ（ｘ０）を記憶部３から取得してもよい。ここで、ρ（・）は、例えば、コサイン類似度関数などの類似度関数であり、ρ（ａ，ｂ）＝ρ（ｂ，ａ）∈［０，１］、ａ，ｂ∈Ｘの性質を有する。ただし、任意の要素ａは、自分自身との類似度が最も大きくρ（ａ，ａ）＝１である。 Next, the flow of the information search process will be described along FIG. 6 with reference to FIGS. 1 and 5.
FIG. 6 is a flowchart showing a flow of information search processing according to the present embodiment.
The storage unit 3 of the information search device 1 stores the cost upper limit β input in advance through the input unit 4, a list of elements and features, the starting element x 0, and the nearest neighbor element network calculated by the network generation process Γ (x) is stored for each element.
First, the search processing unit 22 acquires the starting element x0 (x0εX) from the storage unit, and acquires the nearest neighbor element network Γ (x0) for the starting element x0 from the storage unit 3 (S201). That is, the information search apparatus 1 makes the nearest neighbor element network Γ (x0) resident on a memory such as a RAM.
Next, the query q is input to the information search apparatus 1 via the input unit 4 (S202). The input of the query may be input from the terminal 11 via the physical network 10 or may be input directly from the input unit 4. In the present embodiment, the search processing unit 22 acquires the nearest neighbor element network Γ (x0) for the starting element x0 from the storage unit 3 and then inputs the query q. The search processing unit 22 may acquire the nearest neighbor element network Γ (x0) with respect to the starting element x0 from the storage unit 3.
Next, the search processing unit 22 calculates the similarity ρ (x0, q) between the starting point element x0 and the query q (S203) and stores it in the storage unit 3. Note that the search processing unit 22 may acquire the nearest neighbor element network Γ (x0) for the starting element x0 from the storage unit 3 at this time. Here, ρ (·) is a similarity function such as a cosine similarity function, for example, and the properties of ρ (a, b) = ρ (b, a) ∈ [0,1], a, b∈X Have However, the arbitrary element a has the highest similarity with itself and ρ (a, a) = 1.

そして、探索処理部２２は、集合Ａ＝Γ（ｘ０）∪｛ｘ０｝および集合Ｂ＝｛ｘ０｝を算出する（Ｓ２０４：図５（ａ））。
次に、探索処理部２２は、集合Ａの要素の数｜Ａ｜を算出し、｜Ａ｜＞上限コストβ、または、クエリｑと要素ｘｍａｘとの類似度（設定類似度）が１であること、すなわちρ（ｘｍａｘ，ｑ）＝１（クエリと要素とが一致していること）を満たしているか否かを判定する（Ｓ２０５）。ここで、｜・｜は、該当する集合の要素の数である。なお、要素ｘｍａｘの初期要素は、特に限定しないが、要素ｘ０などを代入しておいてもよい。ここで、算出されたρ（ｘｍａｘ，ｑ）は、記憶部３に格納される。
ステップＳ２０５の結果、｜Ａ｜＞β、または、ρ（ｘｍａｘ，ｑ）＝１を満たしている場合（Ｓ２０５→Ｙｅｓ）、探索処理部２２は、要素ｘｍａｘを最終出力要素ｘ２として出力し（Ｓ２０６）、処理を終了する。なお、本実施形態では、ステップＳ２０４の処理において、｜Ａ｜＞上限コストβ、または、クエリｑと要素ｘｍａｘとの類似度が１であることを判定しているが、これに加え、探索処理部２２が、図示しないタイマなどを監視し、所定の計算時間を越えているか否かを判定してもよい。 Then, the search processing unit 22 calculates a set A = Γ (x0) ∪ {x0} and a set B = {x0} (S204: FIG. 5A).
Next, the search processing unit 22 calculates the number of elements | A | in the set A, and | A |> the upper limit cost β, or the similarity (set similarity) between the query q and the element xmax is 1. That is, it is determined whether or not ρ (xmax, q) = 1 (the query and the element match) is satisfied (S205). Here, | · | is the number of elements of the corresponding set. The initial element of element xmax is not particularly limited, but element x0 or the like may be substituted. Here, the calculated ρ (xmax, q) is stored in the storage unit 3.
When | A |> β or ρ (xmax, q) = 1 is satisfied as a result of step S205 (S205 → Yes), the search processing unit 22 outputs the element xmax as the final output element x2 (S206). ), The process is terminated. In the present embodiment, in the process of step S204, it is determined that | A |> the upper limit cost β or the similarity between the query q and the element xmax is 1, but in addition to this, the search process The unit 22 may monitor a timer or the like (not shown) and determine whether or not a predetermined calculation time has been exceeded.

ステップＳ２０５の結果、条件を満たしていない場合（Ｓ２０４→Ｎｏ）、探索処理部２２は、集合Ａと集合Ｂとの差集合を算出し（Ｓ２０７）、当該差集合の要素ｙとクエリｑとの類似度ρ（ｙ，ｑ）を算出する（Ｓ２０８）。
そして、探索処理部２２は、集合Ａと、集合Ｂとの差集合におけるすべての要素ｙ（ｙ∈Ｘ）に対して、ステップＳ２０８の処理を行ったか否かを判定する（Ｓ２０９）。判定は、例えば、ステップＳ２０８の後に、要素ｙにフラグを付し、このフラグがすべての要素に対し付されているか否かを、探索処理部２２が判定することによって行われる。 When the condition is not satisfied as a result of step S205 (S204 → No), the search processing unit 22 calculates a difference set between the set A and the set B (S207), and calculates the difference y between the element y of the difference set and the query q. The similarity ρ (y, q) is calculated (S208).
Then, the search processing unit 22 determines whether or not the process of step S208 has been performed for all elements y (yεX) in the difference set between the set A and the set B (S209). The determination is performed, for example, by adding a flag to the element y after step S208 and determining whether or not the flag is attached to all the elements by the search processing unit 22.

ステップＳ２０９の結果、すべての要素ｙについて、ステップＳ２０８の処理を行っていないと判定された場合（Ｓ２０９→Ｎｏ）、探索処理部２２は、ステップＳ２０８の処理へ戻る。
ステップＳ２０９の結果、すべての要素ｙについて、ステップＳ２０８の処理を行っていると判定された場合（Ｓ２０９→Ｙｅｓ）、探索処理部２２は、ステップＳ２１０の処理へ進む。 As a result of step S209, when it is determined that the process of step S208 has not been performed for all elements y (S209 → No), the search processing unit 22 returns to the process of step S208.
As a result of step S209, when it is determined that the process of step S208 is performed for all elements y (S209 → Yes), the search processing unit 22 proceeds to the process of step S210.

次に、探索処理部２２は、式（８）の要素ｘ１を求める（Ｓ２１０：図５（ｂ））。 Next, the search processing unit 22 obtains an element x1 of Expression (8) (S210: FIG. 5B).

すなわち、探索処理部２２は、最大の類似度ρ（ｗ，ｑ）を有する要素ｗを算出し、この要素ｗを要素ｘ１（ｘ１∈Ｘ）とする。同時に、探索処理部２２は、ステップＳ２１０における式（８）で求めた要素ｘ１に係る類似度ρ（ｘ１，ｑ）を記憶部３に格納する。
そして、探索処理部２２は、記憶部３から類似度ρ（ｘｍａｘ，ｑ）および類似度ρ（ｘ１，ｑ）を取得し、類似度ρ（ｘ１，ｑ）＞類似度ρ（ｘｍａｘ，ｑ）であるか否かを判定する（Ｓ２１１）。
ステップＳ２１１の結果、類似度ρ（ｘ１，ｑ）＞類似度ρ（ｘｍａｘ，ｑ）ではない場合（Ｓ２１１→Ｎｏ）、探索処理部２２は、ステップＳ２１３の処理へ進む。
ステップＳ２１１の結果、類似度ρ（ｘ１，ｑ）＞類似度ρ（ｘｍａｘ，ｑ）である場合（Ｓ２１１→Ｙｅｓ）、探索処理部２２は、探索処理部２２は、要素ｘ１を、新たな要素ｘｍａｘとして保持する（Ｓ２１２：図５（ｃ））。前記したように、類似度ρ（ｘｍａｘ，ｑ）が、請求項における設定類似度であり、情報探索部が、要素ｘｍａｘを保持することにより、設定類似度も保持することになる。
次に、探索処理部２２は、要素ｘ１に対する最近傍要素ネットワークΓ（ｘ１）を記憶部３から取得すると、集合Ａ’＝Ａ∪Γ（ｘ１）および集合Ｂ’＝Ｂ∪｛ｘ１｝を算出し、集合Ａ’を新たなＡとし、Ｂ’を新たなＢとする（Ａ←Ａ’、Ｂ←Ｂ’：Ｓ２１３：図５（ｃ））。すなわち、集合Ａに集合Ａ’を代入し、集合Ｂに集合Ｂ’を代入する。
そして、探索処理部２２は、ステップＳ２０５の処理へ戻る。 That is, the search processing unit 22 calculates an element w having the maximum similarity ρ (w, q), and sets this element w as an element x1 (x1εX). At the same time, the search processing unit 22 stores the similarity ρ (x1, q) related to the element x1 obtained by the expression (8) in step S210 in the storage unit 3.
Then, the search processing unit 22 acquires the similarity ρ (xmax, q) and the similarity ρ (x1, q) from the storage unit 3, and the similarity ρ (x1, q)> similarity ρ (xmax, q) It is determined whether or not (S211).
As a result of step S211, when the similarity ρ (x1, q)> similarity ρ (xmax, q) is not satisfied (S211 → No), the search processing unit 22 proceeds to the process of step S213.
As a result of step S211, when the similarity ρ (x1, q)> similarity ρ (xmax, q) (S211 → Yes), the search processing unit 22 sets the element x1 as a new element. xmax is held (S212: FIG. 5C). As described above, the similarity ρ (xmax, q) is the set similarity in the claims, and the information search unit holds the element xmax, thereby holding the set similarity.
Next, when the search processing unit 22 obtains the nearest neighbor element network Γ (x1) for the element x1 from the storage unit 3, the search processing unit 22 calculates the set A ′ = A∪Γ (x1) and the set B ′ = B∪ {x1}. Then, set A ′ is a new A and B ′ is a new B (A ← A ′, B ← B ′: S213: FIG. 5C). That is, the set A ′ is assigned to the set A, and the set B ′ is assigned to the set B.
Then, the search processing unit 22 returns to the process of step S205.

なお、本実施形態において、情報探索集合の要素内に同一の情報が存在するようなクエリを入力してもよいし、情報探索集合の要素内に同一の情報が存在しないようなクエリを入力してもよい。 In this embodiment, a query in which the same information exists in the elements of the information search set may be input, or a query in which the same information does not exist in the elements of the information search set is input. May be.

本実施形態に係る情報探索処理は、要素間の平均最短パス長が小さいスモールワールドネットワークを使用して情報探索を行うため、情報探索集合に対して距離空間を定義すると、要素間の距離が大きい、すなわち要素同士が疎となり、三角不等式などによる探索空間の削減が不可能な情報探索集合に対しても、探索コストの小さい情報探索を行うことができる。すなわち、探索空間を小さくすることができる。
また、要素間における距離の定義を前提としていないため、距離空間を定義不可能な情報探索集合に対しても効率的な情報探索を行うことが可能となる。例えば、任意の２つの要素間の類似度を、コサイン類似度で定義した情報探索集合は、距離空間ではない。さらに、局所的な要素の集合である最近傍要素ネットワークを連結した近傍要素ネットワークを用いており、全体の情報探索を、処理の軽い最近傍要素ネットワークにおける探索の集まりとすることができ、全体的な処理の負担を軽減することができる。
そして、１度探索した要素は、次回以降の探索対象から外した情報探索を行うため、効率的な情報探索を行うことができる。 In the information search processing according to the present embodiment, information search is performed using a small world network having a small average shortest path length between elements. Therefore, when a metric space is defined for an information search set, a distance between elements is large. That is, an information search with a low search cost can be performed even for an information search set in which elements are sparse and the search space cannot be reduced by a triangle inequality. That is, the search space can be reduced.
In addition, since it is not premised on the definition of the distance between elements, an efficient information search can be performed even for an information search set in which a metric space cannot be defined. For example, an information search set in which the similarity between two arbitrary elements is defined by the cosine similarity is not a metric space. Furthermore, it uses a neighborhood element network that connects local element networks, which are local element collections, so that the entire information search can be a collection of searches in the lightest neighborhood element network. Can reduce the burden of unnecessary processing.
Since the element searched once is searched for information excluded from the search target after the next time, an efficient information search can be performed.

（ネットワークの特性）
ここで、本実施形態に好適なネットワークの性質について説明する。
まず、本実施形態におけるネットワークは、情報探索を効率よく行うため、近傍要素数ｋと強い相関を有する値である次数が、比較的小さいネットワークであることが望ましい。次数を比較的小さくすることで、情報探索集合内の全要素が結合した１コンポーネントのネットワークであり、次数が比較的小さいことが望ましい。本実施形態で用いた近傍要素ネットワークにおける平均次数は、式（９）で定義される。 (Network characteristics)
Here, the nature of the network suitable for the present embodiment will be described.
First, in order to perform information search efficiently, the network in the present embodiment is desirably a network having a relatively small order that is a value having a strong correlation with the number k of neighboring elements. It is a one-component network in which all elements in the information search set are combined by making the order relatively small, and it is desirable that the order be relatively small. The average order in the neighborhood element network used in this embodiment is defined by Expression (9).

さらに、本実施形態におけるネットワークは、任意の起点要素と、最終出力要素との間に、比較的短いリンクで連結されていることが必要である。探索コストの小さい情報探索を行うためである。
本実施形態で用いた近傍要素ネットワーク全体における平均値である平均最短パス長は、式（１０）で定義される。 Furthermore, the network in the present embodiment needs to be connected by a relatively short link between an arbitrary origin element and a final output element. This is because information search with a low search cost is performed.
The average shortest path length, which is an average value in the entire neighborhood element network used in this embodiment, is defined by Expression (10).

ここで、ｄΓ（ｘ，ｙ）は、ネットワークにおける任意の要素における最短パス長である。 Here, dΓ (x, y) is the shortest path length in an arbitrary element in the network.

また、最終出力要素ｘ２における近傍の要素群ｙ∈Γ（ｘ２）のそれぞれと、クエリｑとの類似度が比較的低い場合、情報探索が困難になる。なぜならば、起点要素ｘ０から最終出力要素ｘ２へ到達するためには、最終出力要素ｘ２における近傍の要素ｙを経由することが必須となるためである。すなわち、類似度ρ（ｘ２，ｑ）と類似度ρ（ｘ２，ｙ）が高い値を示すときには、類似度ρ（ｙ，ｑ）もまた高い値を示すことが望ましい。これを一般化すると、３つの要素ｘ，ｙ，ｚにおいて、ｙ∈Γ（ｘ）かつｚ∈Γ（ｙ）において、類似度ρ（ｘ，ｙ）と類似度ρ（ｙ，ｚ）が高い値を示すとき、ｘ∈Γ（ｚ）となるような高い値の類似度ρ（ｚ，ｘ）（ｘ∈Γ（ｚ））が高い値を示すこと好ましい。すなわち、３つの要素ｘ，ｙ，ｚにおける任意のペア要素間にリンクが存在することが望ましい。
このような、３つの要素間の関係を定量的に評価する尺度であるネットワークのクラスタ係数は、式（１１）で定義される。クラスタ係数が高い値であるほど、任意の３つの要素間における任意のペア要素間にリンクが存在する率が高い。

In addition, when the similarity between each of the neighboring element groups y∈Γ (x2) in the final output element x2 and the query q is relatively low, information search becomes difficult. This is because in order to reach the final output element x2 from the starting element x0, it is essential to pass through the element y in the vicinity of the final output element x2. That is, when the degree of similarity ρ (x2, q) and the degree of similarity ρ (x2, y) show high values, it is desirable that the degree of similarity ρ (y, q) also shows a high value. When this is generalized, the similarity ρ (x, y) and the similarity ρ (y, z) are high in the three elements x, y, z in y∈Γ (x) and z∈Γ (y). When a value is indicated, it is preferable that a high degree of similarity ρ (z, x) (x∈Γ (z)) such that x∈Γ (z) is high. That is, it is desirable that a link exists between arbitrary pair elements in the three elements x, y, and z.
The network cluster coefficient, which is a measure for quantitatively evaluating the relationship between the three elements, is defined by Expression (11). The higher the cluster coefficient, the higher the rate at which links exist between any pair of elements between any three elements.

本実施形態に好適なネットワークの特性として、１．式（９）で示される平均次数が小さく、かつ１コンポーネントのネットワークであること、２．平均最短パス長が比較的小さいネットワークであること、３．クラスタ係数が比較的大きいネットワークであることが望ましい。
このような特性を備えるネットワークをスモールワールドネットワークと記載する。スモールワールドネットワークには、本実施形態で記載した近傍要素ネットワークが含まれる。本実施形態で使用する近傍要素ネットワークにおける平均最短パス長と、クラスタ係数とに関する考察は、図１５および図１６を参照して後記する。 The network characteristics suitable for this embodiment are: 1. The average order represented by equation (9) is small and the network is a one component. 2. the network has a relatively short average shortest path length; It is desirable that the network has a relatively large cluster coefficient.
A network having such characteristics is referred to as a small world network. The small world network includes the neighborhood element network described in the present embodiment. The consideration regarding the average shortest path length and the cluster coefficient in the neighborhood element network used in this embodiment will be described later with reference to FIGS. 15 and 16.

次に、図７から図１４に沿って、本実施形態における実施形態例を示す。
なお、図７から図１１は、クエリと同一の情報が情報探索集合の要素に含まれている探索問題に対する図であり、図１２から図１４は、クエリと同一の情報が情報探索集合の要素に含まれていない探索問題に対する図である。 Next, along with FIGS. 7 to 14, an example of the embodiment will be described.
7 to 11 are diagrams for a search problem in which the same information as the query is included in the elements of the information search set. FIGS. 12 to 14 show the same information as the query as the elements of the information search set. It is a figure with respect to the search problem which is not contained in.

図７は、近傍要素数ｋに対するコンポーネント数の変化を示す図である。
図７において、横軸は、近傍要素数ｋの値を示し、縦軸は、コンポーネント数に対し常用対数を適用した値である。
なお、図７における情報探索集合は、１０年分の新聞の記事における文書ファイルを要素とする集合である。そして、要素間の類似度は、以下の手順によって算出した。すなわち、各文書ファイルを形態素解析し、不要なストップワードを削除した上で、単語を抽出する。そして、抽出された単語に対し、ｔｆ−ｉｄｆ法で各単語に対し、重み付けを行う。この結果、生じる重み付け単語ベクトルを、該文書ファイルの特徴量とする。
その上で、文書ファイルを要素とし、コサイン類似度関数を用いて、要素間の類似度を規定する。
この例で用いた要素数（文書ファイル数）は、６４５８５個であり、距離空間の次元数は、５１０３０となった。
図７において示されるようにｋ＝６において、コンポーネント数は、１となる。すなわち、ｋ＝６で、１コンポーネントの近傍要素ネットワークの生成が可能となる。すなわち、ｋ≧６以上であれば、１コンポーネントの近傍要素ネットワークを生成することができる。 FIG. 7 is a diagram illustrating a change in the number of components with respect to the number k of neighboring elements.
In FIG. 7, the horizontal axis indicates the value of the number k of neighboring elements, and the vertical axis indicates a value obtained by applying a common logarithm to the number of components.
Note that the information search set in FIG. 7 is a set having document files in newspaper articles for 10 years as elements. And the similarity between elements was computed with the following procedures. That is, each document file is subjected to morphological analysis, unnecessary words are deleted, and words are extracted. Then, the extracted words are weighted by the tf-idf method. As a result, the resulting weighted word vector is used as the feature amount of the document file.
Then, the document file is used as an element, and the similarity between elements is defined using a cosine similarity function.
The number of elements (number of document files) used in this example is 64585, and the number of dimensions in the metric space is 51030.
As shown in FIG. 7, the number of components is 1 at k = 6. That is, when k = 6, it is possible to generate a one-component neighborhood element network. That is, if k ≧ 6 or more, a one-component neighborhood element network can be generated.

図８は、近傍要素数ｋに対する平均探索コストの変化を示す図である。
図８において、横軸は、近傍要素数ｋの値を示し、縦軸は、平均探索コストである。
図８では、前記した１０年分の新聞記事の文書ファイルの要素から、ランダムに１０００００個のペア要素（クエリと、起点要素のペア）を選択し、前記した情報探索集合に対して、本実施形態における情報探索処理を行った結果を示す。
コスト上限値は、無限大に設定されている。また、平均探索コストとは、同一の情報探索集合に対し、クエリと、起点要素とを変化させて、情報探索をおこなったときの探索コストの平均値である。
図８で示されるように近傍要素数ｋ＝２０において、平均探索コストは、最小の３６５．３８となった。この値は、全要素を探索した場合の探索コストの０．５７％である。 FIG. 8 is a diagram showing a change in average search cost with respect to the number k of neighboring elements.
In FIG. 8, the horizontal axis indicates the value of the number of neighboring elements k, and the vertical axis indicates the average search cost.
In FIG. 8, 100,000 pairs of elements (query and starting element pairs) are randomly selected from the elements of the document file of the newspaper articles for the 10 years described above, and this implementation is performed on the information search set described above. The result of having performed the information search process in a form is shown.
The cost upper limit is set to infinity. The average search cost is an average value of search costs when an information search is performed by changing a query and a starting point element for the same information search set.
As shown in FIG. 8, when the number of neighboring elements k = 20, the average search cost is the minimum 365.38. This value is 0.57% of the search cost when all elements are searched.

平均コストが、最小値をもつ理由として、次の理由が考えられる。本実施形態における探索コストは、平均次数と平均ステップ数との積にほぼ近い値となる。ここで、ステップ数とは、最終出力要素を算出するまでにたどった起点要素ｘ０と要素ｘ１（図５）との数である。すなわち、図５における黒丸の数である。
一般に、近傍要素数ｋは、図７および図８の手順によって、決定される。 The reason why the average cost has the minimum value is considered as follows. The search cost in the present embodiment is a value that is substantially close to the product of the average order and the average step number. Here, the number of steps is the number of starting element x0 and element x1 (FIG. 5) traced until the final output element is calculated. That is, the number of black circles in FIG.
In general, the number k of neighboring elements is determined by the procedure shown in FIGS.

図９は、近傍要素数ｋに対する平均次数の変化を示す図であり、図１０は、近傍要素数ｋに対するステップ数の変化を示す図である。
図９において、横軸は、近傍要素数ｋを示し、縦軸は、平均次数を示す。
また、図１０において、横軸は、近傍要素数ｋを示し、縦軸は、ステップ数の平均値（Ａｖｅｒａｇｅ）または中央値（Ｍｅｄｉａｎ）を示す。
図９および図１０の各ｋにおいて、平均次数の値と、ステップ数の平均値または中央値を乗算すると、ｋ＝２０において、平均探索コストが最小となることがわかる。 FIG. 9 is a diagram showing a change in the average order with respect to the number k of neighboring elements, and FIG. 10 is a diagram showing a change in the number of steps with respect to the number k of neighboring elements.
In FIG. 9, the horizontal axis indicates the number of neighboring elements k, and the vertical axis indicates the average order.
In FIG. 10, the horizontal axis indicates the number k of neighboring elements, and the vertical axis indicates the average value (Average) or median value (Median) of the number of steps.
9 and 10, when the average degree value is multiplied by the average value or the median value of the number of steps, it can be seen that the average search cost is minimized at k = 20.

図１１は、各近傍要素数ｋにおける探索コストと、クエリへの到達率を示す図である。
図１１において、横軸は、探索コストを示し、縦軸は、到達率を示す。
到達率とは、前記したようなペア要素（クエリと、起点要素のペア）を１０００００個選択したとき、そのうち、該当する探索コストでクエリに到達したペア要素の割合である。
グラフは、それぞれｋ＝１０，２０，４０，６０の場合について記載する。図８において、最も平均探索コストが小さかったｋ＝２０に注目すると、探索コストが２９１で到達率５０％であり、探索コストが６３３で９０％の到達率となっている。すなわち、選択したペア要素のうち、探索コストが２９１でクエリへ到達したペア要素は、選択したペア要素のうちの５０％であり、探索コストが６３３でクエリへ到達したペア要素は、選択したペア要素のうちの９０％であることを示す。 FIG. 11 is a diagram showing the search cost and the arrival rate to the query for each number of neighboring elements k.
In FIG. 11, the horizontal axis indicates the search cost, and the vertical axis indicates the arrival rate.
The arrival rate is the ratio of the pair elements that have reached the query at the corresponding search cost when 100000 such pair elements (a pair of the query and the starting element) are selected.
The graph describes the cases where k = 10, 20, 40, and 60, respectively. In FIG. 8, paying attention to k = 20 having the lowest average search cost, the search cost is 291, the arrival rate is 50%, and the search cost is 633, the arrival rate is 90%. That is, among the selected pair elements, the pair element that has reached the query with a search cost of 291 is 50% of the selected pair elements, and the pair element that has reached the query with a search cost of 633 is the selected pair element. Indicates 90% of the elements.

本実施形態例における全要素数は、前記したように６４５８５個であり、そのうちの１％がほぼ６４６個である。すなわち、本実施形態の情報探索処理に、本実施形態例の情報探索空間に、本実施形態の情報探索処理を適用すると、上限コストβを全要素数の１％程度の値に設定したとしても、９０％の確率で探索が成功することがわかる。 The total number of elements in this embodiment is 64585 as described above, and 1% of them is approximately 646. That is, if the information search processing of this embodiment is applied to the information search space of this embodiment example in the information search processing of this embodiment, even if the upper limit cost β is set to a value of about 1% of the total number of elements. It can be seen that the search succeeds with a probability of 90%.

次に、図１２から図１４に沿って、本実施形態をクエリと同一の情報が情報探索集合の要素に含まれていない探索問題に適用した際の実施形態例を説明する。
なお、図１２から図１４における各用語の定義は、図７から図１１における用語と同様である。
図１２は、近傍要素数ｋに対するコンポーネント数の変化を示す図である。
図１２における条件は、以下の通りである。
図７から図１１において、用いた情報探索集合（要素数：６４５８５個）の中から、一様ランダムに６４５８要素を選択し、これをクエリとした。そして、残りの５８１２７個の要素を情報探索集合とした。
図１２において、横軸は、近傍要素数ｋの値を示し、縦軸は、コンポーネント数に対し常用対数を適用した値である。
図１２において示されるようにｋ＝７において、コンポーネント数は、１となる。すなわち、ｋ＝７で、１コンポーネントの近傍要素ネットワークの生成が可能となる。すなわち、ｋ≧７以上であれば、１コンポーネントの近傍要素ネットワークを生成することができる。 Next, along with FIG. 12 to FIG. 14, an embodiment example when the present embodiment is applied to a search problem in which the same information as the query is not included in the elements of the information search set will be described.
The definitions of the terms in FIGS. 12 to 14 are the same as the terms in FIGS. 7 to 11.
FIG. 12 is a diagram illustrating a change in the number of components with respect to the number k of neighboring elements.
The conditions in FIG. 12 are as follows.
In FIG. 7 to FIG. 11, 6458 elements were uniformly selected from the used information search set (number of elements: 64585) and used as a query. The remaining 58127 elements were used as an information search set.
In FIG. 12, the horizontal axis indicates the value of the number k of neighboring elements, and the vertical axis indicates the value obtained by applying the common logarithm to the number of components.
As shown in FIG. 12, the number of components is 1 at k = 7. That is, it is possible to generate a one-component neighborhood element network with k = 7. That is, if k ≧ 7 or more, a one-component neighborhood element network can be generated.

図１３は、近傍要素数ｋに対する平均探索コストの変化を示す図である。
図１３において、横軸は、近傍要素数ｋの値を示し、縦軸は、平均探索コストである。
なお、図１３における条件は、図８における条件と同様である。
平均探索コストとは、同一の情報探索集合に対し、クエリと、起点要素とを変化させて、情報探索をおこなったときの探索コストの平均値である。
図８で示されるようにｋ＝４０において、平均探索コストは、最小の９３９．８８となった。この値は、全要素を探索した場合の探索コストの１．６２％である。
一般に、近傍要素数ｋは、図１２および図１３の手順によって、決定される。 FIG. 13 is a diagram showing a change in average search cost with respect to the number k of neighboring elements.
In FIG. 13, the horizontal axis indicates the value of the number k of neighboring elements, and the vertical axis indicates the average search cost.
The conditions in FIG. 13 are the same as the conditions in FIG.
The average search cost is an average value of search costs when an information search is performed by changing a query and a starting element for the same information search set.
As shown in FIG. 8, at k = 40, the average search cost is a minimum of 939.88. This value is 1.62% of the search cost when all elements are searched.
In general, the number k of neighboring elements is determined by the procedure shown in FIGS.

図１４は、各近傍要素数ｋにおける探索コストと、クエリへの到達率を示す図である。
図１４において、横軸は、探索コストを示し、縦軸は、到達率を示す。
到達率とは、現在探索中の要素とクエリとの距離を起点要素と、クエリとの距離で除算したものである。
グラフは、それぞれｋ＝１０，２０，４０，６０の場合について記載する。図１３において、最も平均探索コストが小さいｋ＝４０に注目すると、探索コストが４９４で到達率５０％であり、探索コストが１５４０で９０％の到達率となっている。 FIG. 14 is a diagram showing the search cost and the arrival rate to the query for each number of neighboring elements k.
In FIG. 14, the horizontal axis indicates the search cost, and the vertical axis indicates the arrival rate.
The arrival rate is obtained by dividing the distance between the currently searched element and the query by the distance between the starting element and the query.
The graph describes the cases where k = 10, 20, 40, and 60, respectively. In FIG. 13, focusing on k = 40 having the lowest average search cost, the search cost is 494, the reach is 50%, and the search cost is 1540, the reach is 90%.

本実施形態例における全要素数は、５８１２７個であり、この３％がほぼ１７４４個である。すなわち、本実施形態の情報探索処理に、本実施形態例の情報探索空間に、本実施形態の情報探索処理を適用すると、上限コストβ（図６参照）を全要素数の３％程度の値に設定したとしても、９０％の確率で探索が成功することがわかる。 The total number of elements in this embodiment is 58127, and 3% of this is approximately 1744. That is, when the information search process of the present embodiment is applied to the information search space of the present embodiment example to the information search process of the present embodiment, the upper limit cost β (see FIG. 6) is a value of about 3% of the total number of elements. Even if it is set to, the search succeeds with a probability of 90%.

次に、図１５および図１６に沿って、本実施形態で用いた近傍要素ネットワークの特性を説明する。
図１５は、ランダムネットワーク、近傍要素ネットワークおよびレギュラーネットワークにおける近傍要素数ｋに対する平均最短パス長の変化を示す図である。
ここで、ランダムネットワークとは、情報探索集合中の任意の要素と、要素との結合をランダムに行ったネットワークである。レギュラーネットワークとは、情報探索集合中の要素間の結合を、所定の規則に従って結合したネットワークである。
図１５の横軸は、近傍要素数ｋを示し、縦軸は、平均最短パス長を示す。
図１５に示すように、各近傍要素数ｋにおける近傍要素ネットワーク（ｋ−ＮＮＮＷ）の平均最短パス長は、レギュラーネットワーク（ＲｅｇｕｌａｒＮＷ）の平均最短パス長よりかなり小さく、ランダムネットワーク（ＲａｎｄｏｍＮＷ）の平均最短パス長に近い値を有する。
一般に、スモールワールドネットワークにおける平均最短パス長は、式（１２）を見たすオーダであることが望ましい。
ｌｏｇ_１０（スモールワールドネットワークの平均最短パス長／ランダムネットワークの平均最短パス長）＜１・・・式（１２） Next, the characteristics of the neighborhood element network used in this embodiment will be described with reference to FIGS. 15 and 16.
FIG. 15 is a diagram illustrating a change in average shortest path length with respect to the number k of neighboring elements in a random network, a neighboring element network, and a regular network.
Here, the random network is a network in which arbitrary elements in the information search set and elements are randomly combined. A regular network is a network in which connections between elements in an information search set are combined according to a predetermined rule.
The horizontal axis in FIG. 15 indicates the number of neighboring elements k, and the vertical axis indicates the average shortest path length.
As shown in FIG. 15, the average shortest path length of the neighboring element network (k-NN NW) at each neighboring element number k is considerably smaller than the average shortest path length of the regular network (Regular NW), and is a random network (Random NW). It has a value close to the average shortest path length.
In general, the average shortest path length in the small world network is preferably in the order of the expression (12).
log ₁₀ (average shortest path length of small world network / average shortest path length of random network) <1 formula (12)

図１６は、ランダムネットワーク、近傍要素ネットワークおよびレギュラーネットワークにおける近傍要素数ｋに対するクラスタ係数の変化を示す図である。
図１６の横軸は、近傍要素数ｋを示し、縦軸は、クラスタ係数を示す。
図１６に示すように、各ｋにおける近傍要素ネットワーク（ｋ−ＮＮＮＷ）のクラスタ係数は、ランダムネットワーク（ＲａｎｄｏｍＮＷ）のクラスタ係数より大きく、レギュラーネットワーク（ＲｅｇｕｌａｒＮＷ）のクラスタ係数に近い値を有する。 FIG. 16 is a diagram illustrating changes in cluster coefficients with respect to the number k of neighboring elements in a random network, a neighboring element network, and a regular network.
The horizontal axis in FIG. 16 indicates the number k of neighboring elements, and the vertical axis indicates the cluster coefficient.
As shown in FIG. 16, the cluster coefficient of the neighborhood element network (k-NN NW) at each k is larger than the cluster coefficient of the random network (Random NW) and has a value close to the cluster coefficient of the regular network (Regular NW). .

本実施形態に係る情報探索システムの構成例を示す図である。It is a figure showing an example of composition of an information search system concerning this embodiment. 本実施形態に係るネットワーク生成処理の概要を示す図である。It is a figure which shows the outline | summary of the network production | generation process which concerns on this embodiment. 本実施形態に係るネットワーク生成処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the network generation process which concerns on this embodiment. ネットワーク生成部によって算出された近傍要素ネットワークの記憶部での記憶状態を示す図である。It is a figure which shows the memory | storage state in the memory | storage part of the neighborhood element network calculated by the network production | generation part. 本実施形態に係る情報探索処理の概要を示す図である。It is a figure which shows the outline | summary of the information search process which concerns on this embodiment. 本実施形態に係る情報探索処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the information search process which concerns on this embodiment. 近傍要素数ｋに対するコンポーネント数の変化を示す図である。It is a figure which shows the change of the number of components with respect to the number k of neighborhood elements. 近傍要素数ｋに対する平均探索コストの変化を示す図である。It is a figure which shows the change of the average search cost with respect to the number k of neighborhood elements. 近傍要素数ｋに対する平均次数の変化を示す図である。It is a figure which shows the change of the average order with respect to the number k of neighborhood elements. 近傍要素数ｋに対するステップ数の変化を示す図である。It is a figure which shows the change of the number of steps with respect to the number k of neighborhood elements. 各近傍要素数ｋにおける探索コストと、クエリへの到達率を示す図である。It is a figure which shows the search cost in each neighboring element number k, and the arrival rate to a query. 近傍要素数ｋに対するコンポーネント数の変化を示す図である。It is a figure which shows the change of the number of components with respect to the number k of neighborhood elements. 近傍要素数ｋに対する平均探索コストの変化を示す図である。It is a figure which shows the change of the average search cost with respect to the number k of neighborhood elements. 各近傍要素数ｋにおける探索コストと、クエリへの到達率を示す図である。It is a figure which shows the search cost in each neighboring element number k, and the arrival rate to a query. ランダムネットワーク、近傍要素ネットワークおよびレギュラーネットワークにおける近傍要素数ｋに対する平均最短パス長の変化を示す図である。It is a figure which shows the change of the average shortest path length with respect to the number k of neighborhood elements in a random network, a neighborhood element network, and a regular network. ランダムネットワーク、近傍要素ネットワークおよびレギュラーネットワークにおける近傍要素数ｋに対するクラスタ係数の変化を示す図である。It is a figure which shows the change of the cluster coefficient with respect to the number k of neighborhood elements in a random network, a neighborhood element network, and a regular network. 情報探索空間から、無作為に１×１０^６個のペア要素（２つの要素）を選択し、このペア要素間の距離の累積分布を示す図である。It is a figure which shows the cumulative distribution of the distance between this pair element by selecting 1 * 10 < ⁶ > pair elements (two elements) at random from information search space. 図１７と同様の条件下における距離の下界の累積分布を示す図である。It is a figure which shows the cumulative distribution of the lower bound of the distance on the conditions similar to FIG.

Explanation of symbols

１情報探索装置
２処理部
２ネットワーク生成処理部
３記憶部
４入力部
５出力部
１０ネットワーク
１１端末
１２情報探索システム
２１ネットワーク生成部
２２探索処理部 DESCRIPTION OF SYMBOLS 1 Information search apparatus 2 Processing part 2 Network generation process part 3 Storage part 4 Input part 5 Output part 10 Network 11 Terminal 12 Information search system 21 Network generation part 22 Search process part

Claims

An information search method in an information search apparatus for searching for information similar to a query from an information search set held in a storage unit,
The information search method includes a network generation process and an information search process,
The network generation process includes:
A network generation unit of the information search device;
A two-component network is created by directly linking between any two elements in the information search set, or indirectly through elements other than the two elements, and stored in the storage unit Let
The information search process includes:
The search processing unit of the information search device,
(A1) An element directly linked to the predetermined first element of the network is acquired from the storage unit, and the element having the highest similarity to the query among the elements is the second element Select as
(A2) If the similarity between the second element and the query is greater than a predetermined set similarity held in the storage unit, the similarity between the second element and the query is updated Is stored in the storage unit as a set similarity,
(A3) The second element is set as a third element, and an element that is directly linked to the third element is acquired from the storage unit,
(A4) Of the elements that are directly linked to the third element, elements that have never become the second element in the past and that have the largest similarity to the query An information search process for searching for an element similar to the query by performing the process (a2) on the new second element, and selecting the new second element;
The information search method characterized by performing.

The network generation process is performed by the network generation unit.
Calculating the similarity between an element x that is an arbitrary element in the information search set and each element other than the element x;
Obtaining a predetermined number of elements set in advance in descending order of similarity to the element x among elements other than the element x, and linking the acquired element with the element x 2. The information search method according to claim 1, wherein the process of generating a nearest neighbor element network Γ (x) for the element x is performed for all elements in the information search set.

The information search process includes:
The search processing unit
Obtaining the nearest element network Γ (x0) for an element x0, which is an arbitrary element in the information search set, and xmax, which is an arbitrary element, from the storage unit;
When the query q is input via the input unit,
(B1) A set A and a set B in which A = Γ (x0) ∪ {x0}, B = {x0} are calculated,
(B2) An element x1 satisfying the following expression (8) is acquired from the set AB,

Where ρ (a, b) is the similarity between elements a and b in the information search set, | A | is the number of elements in set A, and AB is elements A and B. Is the nearest element network (b3) ρ (x1, q)> ρ (xmax, q) for the element a, the element x1 is set as a new element xmax,
(B4) Substitute A∪Γ (x1) for A, Substitute B∪ {x1} for B,
(B5) The processes (b2) to (b4) are repeated,
3. The information search method according to claim 2, wherein when | A | exceeds a predetermined value β, or when the query matches the element x 1, the element xmax is output as a final output element.

An information search program that causes a computer to execute the information search method according to any one of claims 1 to 3.

An information search device for searching for information similar to a query from an information search set held in a storage unit,
The information search apparatus includes a network generation unit and a search processing unit,
The network generation unit
A two-component network is created by directly linking between any two elements in the information search set, or indirectly through elements other than the two elements, and stored in the storage unit Has a function to
The search processing unit
(A1) An element directly linked to the predetermined first element of the network is acquired from the storage unit, and the element having the highest similarity to the query among the elements is the second element Select as
(A2) If the similarity between the second element and the query is greater than a predetermined set similarity held in the storage unit, the similarity between the second element and the query is updated Is stored in the storage unit as a set similarity,
(A3) The second element is set as a third element, and an element that is directly linked to the third element is acquired from the storage unit,
(A4) Of the elements that are directly linked to the third element, elements that have never become the second element in the past and that have the largest similarity to the query It has a function of selecting a new second element and searching for an element similar to the query by performing the process (a2) on the new second element. Information search device.