JP5457946B2

JP5457946B2 - Related word calculation device, related word calculation method, and related word calculation program

Info

Publication number: JP5457946B2
Application number: JP2010133934A
Authority: JP
Inventors: 貴行足立; 俊郎内山
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2010-06-11
Filing date: 2010-06-11
Publication date: 2014-04-02
Anticipated expiration: 2030-06-11
Also published as: JP2011258114A

Description

本発明は単語の共起頻度を基に関連語を算出する技術に関するものである。 The present invention relates to a technique for calculating related words based on word co-occurrence frequencies.

従来技術に係る同義語計算装置（例えば特許文献１）は、少なくとも２種類の関連度辞書を用い、一の関連度辞書に基づいて単語グループを初期化するとともに、少なくとも２種類の関連度を反映した同義語辞書を作成する。そして、これらの関連度辞書に基づいて単語グループを併合処理することによって同義語グループを作成する。 A synonym calculation device according to the related art (for example, Patent Document 1) uses at least two types of association degree dictionaries, initializes word groups based on one association degree dictionary, and reflects at least two types of associations. Create a synonym dictionary. Then, a synonym group is created by merging word groups based on these relevance dictionaries.

特許第３５５３７９５明細書Japanese Patent No. 3553795

従来技術は同義語グループの作成を目的としている。従来技術においては、同義語より広い意味の関連語を求める場合、単に閾値を相対的に大きな値に変更する対処では、同義語の上位概念の語や同じ種類の語のように有用な関連語だけでなく、関連性がない語も統合化されてしまうという問題がある。 The prior art aims at creating synonym groups. In the related art, when a related word having a broader meaning than a synonym is obtained, a simple related word such as a higher-level concept word or the same kind of word is used in the countermeasure of simply changing the threshold value to a relatively large value. In addition, there is a problem that unrelated words are integrated.

また、各単語に対する特徴量をベクトルで表わし、初期状態として各単語を個別のグループと定め、グループに含まれる単語のベクトル重心との距離などの方法でグループの距離を計算するとともに、２つのグループの統合後のグループを仮定して同様な距離を計算し、統合前後の距離の増加量が最小となるグループを順に統合するクラスタリング技術が存在する。この方法でもグループの距離の増加量に対して閾値を設けることで、距離の近い語が統合化されるが、閾値を大きくした場合、関連性がない語も統合化されてしまうという問題は同様に存在する。 In addition, the feature amount for each word is represented by a vector, each word is defined as an individual group as an initial state, and the distance between the group and the vector center of gravity of the word included in the group is calculated, and two groups are calculated. There is a clustering technique that calculates a similar distance assuming a group after integration, and sequentially integrates the groups with the smallest increase in distance before and after integration. Even in this method, by setting a threshold for the amount of increase in group distance, words that are close to each other are integrated. However, if the threshold is increased, unrelated words are also integrated. Exists.

本発明は単語のグループ化において、統合におけるグループ間の距離が最小となるグループ対を統合する処理に加え、２つのグループが統合に相応しくない条件である場合には両者のグループを統合化の対象から除外する。これにより同義語より広い範囲の関連語を精度良く得ることができる。 The invention Oite grouping of words, in addition to processing the distance between the groups to integrate smallest group pairs in integration, integration groups of both if the two groups are not suitable conditions for integration Exclude from As a result, related words in a wider range than synonyms can be obtained with high accuracy.

本発明の関連語計算装置の態様としては、単語をグループ化する関連語計算装置であって、単語間の共起頻度情報を格納した統計情報データベースから各単語の共起頻度に基づき作成された各単語のベクトルに基づき各単語のグループを作成するグループ化手段と、前記作成された各グループのベクトルに基づき計算された値から任意の２つのグループ間の距離として前記２つのグループの統合時の拡散度の増加量を算出する計算を全てのグループの組に対して行い、前記算出された距離と統合前の各グループの拡散度との比率をそれぞれ算出し、その２つの比率の値がいずれも閾値を超えるグループの組の両グループを統合化の対象から除外するグループ間計算手段と、前記除外によって残されたグループの集合からグループ間で統合した場合の当該グループ間の距離が最小となるグループの対を選択しこの選択されたグループ間の距離が閾値未満である場合には当該グループの対を前記グループ化手段に供する一方で前記距離が閾値以上である場合には当該対の各グループのデータを出力するグループ化判定手段とを備える。 The related word calculation device according to the present invention is a related word calculation device for grouping words, which is created based on the co-occurrence frequency of each word from a statistical information database storing co-occurrence frequency information between words. and grouping means for forming groups of each word based on the vector of each word, upon the integration of the two groups as the distance between the calculated arbitrary from the values of two groups based on the vectors of the group to which the created a calculation for calculating the amount of increase in diffusivity performed for the set of all the groups, the ratio of the diffusivity of each group before integration with the calculated distance calculated respectively, the value of the two ratios is either a set of inter-group calculation means excluded from the integration of both groups of groups also exceeds the threshold, integrated between groups from the set of groups left by the exclusion If the distance between the selected groups is less than a threshold value, the pair of groups is provided to the grouping means while the distance is a threshold value. In the case of the above, a grouping determination unit that outputs data of each group of the pair is provided.

本発明の関連語計算方法の態様としては、単語をグループ化する関連語計算方法であって、グループ化手段が単語間の共起頻度情報を格納した統計情報データベースから各単語の共起頻度に基づき作成された各単語のベクトルに基づき各単語のグループを作成するステップと、グループ間計算手段が、前記作成された各グループのベクトルに基づき計算された値から任意の２つのグループ間の距離として前記２つのグループの統合時の拡散度の増加量を算出する計算を全てのグループの組に対して行い、前記算出された距離と統合前の各グループの拡散度との比率をそれぞれ算出し、その２つの比率の値がいずれも閾値を超えるグループの組の両グループを統合化の対象から除外するステップと、グループ化判定手段が、前記除外によって残されたグループの集合からグループ間で統合した場合の当該グループ間の距離が最小となるグループの対を選択し、この選択されたグループ間の距離が閾値未満である場合には当該グループの対を前記グループ化手段に供する一方で、前記距離が閾値以上である場合には当該対の各グループのデータを出力するステップとを有する。 An aspect of the related word calculation method of the present invention is a related word calculation method for grouping words, wherein the grouping means calculates the co-occurrence frequency of each word from a statistical information database storing co-occurrence frequency information between words. A step of creating a group of each word based on a vector of each word created based on, and a calculation means between groups as a distance between any two groups from a value calculated based on the vector of each created group Performing a calculation to calculate the amount of increase in diffusivity during the integration of the two groups for all groups, and calculating the ratio between the calculated distance and the diffusivity of each group before integration, and excluded from step integrating both groups of pairs of groups value of the two ratios is greater than the both the threshold, grouping determination means, remaining by the exclusion The group pair that minimizes the distance between the groups when the group is integrated from the set of selected groups is selected. If the distance between the selected groups is less than the threshold, the group pair is selected. And providing to the grouping means, and outputting the data of each group of the pair when the distance is greater than or equal to a threshold value.

上記の発明においては、統合における距離が最小となるグループ統合において、誤った統合の影響により正しい２つのグループが統合化されなかった場合でも、他の統合化の条件を適用して上記の誤ったグループ統合を対象から除外することで、同義語より広い範囲の関連語をより一層精度良く算出できる。 In the above invention, in the group integration in which the distance in the integration is the minimum, even if two correct groups are not integrated due to the effect of the incorrect integration, the above-mentioned erroneous integration is applied by applying other integration conditions. By excluding group integration from the target, related terms in a wider range than synonyms can be calculated with higher accuracy.

尚、本発明は前記関連語計算装置を構成する各手段としてコンピュータを機能させる関連語計算プログラムの態様とすることもできる。 In addition, this invention can also be made into the aspect of the related word calculation program which makes a computer function as each means which comprises the said related word calculation apparatus.

以上の発明によれば同義語より広い範囲の関連語を精度良く得られる。 According to the above invention, related words in a wider range than synonyms can be obtained with high accuracy.

発明の実施形態に係る関連語計算装置の構成図。The block diagram of the related word calculation apparatus which concerns on embodiment of invention. 発明の第一の実施形態に係る関連語計算装置の動作例を説明したフローチャート図。The flowchart figure explaining the operation example of the related word calculation apparatus which concerns on 1st embodiment of invention. 発明の第二の実施形態に係る関連語計算装置の動作例を説明したフローチャート図。The flowchart figure explaining the operation example of the related word calculation apparatus which concerns on 2nd embodiment of invention. 単語統計情報データの一例を示した図。The figure which showed an example of word statistical information data. ベクトルデータの一例を示した図。The figure which showed an example of vector data. グループデータ（初期化時）の一例を示した図。The figure which showed an example of group data (at the time of initialization). グループデータの一例を示した図。The figure which showed an example of group data. 各グループに対する距離が最小となる組の距離データの一例を示した図。The figure which showed an example of the distance data of the group from which the distance with respect to each group becomes the minimum. 統合時の拡散度の増加量と拡散度の比に基づきグループの除外を行う前のグループの一例を示した図。The figure which showed an example of the group before performing a group exclusion based on the ratio of the increase of the spreading | diffusion degree at the time of integration, and a spreading | diffusion degree. 統合時の拡散度の増加量と拡散度の比に基づきグループの除外を行った後のグループの一例を示した図。The figure which showed an example of the group after performing the exclusion of a group based on the ratio of the increase of the spreading | diffusion degree at the time of integration, and a spreading | diffusion degree.

以下、図面を参照しながら本発明の実施の形態を説明するが、本発明は下記の実施形態例に限定されるものではない。 Hereinafter, embodiments of the present invention will be described with reference to the drawings, but the present invention is not limited to the following embodiments.

［実施形態１］
（概要）
実施形態１に係る関連語計算装置は、統合におけるグループ間の距離が最小となるグループ対を統合する処理に加え、２つのグループが統合に相応しくない条件である場合には両者のグループを統合化の対象から除外する。 [Embodiment 1]
(Overview)
The related term calculation apparatus according to the first embodiment integrates both groups in a case where two groups are not suitable for integration, in addition to the process of integrating the group pair that minimizes the distance between the groups in the integration. Exclude from

（装置の構成）
図１に示された本実施形態の関連語計算装置１００は、単語統計情報データベース１１０、ベクトル作成部１２０、グループ化部１３０、グループ間計算部１４０、グループ化判定部１５０、作業領域１６０、グループデータペース１７０で構成される。 (Device configuration)
The related word calculation apparatus 100 of this embodiment shown in FIG. 1 includes a word statistical information database 110, a vector creation unit 120, a grouping unit 130, an inter-group calculation unit 140, a grouping determination unit 150, a work area 160, and a group. It consists of a data pace 170.

関連語計算装置１００の各機能部１１０〜１７０は例えばコンピュータのハードウェアリソースによって実現されている。すなわち、関連語計算装置１００はＣＰＵ、メモリ、記憶装置（例えば、ハードディスクドライブ装置）、入力デバイス、出力デバイス等のコンピュータに係るハードウェアリソースを備える。そして、これらのハードウェアリソースがソフトウェアリソース（ＯＳ、アプリケーション等）との協働することで機能部１１０〜１７０が実装される。 The function units 110 to 170 of the related word calculation device 100 are realized by hardware resources of a computer, for example. That is, the related word calculation device 100 includes hardware resources related to a computer such as a CPU, a memory, a storage device (for example, a hard disk drive device), an input device, and an output device. Then, the functional units 110 to 170 are implemented by cooperation of these hardware resources with software resources (OS, applications, etc.).

単語統計情報データベース１１０は図４に例示したように単語間の共起頻度情報を予め格納している。例示された単語統計情報データベースは単語１と単語２の共起情報と出現頻度のデータを格納している。 The word statistical information database 110 stores in advance co-occurrence frequency information between words as illustrated in FIG. The illustrated word statistical information database stores co-occurrence information of word 1 and word 2 and appearance frequency data.

ベクトル作成部１２０は単語統計情報データベース１１０から引き出した各単語の共起頻度に基づき各単語のベクトルを作成する。この作成された各単語のベクトルは作業領域１６０に出力される。 The vector creation unit 120 creates a vector of each word based on the co-occurrence frequency of each word extracted from the word statistical information database 110. The created vector of each word is output to the work area 160.

グループ化部１３０は前記作成された各単語のベクトルに基づき各単語のグループを作成する。グループ化部１３０は、初期処理の段階では、作業領域１６０から引き出したベクトルの要素に基づき各単語のグループのデータを作成し、作業領域１６０へ出力する。また、グループ化部１３０は、グループ化判定部１５０において２つのグループのグループ化が可能と判定された場合、２つのグループを統合してグループのデータを更新し、作業領域１６０に反映させる。 The grouping unit 130 creates a group of each word based on the created vector of each word. In the initial processing stage, the grouping unit 130 creates group data for each word based on the vector elements extracted from the work area 160 and outputs the data to the work area 160. When the grouping determination unit 150 determines that two groups can be grouped, the grouping unit 130 integrates the two groups, updates the group data, and reflects the group data in the work area 160.

グループ間計算部１４０は作業領域１６０に保持された各グループのベクトルに基づき計算された値から任意の２つのグループ間の距離を算出する計算を全てのグループの組に対して行い当該距離に基づく値が閾値を超えるグループをグループ化の対象から除外する。 The inter-group calculation unit 140 performs a calculation for calculating the distance between any two groups from the values calculated based on the vectors of the groups held in the work area 160, and based on the distances. Groups whose value exceeds the threshold are excluded from grouping.

グループ化判定部１５０は前記除外によって残されたグループの集合からグループ間で統合した場合の当該グループ間の距離が最小となるグループの対を選択する。そして、この選択されたグループ間の距離が閾値未満である場合には当該グループの対をグループ化部１３０に供する。一方、前記距離が閾値以上である場合には当該対の各グループのデータをグループデータベース１７０に出力する。 The grouping determination unit 150 selects a pair of groups that minimizes the distance between the groups when the groups are integrated from the set of groups left by the exclusion. When the distance between the selected groups is less than the threshold, the group pair is provided to the grouping unit 130. On the other hand, if the distance is greater than or equal to the threshold value, the data of each group in the pair is output to the group database 170.

（処理フローの説明）
図２を参照しながら関連語計算装置１００の処理フロー（Ｓ１００〜Ｓ１６０）について具体例に基づき説明する。 (Description of processing flow)
The processing flow (S100 to S160) of the related word calculation device 100 will be described based on a specific example with reference to FIG.

Ｓ１００：ベクトル作成部１２０は、単語統計情報データベース１１０に格納されている各単語の共起頻度に基づき各単語のベクトルを作成して作業領域１６０へ出力する。 S100: The vector creation unit 120 creates a vector of each word based on the co-occurrence frequency of each word stored in the word statistical information database 110 and outputs it to the work area 160.

具体的には例えば図４に示された単語統計情報データベース１１０のデータに基づき図５に示された単語ｗ１に対する単語ｗ２の共起頻度を要素とするベクトルを作成する。図５に示されたベクトルデータ５における単語ｗ１，…，ｗｎには個別の識別子（ＩＤ）が付与されている。単語ｗ１を特徴付けるベクトルは対象単語ｗ１に対する共起単語ｗ１，…，ｗｎの各要素によって表される。さらに、各単語のベクトルの値の和が１になるよう正規化を行うため、各単語の全ベクトルの和で各ベクトル要素を割る処理がなされている。作成されたベクトルは、作業領域１６０へ出力される。 Specifically, for example, based on the data in the word statistical information database 110 shown in FIG. 4, a vector whose element is the co-occurrence frequency of the word w2 with respect to the word w1 shown in FIG. 5 is created. Individual identifiers (IDs) are assigned to the words w1,..., Wn in the vector data 5 shown in FIG. A vector characterizing the word w1 is represented by each element of the co-occurrence words w1,..., Wn for the target word w1. Further, in order to perform normalization so that the sum of the vector values of each word becomes 1, processing for dividing each vector element by the sum of all the vectors of each word is performed. The created vector is output to the work area 160.

Ｓ１１０：グループ化部１３０はＳ１００で作成された各単語のベクトルを作業領域１６０から引き出して当該各単語のグループを作成して作業領域１６０へ出力する。 S <b> 110: The grouping unit 130 extracts each word vector created in S <b> 100 from the work area 160, creates a group of each word, and outputs the group to the work area 160.

Ｓ１１０の初期処理では各単語をそれぞれ単一のグループとするため図６に例示したグループデータ６が作成される。グループデータ６における各単語のグループの識別子ｇＩＤのカラムには図５に示されたベクトルデータ５の各単語の識別子ＩＤと一致した識別子が格納され、単語リストのカラムにはベクトルデータ５における識別子ＩＤに対応した対象単語が格納される。また、グループデータ６にはＳ１２０（グループ間計算部１４０による演算処理）に供される各グループの拡散度も計算して格納されている。尚、ベクトルデータ５は以後のステップにおいて各グループに対応するベクトルとして扱われる。 In the initial processing of S110, group data 6 illustrated in FIG. 6 is created in order to make each word a single group. The identifier gID column of each word group in the group data 6 stores an identifier that matches the identifier ID of each word in the vector data 5 shown in FIG. 5, and the identifier ID in the vector data 5 is stored in the word list column. The target word corresponding to is stored. Further, the group data 6 also stores the calculated degree of diffusion of each group provided for S120 (operation processing by the inter-group calculation unit 140). The vector data 5 is treated as a vector corresponding to each group in the subsequent steps.

Ｓ１２０：グループ間計算部１４０は、Ｓ１１０で作成された各グループのベクトルを作業領域１６０から引き出し、このベクトルに基づき計算された値から任意の２つのグループ間の距離を算出する計算を全てのグループの組について行う。 S120: The inter-group calculation unit 140 extracts a vector of each group created in S110 from the work area 160, and calculates a distance between any two groups from a value calculated based on this vector for all groups. This is done for

グループ間の距離の指標としては、例えば、「２つのグループの統合に伴う拡散度の増加量」が挙げられる。拡散度はその値が小さいほどデータの偏り度合いが高くなるとすると、グループの統合は拡散度の値を大きくする方向に働くため、統合時の拡散度の増加量が小さいほどグループの特徴が保持されやすくなり、全グループで最適な状態となる。尚、グループの結束度やグループ間の区別度などによって同様な結果が得られるのであれば、拡散度を用いる方法に限定されない。 As an index of the distance between groups, for example, “amount of increase in diffusivity associated with integration of two groups” can be cited. If the degree of diffusion is smaller, the degree of data bias becomes higher. Since group integration works in a direction to increase the value of diffusion, the smaller the increase in diffusion during integration, the more the group characteristics are retained. It becomes easy and it becomes the optimal state in all groups. Note that the method is not limited to the method using the degree of diffusion as long as similar results can be obtained depending on the degree of group cohesion and the degree of distinction between groups.

例えばグループＡの拡散度Ｋ（Ａ）は、図５に示されたベクトルデータ５を用いると、グループＡに対するベクトルｔｆ（Ａ，ｉ）（ｉ＝１，…，ｎ）、Ａに属する単語数ＮＡとした場合、以下の式（１）による演算によって各ベクトル要素の拡散度から算出される。尚、後述の式（３）に供されるグループＢの拡散度Ｋ（Ｂ）も同様の演算によって算出できる。 For example, when the vector data 5 shown in FIG. 5 is used for the diffusivity K (A) of the group A, the vector tf (A, i) (i = 1,..., N) for the group A and the number of words belonging to A In the case of NA, it is calculated from the diffusivity of each vector element by the calculation according to the following equation (1). Note that the diffusivity K (B) of the group B provided for the later-described equation (3) can also be calculated by the same calculation.

グループＡとグループＢの統合後の拡散度Ｋ（ＡＢ）は以下の式（２）による演算によって算出される。 The degree of diffusion K (AB) after the integration of group A and group B is calculated by the calculation according to the following equation (2).

そして、グループＡとグループＢの統合時の拡散度の増加量Ｄ（Ａ，Ｂ）は以下の式（３）による演算によって算出される。 Then, the amount of increase D (A, B) of the diffusivity when the group A and the group B are integrated is calculated by the following equation (3).

Ｄ（Ａ，Ｂ）＝Ｋ（ＡＢ）−（Ｋ（Ａ）＋Ｋ（Ｂ）） …（３）
以上の計算を全てのグループの組について行う。そして、図８に例示したようにｇＩＤ（Ａ）側の各グループに対して増加量（Ｄ）が最小となるグループの組のデータを選択して作業領域１６０に格納する。尚、ｇＩＤ（Ａ）とｇＩＤ（Ｂ）を入れ替えると同じ組となる場合、重複を避けるためｇＩＤ（Ａ）のグループ識別子が小さい番号の組のみ選択する。 D (A, B) = K (AB) − (K (A) + K (B)) (3)
The above calculation is performed for all groups. Then, as illustrated in FIG. 8, the data of the group set that minimizes the increase (D) is selected for each group on the gID (A) side and stored in the work area 160. When gID (A) and gID (B) are exchanged to form the same pair, only a pair having a smaller group identifier of gID (A) is selected to avoid duplication.

Ｓ１３０：グループ間計算部１４０は、Ｓ１２０で算出された距離（拡散度の増加量Ｄ）に基づく値が閾値を超えるグループを作業領域１６０から引き出し、当該グループを統合化の対象から除外する。 S130: The inter-group calculation unit 140 extracts a group whose value based on the distance (difference increase amount D) calculated in S120 exceeds a threshold value from the work area 160, and excludes the group from integration.

統合前の２つのグループＡ，Ｂの拡散度Ｋ（Ａ），Ｋ（Ｂ）と、グループＡ，Ｂが統合された時の拡散度Ｋ（ＡＢ）の増加量Ｄ（Ａ，Ｂ）を比べた際に、Ｄの値がＫ（Ａ）やＫ（Ｂ）の大部分を占める場合がある。この場合、各グループの単語出現状況の違いが大きいため、後述のＳ１５０でのグループ化判定部１５０で統合化を行うかどうかの判定を満たしたとしても、意味的に違うグループ同士の統合がなされる。そこで、Ｓ１２０で算出された距離（拡散度の増加量Ｄ）に基づく値として、統合時の拡散度の増加量と各グループの拡散度との比率である「Ｄ（Ａ，Ｂ）／Ｋ（Ａ）」及び「Ｄ（Ａ，Ｂ）／Ｋ（Ｂ）」が挙げられる。比率「Ｄ（Ａ，Ｂ）／Ｋ（Ａ）」「Ｄ（Ａ，Ｂ）／Ｋ（Ｂ）」がいずれも所定の閾値を超える場合は処理対象から除外する。前記閾値を超えるグループが除外された残りのグループの集合は作業領域１６０へ出力する。 Compare the diffusion amount K (A), K (B) of the two groups A, B before integration with the increase D (A, B) of the diffusion degree K (AB) when the groups A, B are integrated. The value of D may occupy most of K (A) or K (B). In this case, since the difference in the word appearance status of each group is large, even if the determination of whether or not to perform the integration in the grouping determination unit 150 in S150 described later is satisfied, the semantically different groups are integrated. The Therefore, as a value based on the distance (difference increase amount D) calculated in S120, “D (A, B) / K ( A) "and" D (A, B) / K (B) ". When the ratios “D (A, B) / K (A)” and “D (A, B) / K (B)” exceed a predetermined threshold, they are excluded from the processing targets. A set of remaining groups from which groups exceeding the threshold are excluded is output to the work area 160.

Ｓ１４０：グループ化判定部１５０は、Ｓ１３０での除外によって得られた残りのグループの集合から、Ｓ１２０で得られた作業領域１６０に格納されたグループ間の距離が最小となるグループの対を選択する。 S140: The grouping determination unit 150 selects a group pair that minimizes the distance between groups stored in the work area 160 obtained in S120, from the set of remaining groups obtained by the exclusion in S130. .

Ｓ１５０：グループ化判定部１５０は、Ｓ１４０で選択されたグループの対が統合化可能であるか否かの判断を当該グループ間の距離の値に基づき行う。 S150: The grouping determination unit 150 determines whether the pair of groups selected in S140 can be integrated based on the value of the distance between the groups.

具体的には、グループ間の距離が所定の閾値を満たすか否かを判断することによって統合化が可能かを判定する。そして、前記距離の値が閾値未満である場合（Ｙｅｓ）、統合化が可能である判断し前記グループの対をＳ１６０に供して統合処理を継続させる。一方、前記距離の値が閾値以上である場合（Ｎｏ）、前記グループの対をＳ１７０に供して統合処理を終了させる。 Specifically, it is determined whether integration is possible by determining whether the distance between groups satisfies a predetermined threshold. If the distance value is less than the threshold (Yes), it is determined that integration is possible, and the group pair is provided to S160 to continue the integration process. On the other hand, when the value of the distance is greater than or equal to the threshold (No), the group pair is used in S170 to end the integration process.

Ｓ１６０：グループ化部１３０はＳ１５０から供された前記グループの対を統合させることでグループデータの単語集合のベクトルや拡散度のデータを更新する。 S160: The grouping unit 130 updates the vector of the word set of the group data and the data of the diffusion degree by integrating the pair of groups provided from S150.

グループデータの更新にあたり、ベクトルデータはグループ番号である識別子ＩＤの小さいグループに統合させるようにし、単語の出現頻度は加算される。統合されるグループデータの単語リストの単語はグループ番号の小さい方のグループに属するようにし、この小さい方のグループの拡散度は統合後の値に更新される。グループに属する単語が空である場合、そのグループは統合済みで以後の処理では不要となるため、処理対象から外す。このように更新されたグループデータ６はＳ１２０に供される。 In updating the group data, the vector data is integrated into a group having a small identifier ID which is a group number, and the appearance frequency of words is added. The words in the word list of the group data to be integrated belong to the group with the smaller group number, and the diffusivity of the smaller group is updated to the value after integration. If a word belonging to a group is empty, the group is already integrated and is not necessary for the subsequent processing, and is therefore excluded from the processing target. The group data 6 updated in this way is provided to S120.

Ｓ１７０：グループ化判定部１５０は、ステップＳ１５０で前記距離が閾値以上であると判断した場合（Ｎｏ）、Ｓ１２０〜Ｓ１６０の統合処理を終了する。そして、この統合処理によって得られた図７に例示したような単語を統合化したグループデータ７を作業領域１６０から引き出してグループデータベース１７０に出力する。 S170: When the grouping determination unit 150 determines in step S150 that the distance is equal to or greater than the threshold (No), the grouping determination unit 150 ends the integration process of S120 to S160. Then, the group data 7 obtained by integrating the words as illustrated in FIG. 7 obtained by the integration process is extracted from the work area 160 and output to the group database 170.

（実施形態１の効果）
以上のように、本実施形態によれば、語のグループ化において、統合におけるグループ間の距離が最小となるグループが統合されることに加え、２つのグループが統合化に相応しくない条件では当該両者のグループが統合化対象から除外される。これにより同義語より広い範囲の関連語が精度よく得られる。 (Effect of Embodiment 1)
As described above, according to the present embodiment, in the grouping of words, in addition to the integration of the group that minimizes the distance between the groups in the integration, the two groups are not suitable for the integration. Are excluded from consolidation. As a result, related words in a wider range than synonyms can be obtained with high accuracy.

特にＳ１３０のように、２つのグループのそれぞれの拡散度と当該グループ間の距離の比を全てのグループの組に対して計算を行い、前記距離が閾値を超えるグループを統合化の対象から除外することで、グループの拡散度に依存した関連語の統合化が実現する。 In particular, as in S <b> 130, the ratio of the diffusivity of each of the two groups and the ratio of the distance between the groups is calculated for all groups, and the group whose distance exceeds the threshold is excluded from the integration targets. As a result, integration of related terms depending on the spread of the group is realized.

Ｓ１３０で統合時の拡散度の増加量と各グループの拡散度との比率「Ｄ（Ａ，Ｂ）／Ｋ（Ａ）」「Ｄ（Ａ，Ｂ）／Ｋ（Ｂ）」のいずれも所定の閾値を超える場合に統合化の処理対象からの除外を行う前と行った場合の違いについて図９、図１０を参照しながら説明する。ここでは、統合時の拡散度の増加量と各グループの拡散度との比率の閾値を「０．３」、グループ化判定部１５０での２つのグループ間距離の閾値を「４」とした場合の事例について説明する。 In S130, the ratio between the amount of increase in diffusivity during integration and the diffusivity of each group “D (A, B) / K (A)” and “D (A, B) / K (B)” is predetermined. A difference between when the threshold value is exceeded and before the exclusion from the integration processing target will be described with reference to FIGS. 9 and 10. Here, when the threshold value of the ratio between the amount of increase in diffusivity at the time of integration and the diffusivity of each group is “0.3”, and the threshold value of the distance between two groups in the grouping determination unit 150 is “4” The case of will be described.

統合時の拡散度の増加量と各グループの拡散度との比率の閾値に基づき除去する前では図９（ａ）（ｂ）に例示されたように鉄道に関する単語を含むグループにはバスに関する単語が含まれている。図示された数値はグループ拡散度Ｋの値を示し、括弧の数値は統合時の拡散度の増加量Ｄの値を示す。 Before removal based on the threshold value of the ratio between the amount of increase in diffusivity during integration and the diffusivity of each group, as shown in FIG. 9A and FIG. It is included. The illustrated numerical value indicates the value of the group diffusion degree K, and the numerical value in parentheses indicates the value of the increase amount D of the diffusion degree at the time of integration.

前記比率の閾値で除外する場合を考えると図９（ｃ）に例示された“電車”、“本数”、“多い”が属するグループはＫ（Ａ）＝３．７、Ｋ（Ｂ）＝２．８、Ｄ（Ａ，Ｂ）＝０．９である場合、当該比率の値はＤ（Ａ，Ｂ）／Ｋ（Ｂ）＝０．３２となり閾値「０．３」を超える。この閾値に基づき統合化の対象から除外するようにすると、図１０（ｃ）に示されたように“電車”は“本数”及び“多い”が属するグループに統合されなくなる。 Considering the case of exclusion by the ratio threshold, the groups to which “train”, “number”, and “large” illustrated in FIG. 9C belong are K (A) = 3.7 and K (B) = 2. .8, D (A, B) = 0.9, the value of the ratio is D (A, B) / K (B) = 0.32, which exceeds the threshold “0.3”. If it is excluded from the integration targets based on this threshold, “train” is not integrated into the group to which “number” and “many” belong, as shown in FIG.

そして、“電車”は、図１０（ａ）に例示したように“バス停”や“路線バス”よりも優先して鉄道に関する単語を含むグループに属することになる。それに伴って“バス停”、“路線バス”は図１０（ｂ）に例示したように“リムジンバス”が含まれているバスおよび公共交通の移動に関係する単語を含むグループに属することとなる。 As shown in FIG. 10A, “train” belongs to a group including words related to railways in preference to “bus stop” and “route bus”. Accordingly, “bus stop” and “route bus” belong to a group including a bus including “limousine bus” and a word related to movement of public transportation as illustrated in FIG.

以上のように拡散度の増加量に依存した統合処理によって“電車”や“バス停”及び“路線バス”が適切に統合化される。 As described above, the “train”, the “bus stop”, and the “route bus” are appropriately integrated by the integration process depending on the amount of increase in diffusion.

［実施形態２］
（概要）
実施形態２に係る関連語計算装置１００は、統合における距離が最小となるグループ統合では２つのグループがグループ化されなかった場合でも、他の条件を考慮することで統合化が可能であれば当該両者のグループの統合を実行する。 [Embodiment 2]
(Overview)
The related word calculation apparatus 100 according to the second embodiment can perform integration even if two groups are not grouped in the group integration that minimizes the distance in the integration, if the integration is possible by considering other conditions. Perform integration of both groups.

実施形態１に係るＳ１５０のグループ化判定の過程では、グループ間の距離（拡散度の増加量）が閾値以上であると判断されても、既にグループ化が進んだ大きなグループとまだグループ化がそれほどされていない小さいグループとの間で統合化が可能となる場合がある。 In the grouping determination process of S150 according to the first embodiment, even if it is determined that the distance between groups (the amount of increase in the degree of diffusion) is greater than or equal to the threshold value, the grouping is not so much as the group that has already been grouped. It may be possible to integrate with small groups that are not.

そこで、実施形態２では、Ｓ１５０にてグループ間の距離（拡散度の増加量）が閾値以上と判断された場合に、大きなグループとしてグループのベクトル要素が正である要素数が閾値よりも多いものと小さいグループとしてグループのベクトル要素が正である要素数が閾値よりも少ないものからなる一対のグループを統合処理に供している。これにより、同義語より広い範囲の関連語をより精度よく算出できる。 Therefore, in the second embodiment, when it is determined in S150 that the distance between groups (the amount of increase in the degree of diffusion) is equal to or greater than the threshold, the number of elements in which the group vector elements are positive as a large group is greater than the threshold. As a small group, a pair of groups having the number of elements whose group vector elements are positive is smaller than a threshold value is subjected to integration processing. As a result, related words in a wider range than synonyms can be calculated more accurately.

（装置の構成）
実施形態２に係る関連語計算装置１００は実施形態１と異なる処理手順を実行するグループ化判定部１５０を備えたこと以外は実施形態１に係る関連語計算装置１００と同じ装置構成である。 (Device configuration)
The related word calculation apparatus 100 according to the second embodiment has the same device configuration as the related word calculation apparatus 100 according to the first embodiment except that the related word calculation apparatus 100 includes a grouping determination unit 150 that executes a processing procedure different from that of the first embodiment.

（処理フローの説明）
図３を参照しながら実施形態２に係る関連語計算装置１００の処理フロー（Ｓ１００〜Ｓ１８０）について説明する。 (Description of processing flow)
A processing flow (S100 to S180) of the related word calculation device 100 according to the second embodiment will be described with reference to FIG.

Ｓ１２０：グループ間計算部１４０は、Ｓ１１０で作成された各グループのベクトルを作業領域１６０から引き出し、このベクトルに基づき計算された値から任意の２つのグループ間の距離を算出する計算を全てのグループの組について行う。そして、各グループに対して増加量（Ｄ）が最小となるグループの組のデータを選択して作業領域１６０に格納する。 S120: The inter-group calculation unit 140 extracts a vector of each group created in S110 from the work area 160, and calculates a distance between any two groups from a value calculated based on this vector for all groups. This is done for Then, for each group, the data of the group having the smallest increase (D) is selected and stored in the work area 160.

Ｓ１５０：グループ化判定部１５０は、Ｓ１４０で選択されたグループの対が統合化可能であるか否かの判断を当該グループ間の距離の値に基づき行う。具体的には、グループ間の距離が所定の閾値を満たすか否かを判断することによって統合化が可能かを判定する。そして、前記距離の値が閾値未満である場合（Ｙｅｓ）、統合化が可能である判断し前記グループの対をＳ１６０に供して統合処理を継続させる。一方、前記距離の値が閾値以上である場合（Ｎｏ）、前記グループの対をＳ１８０に供する。 S150: The grouping determination unit 150 determines whether the pair of groups selected in S140 can be integrated based on the value of the distance between the groups. Specifically, it is determined whether integration is possible by determining whether the distance between groups satisfies a predetermined threshold. If the distance value is less than the threshold (Yes), it is determined that integration is possible, and the group pair is provided to S160 to continue the integration process. On the other hand, when the distance value is equal to or greater than the threshold value (No), the group pair is provided to S180.

Ｓ１６０：グループ化部１３０はＳ１５０から供された前記グループの対を統合させることでグループデータの単語集合のベクトルや拡散度のデータを更新する。更新されたグループデータ６はＳ１２０に供される。 S160: The grouping unit 130 updates the vector of the word set of the group data and the data of the diffusion degree by integrating the pair of groups provided from S150. The updated group data 6 is provided to S120.

Ｓ１８０：グループ化判定部１５０は、Ｓ１５０から供されたグループの対のうちで一方のグループのベクトル全要素における要素値が正となる要素数の割合が所定の閾値以上であり且つ他方のグループのベクトル全要素における要素値が正となる要素数の割合が所定の閾値未満である場合（Ｙｅｓ）、当該グループの対をＳ１６０に供して統合処理を継続させる。一方、この統合処理の継続の条件を満たさない場合（Ｎｏ）、当該グループの対をＳ１７０に供して統合処理を終了させる。 S180: The grouping determination unit 150 has a ratio of the number of elements in which the element values of all the vector elements of one group are positive among the pair of groups provided from S150 equal to or greater than a predetermined threshold, and the other group When the ratio of the number of elements having positive element values in all the vector elements is less than a predetermined threshold (Yes), the group pair is used in S160 to continue the integration process. On the other hand, when the condition for continuation of the integration process is not satisfied (No), the pair of the group is used in S170 to end the integration process.

Ｓ１７０：グループ化判定部１５０は、ステップＳ１８０で統合処理の継続の条件を満たさないと判断した場合（Ｎｏ）、Ｓ１２０〜Ｓ１８０の統合処理を終了する。そして、この統合処理によって得られた図７に例示したような単語を統合化したグループデータ７を作業領域１６０から引き出してグループデータベース１７０に出力する。 S170: If the grouping determination unit 150 determines in step S180 that the conditions for continuing the integration process are not satisfied (No), the grouping determination unit 150 ends the integration process of S120 to S180. Then, the group data 7 obtained by integrating the words as illustrated in FIG. 7 obtained by the integration process is extracted from the work area 160 and output to the group database 170.

（実施形態２の効果）
以上のように統合における距離が最小となるグループ統合では２つのグループが統合化されなかった場合でも他の条件を考慮することで統合化が可能である場合には当該両者のグループが統合に供されるので、同義語より広い範囲の関連語をより精度良く得られる。 (Effect of Embodiment 2)
As described above, in the group integration in which the distance in the integration is minimized, even if two groups are not integrated, if the integration is possible by considering other conditions, the two groups are used for the integration. Therefore, related words in a wider range than synonyms can be obtained with higher accuracy.

実施形態２においては、グループのベクトル要素が正である要素数が閾値よりも多いグループとグループのベクトル要素が正である要素数が閾値よりも少ないグループからなる一対のグループの統合処理についてさらに制限を設けてもよい。具体的には、Ｓ１８０で、選択された一方のグループのベクトル全要素における要素値が正となる要素数の割合が所定の閾値以上であり且つ他方のグループのベクトル全要素における要素値が正となる要素数の割合が所定の閾値未満であるグループの対であってかつグループ間の距離が閾値未満のものをＳ１６０の更新処理に供すればよい。この処理によって、さらにより一層精度良く、同義語より広い範囲の関連語を算出できる。 In the second embodiment, a further limitation is imposed on the integration processing of a pair of groups including a group in which the number of elements in which the group vector elements are positive is greater than the threshold and a group in which the number of elements in which the group vector elements are positive is less than the threshold. May be provided. Specifically, in S180, the ratio of the number of elements in which the element values in all the vector elements of one selected group are positive is equal to or greater than a predetermined threshold value, and the element values in all the vector elements in the other group are positive. What is necessary is just to use for the update process of S160 the pair of the group whose ratio of the number of elements which are less than a predetermined threshold value, and the distance between groups is less than a threshold value. By this processing, related words in a wider range than synonyms can be calculated with even higher accuracy.

［本発明のプログラムとしての態様］
以上説明した実施形態の関連語計算装置１００における各機能部１２０〜１５０の一部もしくは全部の機能をコンピュータのプログラムで構成し、この関連語計算プログラムをコンピュータによって実行して本発明を実現することができる。また、本実施形態の関連語計算方法における手順をコンピュータのプログラムで構成し、この関連語計算プログラムをコンピュータに実行させることができる。さらに、コンピュータで前記機能を実現するためのプログラムを、そのコンピュータが読み取り可能な記録媒体、例えばＦＤ（Ｆｌｏｐｐｙ（登録商標）Ｄｉｓｋ）や、ＭＯ（Ｍａｇｎｅｔｏ−Ｏｐｔｉｃａｌｄｉｓｋ）、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、メモリカード、ＣＤ（ＣｏｍｐａｃｔＤｉｓｋ）−ＲＯＭ、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）−ＲＯＭ、ＣＤ−Ｒ、ＣＤ−ＲＷ、ＨＤＤ、リムーバブルディスクなどに記録して、保存したり、配布したりすることが可能である。また、上記のプログラムをインターネットや電子メールなど、ネットワークを通して提供することも可能である。 [Aspect as Program of the Present Invention]
A part or all of the functions of each of the functional units 120 to 150 in the related word calculation device 100 of the embodiment described above is configured by a computer program, and the related word calculation program is executed by the computer to realize the present invention. Can do. Moreover, the procedure in the related word calculation method of this embodiment can be comprised by the program of a computer, and this computer can be made to run this related word calculation program. Further, a program for realizing the above functions on a computer is recorded on a computer-readable recording medium such as FD (Floppy (registered trademark) Disk), MO (Magneto-Optical disk), ROM (Read Only Memory), It can be recorded on a memory card, CD (Compact Disk) -ROM, DVD (Digital Versatile Disk) -ROM, CD-R, CD-RW, HDD, removable disk, etc., and can be stored or distributed. is there. It is also possible to provide the above program through a network such as the Internet or electronic mail.

１００…関連語計算装置
１１０…単語統計情報データベース
１２０…ベクトル作成部（ベクトル作成手段）
１３０…グループ化部（グループ化手段）
１４０…グループ間計算部（グループ間計算手段）
１５０…グループ化判定部（グループ化判定手段）
１６０…作業領域
１７０…グループデータベース DESCRIPTION OF SYMBOLS 100 ... Related word calculation apparatus 110 ... Word statistics information database 120 ... Vector preparation part (vector preparation means)
130 ... Grouping unit (grouping means)
140 ... Inter-group calculation section (inter-group calculation means)
150 ... Grouping determination unit (grouping determination means)
160 ... work area 170 ... group database

Claims

A related word calculation device for grouping words,
A grouping means for creating a group of each word based on a vector of each word created based on a co-occurrence frequency of each word from a statistical information database storing co-occurrence frequency information between words;
A calculation for calculating the amount of increase in diffusivity at the time of integration of the two groups as a distance between any two groups from a value calculated based on the generated vector of each group is performed for all sets of groups. And calculate the ratio between the calculated distance and the diffusivity of each group before integration, and exclude both groups in the group of groups whose two ratio values both exceed the threshold from the integration target Inter-group calculation means,
When a group pair that minimizes the distance between the groups is selected from the set of groups left by the exclusion and the distance between the selected groups is less than a threshold, the group A related word calculation apparatus comprising: a grouping determination unit that outputs the data of each group of the pair when the pair is provided to the grouping unit and the distance is equal to or greater than a threshold value.

The grouping determining means is configured such that the ratio of the number of elements in which the element values in all vector elements of one group are positive in the pair of groups is equal to or greater than a predetermined threshold value and the element values in all vector elements of the other group the proportion of the number of elements to be positive is less than a predetermined threshold value, related word calculation device according pairs of said groups in claim 1, characterized in that in addition to the condition to be subjected to the grouping means.

A related word calculation method for grouping words,
A step of creating a group of each word based on a vector of each word created based on the co-occurrence frequency of each word from a statistical information database in which the grouping means stores co-occurrence frequency information between words;
The inter-group calculating means calculates all the calculations for calculating the amount of increase in diffusivity when integrating the two groups as the distance between any two groups from the values calculated based on the created vector of each group. This is done for a group of groups, and the ratio between the calculated distance and the diffusivity of each group before integration is calculated separately, and both groups of the group of groups whose two ratio values both exceed the threshold are integrated. A step to be excluded from
The grouping determination means selects a pair of groups that minimizes the distance between the groups when the group is integrated from the set of groups left by the exclusion, and the distance between the selected groups is less than the threshold value. A pair of the group is provided to the grouping means, and if the distance is greater than or equal to a threshold value, data of each group of the pair is output. Method of calculation.

In providing the group pair to the grouping means, the ratio of the number of elements in which the element values of all vector elements of one group are positive among the pair of groups is equal to or greater than a predetermined threshold and the other group 4. The method according to claim 3 , further comprising adding, as a condition for providing the group pair to the grouping means, that a ratio of the number of elements having positive element values in all vector elements is less than a predetermined threshold value. The related word calculation method described.

A related word calculation program for causing a computer to function as each means constituting the related word calculation device according to claim 1 .