JP2590141B2

JP2590141B2 - Collocation extraction processing method

Info

Publication number: JP2590141B2
Application number: JP62260710A
Authority: JP
Inventors: 弘子冨士盛
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1987-10-15
Filing date: 1987-10-15
Publication date: 1997-03-12
Anticipated expiration: 2012-03-12
Also published as: JPH01102679A

Description

【発明の詳細な説明】〔概要〕機械翻訳システムにおける辞書作成にあたっての専門
用語の抽出や，情報検索システムにおけるキーワードの
抽出などに用いるため，計算機により，電子化された文
章中から連語を抽出する処理を行う連語抽出処理方法に
関し，与えられた電子化された文章中から，特別な意味を持
つと考えられる連語を，自動的に抽出する手段を提供す
ることを目的とし，連語の抽出対象となる文章を単語に分割する単語分割
処理過程と，分割された各単語について，所定の語数の
範囲内で前後の単語を組み合わせ，連語を作成する連語
作成処理過程と，作成された各連語について，該文章中
における出現回数をカウントする出現回数計数処理過程
と，同じ出現回数の連語の間で，重複する連語を除去
し，残りの連語を抽出する重複語削除処理過程とを備え
るように構成する。DETAILED DESCRIPTION OF THE INVENTION [Summary] Computers extract collocations from digitized sentences to be used for extracting technical terms when creating dictionaries in a machine translation system and extracting keywords in an information search system. The purpose of the present invention is to provide a means for automatically extracting collocations that are considered to have a special meaning from a given digitized sentence. A word segmentation process of dividing a sentence into words, a word segmentation process of combining adjacent words within a predetermined number of words for each segmented word, and a word segmentation process of creating a word sequence. The process of counting the number of appearances in the sentence and the process of counting the number of occurrences, removing duplicated collocations between collocations with the same number of appearances, and extracting the remaining collocations And a duplicate word deleting process.

[Industrial applications]

本発明は，機械翻訳システムにおける辞書作成にあた
っての専門用語の抽出や，情報検索システムにおけるキ
ーワードの抽出などに用いるため，計算機により，電子
化された文章中から連語を抽出する処理を行う連語抽出
処理方法に関する。The present invention provides a collocation extraction process for extracting collocations from digitized sentences by a computer for use in extracting technical terms in creating a dictionary in a machine translation system and extracting keywords in an information retrieval system. About the method.

[Conventional technology]

例えば，英語を日本語に翻訳する機械翻訳システムで
用いる辞書を作成する場合に，ある単語の組み合わせが
特別な意味を持つような連語を抽出し，その連語に対
し，最適な訳を与える必要がある。For example, when creating a dictionary to be used in a machine translation system that translates English into Japanese, it is necessary to extract collocations in which a certain combination of words has a special meaning, and to give an optimal translation to the collocation. is there.

また，技術文献などを収録するデータベースを持つ情
報検索システムにおいて，各技術文献において重要な意
味を持つと考えられる用語を，キーワードとして予め抽
出しておけば，そのキーワードによる検索を効率よく行
うことができるが，このキーワードを単語だけでなく，
連語でも指定できるようにすることが望まれる。In addition, in an information retrieval system that has a database that stores technical documents, if terms that are considered to be important in each technical document are extracted in advance as keywords, searches can be performed efficiently using the keywords. Yes, but this keyword can be used not only for words,
It is desirable to be able to specify collocations.

従来，このような連語を抽出する場合，人間が文章を
読み，その連語が用いられる各分野に応じて，経験的判
断により抽出するようにしている。そのため，多大な時
間と労力がかかるという問題があった。Conventionally, when extracting such collocations, a human reads a sentence and extracts it by empirical judgment according to each field in which the collocation is used. Therefore, there is a problem that it takes a lot of time and labor.

ところで，連語ではないが，計算機によって，電子化
された文章中から，専門用語やキーワードなどの特別な
意味を持つことがあると考えられる特異単語を，自動的
に抽出する方法が考えられている。By the way, there is a method of automatically extracting unusual words that are considered to have special meanings such as technical terms and keywords from computerized sentences, although they are not collocations. .

この方法では，電子化された文章中から専門用語やキ
ーワードとなる単語を抽出する際に，計算機により，そ
の文章に現れる各単語の出現回数をカウントし，出現回
数の多い単語を求め，その中から，いわゆるストップワ
ードを一律に除去して，残ったものを特異単語とする。
なお，ストップワードは，英語を例にすると，“a",“t
he",…などの冠詞，“is",“are",…などのBE動詞，“t
o",“in",…などの前置詞のように，頻繁に用いられる
が特異単語とはならないことが明確である単語である。In this method, when extracting words that become technical terms and keywords from the digitized text, the computer counts the number of occurrences of each word appearing in the text, and finds the word with the highest number of occurrences. , So-called stop words are uniformly removed, and the remaining words are defined as unique words.
In English, stop words are “a”, “t”
articles such as he ", ..., BE verbs such as" is "," are ", ...," t
Words such as prepositions such as "o", "in", ..., which are frequently used but are clearly not unique words.

[Problems to be solved by the invention]

単純に出現回数によって特異単語を選択する従来の方
法によれば，前後の単語の組み合わせで，個々の単語の
意味と異なる特別な意味を持つと考えられる連語を，自
動的に抽出することができず，特異単語の抽出結果か
ら，それらの単語を組み合わせても，無意味な連語が多
く出力されてしまうという問題がある。According to the conventional method of simply selecting a unique word based on the number of occurrences, it is possible to automatically extract collocations that are considered to have a special meaning different from the meaning of each word by combining the preceding and following words. In addition, there is a problem that many insignificant collocations are output even if these words are combined from the extraction results of the unique words.

例えば，翻訳辞書の作成にあたって，“DEEP FREEZE
R"に対し，“冷凍庫”という専門用語的な訳を与える必
要があるとする。しかし，単語のみで抽出すると，“DE
EP"および“FREEZER"が別々に抽出される。そのため，
“DEEP"には“深い",“FREEZER"には“フリーザー”と
訳語をつけてしまい，“DEEP FREEZER"は，“深いフリ
ーザー”となってしまう。For example, when creating a translation dictionary, "DEEP FREEZE
It is necessary to give a technical term “freezer” to “R.” However, if only words are extracted, “DE
EP "and" FREEZER "are extracted separately, so
“DEEP” is translated to “deep”, “FREEZER” is translated to “freezer”, and “DEEP FREEZER” is translated to “deep freezer”.

本発明は上記問題点の解決を図り，与えられた電子化
された文章中から，特別な意味を持つと考えられる連語
を，自動的に，かつできるだけ無意味なものを排除して
抽出する手段を提供することを目的としている。Means for Solving the Problems The present invention solves the above-mentioned problems, and automatically extracts a collocation word considered to have a special meaning from a given digitized sentence while excluding as few meaningless as possible. It is intended to provide.

[Means for solving the problem]

第１図は本発明の原理説明図である。 FIG. 1 is a diagram illustrating the principle of the present invention.

第１図において,10はキーボードなどの入力装置,11は
CPUおよびメモリなどからなる計算機,12は連語抽出の対
象となっている文章データが格納される文書ファイル,1
3は連語抽出のための作業テーブルとして用いる抽出テ
ーブル,14は連語の抽出結果が出力されるプリンタや外
部記憶装置などの出力装置を表す。In FIG. 1, 10 is an input device such as a keyboard, and 11 is
A computer consisting of a CPU and memory, etc., 12 is a document file storing text data to be extracted
Reference numeral 3 denotes an extraction table used as a work table for extracting collocations, and 14 denotes an output device, such as a printer or an external storage device, to which the extraction results of the collocations are output.

入力処理P1では，連語抽出対象となる文章を，入力装
置10から，例えばワードプロセッサの処理と同様に入力
し，文書ファイル12を作成する。なお，この文書ファイ
ル12は，必ずしもこの計算機11によって作成したもので
はなくてもよく，他の計算機で作成したものを利用する
こともできる。In the input process P1, a sentence to be subjected to collocation extraction is input from the input device 10 in the same manner as, for example, a word processor process, and a document file 12 is created. The document file 12 does not necessarily have to be created by the computer 11, but may be created by another computer.

単語分割処理P2では，文書ファイル12の文章を読み出
し，各文毎に単語に分割する処理を行う。分割した単語
情報は，作業領域に記憶する。In the word division process P2, a process of reading the text of the document file 12 and dividing the text into words for each sentence is performed. The divided word information is stored in the work area.

連語作成処理P3では，分割された文毎の各単語につい
て，例えば３語とか４語とかの予め定められた語数の範
囲内で，前後の単語を組み合わせて連語を作成し，その
連語情報を抽出テーブル13に格納する。In the collocation generation process P3, for each word of each divided sentence, a collocation is created by combining preceding and succeeding words within a predetermined number of words such as 3 or 4 words, and the collocation information is extracted. Stored in table 13.

出現回数計数処理P4では，作成された各連語につい
て，連語抽出対象となっている文章中における出現回数
をカウントし，結果を抽出テーブル13に記入する。な
お，この計数は，各文毎に行ってもよく，また，文章全
体について最後にまとめて行ってもよい。In the appearance count counting process P4, the number of appearances in the text from which the collocation is to be extracted is counted for each created collocation, and the result is entered in the extraction table 13. This counting may be performed for each sentence, or may be performed for the entire sentence at the end.

重複語削除処理P5では，抽出テーブル13を参照，更新
しながら，同じ出現回数の連語の間で，重複する連語を
除去し，残りの連語を，求める連語として抽出する。出
現回数によるソート処理P51は，同じ出現回数の連語を
収集するために，重複語削除処理P5が呼び出す補助処理
である。In the duplicate word deletion process P5, while referring to and updating the extraction table 13, duplicate collocations are removed from the collocations having the same number of appearances, and the remaining collocations are extracted as desired collocations. The sorting process P51 based on the number of appearances is an auxiliary process called by the duplicate word deletion process P5 to collect collocations having the same number of occurrences.

出力処理P6では，最終的に抽出テーブル13に残った連
語を，例えばアルファベット順で，プリンタや磁気ディ
スク装置等の出力装置14に出力する。In the output process P6, the collocations finally left in the extraction table 13 are output, for example, in alphabetical order to an output device 14 such as a printer or a magnetic disk device.

[Action]

第２図は本発明の作用を説明するための図である。 FIG. 2 is a diagram for explaining the operation of the present invention.

第１図に示す単語分割処理P2による処理および連語作
成処理P3により処理の後，出現回数計数処理P4による処
理結果が，例えば第２図（イ）に示す抽出テーブル13の
ようになったとする。すなわち，連語抽出対象となって
いる文章中に，“CORE STORAGE"が10回，“MAGNETIC CO
RE"が10回，“MAGNETIC CORE STORAGE"が10回出現して
いる。この出現回数に着目すれば，“CORE STORAGE"お
よび“MAGNETIC CORE"が，“MAGNETIC CORE STORAGE"と
は別に単独では出現しないことが明らかであり，“MAGN
ETIC CORE STORAGE"が常に３語の組で用いられているこ
とが分かる。そこで，重複語削除処理P5は，“CORE STO
RAGE"および“MAGNETIC CORE"を除去し，“MAGNETIC CO
RE STORAGE"だけを残す。After the processing by the word division processing P2 and the processing by the collocation creation processing P3 shown in FIG. 1, it is assumed that the processing result by the appearance frequency counting processing P4 becomes, for example, an extraction table 13 shown in FIG. In other words, “CORE STORAGE” appears 10 times in the sentences for which collocations are to be extracted, and “MAGNETIC CO
"RE" appears 10 times and "MAGNETIC CORE STORAGE" appears 10 times. Focusing on the number of occurrences, "CORE STORAGE" and "MAGNETIC CORE" do not appear separately from "MAGNETIC CORE STORAGE" It is clear that “MAGN
It can be seen that "ETIC CORE STORAGE" is always used as a set of three words.
RAGE "and" MAGNETIC CORE "
RE STORAGE "only.

また，例えば第２図（ロ）に示すように，“CORE STO
RAGE"が10回，“MAGNETIC CORE STORAGE"が10回，“MAG
NETIC CORE"が30回出現したとする。この出現回数に着
目すれば，“CORE STORAGE"は常に“MAGNETIC CORE STO
RAGE"の一部として出現することが明らかであるので，
重複語削除処理P5によって“CORE STORAGE"を除去し，
“MAGNETIC CORE STORAGE"と，別に単独で出現すること
がある“MAGNETIC CORE"とを残す。For example, as shown in FIG.
RAGE "10 times," MAGNETIC CORE STORAGE "10 times," MAG
NETIC CORE "appears 30 times. Focusing on this number of occurrences," CORE STORAGE "is always" MAGNETIC CORE STO "
RAGE ".
"CORE STORAGE" is removed by the duplicate word deletion process P5,
Leave "MAGNETIC CORE STORAGE" and "MAGNETIC CORE" that may appear separately.

以上のように出現回数に基づいて処理するので，前後
の単語を組み合わせて作成された連語の中で，無意味な
ものは除去され，組み合わせに意味がある可能性が大き
い連語だけが抽出されることになる。Since processing is performed based on the number of appearances as described above, insignificant collocations created by combining preceding and succeeding words are removed, and only collocations that are likely to have significant combinations are extracted. Will be.

〔Example〕

第３図は本発明の一実施例処理説明図，第４図は本発
明の一実施例に係る単語分割および連語作成を説明する
図，第５図は本発明の一実施例に係る重複語削除処理の
例，第６図は本発明の一実施例に係る重複語削除の具体
例を示す。FIG. 3 is a diagram for explaining the processing of an embodiment of the present invention, FIG. 4 is a diagram for explaining word division and collocation according to an embodiment of the present invention, and FIG. 5 is a duplicate word according to an embodiment of the present invention. FIG. 6 shows an example of deletion processing, and FIG. 6 shows a specific example of duplicate word deletion according to an embodiment of the present invention.

以下，第３図に示す処理〜に従って，本発明の一
実施例による処理の例を説明する。Hereinafter, an example of processing according to an embodiment of the present invention will be described with reference to the processing shown in FIG.

第１図に示す文書ファイル12から,1文を読み出す。One sentence is read from the document file 12 shown in FIG.

読み出しが成功したか否かにより，文があるかどうか
を判定する。文が終了した場合，処理へ制御を移す。It is determined whether or not there is a sentence depending on whether or not the reading is successful. When the statement ends, control is transferred to the processing.

文を単語に分割する。英語では，空白またはカンマ等
により分割することができる。日本語では，例えばワー
ドプロセッサにおける一括漢字変換等で用いられている
方法で分割すればよい。Break sentences into words. In English, it can be divided by spaces or commas. In Japanese, for example, the division may be performed by a method used in a batch kanji conversion or the like in a word processor.

所定の語数の範囲内で，前後の単語を組み合わせ，連
語を作成する。例えば,4語までと決められている場合に
は,2語,3語,4語の連語を作成する。A collocation is created by combining words before and after within a predetermined number of words. For example, if the number of words is determined to be four, two, three, and four collocations are created.

連語の出現回数を，各連語毎に累計する。なお，この
結果は，第１図に示す抽出テーブル13または他の作業領
域に格納しておく。The number of occurrences of collocations is accumulated for each collocation. The result is stored in the extraction table 13 shown in FIG. 1 or another work area.

文書ファイル12から，次の１文を読み出し，処理へ
制御を戻して，全文についての処理が終わるまで，同様
に処理を繰り返す。The next one sentence is read from the document file 12, control is returned to the process, and the process is repeated in the same manner until the process for all the sentences is completed.

全文についての処理が終了したならば，次に抽出テー
ブル13または他の作業領域から１つずつ連語を読む。When the processing for all the sentences is completed, the collocation is read one by one from the extraction table 13 or another work area.

抽出テーブル13または他の作業領域に，未処理連語が
あるかどうかを判定する。全連語についての処理が終了
したならば，処理へ制御を移行する。It is determined whether there is any unprocessed collocation in the extraction table 13 or another work area. When the processing for all collocations is completed, control is transferred to the processing.

その連語が，無意味な連語であるかどうかを調べる。
ここで，無意味な連語であるかどうかは，例えば英語で
あれば，（１）冠詞で終わる連語，（２）BE動詞,HAVE
動詞を含む連語などというように，予め決められたルー
ルによって判断される。無意味な連語である場合には，
処理へ，そうでない場合には処理へ移る。Check whether the collocation is a meaningless collocation.
Here, whether or not a collocation is meaningless, for example, in English, (1) collocation ending with an article, (2) BE verb, HAVE
Judgment is made according to a predetermined rule such as a collocation including a verb. If the collocation is meaningless,
Processing proceeds to processing otherwise.

無意味でない連語を，抽出テーブル13に格納する。Non-significant collocations are stored in the extraction table 13.

次の連語を読み出し，処理へ制御を戻して，同様に
処理を繰り返す。The next collocation is read, control is returned to the process, and the process is repeated in the same manner.

全連語についての処理が終了したならば，次に抽出テ
ーブル13内における重複語の削除処理を行う。この詳し
い処理については，第５図により後述する。重複語の削
除結果から，求める連語を決定する。When the processing for all collocations is completed, the processing for deleting duplicate words in the extraction table 13 is performed. This detailed processing will be described later with reference to FIG. The desired collocation is determined based on the result of removing the duplicate words.

第１図に示す単語分割処理P2および連語作成処理P3に
よる処理では，具体的には第４図に示すような単語分割
および連語作成が行われる。In the processing by the word division processing P2 and the collocation generation processing P3 shown in FIG. 1, specifically, word division and collocation generation as shown in FIG. 4 are performed.

例えば，“Make hay while the sun shines."という
英文（ａ）を扱うとする。単語分割では，第４図（ｂ）
に示すように，各単語に分割する処理を行う。次に前後
の２個の単語を各々連結し,2語の連語（ｃ）を作成す
る。同様に,3個の単語の並びによる連語（ｄ）を作成す
る。この連語の作成を，予め決められた語数まで行い，
それぞれの結果を，次の出現回数計数処理P4の処理対象
とする。For example, assume that the English sentence (a) “Make hay while the sun shines.” Is handled. Fig. 4 (b)
As shown in (1), processing for dividing each word is performed. Next, the two words before and after are connected to each other to create a two-word collocation (c). Similarly, a collocation (d) based on the arrangement of three words is created. Create these collocations up to a predetermined number of words,
Each result is set as a processing target of the next occurrence count processing P4.

第５図は，第１図に示す重複語削除処理P5による処理
内容の一部を，詳しくしたものである。第６図は，第５
図に示す処理による具体例を示している。ここで，説明
を簡単化するために，以下の前提をおく。FIG. 5 shows in detail a part of the processing content by the duplicate word deletion processing P5 shown in FIG. FIG.
A specific example by the processing shown in the figure is shown. Here, the following assumptions are made to simplify the description.

（ａ）単語E,G,N,O,Rは，各々異なる単語である。(A) The words E, G, N, O, and R are different words.

（ｂ）各々が，文章中にの並びでｎ回出現する。(B) each in the text Appear n times.

（ｃ）連語の最大長は５語である。(C) The maximum length of a collocation is 5 words.

この前提のもとに，重複した連語を削除するために，
次のような処理〔ｉ〕〜〔ｖ〕を行う。Based on this assumption, to remove duplicate collocations,
The following processes [i] to [v] are performed.

〔ｉ〕抽出テーブル13内の連語を，出現回数，第１単
語，第２単語，第３単語，第４単語，第５単語の正順で
ソートする。この処理により，単語の並びに関連した抽出テーブル13の内容は，第６図（イ）に示
すようになる。[I] The collocations in the extraction table 13 are sorted in the order of appearance, first word, second word, third word, fourth word, and fifth word in the normal order. By this processing, word arrangement The contents of the extraction table 13 related to are as shown in FIG.

〔ii〕次に抽出テーブル13内の各連語について，第５図
に示す〜の処理により，重複語の削除を行う。すな
わち，現在着目している連語と次の連語との間で，第１
単語，第２単語，…と，順に単語がなくなるまで比較し
ていき，前の連語部分が，次の連語の左側の部分と全部
一致した場合には，その連語を削除し，一致しなかった
なら，その連語を抽出テーブル13に残す。[Ii] Next, duplicate words are deleted from each collocation in the extraction table 13 by the processing shown in FIG. That is, the first collocation between the current collocation and the next collocation is
The word, the second word, and so on are compared in order until there are no more words. If the preceding collocation part completely matches the left part of the next collocation, the collocation is deleted, and no match is found. Then, the collocation is left in the extraction table 13.

例えば，連語の次に，連語があるとき，第１単語から第３単語まで全部一致するの
で，連語を削除し，連語を残す。一方，連語の次に，連語があるようなケースでは，第１単語が一致しないので，
削除は行われない。For example, collocation Followed by collocations When there is a word, all words from the first word to the third word are matched. And delete collocations Leave. On the other hand, collocation Followed by collocations In some cases, the first word does not match, so
No deletion is performed.

以上の削除処理により，第６図（イ）に示す抽出テー
ブル13上の連語は，第６図（ロ）に示すように整理され
る。By the above-described deletion processing, collocations on the extraction table 13 shown in FIG. 6 (a) are arranged as shown in FIG. 6 (b).

〔iii〕次に各連語の後方一致を調べるため，各連語
を，第６図（ハ）に示すように右詰めにする。[Iii] Next, each collocation is right-justified as shown in FIG.

〔iv〕第６図（ハ）に示す連語を，今度は，出現回数，
第５単語，第４単語，第３単語，第２単語，第１単語の
正順で，ソートしなおす。その処理結果は，第６図
（ニ）に示すようになる。[Iv] The collocation shown in FIG.
The fifth word, the fourth word, the third word, the second word, and the first word are rearranged in the normal order. The processing result is as shown in FIG.

〔ｖ〕上記処理〔ii〕と同様に処理し，次の連語と一致
するものを削除する。ただし，上記処理〔ii〕における
第５図に示す処理〜の比較では，第５単語，第４単
語，第３単語，第２単語の順に一致，不一致を比較す
る。[V] The processing is performed in the same manner as in the above processing [ii], and those that match the next collocation are deleted. However, in the comparison with the processing shown in FIG. 5 in the above processing [ii], the fifth word, the fourth word, the third word, and the second word are compared in order of coincidence and non-coincidence.

以上の処理により，抽出テーブル13には，最終的には
第６図（ホ）に示すように，連語が残ることになる。連語などは，この部分だけ単独に出現することはなく，常
に，単語の並びが一体化して用いられるので，ここでは第６図（ホ）に
示す連語が，求める連語ということになる。As a result of the above processing, the extraction table 13 finally contains the collocation words as shown in FIG. Will remain. Collocation Etc. do not appear alone in this part, but always Are used integrally, so the collocation shown in FIG. 6 (e) is the collocation to be sought here.

このようにして求めた連語は，例えば機械翻訳システ
ムで用いる翻訳用辞書を作成するときに，各連語がそれ
ぞれ１つの単語であるかのように訳を与えて，より適切
な翻訳を可能とすることに利用できる。The collocations obtained in this way are translated as if each collocation were a single word, for example, when a translation dictionary used in a machine translation system was created, enabling more appropriate translation. It can be used for things.

また，情報検索システムにおいて，連語を１つのキー
ワードとして扱う場合に，その連語のキーワード登録処
理に利用することができる。Also, when a collocation is treated as one keyword in the information retrieval system, it can be used for keyword registration processing of the collocation.

〔The invention's effect〕

以上説明したように，本発明によれば，電子化された
文章中から，単語の並びで特別な意味を持つと考えられ
る連語を，自動的に抽出することが可能になり，機械翻
訳における辞書の開発，情報検索におけるキーワードの
抽出などに役立てることができるようになる。As described above, according to the present invention, it is possible to automatically extract collocations that are considered to have a special meaning in a sequence of words from an electronic sentence, and a dictionary for machine translation can be provided. It can be used for the development of information and the extraction of keywords in information retrieval.

[Brief description of the drawings]

第１図は本発明の原理説明図，第２図は本発明の作用説明図，第３図は本発明の一実施例処理説明図，第４図は本発明の一実施例に係る単語分割および連語作
成を説明する図，第５図は本発明の一実施例に係る重複語削除処理の例，第６図は本発明の一実施例に係る重複語削除の具体例を
示す。図中,P1は入力処理,P2は単語分割処理,P3は連語作成処
理,P4は出現回数計数処理,P5は重複語削除処理,P51はソ
ート処理,P6は出力処理,10は入力装置,11は計算機,12は
文書ファイル,13は抽出テーブル,14は出力装置を表す。1 is a diagram illustrating the principle of the present invention, FIG. 2 is a diagram illustrating the operation of the present invention, FIG. 3 is a diagram illustrating the processing of an embodiment of the present invention, and FIG. 4 is word segmentation according to an embodiment of the present invention. FIG. 5 shows an example of duplicate word deletion processing according to one embodiment of the present invention, and FIG. 6 shows a specific example of duplicate word deletion according to one embodiment of the present invention. In the figure, P1 is input processing, P2 is word segmentation processing, P3 is collocation creation processing, P4 is occurrence count processing, P5 is duplicate word deletion processing, P51 is sorting processing, P6 is output processing, 10 is an input device, 11 Denotes a computer, 12 denotes a document file, 13 denotes an extraction table, and 14 denotes an output device.

Claims

(57) [Claims]

1. A collocation extraction method for extracting collocation from digitized text by a computer, comprising: a word segmentation process (P2) for dividing a text to be extracted from collocation into words; , For each of the divided words, combining the preceding and succeeding words within a predetermined number of words to create a collocation word (P3), and counting the number of occurrences in the sentence for each created collocation word A collocation characterized by comprising a number-of-occurrence counting process (P4) and a duplicate word removal process (P5) for removing duplicated collocations between collocations having the same number of occurrences and extracting the remaining collocations. Extraction processing method.