JP3100556B2

JP3100556B2 - Part-of-speech device

Info

Publication number: JP3100556B2
Application number: JP08232993A
Authority: JP
Inventors: エズラ・ダブリュー・ブラック; 秀紀柏岡; ステファン・ジー・ユーバンク
Original assignee: 株式会社エイ・ティ・アール音声翻訳通信研究所
Priority date: 1996-09-03
Filing date: 1996-09-03
Publication date: 2000-10-16
Anticipated expiration: 2016-09-03
Also published as: JPH1078958A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文字列を含む文章
のテキストデータに対して品詞を自動的に付与する品詞
付与装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a part-of-speech providing apparatus for automatically providing a part of speech to text data of a sentence including a character string.

【０００２】[0002]

【従来の技術】従来、比較的精度のよい品詞付与システ
ム（以下、従来例という。）が、従来技術文献１「E.Br
ill et al.,“Some Advances in Transformation--Base
d Partof Speech Tagging",Proceedings of the Twelft
h National Conference on Artificial Intelligence,p
p.722-727,AAAI,1994年」及び従来技術文献２「B.Meria
ldo et al.,“Tagging English Text with a Probabili
stic Model",Computational Linguistics,20-2,pp.155-
171,1994年」において報告されている。この従来例の品
詞付与システムにおいては、単語表記とその表記のとる
品詞ラベルの組を記述した、品詞付与のための辞書を参
照することによりテキストデータに対して品詞を付与し
ている。2. Description of the Related Art Conventionally, a relatively accurate part-of-speech assigning system (hereinafter referred to as a conventional example) has been disclosed in prior art document 1 “E.
ill et al., “Some Advances in Transformation--Base
d Partof Speech Tagging ", Proceedings of the Twelft
h National Conference on Artificial Intelligence, p
p.722-727, AAAI, 1994 "and prior art document 2" B. Meria
ldo et al., “Tagging English Text with a Probabili
stic Model ", Computational Linguistics, 20-2, pp.155-
171, 1994 ". In this part-of-speech giving system of this conventional example, part-of-speech is given to text data by referring to a dictionary for giving part-of-speech, which describes a set of a word notation and a part-of-speech label taken by the notation.

【０００３】[0003]

【発明が解決しようとする課題】従来例の品詞付与シス
テムにおいては、辞書を用いて品詞を付与しているため
に、辞書項目に記載されていない未知語に対する品詞付
与は難しく、また、単語と品詞ラベルとの未知の組合せ
に対する処理は難しいという問題点があった。さらに、
使われる品詞体系の変更により辞書のメンテナンスを行
う必要があるという問題点があった。また、辞書を使用
しないで、ヒューリスティックスにより（発見的に又は
経験的に）単語に対する品詞ラベルを割り当てている品
詞付与装置もあるが、品詞付与の正解率は比較的低いと
いう問題点があった。In the conventional part-of-speech assignment system, the part-of-speech is assigned using a dictionary. Therefore, it is difficult to assign a part-of-speech to an unknown word that is not described in a dictionary item. There is a problem that it is difficult to process an unknown combination with a part of speech label. further,
There was a problem that it was necessary to maintain the dictionary by changing the part of speech system used. There is also a part-of-speech device that uses heuristics (heuristically or empirically) to assign a part-of-speech label to a word without using a dictionary. However, there is a problem that the accuracy rate of part-of-speech assignment is relatively low.

【０００４】本発明の目的は以上の問題点を解決し、品
詞付与のための辞書を用いることなく、従来例に比較し
て正確に自動的に付与することができる品詞付与装置を
提供することにある。SUMMARY OF THE INVENTION An object of the present invention is to solve the above problems and to provide a part-of-speech assigning apparatus which can automatically and accurately assign a part-of-speech as compared with a conventional example without using a dictionary for assigning parts of speech. It is in.

【０００５】[0005]

【課題を解決するための手段】本発明に係る請求項１記
載の品詞付与装置は、各単語の綴りの特徴と、文章内の
使われ方による特徴と、単語の相互情報量を用いた階層
的な分類とを含む複数の属性を備えた属性リストを格納
する第１の記憶装置と、複数の品詞を含む品詞リストを
格納する第２の記憶装置と、生成される頻度確率付き決
定木を格納する第３の記憶装置と、単語列からなる品詞
付与済みテキストデータに基づいて、上記第１の記憶装
置に格納された属性リストと、上記第２の記憶装置に格
納された品詞リストとを参照して、上記各属性の属性値
に依存して分割されるような二分木形式の木構造を有し
品詞付与のための決定木を生成し、上記生成された決定
木の分割されないノードであるリーフノードに対して複
数の品詞に対する頻度確率を計算して付与することによ
り、頻度確率付き決定木を生成して上記第３の記憶装置
に格納する決定木学習手段と、上記第３の記憶装置に格
納された頻度確率付き決定木を用いて、上記第１の記憶
装置に格納された属性リストと、上記第２の記憶装置に
格納された品詞リストとを参照して、入力される単語列
からなるテキストデータに基づいて、上記リーフノード
に付与された頻度確率の中で上位複数ｎ個の頻度確率を
選択して上記テキストデータの各単語に対して付与し、
上記テキストデータの単語列において最大の結合確率を
有する品詞列を正解品詞列として決定して出力する品詞
付与手段とを備えたことを特徴とする。According to the present invention, there is provided a part-of-speech providing apparatus according to claim 1, wherein a spelling feature of each word, a feature according to a usage in a sentence, and a hierarchy using mutual information of words are used. A first storage device for storing an attribute list having a plurality of attributes including a general classification, a second storage device for storing a part-of-speech list including a plurality of part-of-speech, and a generated decision tree with frequency probability. A third storage device to be stored, and an attribute list stored in the first storage device and a part-of-speech list stored in the second storage device, based on the part-of-speech-attached text data composed of word strings. With reference to the above, a decision tree for giving a part-of-speech having a tree structure of a binary tree form that is divided depending on the attribute value of each attribute is generated, Multiple leaf classes for a leaf node Decision tree learning means for generating a decision tree with frequency probability by storing and adding the probability probability to the third storage device, and a decision tree with frequency probability stored in the third storage device With reference to the attribute list stored in the first storage device and the part-of-speech list stored in the second storage device, based on text data consisting of an input word string, From among the frequency probabilities assigned to the leaf nodes, n higher frequency probabilities are selected and assigned to each word of the text data,
A part-of-speech providing means for determining and outputting a part-of-speech sequence having the maximum connection probability in the word sequence of the text data as a correct part-of-speech sequence.

【０００６】また、請求項２記載の品詞付与装置は、請
求項１記載の品詞付与装置において、上記決定木学習手
段は、上記二分木の形式で分割するときに、上記各属性
による分割前の属性の有効性の優先順位を表わすエント
ロピーＨ₀と分割後のエントロピーＨとの差（Ｈ₀−Ｈ）
が最大の属性を分割候補の属性として選択し、所定の分
割続行基準を満足するときに、二分木の形式で分割して
決定木を更新することを特徴とする。According to a second aspect of the present invention, in the part-of-speech providing apparatus according to the first aspect, when the decision tree learning unit performs the division in the binary tree format, Difference (H ₀ −H) between entropy H ₀ representing priority of attribute validity and entropy H after division
Is selected as an attribute of a candidate for division, and when a predetermined division continuation criterion is satisfied, the decision tree is updated by dividing in the form of a binary tree.

【０００７】さらに、請求項３記載の品詞付与装置は、
請求項２記載の品詞付与装置において、上記分割続行基
準は、（Ｉ）選択された属性に基づいて分割したときの
エントロピーの差（Ｈ₀−Ｈ）が所定のエントロピーし
きい値Ｈｔｈ以上であり、かつ（II）選択された属性に
基づく分割後の属性とその属性値及び品詞の組のイベン
ト数が所定のイベント数しきい値Ｄｔｈ以上であること
を特徴とする。Further, the part-of-speech providing apparatus according to claim 3 is
3. The part-of-speech providing apparatus according to claim 2, wherein the division continuation criterion is that (I) a difference (H ₀ −H) in entropy at the time of division based on the selected attribute is equal to or greater than a predetermined entropy threshold Hth. And (II) the number of events of a set of the attribute after division based on the selected attribute, the attribute value thereof, and the part of speech is equal to or more than a predetermined event number threshold Dth.

【０００８】またさらに、請求項４記載の品詞付与装置
は、請求項１、２又は３記載の品詞付与装置において、
上記品詞付与手段は、上記リーフノードに付与された頻
度確率の中で上位複数ｎ個の頻度確率を選択して上記テ
キストデータの各単語に対して付与した後、所定のスタ
ック・デコーダ・アルゴリズムに用いて、処理途中のテ
キストデータの単語列に対する結合確率が所定の結合確
率以上である品詞候補のみを残して品詞候補を限定し、
処理終了時の上記テキストデータの単語列において最大
の結合確率を有する品詞列を正解品詞列として決定する
ことを特徴とする。Further, the part-of-speech assigning device according to claim 4 is the part-of-speech assigning device according to claim 1, 2, or 3,
The part-of-speech assigning means selects the top n number of frequency probabilities among the frequency probabilities assigned to the leaf nodes and assigns them to each word of the text data. By using only the part-of-speech candidates whose connection probability to the word string of the text data being processed is equal to or more than the predetermined connection probability, the part-of-speech candidates are limited,
A part-of-speech sequence having the maximum connection probability in the word sequence of the text data at the end of the processing is determined as a correct part-of-speech sequence.

【０００９】[0009]

【発明の実施の形態】以下、図面を参照して本発明に係
る実施形態について説明する。図１は、本発明に係る一
実施形態である、決定木学習装置及び品詞付与装置を備
えた品詞付与システムのブロック図である。この品詞付
与システムは、英語のテキストデータに対して、品詞付
与のための辞書を参照しないで、品詞を付与する品詞付
与システムであって、（ａ）品詞付与済みテキストメモ
リ２１に格納された品詞付与済みテキストデータに基づ
いて、属性リストメモリ２２に格納された属性リスト
と、品詞タグリストメモリ２３に格納された品詞タグリ
ストとを参照して、詳細後述する決定木学習処理を実行
して学習することにより、頻度確率付き決定木を生成し
て確率付き決定木ファイルメモリ２４に格納する決定木
学習装置１０と、（ｂ）確率付き決定木ファイルメモリ
２４に格納された頻度確率付き決定木を用いて、属性リ
ストメモリ２２に格納された属性リストと、品詞タグリ
ストメモリ２３に格納された品詞タグリストとを参照し
て、テキストデータメモリ２５に格納され入力されるテ
キストデータに対して、詳細後述する品詞付与処理を実
行することにより、品詞を付与することにより、品詞付
与済みテキストデータを生成して品詞付与済みテキスト
データメモリ２６に格納する品詞付与装置１１とを備え
る。本実施形態においては、テキストデータとは、英語
の単語列からなる英文である。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram of a part-of-speech providing system including a decision tree learning device and a part-of-speech providing device according to an embodiment of the present invention. This part-of-speech assigning system is a part-of-speech assigning system that assigns parts of speech to English text data without referring to a dictionary for assigning parts of speech. Based on the text data that has been added, the attribute list stored in the attribute list memory 22 and the part-of-speech tag list stored in the part-of-speech tag list memory 23 are referred to to execute a decision tree learning process, which will be described in detail later, to perform learning. By doing so, the decision tree learning device 10 that generates a decision tree with frequency probability and stores it in the decision tree file memory 24 with probability and (b) the decision tree with frequency probability stored in the decision tree file memory 24 with probability. With reference to the attribute list stored in the attribute list memory 22 and the part of speech tag list stored in the part of speech tag list memory 23, The part-of-speech processing is performed on the text data stored and input in the data memory 25 to give the part-of-speech, thereby generating the part-of-speech-added text data and generating the part-of-speech-added text data. And a part-of-speech providing device 11 stored in the storage unit 26. In the present embodiment, the text data is an English sentence composed of an English word string.

【００１０】ここで、決定木学習装置１０は、単語列か
らなる品詞付与済みテキストデータに基づいて、各単語
の綴りの特徴と、文章内の使われ方による特徴と、単語
の相互情報量を用いた階層的な分類とを含む複数の属性
を用いて、上記各属性の属性値に依存して分割されるよ
うな二分木形式の木構造を有し品詞付与のための決定木
を生成し、上記生成された決定木の分割されないノード
であるリーフノードに対して複数の品詞に対する頻度確
率を計算して付与することにより、頻度確率付き決定木
を生成する。次いで、品詞付与装置１１は、決定木装置
１０によって生成された頻度確率付き決定木を用いて、
入力される単語列からなるテキストデータに基づいて、
上記リーフノードに付与された頻度確率の中で上位複数
ｎ個の頻度確率を選択して上記テキストデータの各単語
に対して付与し、上記テキストデータの単語列において
最大の結合確率を有する品詞列を正解品詞列として決定
して出力する。ここで、決定木学習装置１０は、上記二
分木の形式で分割するときに、上記各属性による分割前
の属性の有効性の優先順位を表わすエントロピーＨ₀と
分割後のエントロピーＨとの差（Ｈ₀−Ｈ）が最大の属
性を分割候補の属性として選択し、所定の分割続行基準
を満足するときに、二分木の形式で分割して決定木を更
新する。また、品詞付与装置１１は、上記リーフノード
に付与された頻度確率の中で上位複数ｎ個の頻度確率を
選択して上記テキストデータの各単語に対して付与した
後、所定のスタック・デコーダ・アルゴリズムに用い
て、処理途中のテキストデータの単語列に対する結合確
率が所定の結合確率以上である品詞候補のみを残して品
詞候補を限定し、処理終了時の上記テキストデータの単
語列において最大の結合確率を有する品詞列を正解品詞
列として決定する。[0010] Here, the decision tree learning device 10 determines the spelling characteristics of each word, the characteristics according to the usage in the sentence, and the mutual information amount of the words based on the part-of-speech-added text data composed of the word strings. Using a plurality of attributes including the hierarchical classification used, a decision tree for giving a part of speech having a tree structure of a binary tree format that is divided depending on the attribute value of each attribute is generated. By generating and assigning frequency probabilities for a plurality of parts of speech to a leaf node that is an undivided node of the generated decision tree, a decision tree with a frequency probability is generated. Next, the part-of-speech providing apparatus 11 uses the decision tree with frequency probability generated by the decision tree apparatus 10,
Based on text data consisting of input word strings,
A part-of-speech sequence having the maximum combination probability in the word sequence of the text data, selecting the top plural n frequency probabilities from among the frequency probabilities assigned to the leaf nodes and assigning them to each word of the text data. Is determined as a correct part-of-speech sequence and output. Here, when the decision tree learning device 10 performs the division in the form of the binary tree, the difference between the entropy H ₀ indicating the priority of the validity of the attribute before the division by each attribute and the entropy H after the division ( H ₀ -H) is selected as the attribute of the division candidate as the attribute of the division candidate, and when a predetermined division continuation criterion is satisfied, the decision tree is updated by division in the form of a binary tree. The part-of-speech assigning device 11 selects a plurality of n higher-order frequency probabilities from the frequency probabilities assigned to the leaf nodes and assigns them to each word of the text data. The algorithm is used to limit the part-of-speech candidates, leaving only the part-of-speech candidates whose connection probability to the word string of the text data being processed is equal to or greater than the predetermined connection probability, and to maximize the connection in the word string of the text data at the end of the processing. A part-of-speech sequence having a probability is determined as a correct part-of-speech sequence.

【００１１】本実施形態においては、決定木学習処理に
より、品詞付与済みテキストデータから得られる知識を
用いて、二分木形式の木構造を有し品詞付与のための頻
度確率付き決定木を生成し、品詞付与を行なう。頻度確
率付き決定木で用いられる属性は、言語学的な特徴やコ
ーパスから得られる統計的な特徴を用いる。従来の品詞
付与では、辞書を引くことで品詞候補を制限し、その中
から、前後に現れる語との関係などを考慮して、もっと
も適切な品詞を選択するという方法が一般的である。し
かしながら、辞書の作成や保守にかかるコストの問題と
なる。また、辞書項目に無い語（未知語）や辞書の品詞
候補にない品詞として使われた語に対しては、特別な処
理が必要とされる。本実施形態に係る頻度確率付き決定
木を用いた方法では、単語の品詞を決定するために、辞
書を用いないため、辞書の作成や保守にかかるコストは
問題にならない。頻度確率付き決定木を、品詞付与済み
テキストを用いた学習により構築する。そのために、品
詞付与済みテキストデータがあれば、品詞体系に柔軟に
対応できる。また、上記頻度確率を用いて、品詞列の優
先順位を自動的に決定することができる。決定木は、対
象を複数の属性とその属性値から、適切なクラスに分類
する木構造のモデルである。品詞付与においては、対象
が各単語に、クラスが品詞に相当する。属性としては、
各単語の綴の特徴や文内の使われ方による特徴や単語の
相互情報量を用いた階層的分類などを用いる。以下、本
実施形態の品詞付与システムについて詳述する。In the present embodiment, a decision tree having a tree structure in a binary tree format and having a frequency probability for giving a part of speech is generated by a decision tree learning process using knowledge obtained from text data to which the part of speech is added. , Giving the part of speech. The attributes used in the decision tree with frequency probability use linguistic features and statistical features obtained from a corpus. In the conventional part-of-speech assignment, a method is generally used in which part-of-speech candidates are limited by drawing a dictionary, and the most appropriate part-of-speech is selected from the candidates, taking into account the relationship with words that appear before and after. However, there is a problem of the cost for creating and maintaining the dictionary. In addition, special processing is required for words that are not included in dictionary items (unknown words) and words that are used as parts of speech that are not included in dictionary part-of-speech candidates. In the method using the decision tree with frequency probability according to the present embodiment, a dictionary is not used to determine the part of speech of a word, so that the cost of creating and maintaining the dictionary does not matter. The decision tree with frequency probability is constructed by learning using the text with the part of speech. Therefore, if there is text data with the part of speech added, it is possible to flexibly cope with the part of speech system. In addition, the priority of the part of speech sequence can be automatically determined using the frequency probability. The decision tree is a tree structure model that classifies a target into an appropriate class from a plurality of attributes and their attribute values. In the part of speech, the object corresponds to each word and the class corresponds to the part of speech. As attributes,
Hierarchical classification that uses the spelling characteristics of each word, the characteristics according to the usage in the sentence, and the mutual information amount of the words is used. Hereinafter, the part-of-speech providing system of the present embodiment will be described in detail.

【００１２】図１において、決定木学習装置１０は、品
詞付与済みテキストメモリ２１に格納された品詞付与済
みテキストデータに基づいて、属性リストメモリ２２に
格納された属性リストと、品詞タグリストメモリ２３に
格納された品詞タグリストとを参照して、詳細後述する
決定木学習処理を実行して学習することにより、頻度確
率付き決定木を生成して確率付き決定木ファイルメモリ
２４に格納する。次いで、品詞付与装置１１は、確率付
き決定木ファイルメモリ２４に格納された頻度確率付き
決定木を用いて、属性リストメモリ２２に格納された属
性リストと、品詞タグリストメモリ２３に格納された品
詞タグリストとを参照して、テキストデータメモリ２５
に格納され入力されるテキストデータに対して、詳細後
述する品詞付与処理を実行することにより、品詞を付与
することにより、品詞付与済みテキストデータを生成し
て品詞付与済みテキストデータメモリ２６に格納する。
ここで、生成された品詞付与済みテキストデータは、例
えばＣＲＴディスプレイやプリンタなどの出力機器に出
力してもよい。In FIG. 1, a decision tree learning device 10 includes an attribute list stored in an attribute list memory 22 and a part-of-speech tag list memory 23 based on text data with a part of speech stored in a text memory 21 with a part of speech. With reference to the part-of-speech tag list stored in the decision tree learning process described later in detail, a decision tree with frequency probability is generated and stored in the decision tree file with probability file memory 24. Then, the part-of-speech providing apparatus 11 uses the decision tree with frequency probability stored in the decision tree file with probability file memory 24 to store the attribute list stored in the attribute list memory 22 and the part of speech stored in the part-of-speech tag list memory 23. With reference to the tag list, the text data memory 25
By executing a part-of-speech assigning process, which will be described in detail later, on the text data stored and input in the part-of-speech, the part-of-speech-assigned text data is generated and stored in the part-of-speech-assigned text data memory 26. .
Here, the generated part-of-speech-added text data may be output to an output device such as a CRT display or a printer.

【００１３】ここで、決定木学習装置１０と品詞付与装
置１１はそれぞれ、例えば、各処理を実行するＣＰＵ
と、各処理のプログラム及びそれを実行するために必要
なデータを格納するＲＯＭ（読出専用メモリ）と、ＣＰ
Ｕのワーキングメモリとして用いられるＲＡＭ（ランダ
ムアクセスメモリ）とを備えたデジタル計算機で構成さ
れる。また、メモリ２１乃至２６は、例えばハードディ
スクメモリで構成される。Here, each of the decision tree learning device 10 and the part of speech giving device 11 is, for example, a CPU that executes each process.
A ROM (read only memory) for storing a program for each process and data necessary for executing the program,
And a RAM (random access memory) used as a working memory for the U. Further, the memories 21 to 26 are configured by, for example, a hard disk memory.

【００１４】品詞タグリストメモリ２３に格納される品
詞タグリストの一例を表１に示す。また、属性リストメ
モリ２２に格納される属性リストの一例を表２に示す。An example of a part-of-speech tag list stored in the part-of-speech tag list memory 23 is shown in Table 1. Table 2 shows an example of the attribute list stored in the attribute list memory 22.

【００１５】[0015]

【表１】品詞タグリスト ─────────────────────────────────── 品詞タグ意義 ─────────────────────────────────── ＮＮ１ＩＮＴＥＲ−ＡＣＴ単数普通名詞、相互行為ＮＰ１ＣＩＴＹＮＭ固有名詞、都市名ＩＩＩＮ前置詞ＩＮＪＪＶＶＧＩＮＴＥＲ−ＡＣＴ形容詞的用法の現在分詞、相互行為ＶＶＧＣＯＮＳＵＭＥ現在分詞、消費ＶＶＧＲＥＣＥＩＶＥ現在分詞、受理 …………………… ………………………… ───────────────────────────────────[Table 1] Part of speech tag list 品 Part of speech tag meaning ───── ────────────────────────────── NN1INTER-ACT Singular common noun, Interaction NP1CITYNM Proper noun, City name IIIN Preposition INJJVVGINTER-ACT Present participle of adjective usage, mutual action VVGCONSUME present participle, consumption VVGRECEIVE present participle, acceptance .............................. ──────────────────────

【００１６】[0016]

【表２】 ─────────────────────────────────── 属性属性値 ─────────────────────────────────── 単語の相互情報量を用いた分類コード階層的分類コード対象単語が“〜ｉｎｇ”の単語Ｙｅｓ，Ｎｏ対象単語が“〜ｅｄ”の単語Ｙｅｓ，Ｎｏ対象単語の長さ単語長さの数値（例えば、“ｗｏｒｄ”なら４）直前の単語の品詞属性の値品詞属性の値現在の単語の品詞属性の値品詞属性の値文末が“？” Ｙｅｓ，Ｎｏ ………………………… ………………………… ───────────────────────────────────[Table 2] ─────────────────────────────────── Attribute Attribute value ────────分類 Classification code using mutual information of words Hierarchical classification code Word whose target word is “~ ing” Yes, No Word whose target word is "-ed" Yes, No Length of target word Numerical value of word length (for example, 4 if "word") Value of part of speech attribute of previous word Value of part of speech attribute of current word Part-of-speech attribute value Part-of-speech attribute value End of sentence is “?” Yes, No ……………………………………… ─────────────────────

【００１７】ここで、品詞属性とは、粗く品詞を３２種
類に分類した属性であり、品詞属性の値とは、例えば、
ｖ（動詞），ｎ（名詞），ｄｅｔｅｒｍｉｎ（冠詞），
ｐｕｎｃｔ（記号）である。また、単語の相互情報量を
用いた階層的分類コードとは、例えば、特願平８−０２
７８０９号の特許出願や従来技術文献３「Akira Ushiod
a,“Hierarchical Clustering of Words",Proceedings
of COLING'96,The 16th International Conference on
Computational Linguistics,Vol.2,pp.1159-1162,1996
年8月」において開示された単語分類方法を用いて分類
された階層的分類コードである。この単語分類方法で
は、テキストデータ内の単語について出現頻度の比較的
低い単語を、同一の単語に隣接する割合の多い単語を同
一のクラスに割り当てるという基準で分類した後、単語
分類結果を中間層、上側層、及び下側層の３つの階層に
分類し、テキストデータ内のすべての単語を対象とする
グローバルな（全体的な）コスト関数である所定の平均
相互情報量を用いて、中間層、上側層、及び下側層の順
序で階層別に単語の分類を実行することを特徴としてい
る。相互情報量を用いたクラスタリングの方法において
は、単語数Ｔのテキスト、語数Ｖの語彙、それに語彙の
分割関数πとが存在すると仮定し、ここで、語彙の分割
関数πは語彙Ｖから語彙の中の単語クラスセットＣへの
分割写像（マッピング）を表わす写像関数である。複数
の単語からなるテキストデータを生成するバイグラムの
クラスモデルの尤度Ｌ（π）は次式によって得られる。Here, the part of speech attribute is an attribute roughly classifying the part of speech into 32 types, and the value of the part of speech attribute is, for example,
v (verb), n (noun), determine (article),
punct (symbol). A hierarchical classification code using mutual information of words is described in, for example, Japanese Patent Application No. 8-02 / 98.
7809 Patent Application and Prior Art Document 3 “Akira Ushiod
a, “Hierarchical Clustering of Words”, Proceedings
of COLING'96, The 16th International Conference on
Computational Linguistics, Vol. 2, pp. 1159-1162, 1996
8 is a hierarchical classification code classified using the word classification method disclosed in "August". In this word classification method, words having relatively low frequency of occurrence in words in text data are classified based on a criterion of assigning words having a high percentage of adjacent words to the same word to the same class, and then the word classification result is classified into an intermediate layer. , An upper layer, and a lower layer, and using a predetermined average mutual information that is a global (overall) cost function for all words in the text data, , Upper layer, and lower layer. In the clustering method using mutual information, it is assumed that a text having the number of words T, a vocabulary having the number of words V, and a vocabulary division function π exist. This is a mapping function representing a division mapping (mapping) to a word class set C in the middle. The likelihood L (π) of the bigram class model that generates text data composed of a plurality of words is obtained by the following equation.

【００１８】[0018]

【数１】Ｌ(π)＝−Ｈｍ＋Ｉ## EQU1 ## L (π) =-Hm + I

【００１９】ここで、Ｈｍはモノグラムの単語分布のエ
ントロピーであり、Ｉはテキストデータ内の隣接する２
つのクラスＣ₁，Ｃ₂に関する平均的な相互情報量（Aver
ageMutual Information；以下、平均相互情報量とし、
ＡＭＩと表記する。）であり、次式で計算することがで
きる。Here, Hm is the entropy of the word distribution of the monogram, and I is two adjacent words in the text data.
Average mutual information about _two classes C ₁ and C ₂ (Aver
ageMutual Information; Hereinafter, the average mutual information,
Notated as AMI. ) And can be calculated by the following equation.

【００２０】[0020]

【数２】 (Equation 2)

【００２１】ここで、Ｐｒ（Ｃ₁）は第１のクラスＣ₁の
単語の出現確率であり、Ｐｒ（Ｃ₂）は第２のクラスＣ₂
の単語の出現確率であり、Ｐｒ（Ｃ₁｜Ｃ₂）は、第２の
クラスＣ₂の単語は出現した後に、第１のクラスＣ₁の単
語が出現する条件付き確率であり、Ｐｒ（Ｃ₁，Ｃ₂）は
第１のクラスＣ₁の単語と第２のクラスＣ₂の単語が隣接
して出現する確率である。従って、上記数２で表される
ＡＭＩは、互いに異なる第１のクラスＣ₁の単語と第２
のクラスＣ₂の単語とが隣接して出現する確率を、上記
第１のクラスＣ₁の単語の出現確率と第２のクラスＣ₂の
単語の出現確率との積で割った相対的な頻度の割合を表
わす。エントロピーＨは写像関数πに依存しない値であ
ることから、ＡＭＩを最大にする写像関数は同時にテキ
ストの尤度Ｌ（π）も最大にする。従って、ＡＭＩを単
語のクラス構成における目的関数として使用することが
できる。Here, Pr (C ₁ ) is the probability of occurrence of a word of the first class C ₁ , and Pr (C ₂ ) is the second class C ₂
Pr (C ₁ | C ₂ ) is the conditional probability that a word of the first class C ₁ will appear after a word of the second class C ₂ has appeared, and Pr (C ₁ | C ₂ ) C ₁ , C ₂ ) is the probability that a word of the _first class C _{1 and} a word of the second class C ₂ appear adjacent to each other. Therefore, AMI is different first mutually Class C ₁ word and the second represented by the number 2
The relative frequency of the word class C ₂ is the probability of occurrence adjacent, divided by the product of the above first occurrence probabilities of the words in the classes C ₁ and a second class of C ₂ probability of occurrence of words Represents the ratio of Since the entropy H is a value independent of the mapping function π, the mapping function that maximizes the AMI also maximizes the likelihood L (π) of the text. Therefore, the AMI can be used as an objective function in the word class configuration.

【００２２】上記単語分類方法は、意味又は統語的特徴
が似通った単語が近接した位置に配置された点で、バラ
ンスが取れた二分木の形式を有するツリー構造を生成す
ることができる。処理の最後に、根のノード（ルートノ
ード（ｒｏｏｔｎｏｄｅ））から葉のノード（リーフ
ノード（ｌｅａｆｎｏｄｅ）に至るパスの追跡し、左
側方向の分岐又は右側方向の分岐をそれぞれ表わす０又
は１の１ビットを各分岐に割り当てることによって、語
彙の中の各単語に対して、ビットストリング（単語ビッ
ト）を割り当てることができる。The above word classification method can generate a tree structure having a balanced binary tree form in that words having similar meanings or syntactic features are arranged at close positions. At the end of the process, the path from the root node (root node) to the leaf node (leaf node) is traced, and 0 or 1 representing a leftward branch or a rightward branch, respectively. By assigning one bit to each branch, a bit string (word bit) can be assigned to each word in the vocabulary.

【００２３】次いで、決定木を構築する決定木学習処理
のアルゴリズム、及び品詞付与処理のアルゴリズムにつ
いて述べる。Next, an algorithm of a decision tree learning process for constructing a decision tree and an algorithm of a part of speech adding process will be described.

【００２４】決定木学習処理では、各属性の有効性を他
の属性と独立に計算し、クラスの決定のための効率的な
属性による分類順序を、二分木の形式で分割された構造
を有する木構造として構築する。属性の有効性は、その
属性による分割分類後のエントロピーＨにより評価す
る。ここでのエントロピーは、属性の有効性の優先順位
を表わす。すなわち、ある属性ＢでノードＮ₁とノード
Ｎ₂とに分割するときに、分割前のエントロピーＨ₀と、
分割後のエントロピーＨと、ノードＮ₁に対するエント
ロピーＨ₁と、ノードＮ₂に対するエントロピーＨ₂とは
次式で表される。In the decision tree learning process, the validity of each attribute is calculated independently of the other attributes, and the classification order based on the efficient attributes for class determination is divided in a binary tree format. Build as a tree structure. The validity of the attribute is evaluated based on the entropy H after division and classification according to the attribute. The entropy here indicates the priority of the validity of the attribute. That is, when dividing into a node N ₁ and a node N ₂ with a certain attribute B, the entropy H ₀ before the division,
And entropy H after division, the entropy H ₁ for node N _1, and the entropy H ₂ to node N ₂ is expressed by the following equation.

【００２５】[0025]

【数３】 (Equation 3)

【数４】Ｈ＝ｐ₁Ｈ₁＋（１−ｐ₁）Ｈ₂ ここで、H = p ₁ H ₁ + (1−p ₁ ) H ₂ where:

【数５】 (Equation 5)

【数６】 (Equation 6)

【００２６】ここで、ｐ（ｔａｇａｌｌ）は分割前のす
べての品詞タグについてのイベントの数の頻度確率又は
出現確率であり、ｔａｇａｌｌについてのΣは、分割前
のすべての品詞タグについての和を示す。また、ｐ
₁は、ノードＮ₁に分割したときに含まれる品詞タグのイ
ベントの数の頻度確率の総和である。さらに、ｐ（ｔａ
ｇＮ₁）はノードＮ₁のすべての品詞タグについてのイベ
ントの数の頻度確率であり、ｔａｇＮ₁についてのΣ
は、ノードＮ₁のすべての品詞タグについての和を示
す。ｐ（ｔａｇＮ₂）はノードＮ₂のすべての品詞タグに
ついてのイベントの数の頻度確率であり、ｔａｇＮ₂に
ついてのΣは、ノードＮ₂のすべての品詞タグについて
の和を示す。Here, p (tagall) is the frequency probability or appearance probability of the number of events for all part-of-speech tags before division, and Σ for tagall indicates the sum of all part-of-speech tags before division. . Also, p
₁ is the sum of the frequency probability of the number of part-of-speech tag of events included when divided at the node N _1. Further, p (ta
gN ₁₎ is the number of frequency probability of the event for all of the part-of-speech tag of the node N _1, for tagN ₁ Σ
Indicates the sum for all parts of speech tags of the node N _1. p (tagN ₂₎ is the number of frequency probability events for all parts of speech tags of the node N _2, the Σ of tagn _2, indicates the sum for all parts of speech tags of the node N _2.

【００２７】有効性の計算のために、学習用のテキスト
データから各語について「属性とその属性値、品詞」の
組からなるイベント情報（ｅｖｅｎｔ：以下、イベント
という。）を予めとりだしておく。具体的には、全ての
イベントの集合に対して、分類後のエントロピーＨが最
小となる属性を求め、最初のノードに割り当てる。この
属性の属性値により、イベントの集合を分割し、対応す
る子ノードを作る。各々の子ノードにおいて、同様の処
理を繰り返し行なうことにより、木構造を構築する。分
割の停止条件は、各ノードに含まれるイベント数が一定
数以下、あるいは分割による有効性が一定基準以下（こ
こで、分割後のエントロピーＨと分割前のエントロピー
Ｈ₀との差がある所定量を越えない場合。）とする。こ
こで、分割されないノードをリーフと呼ぶ。学習された
決定木のリーフでは、与えられたイベントの集合から各
品詞の頻度確率を計算する。In order to calculate the validity, event information (event: hereinafter, referred to as an event) including a set of "attribute, its attribute value, and part of speech" for each word is preliminarily extracted from the text data for learning. Specifically, an attribute that minimizes the entropy H after classification is obtained for a set of all events, and is assigned to the first node. The set of events is divided according to the attribute value of this attribute, and corresponding child nodes are created. A tree structure is constructed by repeating the same process at each child node. The condition for stopping the division is that the number of events included in each node is equal to or less than a certain number, or the effectiveness of the division is equal to or less than a certain reference (here, a predetermined amount having a difference between entropy H after division and entropy H ₀ before division). Is not exceeded.) Here, a node that is not divided is called a leaf. At the leaf of the learned decision tree, the frequency probability of each part of speech is calculated from a given set of events.

【００２８】ここで、本実施形態の品詞付与システムで
は、従来技術文献４「L.E.Baum,“An inequality and a
ssociated maximization technique in statistical es
timation for probabilistic functions of a Markov p
rocess",Inequalities,Vol.3,pp.1-8,1972年」に開示さ
れたＦｏｒｗａｒｄ−Ｂａｃｋｗａｒｄアルゴリズムを
用いて、スムージング用の学習データに基づいて、スム
ージング用の学習データから得られる確率と決定木から
得られる確率との差が最小となるようにスムージングを
行ない、品詞に付与すべき最後の頻度確率分布を補正す
る。また、本実施形態のシステムでは、上記決定木学習
処理のアルゴリズムに従って、２段階の決定木を作成し
ている。１段目は、粗く分類した品詞（以下、ＧＰＯＳ
（ＧｌｏｂａｌＰａｒｔＯｆＳｐｅｅｃｈ）とい
う。）（ここで、実際の品詞の属性の１つに対応してお
り、例えば、動詞、名詞、冠詞などに分類される。）の
ための決定木であり、２段目として、ＧＰＯＳの品詞毎
に実際の品詞（表１に示した品詞タグレベル）を決定す
るための決定木を作成する。本実施形態では、より詳細
な品詞レベルの名称を品詞タグと呼んでいる。すなわ
ち、２段階に分割して決定木を生成することにより、１
回の処理で必要な記憶装置の記憶容量を大幅に減少させ
ている。Here, in the part-of-speech assigning system of the present embodiment, the related art document 4 “LEBaum,“ Aninequality and a
ssociated maximization technique in statistical es
timation for probabilistic functions of a Markov p
rocess ", Inequalities, Vol. 3, pp. 1-8, 1972", using the Forward-Backward algorithm, based on the learning data for smoothing, the probability and determination obtained from the learning data for smoothing. Smoothing is performed so that the difference from the probability obtained from the tree is minimized, and the last frequency probability distribution to be given to the part of speech is corrected. Further, in the system of the present embodiment, a two-stage decision tree is created according to the algorithm of the decision tree learning process. The first row shows the parts of speech roughly classified (hereinafter GPOS
(Global Part Of Speech). ) (Here, it corresponds to one of the attributes of the actual part of speech, and is classified into, for example, a verb, a noun, an article, etc.). First, a decision tree for determining an actual part of speech (part of speech tag level shown in Table 1) is created. In the present embodiment, a more detailed part-of-speech level name is called a part-of-speech tag. That is, by generating a decision tree by dividing into two stages, 1
Each time, the required storage capacity of the storage device is greatly reduced.

【００２９】品詞付与処理においては、入力文のテキス
トデータを左から右に処理し、結合確率を最大にする品
詞列を出力する。入力文が、ｗ₁，ｗ₂，…，ｗ_Nのよう
な複数Ｎ個の単語からなり、品詞列｛ｔ₁，ｔ₂，…，ｔ
_N｝（ここで、ｔ_iはｉ番目の単語の品詞である。）が得
られたとすると、結合確率Ｐは次式で表される。なお、
本実施形態では、品詞の出現をマルコフ情報源として取
り扱っておらず、それまでに出現した単語や品詞に依存
した情報源として取り扱っている。従って、十分に長い
文において、文の最初の語とその品詞に依存して最後の
単語の品詞を導くことが、原理的には可能である。In the part-of-speech assignment processing, the text data of the input sentence is processed from left to right, and a part-of-speech sequence that maximizes the connection probability is output. The input sentence is composed of a plurality of N words such as w ₁ , w ₂ ,..., W _N , and the part-of-speech sequence {t ₁ , t ₂ ,.
Assuming that _N ｝ (where t _i is the part of speech of the i-th word) is obtained, the connection probability P is represented by the following equation. In addition,
In the present embodiment, the appearance of the part of speech is not treated as a Markov information source, but as an information source depending on the words and parts of speech that have appeared so far. Thus, in a sufficiently long sentence, it is in principle possible to derive the part of speech of the last word depending on the first word of the sentence and its part of speech.

【００３０】[0030]

【数７】Ｐ≡ｐ（ｔ₁，ｔ₂，…，ｔ_N│ｗ₁，ｗ₂，…，ｗ_N）(7) P≡p (t ₁ , t ₂ ,..., T _N │w ₁ , w ₂ ,..., W _N )

【数８】 (Equation 8)

【００３１】上記数７の右辺は、入力文ｗ₁，ｗ₂，…，
ｗ_Nが入力されたときに、品詞列ｔ₁，ｔ₂，…，ｔ_Nが与
えられる結合確率を意味し、上記数８の右辺は、入力文
ｗ₁，ｗ₂，ｗ₃，…，ｗ_n、および、ｉ−１番目の単語ま
での品詞列ｔ₁，ｔ₂，…，ｔ_i-1が与えられたときのｉ
番目の品詞の確率をｉが１からｎまで積算することによ
り得られる確率を意味する。ここで、Πの記号はｉを２
からＮまで変化したときの積和を意味する。そして、文
脈に依存する属性をもちいて、決定木のリーフｌｅａｆ
（Ｌ）を導き、Ｌに関連した頻度確率分布を、ｐ_Lによ
り表現し、決定木の条件付き分布を用いて以下のように
近似する。The right side of the above equation (7) represents input sentences w ₁ , w ₂ ,.
When w _N is input, it means the connection probability that a part of speech sequence t ₁ , t ₂ ,..., t _N is given, and the right side of the above equation 8 indicates the input sentence w ₁ , w ₂ , w ₃ ,. w _n , and i given a part-of-speech sequence t ₁ , t ₂ ,..., t _i-1 up to the (i−1) th word
This means the probability obtained by multiplying the probability of the part of speech by i from 1 to n. Here, the symbol of Π represents i as 2
From N to N. Then, using a context-dependent attribute, the leaf leaf of the decision tree
(L) is derived, the frequency probability distribution associated with L is represented by p _L , and approximated as follows using the conditional distribution of the decision tree.

【００３２】[0032]

【数９】Ｌ_i≡文脈ｗ₁，ｗ₂，…，ｗ_N，ｔ₁，ｔ₂，…，
ｔ_i-1において導かれたリーフL _i ≡ context w ₁ , w ₂ ,..., W _N , t ₁ , t ₂ ,.
Leaf led at t _i-1

【数１０】ｐ（ｔ_i│ｗ₁，ｗ₂，…，ｗ_N，ｔ₁，ｔ₂，
…，ｔ_i-1）≒ｐ_Li（ｔ_i）P (t _i | w ₁ , w ₂ ,..., W _N , t ₁ , t ₂ ,
..., t _i-1 ) ≒ p _Li (t _i )

【００３３】上記数９における文脈ｗ₁，ｗ₂，…，
ｗ_N，ｔ₁，ｔ₂，…，ｔ_i-1は、ｉ番目の単語ｗ_iのもつ
文脈を意味する。また、数１０の左辺は、文脈ｗ₁，
ｗ₂，…，ｗ_N，ｔ₁，ｔ₂，…，ｔ_i-1の次に単語ｔ_iが来
る頻度確率又は出現確率を表し、それが、数１０の右辺
である、文脈Ｌ_iのもとで品詞ｔ_iをとる確率に近似でき
ることを意味する。従って、最大化すべき結合確率Ｐは
以下のようになる。The contexts w ₁ , w ₂ ,...
w _N , t ₁ , t ₂ ,..., t _i-1 mean the context of the i-th word w _i . Further, the left side of Expression 10 is the context w ₁ ,
_{_{w 2, ..., w N,}} t 1, t 2, ..., next to the word t _i of t _i-1 represents the frequency probability or occurrence probability come, but it is the right-hand side of the number 10, in the context L _i This means that the probability of taking the part of speech t _i can be approximated. Therefore, the coupling probability P to be maximized is as follows.

【００３４】[0034]

【数１１】 [Equation 11]

【００３５】上記数１１から明らかなように、結合確率
Ｐは、入力文の各単語での文脈に依存して得られる品詞
ｔ_iの確率の積で表される。さらに、入力文の各単語に
対する品詞付与処理においては、次の２段階の処理を行
なっている。（ａ）ＧＰＯＳの各品詞の頻度確率を計算する。（ｂ）ＧＰＯＳの各品詞に対応する決定木を用いて、品
詞の頻度確率を計算する。As is apparent from the above equation 11, the connection probability P is represented by the product of the probabilities of the parts of speech t _i obtained depending on the context of each word of the input sentence. Further, in the part of speech processing for each word of the input sentence, the following two-stage processing is performed. (A) Calculate the frequency probability of each part of speech of the GPOS. (B) Using the decision tree corresponding to each part of speech of the GPOS, the frequency probability of the part of speech is calculated.

【００３６】各語の頻度確率の計算では、それまでに得
られている可能性のある品詞列を全て考慮する必要があ
る。細かな品詞体系を扱う場合、探索範囲が膨大になる
ため、本システムでは、従来技術文献５「F.Jelinek,
“A fast sequential decodingalgorithm using a stac
k",IBM Journal of Research and Development,No.13,p
p.675-685,1969年」及び従来技術文献６「D.Paul,“Alg
orithms for an optimal a* search and linearizing t
he search in the stack decoder",Proceedingsof the
June 1990 DARPA Speech and Natural Language Work s
hop,1990年」において開示されたスタック・デコーダ・
アルゴリズムを用いて、頻度確率又は出現確率が最大と
なる品詞列を探索している。このアルゴリズムは、一種
のグラフサーチアルゴリズムであり、しきい値により一
時的に探索範囲を限定し、評価値の最も良いものを探す
ことができる。すなわち、各語に付与される可能性のあ
る複数の品詞から、最も頻度確率の高い品詞列を選択す
ることは、各品詞をノードとし隣接する単語に付与され
ているノードを連結したグラフの複数の経路から最適な
経路を探索することであり、スタック・デコーダ・アル
ゴリズムは、二分木形式で分割された木構造の経路にお
いて、複数のノードをスタック構造としてまとめて取り
扱い、スタック構造内で、探索範囲を変更することによ
り、最適な経路を、効率的に見い出すことができる。In the calculation of the frequency probability of each word, it is necessary to consider all possible part-of-speech sequences obtained up to that time. When dealing with a fine part of speech system, the search range becomes enormous. Therefore, in this system, the conventional technology document 5 “F. Jelinek,
“A fast sequential decodingalgorithm using a stac
k ", IBM Journal of Research and Development, No. 13, p
p.675-685, 1969 "and prior art document 6" D. Paul, "Alg
orithms for an optimal a * search and linearizing t
he search in the stack decoder ", Proceedingsof the
June 1990 DARPA Speech and Natural Language Work s
hop, 1990 ".
The part-of-speech sequence with the maximum frequency probability or appearance probability is searched for using an algorithm. This algorithm is a kind of graph search algorithm, in which the search range is temporarily limited by a threshold value, and the best evaluation value can be searched. In other words, selecting a part of speech sequence with the highest frequency probability from a plurality of parts of speech that may be given to each word is performed by using a graph that connects each part of speech as a node and nodes attached to adjacent words. The stack decoder algorithm treats a plurality of nodes collectively as a stack structure in a tree-structured path divided in a binary tree format, and searches the stack structure for the optimum path. By changing the range, an optimal route can be efficiently found.

【００３７】図２は、図１の決定木学習装置によって実
行される決定木学習処理を示すフローチャートである。
図２において、まず、ステップＳ１で品詞付与済みテキ
ストデータメモリ２１に格納された品詞付与済みテキス
トデータを読み出して、決定木学習装置１０内のＲＡＭ
に書き込む。次いで、ステップＳ２で、各属性と品詞タ
グとの組み合わせの頻度確率（上記ｐ（ｔａｇａｌ
ｌ），ｐ（ｔａｇＮ₁），ｐ（ｔａｇＮ₂）に対応す
る。）を計算して決定木学習装置１０内のＲＡＭに書き
込む。さらに、ステップＳ３で決定木作成処理を実行す
ることにより頻度確率付き決定木を生成し、ステップＳ
４で作成された確率付き決定木をメモリ２４に出力して
格納する。FIG. 2 is a flowchart showing a decision tree learning process executed by the decision tree learning device of FIG.
In FIG. 2, first, in step S1, the part-of-speech-added text data stored in the part-of-speech-added text data memory 21 is read out, and the RAM in the decision tree learning device 10 is read out.
Write to. Next, in step S2, the frequency probability of the combination of each attribute and the part of speech tag (the above p (tagal)
1), p (tagN ₁ ) and p (tagN ₂ ). ) Is calculated and written in the RAM in the decision tree learning device 10. Furthermore, a decision tree with frequency probability is generated by executing a decision tree creation process in step S3,
The decision tree with probability created in step 4 is output to the memory 24 and stored.

【００３８】図３は、図２のサブルーチンである決定木
作成処理（ステップＳ３）を示すフローチャートであ
る。まず、ステップＳ１１ですべての各属性による分割
後のエントロピーＨと、分割前のエントロピーＨ₀とを
それぞれ数４と数３を用いて計算する。次いで、ステッ
プＳ１２でエントロピーの差（Ｈ₀−Ｈ）が最大の属性
を分割候補の属性として選択し、ステップＳ１３で選択
された属性について分割続行判定基準を満足するか否か
が判断される。ここで、分割続行判定基準とは、（Ｉ）
選択された属性に基づいて分割したときのエントロピー
の差（Ｈ₀−Ｈ）が所定のエントロピーしきい値Ｈｔｈ
以上であり、かつ（II）選択された属性に基づく分割後
のイベント数が所定のイベント数しきい値Ｄｔｈ以上で
あること。ステップＳ１３で分割続行判定基準を満足す
るときは、ステップＳ１４で、選択された属性の属性値
により分割した２つのノードを作成して、すなわち二分
木の形式で分割して、決定木を更新する。そして、ステ
ップＳ１５では、上記作成した各ノードを処理対象とし
て、ステップＳ１１に戻り、ステップＳ１１からの処理
を繰り返す。一方、ステップＳ１３で分割続行判定基準
を満足しないときは、元のメインルーチンに戻る。FIG. 3 is a flowchart showing the decision tree creation process (step S3) which is a subroutine of FIG. First, in step S11, the entropy H after division by all the attributes and the entropy H ₀ before division are calculated using Equations 4 and 3, respectively. Next, in step S12, the attribute having the largest entropy difference (H ₀ −H) is selected as the attribute of the division candidate, and it is determined whether or not the attribute selected in step S13 satisfies the division continuation criterion. Here, the division continuation criterion is (I)
The difference (H ₀ −H) in entropy when divided based on the selected attribute is equal to a predetermined entropy threshold Hth
(II) The number of events after division based on the selected attribute is equal to or greater than a predetermined event number threshold Dth. If the division continuation criterion is satisfied in step S13, in step S14, two nodes divided by the attribute value of the selected attribute are created, that is, divided in the form of a binary tree to update the decision tree. . Then, in step S15, the processing returns to step S11, and the processing from step S11 is repeated, with each of the created nodes as a processing target. On the other hand, when the division continuation criterion is not satisfied in step S13, the process returns to the original main routine.

【００３９】ここで、作成された頻度確率付き決定木の
一例を図６に示す。図６に示すように、当該頻度確率付
き決定木は、各属性１０１乃至１０５で二分木の形式で
分割された木構造を有し、最後のリーフにおいて各品詞
タグに対する頻度確率が付与されている。この例では、
入力文が“ｍｅｅｔｉｎｇｉｎＬｏｎｄｏｎ”であ
るときに、単語“ｍｅｅｔｉｎｇ”に対して品詞タグＮ
Ｎ１ＩＮＴＥＲ−ＡＣＴ（単数普通名詞，相互行為）が
付与される一方、単語“Ｌｏｎｄｏｎ”に対して品詞タ
グＮＰ１ＣＩＴＹＮＭ（固有名詞，都市名）が付与され
ている。FIG. 6 shows an example of the created decision tree with frequency probability. As shown in FIG. 6, the decision tree with frequency probability has a tree structure divided in the form of a binary tree with each of the attributes 101 to 105, and a frequency probability for each part of speech tag is assigned to the last leaf. . In this example,
When the input sentence is “meeting in London”, the part of speech tag N for the word “meeting”
While N1INTER-ACT (singular common noun, mutual action) is given, a part of speech tag NP1CITYNM (proper noun, city name) is given to the word "London".

【００４０】図４は、図１の品詞付与装置によって実行
される品詞付与処理を示すフローチャートである。図４
において、まず、ステップＳ２１で確率付き決定木ファ
イルメモリ２４に格納された頻度確率付き決定木ファイ
ルを読み出して、品詞付与装置１１内のＲＡＭに書き込
む。次いで、ステップＳ２２でテキストデータメモリ２
５に格納された解析対象のテキストデータを読み出して
品詞付与装置１１内のＲＡＭに書き込む。さらに、ステ
ップＳ２３で品詞付与解析処理を実行して、品詞付与済
みテキストデータを生成し、ステップＳ２４で品詞付与
済みテキストデータメモリ２６に出力して書き込む。FIG. 4 is a flowchart showing the part-of-speech providing process executed by the part-of-speech providing apparatus of FIG. FIG.
First, in step S21, the decision tree file with frequency probability stored in the decision tree file with probability file memory 24 is read and written into the RAM in the part-of-speech providing apparatus 11. Next, at step S22, the text data memory 2
The text data to be analyzed stored in 5 is read out and written to the RAM in the part of speech giving device 11. Further, the part-of-speech assignment analysis processing is executed in step S23 to generate the part-of-speech added text data, and in step S24, the data is output to the part-of-speech assigned text data memory 26 and written.

【００４１】図５は、図４のサブルーチンである品詞付
与解析処理（ステップＳ２３）を示すフローチャートで
ある。まず、ステップＳ３１で解析対象のテキストデー
タの文頭の単語を対象単語とする。次いで、ステップＳ
３２で決定木の最上位置にあるルートノードを処理対象
のカレントノードとする。そして、ステップＳ３３でカ
レントノードがリーフノードであるか否かが判断され
る。ＮＯであるときは、ステップＳ３８でカレントノー
ドの属性値に基づいて対応する子ノードをカレントノー
ドとし、ステップＳ３３に戻る。一方、ステップＳ３３
でＹＥＳであるときは、ステップＳ３４でリーフノード
に割り当てられた頻度確率リストの中で上位ｎ個の頻度
確率（ここで、ｎは複数であり、例えば、好ましくは、
３乃至６であり、より好ましくは４である。）を選択し
て対象単語に与える。そして、ステップＳ３５で上述の
スタック・デコーダ・アルゴリズムに従って、所定の結
合確率以上の結合確率Ｐを残して品詞タグ候補を限定す
る。さらに、ステップＳ３６で次の処理単語があるか否
かが判断され、あるときはステップＳ３７で次の単語を
対象単語とし、ステップＳ３２に戻って上記の処理を繰
り返す。一方、ステップＳ３６で次の単語がないとき
は、ステップＳ３９で最大の結合確率Ｐを有する品詞タ
グ列を正解品詞列に決定し、元のメインルーチンに戻
る。FIG. 5 is a flowchart showing the part-of-speech assignment analysis processing (step S23) which is a subroutine of FIG. First, in step S31, a word at the beginning of the text data to be analyzed is set as a target word. Then, step S
At 32, the root node at the uppermost position of the decision tree is set as the current node to be processed. Then, in step S33, it is determined whether the current node is a leaf node. If NO, the corresponding child node is set as the current node based on the attribute value of the current node in step S38, and the process returns to step S33. On the other hand, step S33
Is YES, the top n frequency probabilities in the frequency probability list assigned to the leaf node in step S34 (where n is plural, for example, preferably
It is 3 to 6, more preferably 4. ) To give to the target word. Then, in step S35, the part-of-speech tag candidates are limited according to the above-described stack decoder algorithm while leaving the connection probability P equal to or higher than the predetermined connection probability. Further, it is determined whether or not there is a next processing word in step S36. If there is, the next word is set as a target word in step S37, and the process returns to step S32 to repeat the above processing. On the other hand, if there is no next word in step S36, the part-of-speech tag string having the maximum connection probability P is determined as the correct part-of-speech string in step S39, and the process returns to the original main routine.

【００４２】[0042]

【実施例】本発明者は、以上のように構成された品詞付
与システムを用いて以下の実験を行った。本実験では、
異なる品詞体系による差異、学習用テキストデータのと
り方による差異を比較する。異なる品詞体系として、表
３に示す３種類の体系で実験を行った。ここで、ＵＰｅ
ｎｎは、ペンシルベニア大学の提供しているツリーバン
クのデータで用いられている体系であり、ＡＴＲＦｕ
ｌｌは、本特許出願人で作成したツリーバンクで、意味
カテゴリを含む品詞体系を用いた体系であり、ＡＴＲ
Ｓｙｎｔａｘは、ＡＴＲＦｕｌｌの中の意味カテゴリ
を削除したサブセットの品詞体系である。学習用テキス
トデータのとり方として、文単位でランダムに集めた場
合（表４の選択：文）、文章単位でランダムに集めた場
合（表４の選択：文章）の実験を行なった。EXAMPLES The present inventor conducted the following experiments using the part-of-speech assignment system configured as described above. In this experiment,
The differences between different parts of speech systems and the differences in how to take the learning text data are compared. Experiments were performed using three different parts of speech systems as shown in Table 3. Where UPe
nn is a system used in tree bank data provided by the University of Pennsylvania.
11 is a tree bank created by the present applicant, and is a system using a part-of-speech system including a semantic category.
Syntax is a part-of-speech system of a subset in which semantic categories in ATR Full are deleted. Experiments were performed on learning text data in a case where the text data was randomly collected in units of sentences (selection in Table 4: sentences) and in a case where the data was randomly collected in units of sentences (selection in Table 4: sentences).

【００４３】[0043]

【表３】取り扱った品詞体系 ─────────────────────────────────── 品詞体系品詞総数学習データテストデータ ─────────────────────────────────── ＵＰｅｎｎ４８約１００万語５万語 ATR Syntax ４４１約３０万語（４０万語）６万語（１万語） ATR Full ２６００約３０万語（４０万語）６万語（１万語） ───────────────────────────────────[Table 3] Part-of-speech system handled ─────────────────────────────────── Part-of-speech system Total part-of-speech learning data test Data ─────────────────────────────────── UPnn 48 Approx. 1 million words 50,000 words ATR Syntax 441 Approx. 30 10,000 words (400,000 words) 60,000 words (10,000 words) ATR Full 2600 About 300,000 words (400,000 words) 60,000 words (10,000 words) ─────────────── ────────────────────

【００４４】ここで、ＵＰｅｎｎのテキストデータは、
ＷａｌｌＳｔｒｅｅｔＪｏｕｒｎａｌを対象とし
て、１００万語強の学習用テキストデータを用い、５万
語のテスト用テキストデータを用いた。ＡＴＲＳｙｎ
ｔａｘ及びＡＴＲＦｕｌｌのテキストデータは、文単
位では、４０万語の学習用テキストデータを用い、１万
語のテスト用テキストデータを用いる一方、文章単位で
は、約３０万語の学習用テキストデータを用い、６万語
のテスト用テキストデータを用いた。表４に、実験結果
を示す。Here, UPenn text data is
For Wall Street Journal, slightly more than one million words of text data for learning and 50,000 words of test text data were used. ATR Syn
The text data of the tax and the ATR Full use the learning text data of 400,000 words in sentence units and use the test text data of 10,000 words, while the text data of about 300,000 words in sentence units are used. Test text data of 60,000 words was used. Table 4 shows the experimental results.

【００４５】[0045]

【表４】本実施形態による品詞付与の正当率 ─────────────────────────────────── 品詞体系選択ＡＬＬＫＷＵＷ ────────────── ＡＬＬＫＴＵＴ ─────────────────────────────────── ＵＰｅｎｎ文９６．０９６．７９９．６６１．０９１．９ ─────────────────────────────────── ATR Syntax 文９２．６９４．７９５．２５２．２８２．９ ─────────────────────────────────── ATR Syntax 文章９０．８９３．８９４．６４１．２７９．６ ─────────────────────────────────── ATR Full 文７６．５７９．４８３．６８．５６３．７ ─────────────────────────────────── ATR Full 文章７１．８７６．８８１．７８．２５３．９ ─────────────────────────────────── （注）ＫＷ：既知語、ＵＷ：未知語、ＫＴ：既知の品詞、ＵＴ：未知の品詞[Table 4] Legitimacy rate of part of speech assignment according to the present embodiment 本 Part of speech system Select ALL KW UW {ALL KT UT} << UPen sentence 96.0 96.7 99.6 61.0 91.9 >> {ATR Syntax sentence 92.6 94.7 95.2 52.2 82.9} ─────────── ATR Syntax sentence 90.8 93.8 94.6 41.2 79.6 ───────────────────── ────────────── ATR Full sentence 76.5 79.4 83. 8.5 63.7 ─────────────────────────────────── ATR Full text 71.8 76.8 81.7 8.2 53.9 ─────────────────────────────────── (Note) KW: Known Word, UW: unknown word, KT: known part of speech, UT: unknown part of speech

【００４６】表４では、学習用テキストデータに現れた
単語を既知語ＫＷ、現れなかった単語を未知語ＵＷとし
て別々に精度を調べた。また、既知語ＫＷの中では、そ
の単語と正解の品詞の組合せが、学習用テキストデータ
上に現れている場合を既知の品詞ＫＴとし、現れていな
い場合を未知の品詞ＵＴとしている。In Table 4, the accuracy was separately checked for words that appeared in the learning text data as known words KW and words that did not appear as unknown words UW. In the known word KW, the case where the combination of the word and the correct part of speech appears in the learning text data is defined as the known part of speech KT, and the case where the combination does not appear is defined as the unknown part of speech UT.

【００４７】表４から明らかなように、全体では、基準
となる精度（具体的には、学習データ内で、その単語に
付与されている品詞のうち最も頻度の高い品詞を付与し
た場合の精度であり、次の通りである。ＵＰｅｎｎ：文
選択８９％，ＡＴＲＳｙｎｔａｘ：文選択８３％（文
章選択８２％），ＡＴＲＦｕｌｌ：文選択６９％（文
章選択６９％））を越えている。また、ＡＴＲＳｙｎ
ｔａｘ及びＡＴＲＦｕｌｌの場合、文を単位とした場
合より、文章を単位とした場合のほうが正答率が下がっ
ている。これは、学習量の差とも考えられるが、文章と
してまとまった特徴（使用する単語の分布や、言いまわ
しなど）があり、文単位で学習データを選択した場合に
は影響のない、学習用テキストデータとテスト用テキス
トデータ間の特徴の差が影響しているとも考えられる。As is clear from Table 4, the accuracy as a reference (specifically, the accuracy when the most frequent part of speech assigned to the word in the learning data is assigned) UPnn: sentence selection 89%, ATR Syntax: sentence selection 83% (sentence selection 82%), ATR Full: sentence selection 69% (sentence selection 69%). Also, ATR Syn
In the case of tax and ATR Full, the correct answer rate is lower in the case of a sentence than in the case of a sentence. Although this may be considered as a difference in the amount of learning, there is a set of textual features (distribution of words to be used, wording, etc.), and there is no effect when learning data is selected for each sentence. It is also considered that the difference in characteristics between the data and the test text data has an effect.

【００４８】以上の実施形態においては、英語の品詞付
与システムについて述べているが、本発明はこれに限ら
ず、日本語の品詞付与システムに適用することができ
る。In the above embodiment, the English part-of-speech system is described. However, the present invention is not limited to this and can be applied to a Japanese part-of-speech system.

【００４９】以上説明したように、本実施形態によれ
ば、品詞の接続関係、語と品詞の関係、さらに、離れた
語あるいは品詞との依存関係を統計的に処理するため、
自動的に一意に高精度で品詞を付与できる。また、辞書
を用いずに、単語に品詞ラベルを割り当てるため、従来
技術の問題となる未知語に対する特別な処理が不必要で
ある。さらに、品詞を付与したテキストデータを用いて
学習を行なうため、多くの品詞体系に対して、例えば複
数の分割候補に対して優先度を導入するなど柔軟な対応
ができる。As described above, according to the present embodiment, in order to statistically process the connection relation of parts of speech, the relation between words and parts of speech, and the dependence relation between distant words or parts of speech,
Part of speech can be automatically and uniquely assigned with high precision. In addition, since a part of speech label is assigned to a word without using a dictionary, a special process for an unknown word, which is a problem of the related art, is unnecessary. Further, since the learning is performed using the text data to which the parts of speech are added, it is possible to flexibly deal with many parts of speech systems, for example, by introducing priorities to a plurality of division candidates.

【００５０】[0050]

【発明の効果】以上詳述したように本発明に係る請求項
１記載の品詞付与装置によれば、各単語の綴りの特徴
と、文章内の使われ方による特徴と、単語の相互情報量
を用いた階層的な分類とを含む複数の属性を備えた属性
リストを格納する第１の記憶装置と、複数の品詞を含む
品詞リストを格納する第２の記憶装置と、生成される頻
度確率付き決定木を格納する第３の記憶装置と、単語列
からなる品詞付与済みテキストデータに基づいて、上記
第１の記憶装置に格納された属性リストと、上記第２の
記憶装置に格納された品詞リストとを参照して、上記各
属性の属性値に依存して分割されるような二分木形式の
木構造を有し品詞付与のための決定木を生成し、上記生
成された決定木の分割されないノードであるリーフノー
ドに対して複数の品詞に対する頻度確率を計算して付与
することにより、頻度確率付き決定木を生成して上記第
３の記憶装置に格納する決定木学習手段と、上記第３の
記憶装置に格納された頻度確率付き決定木を用いて、上
記第１の記憶装置に格納された属性リストと、上記第２
の記憶装置に格納された品詞リストとを参照して、入力
される単語列からなるテキストデータに基づいて、上記
リーフノードに付与された頻度確率の中で上位複数ｎ個
の頻度確率を選択して上記テキストデータの各単語に対
して付与し、上記テキストデータの単語列において最大
の結合確率を有する品詞列を正解品詞列として決定して
出力する品詞付与手段とを備える。従って、品詞の接続
関係、語と品詞の関係、さらに、離れた語あるいは品詞
との依存関係を統計的に処理するため、自動的に一意に
高精度で品詞を付与できる。また、辞書を用いずに、単
語に品詞ラベルを割り当てるため、従来技術の問題とな
る未知語に対する特別な処理が不必要である。さらに、
品詞を付与したテキストデータを用いて学習を行なうた
め、多くの品詞体系に対して、例えば複数の分割候補に
対して優先度を導入するなど柔軟な対応ができる。As described above in detail, according to the part-of-speech providing apparatus according to the first aspect of the present invention, the spelling characteristics of each word, the characteristics according to the usage in the sentence, and the mutual information amount of the word A first storage device for storing an attribute list having a plurality of attributes including a hierarchical classification using a plurality of attributes, a second storage device for storing a part-of-speech list including a plurality of part-of-speech, and a generated frequency probability A third storage device for storing the attached decision tree, an attribute list stored in the first storage device based on the part-of-speech-attached text data comprising a word string, and an attribute list stored in the second storage device. Referring to the part-of-speech list, a decision tree for giving a part-of-speech having a tree structure of a binary tree format that is divided depending on the attribute value of each attribute is generated, and the generated decision tree is generated. Multiple items for leaf nodes that are not split nodes Means for generating a decision tree with a frequency probability and storing the same in the third storage device by calculating and assigning a frequency probability to the data, and a decision with a frequency probability stored in the third storage device Using a tree, the attribute list stored in the first storage device and the second
With reference to the part-of-speech list stored in the storage device of the above, based on the text data consisting of the input word string, the frequency probabilities assigned to the leaf nodes are selected from among the top n n frequency probabilities. And a part-of-speech assigning means for assigning to each word of the text data, determining a part-of-speech string having the maximum connection probability in the word string of the text data as a correct part-of-speech string, and outputting it. Therefore, since the connection relation of the part of speech, the relation between the word and the part of speech, and the dependency relation between the distant word and the part of speech are statistically processed, the part of speech can be automatically and uniquely assigned with high accuracy. In addition, since a part of speech label is assigned to a word without using a dictionary, a special process for an unknown word, which is a problem of the related art, is unnecessary. further,
Since learning is performed using the text data to which the part of speech is added, it is possible to flexibly deal with many parts of speech systems, for example, by introducing priorities to a plurality of division candidates.

【００５１】また、請求項２記載の品詞付与装置におい
ては、請求項１記載の品詞付与装置において、上記決定
木学習手段は、上記二分木の形式で分割するときに、上
記各属性による分割前の属性の有効性の優先順位を表わ
すエントロピーＨ₀と分割後のエントロピーＨとの差
（Ｈ₀−Ｈ）が最大の属性を分割候補の属性として選択
し、所定の分割続行基準を満足するときに、二分木の形
式で分割して決定木を更新する。従って、品詞の接続関
係、語と品詞の関係、さらに、離れた語あるいは品詞と
の依存関係を統計的に処理するため、自動的に一意によ
り正確に品詞を付与できる。また、辞書を用いずに、単
語に品詞ラベルを割り当てるため、従来技術の問題とな
る未知語に対する特別な処理が不必要である。さらに、
品詞を付与したテキストデータを用いて学習を行なうた
め、多くの品詞体系に対して、例えば複数の分割候補に
対して優先度を導入するなど柔軟な対応ができる。Further, in the part-of-speech providing apparatus according to claim 2, in the part-of-speech providing apparatus according to claim 1, when the decision tree learning means performs the division in the binary tree format, When the attribute having the largest difference (H ₀ −H) between the entropy H ₀ representing the priority order of the validity of the attribute of the attribute and the entropy H after the division is selected as the attribute of the division candidate and the predetermined division continuation criterion is satisfied Then, the decision tree is updated by dividing in the form of a binary tree. Therefore, since the connection relation between parts of speech, the relation between words and parts of speech, and the dependency relation between distant words or parts of speech are statistically processed, parts of speech can be automatically and uniquely assigned. In addition, since a part of speech label is assigned to a word without using a dictionary, a special process for an unknown word, which is a problem of the related art, is unnecessary. further,
Since learning is performed using the text data to which the part of speech is added, it is possible to flexibly deal with many parts of speech systems, for example, by introducing priorities to a plurality of division candidates.

【００５２】さらに、請求項３記載の品詞付与装置にお
いては、請求項２記載の品詞付与装置において、上記分
割続行基準は、（Ｉ）選択された属性に基づいて分割し
たときのエントロピーの差（Ｈ₀−Ｈ）が所定のエント
ロピーしきい値Ｈｔｈ以上であり、かつ（II）選択され
た属性に基づく分割後の属性とその属性値及び品詞の組
のイベント数が所定のイベント数しきい値Ｄｔｈ以上で
ある。従って、品詞の接続関係、語と品詞の関係、さら
に、離れた語あるいは品詞との依存関係を統計的に処理
するため、自動的に一意により正確に品詞を付与でき
る。また、辞書を用いずに、単語に品詞ラベルを割り当
てるため、従来技術の問題となる未知語に対する特別な
処理が不必要である。さらに、品詞を付与したテキスト
データを用いて学習を行なうため、多くの品詞体系に対
して、例えば複数の分割候補に対して優先度を導入する
など柔軟な対応ができる。Furthermore, in the part-of-speech providing apparatus according to the third aspect, in the part-of-speech providing apparatus according to the second aspect, the division continuation criterion is defined as: (I) a difference in entropy when dividing based on the selected attribute ( (H ₀ −H) is equal to or greater than a predetermined entropy threshold Hth, and (II) the number of events of a set of the attribute after division based on the selected attribute and its attribute value and part of speech is a predetermined event number threshold Dth or more. Therefore, since the connection relation between parts of speech, the relation between words and parts of speech, and the dependency relation between distant words or parts of speech are statistically processed, parts of speech can be automatically and uniquely assigned. In addition, since a part of speech label is assigned to a word without using a dictionary, a special process for an unknown word, which is a problem of the related art, is unnecessary. Further, since the learning is performed using the text data to which the parts of speech are added, it is possible to flexibly deal with many parts of speech systems, for example, by introducing priorities to a plurality of division candidates.

【００５３】またさらに、請求項４記載の品詞付与装置
においては、請求項１、２又は３記載の品詞付与装置に
おいて、上記品詞付与手段は、上記リーフノードに付与
された頻度確率の中で上位複数ｎ個の頻度確率を選択し
て上記テキストデータの各単語に対して付与した後、所
定のスタック・デコーダ・アルゴリズムに用いて、処理
途中のテキストデータの単語列に対する結合確率が所定
の結合確率以上である品詞候補のみを残して品詞候補を
限定し、処理終了時の上記テキストデータの単語列にお
いて最大の結合確率を有する品詞列を正解品詞列として
決定する。従って、品詞の接続関係、語と品詞の関係、
さらに、離れた語あるいは品詞との依存関係を統計的に
処理するため、自動的に一意により正確に品詞を付与で
きる。また、辞書を用いずに、単語に品詞ラベルを割り
当てるため、従来技術の問題となる未知語に対する特別
な処理が不必要である。さらに、品詞を付与したテキス
トデータを用いて学習を行なうため、多くの品詞体系に
対して、例えば複数の分割候補に対して優先度を導入す
るなど柔軟な対応ができる。Further, in the part-of-speech providing apparatus according to claim 4, in the part-of-speech providing apparatus according to claim 1, 2, or 3, the part-of-speech providing means is higher in the frequency probabilities given to the leaf nodes. After selecting a plurality of n frequency probabilities and assigning them to each word of the text data, using the predetermined stack decoder algorithm, the connection probability for the word string of the text data being processed is determined by the predetermined connection probability. The part-of-speech candidates are limited while leaving only the part-of-speech candidates described above, and the part-of-speech string having the maximum connection probability in the word string of the text data at the end of the process is determined as the correct part-of-speech string. Therefore, the connection between parts of speech, the relation between words and parts of speech,
Furthermore, since the dependency relationship with a distant word or part of speech is statistically processed, the part of speech can be automatically and uniquely assigned. In addition, since a part of speech label is assigned to a word without using a dictionary, a special process for an unknown word, which is a problem of the related art, is unnecessary. Further, since the learning is performed using the text data to which the parts of speech are added, it is possible to flexibly deal with many parts of speech systems, for example, by introducing priorities to a plurality of division candidates.

[Brief description of the drawings]

【図１】本発明に係る一実施形態である、決定木学習
装置及び品詞付与装置を備えた品詞付与システムのブロ
ック図である。FIG. 1 is a block diagram of a part-of-speech providing system including a decision tree learning device and a part-of-speech providing device according to an embodiment of the present invention.

【図２】図１の決定木学習装置によって実行される決
定木学習処理を示すフローチャートである。FIG. 2 is a flowchart showing a decision tree learning process executed by the decision tree learning device of FIG. 1;

【図３】図２のサブルーチンである決定木作成処理
（ステップＳ３）を示すフローチャートである。FIG. 3 is a flowchart showing a decision tree creation process (step S3) which is a subroutine of FIG.

【図４】図１の品詞付与装置によって実行される品詞
付与処理を示すフローチャートである。FIG. 4 is a flowchart showing a part-of-speech providing process executed by the part-of-speech providing apparatus of FIG. 1;

【図５】図４のサブルーチンである品詞付与解析処理
（ステップＳ２３）を示すフローチャートである。FIG. 5 is a flowchart showing a part of speech assignment analysis process (step S23) which is a subroutine of FIG.

【図６】図１の決定木学習装置によって作成された頻
度確率付き決定木ファイル内の決定木の一例を示す図で
ある。6 is a diagram illustrating an example of a decision tree in a decision tree file with frequency probability created by the decision tree learning device of FIG. 1;

[Explanation of symbols]

１０…決定木学習装置、１１…品詞付与装置、２１…品詞付与済みテキストデータメモリ、２２…属性リストメモリ、２３…品詞タグリストメモリ、２４…確率付き決定木ファイルメモリ、２５…テキストデータメモリ、２６…品詞付与済みテキストデータメモリ。 Reference Signs List 10: Decision tree learning device, 11: Part-of-speech assigning device, 21: Part-of-speech-added text data memory, 22: Attribute list memory, 23: Part of speech tag list memory, 24: Decision tree file memory with probability 26: Text data memory with part of speech added.

フロントページの続き (72)発明者ステファン・ジー・ユーバンク京都府相楽郡精華町大字乾谷小字三平谷５番地株式会社エイ・ティ・アール音声翻訳通信研究所内 (56)参考文献特開平８−55122（ＪＰ，Ａ) 特開平４−141771（ＪＰ，Ａ) 特開平１−137296（ＪＰ，Ａ) ＥｒｉｃＢｒｉｌｌ，”Ｔｒａｎｓｆｏｒｍａｔｉｏｎ−ＢａｓｅｄＥｒｒｏｒ−ＤｒｉｖｅｎＬｅａｒｎｉｎｇａｎｄＮａｔｕｒａｌＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ：ＡＣａｓｅＳｔｕｄｙｉｎＰａｒｔ −ｏｆ−ＳｐｅｅｃｈＴａｇｇｉｎｇ”，ＣｏｍｐｕｔａｔｉｏｎａｌＬｉｎｇｕｉｓｔｉｃｓ，Ｖｏｌ．21，Ｎｏ．４，ｐ．543−ｐ．565（1995) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 17/20 - 17/30 ＪＩＣＳＴファイル（ＪＯＩＳ)Continuation of the front page (72) Inventor Stephen G. Eubank 5 Shiratani, Inaya, Seika-cho, Soraku-gun, Kyoto Pref. 55122 (JP, A) JP-A-4-141717 (JP, A) JP-A-1-137296 (JP, A) Eric Brill, "Transformation-Based Error-Driven Learning and Natural Language Processing: Study in Part -of-Speech Tagging ", Computational Linguistics, Vol. 21, No. 4, p. 543-p. 565 (1995) (58) Field surveyed (Int. Cl. ⁷ , DB name) G06F 17/20-17/30 JICST file (JOIS)

Claims

(57) [Claims]

1. An attribute list including a plurality of attributes including a spelling characteristic of each word, a characteristic according to a usage in a sentence, and a hierarchical classification using mutual information of words. One storage device, a second storage device for storing a part-of-speech list including a plurality of part-of-speech, a third storage device for storing a generated decision tree with frequency probability, and a text-to-speech text data comprising a word string Based on the attribute list stored in the first storage device and the part-of-speech list stored in the second storage device, depending on the attribute value of each attribute. A decision tree for giving a part of speech having a tree structure with a simple binary tree format, and calculating and assigning frequency probabilities for a plurality of parts of speech to a leaf node that is an undivided node of the generated decision tree. The decision tree with frequency probability Tree generating means for generating and storing in the third storage device; and an attribute list stored in the first storage device using the decision tree with frequency probability stored in the third storage device And the part-of-speech list stored in the second storage device, and based on the text data consisting of the input word string, the top plurality n of the frequency probabilities assigned to the leaf nodes A part-of-speech assigning means for selecting and assigning a frequency probability to each word of the text data, determining a part-of-speech string having a maximum connection probability in the word string of the text data as a correct part-of-speech string, and outputting the determined part-of-speech string. A part-of-speech providing apparatus characterized by the following.

Wherein said decision tree learning means, when splitting in the form of the binary tree, after the division and entropy H ₀ representing the priority of the effectiveness of the attributes before division by the attributes of the entropy H An attribute having the largest difference (H ₀ −H) is selected as an attribute of a candidate for division, and when a predetermined continuation criterion is satisfied, the decision tree is divided and updated in a binary tree format. Item-of-speech providing apparatus according to item 1.

3. The division continuation criterion includes: (I) a difference (H _{0) in} entropy at the time of division based on a selected attribute.
-H) is equal to or greater than a predetermined entropy threshold Hth, and (II) the number of events of the set of the attribute after division based on the selected attribute and its attribute value and part of speech is equal to or greater than a predetermined event number threshold Dth 3. The method according to claim 2, wherein
Description part-of-speech device.

4. The part-of-speech providing means selects a plurality of top n frequency probabilities from among the frequency probabilities given to the leaf nodes and gives them to each word of the text data. Using the decoder algorithm, the part-of-speech candidates are limited by leaving only the part-of-speech candidates whose connection probability to the word string of the text data being processed is equal to or greater than the predetermined connection probability, and in the word string of the text data at the end of the processing, 4. The part-of-speech sequence having the maximum connection probability is determined as a correct part-of-speech sequence.
Description part-of-speech device.