JP3379643B2

JP3379643B2 - Morphological analysis method and recording medium storing morphological analysis program

Info

Publication number: JP3379643B2
Application number: JP2000082551A
Authority: JP
Inventors: 久子浅野; 永小原
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2000-03-23
Filing date: 2000-03-23
Publication date: 2003-02-24
Anticipated expiration: 2020-03-23
Also published as: JP2001265763A

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は日本語テキストを単
語分割し、読みや品詞などの情報を付与する日本語形態
素解析方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a Japanese morphological analysis method for dividing a Japanese text into words and adding information such as reading and part of speech.

【０００２】[0002]

【従来の技術】日本語形態素解析における単語認定で
は、単語辞書と文法規則を用いて、接続可能な品詞連鎖
を満たす単語を抽出し、品詞連鎖に対して優先度を付与
し、品詞連鎖による絞り込みを行ったのち、同形語が存
在する場合には単語の優先度による絞り込みを行い、最
終的に第一解となる単語連鎖を抽出するという手法が一
般的である。2. Description of the Related Art In word recognition in Japanese morphological analysis, words that satisfy a connectable part-of-speech chain are extracted using a word dictionary and grammatical rules, priority is given to the part-of-speech chain, and filtering is performed by the part-of-speech chain. After that, if there are homomorphic words, it is common to narrow down by the priority of the words and finally extract the word chain that becomes the first solution.

【０００３】具体的な優先度の付与方法としては、文節
の数が少ない連鎖を優先する文節数最小法（吉村他、
「文節数最小法を用いたべた書き日本語文の形態素解
析」、情処論Ｖｏｌ．２４，Ｎｏ．１，１９８３）、語
の隣接規則にコストを設け、その合計が最小な連鎖を優
先するコスト最小法（久光他、「接続コスト最小法によ
る形態素解析の提案と計算量の評価について」、信学技
法，ＮＬＣ９０−８，１９９０）等やそれらを組み合わ
せた技術が確立されている。As a concrete method of assigning a priority, a minimum clause number method which gives priority to a chain having a small number of clauses (Yoshimura et al.
"Morphological analysis of solid Japanese sentences using the minimum number of clauses", Information Processing Vol. 24, No. 1, 1983), a cost minimization method that puts a cost on word adjacency rules and gives priority to a chain whose sum is the minimum (Hisamitsu et al., "Proposal of morphological analysis by minimum connection cost method and evaluation of computational complexity", IEICE Techniques, NLC 90-8, 1990) and the like and techniques combining them have been established.

【０００４】しかしこれらの従来の方法では、複数の読
みが考えられる場合にでも、最も一般的な読みを持つ単
語が第一解として選ばれるため、ローカルな固有名詞
（例えば同僚の姓名など）について正しい単語認定が行
われない場合がある。例えば、「中島健司」は通常の形
態素解析では「中島（なかじま）健司（けんじ）」と最
も一般的な読みの単語が認定されるのが普通であるが、
実際は「中島（なかしま）健司（たけし）」という認定
が正しい場合もありえる。However, in these conventional methods, even when a plurality of readings are possible, the word having the most common reading is selected as the first solution, so that the local proper noun (for example, the first and last name of a colleague) is selected. Correct word recognition may not be performed. For example, "Kenji Nakajima" is usually recognized as "Nakajima Kenji" in the usual morphological analysis, but
In fact, the accreditation of “Takeshi Nakajima” may be correct.

【０００５】また、一般名詞と同じ表記をもつ固有名詞
は多数存在するが、文脈を考慮しなければどちらである
か判定することは難しく（例：「野原（＝地名）の町
長」、「ススキの野原（＝一般名詞）」）、通常の形態
素解析では、一般にどちらかの品詞に固定されて認定さ
れる。In addition, there are many proper nouns having the same notation as general nouns, but it is difficult to determine which one is the same without considering the context (eg, "Mayor of field (= place name)", "Susuki"). Nonohara (= general noun) ”), in general morphological analysis, it is generally fixed as either part of speech and recognized.

【０００６】[0006]

【発明が解決しようとする課題】従来の日本語形態素解
析方法では、一般的ではない読みをもつ単語に対して正
しく読みが付与できず、例えばテキスト音声合成で利用
した場合に、人名を読み誤る等の問題があった。In the conventional Japanese morphological analysis method, the reading cannot be correctly added to a word having an uncommon reading, and the person's name is erroneously read when used in text-to-speech synthesis, for example. There was a problem such as.

【０００７】また、品詞の区別が困難な単語に対しては
どちらかの品詞に固定されて認定され、例えばある製品
の説明とわかっている場合にでも、その製品名を一般名
詞として扱うなどの問題があった。Further, a word whose part-of-speech is difficult to distinguish is fixed to one of the parts-of-speech and is recognized. For example, even when it is known that the description of a product, the product name is treated as a general noun. There was a problem.

【０００８】本発明の目的は、単語認定精度を向上させ
る形態素解析方法を提供することにある。An object of the present invention is to provide a morphological analysis method that improves word recognition accuracy.

【０００９】[0009]

【課題を解決するための手段】本発明の形態素解析方法
は、データ処置装置が、入力装置から日本語テキストを
入力し、品詞連鎖の生成および優先度を付与するため
の、記憶装置に格納されている文法規則に記述された接
続可能な単語連鎖を、形態素解析で認定する単語とその
単語情報を登録した、記憶装置に格納されている単語辞
書より抽出し、単語連鎖候補列を作成し、記憶装置に出
力する単語辞書検索ステップと、データ処理装置が、単
語情報が指定された文字列に対し、指定された単語情報
を満たし、指定されなかった単語情報は最も確からしい
情報をもつ単語を第一解として認定し、単語情報が与え
られた単語を指定に応じて、日本語テキストで前記単語
情報を付与する対象文字列と与える単語情報の指定方法
を記述した、記憶装置に格納されている設定情報に登録
し、日本語単語情報列を出力装置へ出力する単語選択ス
テップを有する。単語選択ステップは、データ処理装置
が、単語連鎖候補列に対して優先順位を付け、優先順位
が最も高い連鎖を第一解とする第１のステップと、デー
タ処理装置が、日本語テキストに、単語情報が与えられ
た文字列、あるいは設定情報の登録単語情報に登録され
ている単語である対象文字列が存在するかどうか判定す
る第２のステップと、データ処理装置が、対象文字列が
存在する場合、該対象文字列と完全一致する単語が、優
先順位が一位の品詞連鎖に存在するかどうか判定する第
３のステップと、データ処理装置が、対象文字列と完全
一致する単語が存在しなければ、対象文字列と単語境界
が一致し、品詞指定がある場合には品詞も一致する単語
が単語連鎖候補列の中に存在するかどうか判定する第４
のステップと、データ処理装置が、対象文字列と単語境
界が一致する単語が存在すれば、対象文字列に対して、
品詞が指定されている場合、その品詞と一致し、他に指
定された単語情報も満たす単語が存在すれば、その単語
を選択し、他に指定された単語情報が異なる場合には、
その単語を単語連鎖を含めて複製し、他に指定されてい
る単語情報を上書きし、対象文字列に対して品詞が指定
されていない場合には、第４のステップの中で最も優先
度の高い単語連鎖の単語を品詞連鎖を含めて複製し、他
に指定されている単語情報を上書きする第５のステップ
と、データ処理装置が、対象文字列に対して品詞が指定
されている場合には、対象単語列の境界を単語境界とし
てもつ品詞連鎖のうち、その境界で指定された品詞と連
鎖可能な最も優先度の高い連鎖の前後の品詞連鎖をコピ
ーし、対象単語列は１単語として指定された単語情報を
埋め、指定されていない単語情報については、元とした
品詞連鎖の単語から、複数ある場合にはその中で単語の
優先度の最も高いものから生成し、連鎖可能なものが存
在しない場合には未知語との連鎖を生成し、品詞が指定
されていない場合には、対象単語列の境界を単語境界と
してもつ品詞連鎖のうち、最も優先度の高い単語連鎖を
コピーし、対象単語列は１単語として指定された単語情
報を埋め、指定されていない単語情報については元の単
の情報から生成する第６のステップと、データ処理装置
が、第３、第５、第６のステップで扱った単語の優先順
位を上げ、それを第一位で選択するようにする第７のス
テップを含む。In the morphological analysis method of the present invention, a data processing device stores a Japanese text from an input device and stores it in a storage device for generating a part-of-speech chain and giving a priority. and that the described connectable word linked to grammar rules, the words of certified morphological analysis have created the word information is extracted from the word dictionary stored in the storage device, to create a word concatenation candidate sequence, The word dictionary search step output to the storage device and the data processing device satisfy the specified word information for the character string for which the word information is specified, and the word information not specified is most likely. The word having information is identified as the first solution, and the word to which the word information is given is described in Japanese text in accordance with the designation method of the target character string to which the word information is given and the word information to be given. Registered in the setting information stored device comprises a word selection step of outputting Japanese word information sequence to the output device. The word selection step is a data processing device.
But prioritized for the word chain sequence candidates, a first step of the highest chain priority to the first solution, Day
A second step in which the data processing device determines whether or not the Japanese text has a character string to which word information is given or a target character string that is a word registered in the registered word information of the setting information; When the data processing device has the target character string, a third step of determining whether or not a word that exactly matches the target character string exists in the part-of-speech chain having the first priority, and the data processing device, If there is no word that exactly matches the target character string, the word boundary matches the target character string, and if there is a part-of-speech designation, it is determined whether a word that also matches the part-of-speech exists in the word chain candidate sequence. Fourth
A step of, data processing apparatus, if there is a word object character string and words boundary coincides, with respect to the target string,
If a part of speech is specified, if there is a word that matches the part of speech and also satisfies the other specified word information, select that word, and if the other specified word information is different,
The word is duplicated including the word chain, the word information specified elsewhere is overwritten, and when no part of speech is specified for the target character string, the highest priority in the fourth step is set. A fifth step of duplicating a word having a high word chain including a part-of-speech chain and overwriting other specified word information, and a case where the data processing device specifies a part-of-speech for a target character string. Of the part-of-speech chain that has the boundary of the target word string as the word boundary, copies the part-of-speech chain before and after the chain with the highest priority that can be chained with the part-of-speech specified by the boundary, and the target word string is one word. The specified word information is filled, and the word information that is not specified is generated from the word of the original part-of-speech chain, and if there are multiple words, it is generated from the word with the highest priority and can be chained. If does not exist When a chain of words is generated and no part of speech is specified, the word chain with the highest priority is copied from the part of speech chain having the boundary of the target word string as a word boundary, and the target word string is 1 buried word information specified as a word, a sixth step of generating from the original single information about the word information not specified, the data processing device
Includes the seventh step of increasing the priority of the words dealt with in the third, fifth and sixth steps so that the word is selected in the first rank.

【００１０】[0010]

【発明の実施の形態】次に、本発明の実施の形態につい
て図面を参照して説明する。DESCRIPTION OF THE PREFERRED EMBODIMENTS Next, embodiments of the present invention will be described with reference to the drawings.

【００１１】図１は本発明の一実施形態の形態素解析方
法を示す流れ図で、日本語テキスト１を入力して、単語
辞書検索処理２−１を行い、単語連鎖候補列２−２を出
力し、単語選択処理２−３を行い、日本語単語情報列３
を出力することを文法規則４と単語辞書５と設定情報６
を用いて行う。FIG. 1 is a flow chart showing a morphological analysis method according to an embodiment of the present invention. A Japanese text 1 is input, a word dictionary search process 2-1 is performed, and a word chain candidate sequence 2-2 is output. , Word selection processing 2-3, and Japanese word information sequence 3
Output grammar rule 4, word dictionary 5 and setting information 6
Using.

【００１２】日本語テキスト１は、任意の日本語のテキ
ストである。ただし、単語情報が付加されている場合が
ある。ここでいう単語情報とは、単語辞書５に登録され
るべき情報であり、一般に、表記、品詞、読み、意味属
性、アクセント型など多くの情報がある。ここでは、解
析対象となる日本語テキスト１に表記は当然含まれるた
め、表記以外の単語情報を付加する。Japanese text 1 is any Japanese text. However, word information may be added in some cases. The word information mentioned here is information to be registered in the word dictionary 5, and generally, there are many information such as notations, parts of speech, readings, meaning attributes, and accent types. Here, since the Japanese text 1 to be analyzed naturally includes the notation, word information other than the notation is added.

【００１３】単語情報を付与するための単語情報指定パ
ターンは、設定情報６の単語情報指定パターンで任意に
指定できるものとする。例えば、図２（ａ）では、単語
情報を付加する対象の文字列を“｛｝”で囲み、その品
詞を“＜＞”で囲み、読みを“（）”で囲むという指定
である。単語情報を指定する際には、対象文字列および
１つ以上の単語情報は必須であるが、単語情報指定パタ
ーンに登録したすべての単語情報を記述する必要はな
い。例えば、図２の例では、「中島」という対象文字列
に対しては品詞および読みが指定されているが、「健
司」に対しては読みのみが指定されている。The word information specifying pattern for giving word information can be arbitrarily specified by the word information specifying pattern of the setting information 6. For example, in FIG. 2A, the character string to which the word information is added is enclosed in "{}", the part of speech is enclosed in "<>", and the reading is enclosed in "()". When specifying the word information, the target character string and one or more pieces of word information are indispensable, but it is not necessary to describe all the word information registered in the word information specifying pattern. For example, in the example of FIG. 2, part-of-speech and reading are specified for the target character string “Nakajima”, but only reading is specified for “Kenji”.

【００１４】このように、文字列パターンにより単語情
報の指定を行う際には、文字列パターンで使われた文字
を形態素解析対象とする場合には直前に“＼”を付加す
る。例えば図２（ａ）の単語情報指定パターンを使い、欧州連合（ＥＵ）を解析させるためには、欧州連合＼（ＥＵ＼）を入力の日本語テキストとする。As described above, when the word information is designated by the character string pattern, "\" is added immediately before the character used in the character string pattern as the morphological analysis target. For example, to analyze the European Union (EU) using the word information designating pattern of FIG. 2 (a), the European Union \ (EU \) is used as the input Japanese text.

【００１５】また、図２（ｂ）はＸＭＬによる単語情報
の指定である。ＸＭＬに基づき、対象文字列を＜ｗｏｒ
ｄ＞要素、その読みを＜ｗｏｒｄ＞要素のｙｏｍｉ属
性、品詞を＜ｗｏｒｄ＞要素のｈｉｎｓｈｉ属性で指定
するという記述である。この場合も図２（ａ）と同様
に、＜ｗｏｒｄ＞要素において１つ以上の属性を指定す
ればよい。FIG. 2B shows designation of word information by XML. The target character string is <wor based on XML.
The description is that the d> element, its reading is designated by the yomi attribute of the <word> element, and the part of speech is designated by the hinti attribute of the <word> element. Also in this case, one or more attributes may be designated in the <word> element, as in FIG.

【００１６】図１に戻り、形態素解析２は文単位に行わ
れる。Returning to FIG. 1, the morphological analysis 2 is performed for each sentence.

【００１７】単語辞書検索処理２−１では、文法規則４
に記述された接続可能な単語連鎖を単語辞書５よりすべ
て抽出し、単語連鎖候補列２−２を出力する。In the word dictionary search process 2-1, the grammar rule 4 is used.
All connectable word chains described in 1. are extracted from the word dictionary 5, and the word chain candidate sequence 2-2 is output.

【００１８】単語連鎖候補列２−２は品詞連鎖と単語よ
りなる。例えば、図５において線で結ばれたものが品詞
連鎖（＜＞内が品詞名）で、各品詞連鎖に１つ以上の単
語が存在する。The word chain candidate sequence 2-2 is composed of a part-of-speech chain and words. For example, in FIG. 5, what is connected by a line is a part-of-speech chain (parts of speech are in <>), and one or more words are present in each part-of-speech chain.

【００１９】次に、単語選択処理２−３について、図３
を用いて詳細に説明する。Next, the word selection process 2-3 will be described with reference to FIG.
Will be described in detail.

【００２０】ステップ１０では、単語連鎖候補列２−２
に対して優先順位を付け、最も優先順位が高い単語連鎖
（最も優先度が高い品詞連鎖の中の最も優先度が高い単
語）を第一解とする。In step 10, the word chain candidate sequence 2-2
Is prioritized, and the word chain with the highest priority (the word with the highest priority in the part-of-speech chain with the highest priority) is set as the first solution.

【００２１】ステップ１１では、日本語テキスト１に、
単語情報が与えられた文字列、あるいは設定情報６の登
録単語情報に登録されている単語（以下「対象文字列」
と略記）が存在するか判定する。存在する場合には、ス
テップ１２に移行する。ここで、対象文字列が複数存在
する場合には、ステップ１２以降の処理を各対象文字列
に対して順に行う。存在しない場合には処理を終了す
る。In step 11, Japanese text 1
A character string to which word information is given, or a word registered in the registered word information of the setting information 6 (hereinafter referred to as “target character string”).
Is abbreviated). If it exists, the process proceeds to step 12. Here, when there are a plurality of target character strings, the processing from step 12 onward is sequentially performed for each target character string. If it does not exist, the process ends.

【００２２】ステップ１２では、対象文字列と完全一致
する単語が、ステップ１０で与えられた優先順位一位の
品詞連鎖に存在するか判定する。存在する場合には、ス
テップ１６へ移行する。存在しない場合にはステップ１
３へ移行する。In step 12, it is determined whether or not a word that exactly matches the target character string exists in the part-of-speech chain having the highest priority given in step 10. If it exists, the process proceeds to step 16. Step 1 if not present
Move to 3.

【００２３】ここで、「完全一致」についてより具体的
に説明する。Here, the "perfect match" will be described more specifically.

【００２４】「完全一致」とは、対象文字列と単語境界
が完全一致し、指定された単語情報をすべて満たす単語
に加えて、対象文字列と単語境界は異なるが、指定され
た単語情報をすべて満たす単語も含む。The term "exact match" refers to a word in which the target character string and the word boundary completely match and satisfies all the specified word information, and the target character string and the word boundary are different, but the specified word information is Including words that meet all.

【００２５】例えば、日本語テキスト１として「｛山
田｝（やまだ）太郎」（図２（ａ）の指定パターンを利
用）が入力され（すなわち「山田」の読みのみ指定）、
「山田太郎（読み＝やまだたろう）」という単語が第一
品詞連鎖に含まれている場合、先頭からの読みが「やま
だ」と一致し、他に指定された単語情報は存在しないの
で、この単語は完全一致する単語として扱う。For example, "{Yamada} (Yamada) Taro" (using the designated pattern in FIG. 2A) is input as the Japanese text 1 (that is, only the reading of "Yamada" is designated),
If the word "Yamada Taro (yomi = Yamada Taro)" is included in the first part-of-speech chain, the reading from the beginning matches "Yamada" and there is no other word information specified, so this word Is treated as an exact match.

【００２６】また、逆に、日本語テキスト１として
「｛山田太郎｝（やまだたろう）」が入力され、単語連
鎖候補列２−２に、「山田（読み＝やまだ）」＋「太郎
（読み＝たろう）」という単語連鎖が第一品詞連鎖に含
まれている場合、この２単語の表記と対象文字列は一致
し、読みも一致するので、この単語連鎖を完全一致する
単語（連鎖）として扱う。On the contrary, "{Yamada Taro} (Yamada Taro)" is input as the Japanese text 1, and "Yamada (Yomi = Yamada)" + "Taro (Yomi = Yamada)" is entered in the word chain candidate sequence 2-2. If the word chain "Tarou)" is included in the first part-of-speech chain, the notation of these two words matches the target character string, and the reading also matches, so this word chain is treated as a completely matching word (chain). .

【００２７】ステップ１３では、対象文字列と単語境界
が一致し、品詞指定がある場合には品詞も一致する単語
が単語連鎖候補列２−２の単語の中に存在するか判定す
る。存在する場合には、ステップ１４に移行し、存在し
ない場合にはステップ１５に移行する。In step 13, it is determined whether or not there is a word whose word boundary matches the word boundary and, where part-of-speech is specified, which word also matches the part-of-speech, in the word chain candidate sequence 2-2. If it exists, the process proceeds to step 14, and if it does not exist, the process proceeds to step 15.

【００２８】ステップ１４では、対象文字列に対して品
詞が指定されている場合、その品詞と一致し、他に指定
された単語情報も満たす単語が存在すれば、その単語を
選択する。他に指定された単語情報が異なる場合には、
その単語を単語連鎖を含めて複製し、他に指定されてい
る単語情報を上書きする。In step 14, if a part of speech is designated for the target character string, and if there is a word that matches the part of speech and also satisfies the designated word information, that word is selected. If the other specified word information is different,
The word is copied including the word chain, and the word information specified elsewhere is overwritten.

【００２９】対象文字列に対して品詞が指定されていな
い場合には、ステップ１３の中で最も優先度の高い単語
連鎖の単語を品詞連鎖を含めて複製し、他に指定されて
いる単語情報を上書きする。If the part of speech is not specified for the target character string, the word of the word chain having the highest priority in step 13 is duplicated including the part of speech chain, and the word information specified elsewhere is reproduced. Overwrite.

【００３０】ステップ１５では、対象文字列に対して品
詞が指定されている場合には、対象単語列の境界を単語
境界としてもつ品詞連鎖のうち、その境界で指定された
品詞と連鎖可能な最も優先度の高い連鎖の前後の品詞連
鎖をコピーし、対象単語列は１単語として指定された単
語情報を埋める。指定されていない単語情報について
は、元とした品詞連鎖の単語（複数ある場合にはその中
で単語の優先度の最も高いもの）から生成する（例え
ば、読みはすべてつなげる）。連鎖可能なものが存在し
ない場合には未知語との連鎖を生成する。In step 15, if a part of speech is specified for the target character string, of the part-of-speech chains having the boundary of the target word string as a word boundary, the part of speech that can be chained to the part of speech specified at that boundary is the most chainable. The part-of-speech chain before and after the chain of high priority is copied, and the target word string fills in the word information designated as one word. The word information that is not specified is generated from the word of the original part-of-speech chain (if there are multiple words, the word with the highest priority) (for example, all readings are connected). If there is nothing that can be chained, a chain with an unknown word is generated.

【００３１】品詞が指定されていない場合には、対象単
語列の境界を単語境界としてもつ品詞連鎖のうち、最も
優先度の高い単語連鎖をコピーし、対象単語列は１単語
として指定された単語情報を埋める。指定されていない
単語情報については、元の単語の情報から生成する。When no part of speech is specified, the word chain with the highest priority is copied from the part of speech chain having the boundary of the target word string as the word boundary, and the target word string is the word specified as one word. Fill in the information. The word information that is not specified is generated from the information of the original word.

【００３２】ステップ１６では、直前の処理で扱った単
語（連鎖）の優先順位を上げ、それを、第１位で選択さ
れるようにする。In step 16, the priority of the word (chain) handled in the immediately preceding process is raised, and it is selected in the first place.

【００３３】図１に戻り、文法規則４は、品詞連鎖の生
成および優先度を付与するための文法を記述する。Returning to FIG. 1, grammar rule 4 describes a grammar for generating a part-of-speech chain and giving a priority.

【００３４】単語辞書５は、形態素解析で認定する単語
とその単語情報を登録する。The word dictionary 5 registers the words recognized by the morphological analysis and their word information.

【００３５】設定情報６は、日本語テキスト１で単語情
報を付与する対象文字列と与える単語情報の指定方法を
記述する単語情報指定パターン（図２参照）、一度、単
語情報が指定された場合にそれを登録単語情報とするか
を表す単語保存情報、単語情報が指定された単語の表記
および指定された単語情報を保存する登録単語情報から
なる。The setting information 6 is a word information designation pattern (see FIG. 2) that describes a target character string to which word information is assigned in the Japanese text 1 and a designation method of the word information to be given, and once the word information is designated. In addition, it is composed of word storage information indicating whether or not to use it as registered word information, notation of a word for which the word information is designated, and registered word information for storing the designated word information.

【００３６】ステップ１７で設定情報６の単語保存情報
がオンの場合には、ステップ１８で単語情報が指定され
た単語を登録単語情報に登録し、オフの場合には登録し
ない。また、登録単語情報から単語情報を削除すること
により、指定した単語情報は解除される。When the word storage information of the setting information 6 is turned on in step 17, the word for which the word information is designated is registered in the registered word information in step 18, and is not registered when it is off. Further, the designated word information is canceled by deleting the word information from the registered word information.

【００３７】日本語単語情報列７は、最終的に選択され
た単語連鎖列である。第一解のみに絞り込まれる場合
も、上位数個の単語連鎖列が提示される場合もある。The Japanese word information sequence 7 is a finally selected word chain sequence. In some cases, only the first solution may be narrowed down, or in some cases, the top several word chain sequences may be presented.

【００３８】図４に、図２（ａ）の単語情報指定パター
ンを使った日本語テキスト１の例を示す。FIG. 4 shows an example of Japanese text 1 using the word information designating pattern of FIG. 2 (a).

【００３９】以下、この図４の例を用いて、単語選択処
理２−３を説明する。ここでは、優先度の付与には、ま
ず品詞の隣接コストを用い、同じ隣接コストの単語が複
数ある場合の選択には語コストを利用するとする（隣接
コスト、語コストとも値が小さい方を優先）。The word selection process 2-3 will be described below with reference to the example of FIG. Here, first, the adjacency cost of the part of speech is used to assign the priority, and the word cost is used for selection when there are a plurality of words having the same adjacency cost (the smaller the adjacency cost and the word cost are prioritized). ).

【００４０】また、文法規則４において、固有名詞：地
名と固有名詞：名は接続できない、記号と固有名詞、固
有名詞と記号の接続はできると記述されているとする。
さらに、未知語への接続は、他に接続する単語が存在し
ない場合のみ許すとする。Further, it is assumed that the grammar rule 4 describes that proper nouns: place names and proper nouns: names cannot be connected, symbols and proper nouns, and proper nouns and symbols can be connected.
Further, connection to an unknown word is allowed only when there is no other word to connect.

【００４１】設定情報６の単語保存情報はオンとする。The word storage information of the setting information 6 is turned on.

【００４２】第１文については、対象文字列の指定が存
在しないので、ステップ１０の通常の優先度付与により
単語選択が行われ、処理を終了する。For the first sentence, since there is no designation of the target character string, word selection is performed by the normal priority assignment in step 10, and the process is terminated.

【００４３】第２文では、対象文字列の指定が２つ、す
なわち「中島」に対する「なかしま」という読みと「健
司」に対する「たけし」という読みが存在する。この時
の単語連鎖候補列２−２を図５（ａ）に示す。ここで、
「中島」を例に図３の流れを詳細に説明する。In the second sentence, there are two designations of the target character string, that is, the reading "Nakashima" for "Nakajima" and the reading "Takeshi" for "Kenji". The word chain candidate sequence 2-2 at this time is shown in FIG. here,
The flow of FIG. 3 will be described in detail by taking “Nakajima” as an example.

【００４４】ステップ１０では隣接コスト合計が小さい
実線の品詞連鎖（第一品詞連鎖列）をたどり、「健司」
については語コストの低い、読みが「けんじ」となる単
語を選択して、スペース記号中島固有名詞：姓なかじま健司固有名詞：名けんじさん人名接尾辞さんを第一解として優先度を与える。In step 10, the solid part-of-speech chain (first part-of-speech chain sequence) having a small total adjacency cost is traced to "Kenji".
For, select a word that has a low word cost and is read as “kenji”, and give the priority with the space symbol Nakajima proper noun: surname Kenji Nakajima proper noun: first name Kenji personal name suffix.

【００４５】ステップ１１→ステップ１２と移行し、第
一品詞連鎖列に含まれる単語「中島」は読みが「なかじ
ま」のものしか存在しない（読みが「なかしま」のもの
が存在しない）ため、ステップ１３へ移行する。The process proceeds from step 11 to step 12, and the word "Nakajima" contained in the first part-of-speech chain sequence has only the reading "Nakajima" (the reading is not "Nakashima"). Move to 13.

【００４６】ステップ１３では、対象文字列「中島」に
品詞が指定されておらず（品詞を満たす）、「中島」と
いう単語が存在するためステップ１４へ移行する。In step 13, the part of speech is not specified in the target character string "Nakajima" (satisfies the part of speech), and the word "Nakajima" exists, so the process proceeds to step 14.

【００４７】ステップ１４では、最も優先度の高い単語
連鎖は「中島（読み＝なかじま、品詞＝固有名詞：
姓）」であるため、この単語を複製し、読みを指定され
た「なかしま」に置き換え、ステップ１６に移行する
（図６（ａ）参照）。In step 14, the word chain with the highest priority is "Nakajima (Yomi = Nakajima, Part of speech = Proper noun:
Surname) ”, this word is duplicated, the reading is replaced with the designated“ Nakashima ”, and the process proceeds to step 16 (see FIG. 6A).

【００４８】ステップ１６では、中島固有名詞：姓なかしまを第一解とし、ステップ１７へ移行する。In step 16, Nakajima proper noun: surname Nakashima Is set as the first solution, and the process proceeds to step 17.

【００４９】ステップ１７→ステップ１８と移行し、
「中島（読み＝なかしま）」を登録単語情報へ登録す
る。The process proceeds from step 17 to step 18,
Register "Nakajima (reading = Nakashima)" in the registered word information.

【００５０】「健司」については、ステップ１２で第一
品詞連鎖列の中に、与えられた読み「たけし」と完全一
致する単語が存在するため（「健司、読み＝たけ
し」）、それを選択し、ステップ１６に移行する。For "Kenji", in step 12, there is a word in the first part-of-speech chain that exactly matches the given reading "Takeshi"("Kenji, Yomi = Takeshi"), so it is selected. Then, the process proceeds to step 16.

【００５１】ステップ１６では、「健司、読み＝けん
じ」の代わりに「健司、読み＝たけし」を優先度第一単
語として選択する。In step 16, instead of "Kenji, Yomi = Kenji", "Kenji, Yomi = Takeshi" is selected as the first priority word.

【００５２】そして、最終的に、スペース記号中島固有名詞：姓なかしま健司固有名詞：名たけしさん人名接尾辞さんという単語連鎖が第一解として出力される。And finally, Space symbol Nakajima proper noun: surname Nakashima Kenji proper noun: Takeshi Natake Mr. name suffix Is output as the first solution.

【００５３】第３文では、文頭の「中島」が登録単語情
報に登録されている単語と一致するので、第２文と同じ
流れで「なかしま」と読みが与えられた単語が第一解と
なる。In the third sentence, the word "Nakajima" at the beginning of the sentence matches the word registered in the registered word information, so the word with the reading "Nakashima" is given as the first solution in the same flow as in the second sentence. Become.

【００５４】以下、品詞が指定されている「インタース
ペース」について、図３の流れを詳細に説明する。この
部分の単語連鎖候補列２−２は図５（ｂ）に示す。The flow of FIG. 3 will be described in detail below for the "interspace" in which the part of speech is designated. The word chain candidate sequence 2-2 in this portion is shown in FIG.

【００５５】ステップ１０で隣接コスト合計が小さい実
線の品詞連鎖（＝単語連鎖、すべて１単語よりなるた
め）が第一解となる。In step 10, the solid line part-of-speech chain (= word chain, which consists of one word) having a small total adjacency cost is the first solution.

【００５６】ステップ１１→ステップ１２と移行し、与
えられた単語情報「インタースペース」に与えられた品
詞である固有名詞と完全一致する単語が第一品詞連鎖に
存在しないのでステップ１３へ移行する。The process proceeds from step 11 to step 12, and since there is no word in the first part-of-speech chain that completely matches the proper noun, which is the part of speech given to the given word information "interspace", the process proceeds to step 13.

【００５７】ステップ１３では、「インタースペース」
に与えられた品詞である固有名詞の単語は存在しないの
で、ステップ１５へ移行する。In step 13, "interspace"
Since there is no word having a proper noun, which is the part of speech given to, the process proceeds to step 15.

【００５８】ステップ１５では、第一解の単語連鎖であ
る「インター（接頭辞）」の前方境界、「スペース（名
詞）」の後方境界が対象文字列「インタースペース」の
前後の境界と一致し、文法規則４において、記号（「）
と固有名詞、固有名詞と記号（」）の連鎖は許されてい
るので、この第一解の品詞連鎖をコピーして、「インタ
ースペース」という新語に接続する（図６（ｂ））。
「インタースペース」の読みは、「インター（接頭
辞）」と「スペース（名詞）」の読みを合わせて、「い
んたーすぺーす」とする。In step 15, the front boundary of "inter (prefix)" and the rear boundary of "space (noun)", which is the word chain of the first solution, match the boundaries before and after the target character string "interspace". , In grammar rule 4, the symbol (")
Since a chain of a proper noun and a proper noun and a symbol (") is allowed, this part-of-speech chain of the first solution is copied and connected to a new word" interspace "(FIG. 6 (b)).
The reading of "interspace" is a combination of the readings of "inter (prefix)" and "space (noun)" to be "interspace".

【００５９】ステップ１６では、今までの第一品詞連鎖
を太線の品詞連鎖が生じたところは置き換え、インター
スペース固有名詞いんたーすぺーすを第一解とす
る。In step 16, the first part-of-speech chain up to now is replaced where a thick part-of-speech chain has occurred, and the interspace proper noun interspace is the first solution.

【００６０】ステップ１７→ステップ１８と移行し、
「インタースペース（品詞＝固有名詞）」を登録単語情
報へ登録する。The process proceeds from step 17 to step 18,
"Interspace (part of speech = proper noun)" is registered in the registered word information.

【００６１】第４文については、対象文字列の指定が存
在しないので、ステップ１０の通常の優先度付与により
単語選択が行われ、処理を終了する。With respect to the fourth sentence, since there is no designation of the target character string, word selection is performed by the normal priority assignment in step 10, and the process ends.

【００６２】図７は本発明の文書解析方法をパソコン等
のコンピュータ上で実施する場合の構成を示したもので
ある。FIG. 7 shows the configuration when the document analysis method of the present invention is carried out on a computer such as a personal computer.

【００６３】入力装置２１は日本語テキスト１を入力す
るための、キーボード等の入力装置である。記憶装置２
２、２３、２４にはそれぞれ文法規則４、単語辞書５、
設定情報６が格納されている。記憶装置２５はハードデ
ィスクである。出力装置２６は日本語単語情報列が出力
される、プリンタ、ディスプレイ等の出力装置である。
記録媒体２７は単語辞書検索２−１、単語選択２−３の
処理からなる文書解析プログラムが記録されている、フ
ロッピィ・ディスク、ＣＤ−ＲＯＭ、光磁気ディスク等
の記録媒体である。データ処理装置２８はＣＰＵ、各種
インタフェースを含み、記録媒体２７から文書解析プロ
グラムを読み込んで、これを実行する。The input device 21 is an input device such as a keyboard for inputting the Japanese text 1. Storage device 2
2, 23, and 24 have grammar rule 4, word dictionary 5, and
The setting information 6 is stored. The storage device 25 is a hard disk. The output device 26 is an output device such as a printer or a display that outputs a Japanese word information string.
The recording medium 27 is a recording medium such as a floppy disk, a CD-ROM, a magneto-optical disk, or the like in which a document analysis program consisting of word dictionary search 2-1 and word selection 2-3 is recorded. The data processing device 28 includes a CPU and various interfaces, reads the document analysis program from the recording medium 27, and executes it.

【発明の効果】以上説明したように、本発明によれば、
入力テキストの一部の単語に対して、読みや品詞、アク
セント型などの単語情報を指定を許し、あらかじめ規定
している文法規則の品詞連鎖を優先しつつ、指定された
単語情報を満たす単語を選択、なければ新規生成するこ
とにより、指定された単語情報を満し、指定されなかっ
た単語情報は最も確からしい情報をもつ単語を第一解と
して認定、また、一度指定された単語を保持し、それが
解除されるまで、その指定条件を満たす単語を認定する
形態素解析を行うことにより、既知情報を積極的に与え
ることで単語認定精度を向上させることができる。As described above, according to the present invention,
For some words in the input text, you can specify word information such as reading, part of speech, accent type, etc., and give priority to the part of speech chain of the grammatical rules that have been specified in advance, and specify words that satisfy the specified word information. By selecting or creating a new one if it does not exist, the specified word information is satisfied, and the unspecified word information is identified as the first solution with the word with the most probable information, and the specified word is retained once. By performing morphological analysis that certifies words satisfying the specified condition until it is canceled, the word recognition accuracy can be improved by positively providing known information.

[Brief description of drawings]

【図１】本発明の一実施形態の形態素解析方法を示す図
である。FIG. 1 is a diagram showing a morphological analysis method according to an embodiment of the present invention.

【図２】単語情報指定パターン例とそれに対応する日本
語テキスト例を示す図である。FIG. 2 is a diagram showing an example of a word information designation pattern and an example of Japanese text corresponding thereto.

【図３】単語選択処理２−３の処理を示すフローチャー
トである。FIG. 3 is a flowchart showing processing of word selection processing 2-3.

【図４】日本語テキストの一例を示す図である。FIG. 4 is a diagram showing an example of Japanese text.

【図５】単語連鎖候補列例、および通常の優先度付与に
より与えられた優先度を示す図である。FIG. 5 is a diagram showing an example of a word chain candidate sequence and priorities given by normal priority assignment.

【図６】単語連鎖候補列への生成単語追加例を示す図で
ある。FIG. 6 is a diagram showing an example of adding generated words to a word chain candidate sequence.

【図７】本発明の形態素検索方法を実施する装置のブロ
ック図である。FIG. 7 is a block diagram of an apparatus for implementing the morpheme search method of the present invention.

[Explanation of symbols]

１日本語テキスト２形態素解析２−１単語辞書検索処理２−２単語連鎖候補列２−３単語選択処理３日本語単語情報列４文法規則５単語辞書６設定情報１０〜１８ステップ２１入力装置２２〜２５記憶装置２６出力装置２７記録媒体２８データ処理装置 1 Japanese text 2 Morphological analysis 2-1 Word dictionary search processing 2-2 Word chain candidate sequence 2-3 Word selection processing 3 Japanese word information string 4 Grammar rules 5 word dictionary 6 setting information 10-18 steps 21 Input device 22-25 storage device 26 Output device 27 recording media 28 Data processing device

───────────────────────────────────────────────────── フロントページの続き (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 17/21 - 17/28 ＪＩＣＳＴファイル（ＪＯＩＳ)─────────────────────────────────────────────────── ─── Continuation of front page (58) Fields surveyed (Int.Cl. ⁷ , DB name) G06F 17/21-17/28 JISST file (JOIS)

Claims

(57) [Claims]

1. A morphological analysis method for dividing a Japanese text into words by a data processing device and adding word information, wherein the data processing device inputs the Japanese text from an input device to generate a part-of-speech chain and A word stored in a storage device in which a word and its word information for which a connectable word chain described in a grammatical rule stored in the storage device for giving priority is identified by morphological analysis are registered. A word dictionary search step of extracting from the dictionary, creating a word chain candidate string , and outputting the word chain candidate string to the storage device; and the data processing device , for the character string for which the word information is specified, the specified word The word information that satisfies the information and is not specified is identified as the first solution for the word that has the most probable information, and the word for which the word information is given is specified,
Register in the setting information stored in the storage device that describes the target character string to which the word information is added in Japanese text and the method of specifying the word information to be output, and output the Japanese word information string
A word selection step of outputting to a device; in the word selection step, the data processing device assigns a priority to the word chain candidate sequence , and sets a chain having the highest priority as a first solution. And whether the data processing device has a character string to which the word information is given or a target character string which is a word registered in the registered word information of the setting information in the Japanese text. A second step of determining, and the data processing apparatus, when the target character string exists, determining whether or not a word that exactly matches the target character string exists in the part-of-speech chain having the first priority. In step 3 and the data processing device, if there is no word that exactly matches the target character string, the word boundary matches the target character string, and if there is a part-of-speech designation, the part-of-speech also matches. Fourth, determining whether a word is present in the word chain candidate sequence
And if there is a word whose word boundary matches the target character string, if the part of speech is specified for the target character string, then the data processing device matches the part of speech and specifies another. If there is a word that also satisfies the specified word information, select that word, and if other specified word information is different, duplicate that word including the word chain, and specify the other word information. If no part of speech is specified for the target character string, the word of the word chain with the highest priority in the fourth step including the part-of-speech chain is duplicated, and another word is specified. The fifth step of overwriting existing word information, and when the data processing device specifies a part of speech for a target character string, of the part of speech chain having a boundary of the target word string as a word boundary, Allows chaining of parts of speech specified by boundaries Copy the part-of-speech chain before and after the highest-priority chain, fill the word information specified as one word in the target word string, and for the word information that is not specified, from the word of the original part-of-speech chain, If there is a word, the word with the highest priority is generated, if there is no chainable one, a chain with an unknown word is generated, and if no part of speech is specified, the target word is generated. Of the part-of-speech chains that have column boundaries as word boundaries, the word sequence with the highest priority is copied, the target word string fills in the word information specified as one word, and the word information that is not specified is the original word. The sixth step of generating from the information of No. 7, and the data processing device increases the priority of the words handled in the third, fifth, and sixth steps, and selects it in the first rank. Shape with steps Element analysis method.

2. The word selection step comprises the data processing.
When the device makes a similar specification for the text after the word information is specified once in the input Japanese text, saves that information, and when canceling the specification, the The method of claim 1, wherein the information is deleted.

3. The word selection step includes the data processing.
If the processing device has the word storage information of the setting information turned on,
The method according to claim 1, further comprising the step of registering a word for which word information is designated in the registered word information.

4. A morphological analysis is performed to identify connectable word chains described in grammatical rules stored in a storage device for inputting Japanese text, generating a part-of-speech chain, and assigning priorities. words have created the word information is extracted from the word dictionary stored in the storage device, to create a word concatenation candidate sequence, a word dictionary retrieval processing to output, to string word information is designated, designated The word information that satisfies the specified word information and is not specified is identified as the first solution for the word that has the most probable information, and the word information is given in Japanese text according to the specified word information. This is a word selection process of registering in the setting information stored in the storage device, which describes the target character string to be given and the method of specifying the given word information, and outputs the Japanese word information string.
A priority sequence for the word chain candidate sequence, and a sequence having the highest priority as the first solution; and a character string in which the word information is given to the Japanese text, or A second procedure of determining whether or not a target character string that is a word registered in the registered word information of the setting information exists; and, if the target character string exists, a word that exactly matches the target character string is A third step of determining whether or not the priority is in the first-ranked part-of-speech chain, and if there is no word that exactly matches the target character string,
A fourth step of determining whether or not a word whose word boundary matches the target character string and whose part of speech also matches is present in the word chain candidate string, and the target character string and the word If there are words whose boundaries match, if a part of speech is specified for the target character string,
If there is a word that matches the part-of-speech and also satisfies the other specified word information, select that word, and if the other specified word information is different, duplicate that word including the word chain. , If the word information specified elsewhere is overwritten and no part of speech is specified for the target character string, the fourth
In the procedure of step 5, the word with the highest priority word chain is duplicated including the part-of-speech chain, and the fifth step of overwriting other specified word information, and the part-of-speech is specified for the target character string. , The part of the chain of speech that has the boundary of the target word string as the word boundary,
Copy the part-of-speech chain before and after the chain of highest priority that can be chained with the part-of-speech specified at that boundary, fill the word information specified as one word in the target word string, and for the word information not specified, If there are multiple words from the original part-of-speech chain, the word with the highest priority is generated, and if there is no chainable word, a chain with an unknown word is generated. If not specified,
Of the part-of-speech chains that have the boundary of the target word string as the word boundary, the word chain with the highest priority is copied, the target word string fills the word information specified as one word, and the word information that is not specified is the original word string. of a sixth step of generating from the word information, third, fifth, raising the priority of the word addressed in the sixth step, a seventh step to make it to select the first of
Computer readable recording morphological analysis program for causing computer to execute word selection processing including
Capacity recording medium.

5. In the word selection process, when a word string is specified once in a Japanese text to be input, and when subsequent texts are also specified, the information is saved and specified. The computer-readable recording medium according to claim 4, wherein the information is deleted when the recording is canceled.

6. The word selection process further comprises the step of registering a word for which word information is designated in the registered word information when the word storage information of the setting information is ON.
The computer-readable recording medium according to claim 4.