JPH0682364B2

JPH0682364B2 - Japanese sentence processing method

Info

Publication number: JPH0682364B2
Application number: JP62315698A
Authority: JP
Inventors: 恒雄安田; 勝美島崎; 伸一郎高木; 悟池原
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 1987-12-14
Filing date: 1987-12-14
Publication date: 1994-10-19
Anticipated expiration: 2009-10-19
Also published as: JPH01156866A

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は計算機による日本文の解析処理方式に係り、詳
しくは辞書にない単語や誤字等を含む日本文に対しても
解析精度を向上させる日本文処理方式に関するものであ
る。TECHNICAL FIELD The present invention relates to a Japanese sentence analysis processing method by a computer, and in particular, improves the analysis accuracy even for Japanese sentences including words or typographical errors that are not in the dictionary. It relates to the Japanese sentence processing method.

[Conventional technology]

計算機による日本文の解析処理方式の従来の概略構成図
を第５図に示す。第５図において、10はデータ入力部、
20は形態素解析部、50は出力編集部である。FIG. 5 shows a conventional schematic configuration diagram of a Japanese sentence analysis processing system by a computer. In FIG. 5, 10 is a data input section,
Reference numeral 20 is a morphological analysis unit, and 50 is an output editing unit.

データ入力部10は漢字OCR、ペンタッチ、タブレット、
キーボード等からなり、解析対象の日本語文字列を読込
む。この読込まれた日本語文字列について、形態素解析
部20では、予め用意された文法辞書や単語辞書を用いて
形態素解析を行い、単語の認定、文節切り等を実施す
る。出力編集部50では、この形態素解析部20での処理結
果に従って日本語文字列を編集出力する。こゝで、辞書
にない単語や誤字を含む日本文の場合、形態素解析部20
においては、解析単位の文節全体が解析不能文字列とし
て検出され、出力編集部50では、該解析不能文字列をそ
のまゝ文節単位で出力することゝなる。The data input unit 10 is Kanji OCR, pen touch, tablet,
It consists of a keyboard, etc., and reads the Japanese character string to be analyzed. With respect to the read Japanese character string, the morphological analysis unit 20 performs morphological analysis using a grammar dictionary or word dictionary prepared in advance, and carries out word recognition, segmentation, and the like. The output editing unit 50 edits and outputs the Japanese character string according to the processing result of the morphological analysis unit 20. Here, in the case of Japanese sentences containing words or typographical errors that are not in the dictionary, the morphological analysis unit 20
In (1), the entire phrase in the parse unit is detected as an unparseable character string, and the output editing unit 50 outputs the unparseable character string in the bunsetsu unit.

[Problems to be solved by the invention]

従来技術においては、計算機により日本文を解析するに
当って、日本文に入力ミスなどの誤りが存在したり、辞
書にない単語が存在した場合、その影響が限定できず、
その部分を含む文節や文全体がそのまゝ解析不能文字列
となっていた。このため、例えば、日本文を解析して単
語単位に認定し、読みやアクセント・ポーズを付与して
合成音声で読上げる音声出力システムでは、実際に１ケ
所の解析不能文字であっても、それを特定できないた
め、文節や文全体の読みがおかしくなり、正読率を低下
させていた。In the prior art, when analyzing a Japanese sentence by a computer, if there is an error such as a typographical error in the Japanese sentence, or if there is a word that is not in the dictionary, the effect cannot be limited,
The clause including the part and the whole sentence were the unparseable character string. Therefore, for example, in a voice output system that analyzes a Japanese sentence, certifies it in word units, adds readings and accents / pauses and reads aloud with synthetic speech, even if one character cannot be analyzed, Since it was not possible to identify the phrase, the reading of the phrase or the whole sentence was incorrect, and the correct reading rate was reduced.

又、解析がうまく行かない所に誤りが存在する可能性が
あるということを利用した日本文誤り自動検出システム
においても、長い単位の文字列しか指摘できず、新聞記
事の校閲に利用した場合などは短時間に大量データを処
理する必要があるため、長い文字列からさらに誤りを見
つけ出す作業のために、省力化効果を低下させてしまう
という問題があった。In addition, even in the Japanese sentence error automatic detection system that uses the fact that there may be an error where the analysis does not go well, only a long unit of character string can be pointed out, and it is used for reviewing newspaper articles etc. Needs to process a large amount of data in a short time, and there is a problem in that the labor-saving effect is reduced due to the work of finding an error from a long character string.

本発明の目的は、辞書にない単語や誤りを含む文におい
てもできるだけその文字列を限定し、特定化して文全体
の解析に与える影響を極小化できる日本文処理方式を提
供することにある。An object of the present invention is to provide a Japanese sentence processing method capable of limiting the character string as much as possible even in a sentence including a word or an error which is not in the dictionary and specifying it to minimize the influence on the analysis of the entire sentence.

[Means for solving problems]

本発明は、計算機により日本文を処理する方式におい
て、処理対象の日本文を単語、文節等に認定分割する形
態素解析部の他に、日本文中に辞書にない単語や誤りが
あるため解析不能になる文字列に対してより深く解析す
る解析不能文字列処理部を設ける。The present invention, in a method of processing Japanese sentences by a computer, in addition to a morphological analysis unit for accrediting and dividing the Japanese sentence to be processed into words, clauses, etc., it is impossible to analyze because there are words or errors in the Japanese sentence that are not in the dictionary. An unanalyzable character string processing unit is provided for deeper analysis of the character string.

解析不能文字列の処理部では、形態素解析部で検出され
た解析不能文字列に対して、とにかく解析できる所まで
初めからできるだけ解析して行く。そして、どうしても
解析できない文字列の先頭がカタカナあるいはアルファ
ベットの場合は、その連続する同一字種の文字列全体を
解析不能文字列として検出する。また、それ以外の字種
（漢字、ひらがな）の場合は、その先頭文字の次の文字
を先頭として再度解析を行い、それ以降もし、うまく解
析できるならずスキップして解析対象としなかった１文
字が解析不能対象であるとする。もし、それでもまだ以
降の文字列を解析できない場合は、さらに先頭文字をス
キップして次の文字から解析を始め、以降うまく解析で
きるか、文節末や文末となるまでこの処理を行い、スキ
ップした文字列を解析不能として検出する。The unparsable character string processing unit analyzes the unparsable character string detected by the morpheme analysis unit as much as possible from the beginning until it can be analyzed. If the beginning of the character string that cannot be analyzed by any means is katakana or alphabet, the entire continuous character string of the same character type is detected as an unanalyzable character string. For other character types (Kanji, Hiragana), the character next to the first character is analyzed again, and after that, one character that could not be analyzed successfully and was skipped was not analyzed. Is a target that cannot be analyzed. If you still can not parse the subsequent character string, skip the first character further and start the analysis from the next character, perform this processing until the end of the phrase or the end of the sentence, or if you can parse successfully Detects the column as unparsable.

[Action]

本発明では、形態素解析部で検出した解析不能の長い文
字列に対して、さらに解析を加えるため誤りの箇所を特
定化でき、また解析不能部分を短い単位の文字列にまで
限定できる。According to the present invention, since an unanalyzed long character string detected by the morphological analysis unit is further analyzed, an error location can be specified, and an unanalyzable portion can be limited to a character string of a short unit.

〔Example〕

以下、本発明の一実施例について図面により説明する。 An embodiment of the present invention will be described below with reference to the drawings.

第１図は本発明の一実施例の構成図で、音声出力システ
ムに適用した場合であり、第２図はその具体的処理例を
第１図の各部分に対応づけて示した図である。FIG. 1 is a block diagram of an embodiment of the present invention, which is applied to a voice output system, and FIG. 2 is a diagram showing a specific processing example thereof in association with each part of FIG. .

第１図において、10は日本文のデータ入力部、20は形態
素解析部、30は読み付与部、40は解析不能文字列処理
部、50は出力編集部、60は合成音声発生部である。解析
不能文字列処理部40は、さらに解析不能文字の限定化部
41、解析不能文字品詞付与部42、アテ読み付与部43、ア
クセント・ポーズ付与部44に分かれる。In FIG. 1, 10 is a Japanese text data input unit, 20 is a morphological analysis unit, 30 is a reading addition unit, 40 is an unanalyzable character string processing unit, 50 is an output editing unit, and 60 is a synthetic speech generation unit. The unparseable character string processing unit 40 is a further unparseable character limiting unit.
41, an unanalyzable character part-of-speech assigning unit 42, an attitudinal reading imparting unit 43, and an accent / pose imparting unit 44.

今、データ入力部10より「キャルビックのように軽い」
という日本文（日本語文字列）が読込まれたとする
（）。こゝで、「キャルビック」が辞書にない単語で
あるとする。Now "lighter like a Calbic" than the data input section 10
Suppose that the Japanese sentence (Japanese character string) is read (). Here, suppose "Calbic" is a word that is not in the dictionary.

形態素解析部20では、予め用意された文法辞書、単語辞
書を用いて「キャルビックように軽い」を単語分割、文
節分割するが、「キャルビック」が辞書にないというこ
とで、「キャルビックのように」の分節は解析不能文字
列とし、「軽い」を形容詞の単語に認定する「」。こ
の形態素解析部20で認定できた単語「軽い」について
は、読み付与部30で読みと適当なアクセントやポーズの
情報が付与される（）。In the morphological analysis unit 20, using a grammar dictionary and a word dictionary prepared in advance, "Calbic-like light" is word-segmented and phrase-segmented. However, since "Calbic" is not in the dictionary, "Calbic-like" is used. The segment of is a non-parsable character string, and "" is recognized as an adjective word "". For the word "light" that has been recognized by the morphological analysis unit 20, the reading addition unit 30 adds reading and appropriate accent and pose information ().

一方、解析不能文字列「キャルビックのように」は解析
不能文字列処理部40に送られる。まず、解析不能文字の
限定化41で解析不能文字の限定化を行う。この場合、カ
タカナ文字で始まる「キャルビック」を１つの単位の解
析不能文字とすると、「のように」は格助詞「の」と助
動詞「ように」に解析できる（）。次に、アクセント
やポーズをできるだけ自然に解析不能文字列に付与する
ため、解析不能文字品詞付与部42で格助詞「の」と接続
可能な適当な品詞（この場合一般名詞）を解析不能文字
列「キャルビック」に与え（）、あたかも辞書にあっ
た一般名詞のようにした後、アテ読み付与部43、アクセ
ント・ポーズ付与部44において適当なアテ読み（この場
合はカタカナ文字なのでカタカナ読み）とアクセント・
ポーズを付与する（，）。On the other hand, the unparseable character string “Like Calbic” is sent to the unparseable character string processing unit 40. First, the unparsable characters are limited 41 to limit the unparsable characters. In this case, if "Calbic" starting with the Katakana character is a non-parsable character of one unit, "like you" can be analyzed into the case particle "no" and auxiliary verb "like" (). Next, in order to add accents and poses to the unparseable character string as naturally as possible, the unparseable character part-of-speech assigning unit 42 creates an appropriate part-of-speech (general noun in this case) that can be connected to the case particle "no". After giving () to "Calbic" and making it like a general noun in the dictionary, an appropriate reading (in this case katakana reading because it is katakana character) and accent are added in the reading reading adding part 43 and accent / pose adding part 44.・
Add a pose (,).

出力編集部50では、読み付与部30とアクセント・ポーズ
付与部44の出力を合わせて編集する（）。この出力編
集部50での編集結果が合成音声発生部60へ送られて音声
となる。The output editing unit 50 edits the outputs of the reading adding unit 30 and the accent / pose adding unit 44 together (). The editing result of the output editing unit 50 is sent to the synthesized voice generating unit 60 to be a voice.

このように、解析不能文字の限定化を行う事により、解
析できない文字列を含む文を入力されても、その部分の
品詞が適当に推定できるため、なるべく自然に近い形で
読みを出力することができる。In this way, by limiting unparsable characters, even if a sentence containing a character string that cannot be parsed is input, the part of speech of that part can be estimated appropriately, so the pronunciation should be output in a form that is as natural as possible. You can

第３図は本発明の他の実施例の構成図で、誤り検出シス
テムに本発明を適用した場合であり、第４図はその具体
的処理例を第３図の各部分に対応づけて示した図であ
る。FIG. 3 is a block diagram of another embodiment of the present invention, which is a case where the present invention is applied to an error detection system, and FIG. 4 shows a specific processing example thereof in association with each part of FIG. It is a figure.

第３図において、10はデータ入力部、20は形態素解析
部、40は解析不能文字列処理部、70は誤り検出部であ
る。解析不能文字列処理部40には解析不能文字の限定化
部41がある。In FIG. 3, 10 is a data input unit, 20 is a morpheme analysis unit, 40 is an unanalyzable character string processing unit, and 70 is an error detection unit. The unparsable character string processing unit 40 includes a nonparsing character limiting unit 41.

今、データ入力部10より日本文「すべて分からないこた
ばかりだ。」が読込まれ、その「こたばかりだ」の
「た」が誤字であるとする（）。この日本文につい
て、形態素解析部10で解析され、副詞「すべて」と解析
不能文字列の文節「分からないこたばかりだ」に限定さ
れる（）。このうち、解析不能文字列「分からいこた
ばかりだ」は、解析不能文字列処理部40に送られる。該
処理部40の解析不能文字限定化部41では、さらに解析が
失敗する文字までなんとか解析し、失敗先頭文字「た」
（カ変活用動詞「こ」に接続できる「た」はない）をス
キップして、再度解析し、「ばかりだ」の解析に成功す
る（）。解析不能文字の限定化部41での解析結果を誤
り検出部70に送ることにより、「た」１文字を誤りとし
て検出できる（）。このように、解析不能文字の限定
化を行う事により、誤り文字を限定化できる。Now, it is assumed that the Japanese sentence "I just don't understand everything." Is read from the data input unit 10 and that "ta" in the "Kota just" is a typographical error (). The Japanese sentence is analyzed by the morphological analysis unit 10 and is limited to the adverb "all" and the phrase "not understandable kotada da" which cannot be analyzed (). Of these, the unparsed character string “I just knew it” is sent to the unparsed character string processing unit 40. The unanalyzable character limiting unit 41 of the processing unit 40 manages to further analyze even the characters for which the analysis fails, and the failure leading character "ta"
It skips (there is no "ta" that can be connected to the ka-inflectional verb "ko"), analyzes it again, and succeeds in parsing "simple" (). By sending the analysis result of the unparsable character limiting unit 41 to the error detecting unit 70, one character "ta" can be detected as an error (). By limiting unparsable characters in this way, error characters can be limited.

〔発明の効果〕以上説明したように、文字文の解析において従来辞書に
ない単語や誤字などを含む文が存在した場合は、文節な
ど処理の最小単位全体を解析不能文字列とせざるを得な
かったが、本発明では、さらにその中でも解析不能文字
を限定して特定化できるようになるため、日本文の解析
効率が向上する。さらに、抽出した解析不能文字列の正
解の品詞を単一品詞として推定しても問題ない程度に限
定化でき、解析不能文字列を適当な品詞に推定して処理
できるため、辞書にない単語や誤字などを含む文を許容
して日本文解析できるようになるという利点がある。[Effects of the Invention] As described above, when there is a sentence including a word or a typographical error that is not in the conventional dictionary in the analysis of character sentences, the entire minimum unit of processing such as bunsetsu must be an unparseable character string. However, according to the present invention, it becomes possible to limit and specify the unparsable characters among them, so that the analysis efficiency of Japanese sentences is improved. Furthermore, it is possible to limit the correct part-of-speech of the extracted unparsable character string to the extent that there is no problem even if it is estimated as a single part-of-speech, and it is possible to estimate the unparsable character string to an appropriate part-of-speech and process it. There is an advantage that it is possible to analyze Japanese sentences by allowing sentences including typographical errors.

現実の業務で使用される一般の文章は、辞書にない単語
や誤字などを含んだ文を持っている場合が多く、本発明
による効果の意義は大きい。例えば日本文音声出力シス
テムに応用すると、解析不能文字列があっても正確に近
いアテ読みができるようになる。又、日本文誤り検出シ
ステムでは誤りを含む範囲を限定して検出できるように
なる。The general sentence used in actual business often has a sentence including a word or a typographical error that is not in the dictionary, and the effect of the present invention is significant. For example, if it is applied to a Japanese sentence voice output system, it is possible to accurately read the reading even if there is a character string that cannot be analyzed. Also, the Japanese sentence error detection system can detect an error in a limited range.

[Brief description of drawings]

第１図は本発明の一実施例の構成図、第２図は第１図の
具体的処理例を示す図、第３図は本発明の他の実施例の
構成図、第４図は第３図の具体的処理例を示す図、第５
図は従来方式の構成図である。 10……データ入力部、20……形態素解析部、40……解析
不能文字列処理部。FIG. 1 is a block diagram of an embodiment of the present invention, FIG. 2 is a diagram showing a concrete processing example of FIG. 1, FIG. 3 is a block diagram of another embodiment of the present invention, and FIG. The figure which shows the concrete example of processing of Figure 3, the 5th
The figure is a block diagram of a conventional method. 10 ... Data input section, 20 ... Morphological analysis section, 40 ... Unanalyzable character string processing section.

───────────────────────────────────────────────────── フロントページの続き (72)発明者池原悟東京都千代田区内幸町１丁目１番６号日本電信電話株式会社内 (56)参考文献特開昭61−134877（ＪＰ，Ａ) 特開昭62−90760（ＪＰ，Ａ) 特開昭62−209659（ＪＰ，Ａ) ─────────────────────────────────────────────────── --- Continuation of the front page (72) Inventor Satoru Ikehara 1-1-6 Uchisaiwaicho, Chiyoda-ku, Tokyo Nihon Telegraph and Telephone Corporation (56) Reference JP-A-61-134877 (JP, A) JP 62-90760 (JP, A) JP 62-209659 (JP, A)

Claims

[Claims]

1. A method for processing Japanese sentences by a computer, a morphological analysis unit for dividing a Japanese sentence to be processed into words, clauses, etc. by using a grammar dictionary, word dictionary, etc., and a word not included in the Japanese sentence in the dictionary. For a character string that cannot be analyzed by the morphological analysis unit due to an error or an error, if the first character type of the character string is Katakana or the alphabet, the consecutive character strings of the same character type are detected as unanalyzable character strings. However, when the first character type of the character string that cannot be parsed is other than katakana or alphabet, the character next to the first character is analyzed again as the head of the character string to exclude the first character from the analysis, and the result is the result. If it cannot be parsed, the next character is parsed sequentially as the first character, and if it can be parsed, the character string excluded from the analysis is detected as an unparseable character string. Text processing system date and having a 析不 ability character string processing unit.