JP3204517B2

JP3204517B2 - Unknown word recognition method

Info

Publication number: JP3204517B2
Application number: JP14312391A
Authority: JP
Inventors: 泰之沼田
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 1991-06-14
Filing date: 1991-06-14
Publication date: 2001-09-04
Anticipated expiration: 2016-09-04
Also published as: JPH04367072A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、例えば英語と日本語の
機械翻訳装置における未知語認定方法に関する。The present invention relates to, for example, for the unknown word certification how in English and Japanese machine translation apparatus.

【０００２】[0002]

【従来の技術】従来の機械翻訳装置は入力装置のキーボ
ードより英文章の文字列が入力され、文字コードの文字
列データが形成され、文字列中の空白，コンマ，ピリオ
ドなどの区切り文字コードを字句分割の認定の材料とし
て字句に分解されて、全ての字句の文字列毎に辞書検索
が試みられる。そして、該字句が辞書中に存在する字句
は、登録済の字句として、未知語でない単語であると判
定される。2. Description of the Related Art In a conventional machine translation apparatus, a character string of an English sentence is input from a keyboard of an input device, character string data of a character code is formed, and a delimiter code such as a space, a comma, or a period in the character string is converted. It is broken down into lexicals as a material for recognition of lexical division, and a dictionary search is attempted for each character string of all lexicals. Then, the phrase whose phrase is present in the dictionary is determined as a registered phrase that is not an unknown word.

【０００３】一方、該文字列が辞書中に存在しない字句
は、未登録の字句として、未知語であると判定され、翻
訳者が介入して、誤字か、変数名や略記号かななどが判
定され、その訂正やその他の処理が行なわれる。On the other hand, a phrase whose character string does not exist in the dictionary is determined to be an unknown word as an unregistered phrase, and a translator intervenes to determine whether the word is a typo, a variable name or an abbreviation symbol. The correction and other processing are performed.

【０００４】[0004]

【発明が解決しようとする課題】計算機のマニュアルな
どの技術文書には、コマンド名，変数名等を意味する字
句が、数多く現れる。これらの字句は専門家には頻出す
る用語であるが、一般的には存在せず、辞書にも登録さ
れていない。コマンド名，変数名等を意味する字句が辞
書中に存在しない字句であることが判っていても、全て
の字句の辞書検索の処理が試みられて、上述の判定処理
の経過をたどる。In technical documents such as computer manuals, a large number of words and phrases meaning command names, variable names, and the like appear. These lexical terms are words that appear frequently in specialists, but are not generally present and are not registered in dictionaries. Even if it is known that a phrase meaning a command name, a variable name, or the like is a word that does not exist in the dictionary, a dictionary search process for all the phrases is attempted, and the above-described determination process is followed.

【０００５】一方、技術文字では、一般の単語は普通の
書体、即ち標準書体で表現されているが、コマンド名，
変換名等を意味する字句は、普通の書体でない特別な書
体、即ち非標準書体の、例えば太ボールドフォント，イ
タリック体等で表現されていることが多い。On the other hand, in technical characters, general words are expressed in a normal typeface, that is, a standard typeface, but command names,
The lexical meanings of conversion names and the like are often expressed in special fonts other than ordinary fonts, that is, non-standard fonts such as bold bold font and italic font.

【０００６】従って、標準書体で表現されている字句は
辞書検索を行ない、非標準書体で表現されている字句は
辞書検索から除外するようにすれば、未知語の辞書検索
の機会を大幅に減少させることができる。[0006] Therefore, if a dictionary expressed in a standard typeface is subjected to a dictionary search and a phrase expressed in a non-standard typeface is excluded from the dictionary search, the chance of dictionary search for unknown words is greatly reduced. Can be done.

【０００７】本発明はこのような点に鑑みなされたもの
で、非標準書体で表現されている字句を未知語の辞書検
索を除去して行なわず、未知語の辞書検索の機会を大幅
に減少させた未知語認定方法を提供することを目的とす
る。The present invention has been made in view of the above points, and does not perform a dictionary search for an unknown word without removing a phrase expressed in a non-standard typeface, thereby greatly reducing the chance of a dictionary search for an unknown word. an object of the present invention is to provide an unknown language certification how that was.

【０００８】[0008]

【課題を解決するための手段】本発明は、機械翻訳装置
における、原文中の未知語と判定された単語に対して、
操作者に介入を指示する未知語認定方法であって、原文
である文書画像を画像データとして入力し、該画像デー
タを１文字単位に切り出して文字画像とし、該文字画像
の文字画像データを認識して文字コードのテキストデー
タへ変換し、かつ前記文字画像より標準書体か非標準書
体かの認識処理を行い、前記テキストデータを区切り文
字により単語に分割し、該単語が非標準書体の文字で構
成されるときは、前記単語の辞書検索を行わずに未知語
としての判定処理の対象外とし、前記単語が非標準書体
の文字を含まない標準書体の文字で構成されるときは、
前記単語の辞書検索を行い、辞書に未登録の単語は未知
語と判定し、操作者への指示対象とするように制御する
ことを特徴とする。SUMMARY OF THE INVENTION The present invention provides a machine translation apparatus.
, For words determined to be unknown words in the original sentence ,
An unknown word certification method of instructing the intervention to the operator, original
Enter the document image is an image data, a character image cut out the image data to a character unit, is converted into text data of a character code to recognize text image data of the character image, and from the character image performs recognition processing or the standard typeface or non-standard typeface, the divided text data into words by a delimiter, structure the words with the character of a non-standard typeface
If the word is unknown,
A determination target outside the process as, when the word is composed of characters of a standard font that does not contain non-standard typeface characters,
Perform dictionary search of the word, unregistered word to the dictionary, it is determined that the unknown word, and controls so that the referent to the operator.

【０００９】[0009]

【作用】本発明によれば、未知語認定のための辞書検索
を行う際に、検索対象の字句が、文書中において、標準
書体により記述されているものであるか否かを判別し、
非標準書体記述による語句は、未知語認定の対象外とす
ることで未知語の辞書検索の機会を大幅に減少させる。According to the present invention, when performing a dictionary search for unknown word recognition, it is determined whether or not a search target lexical character is described in a standard typeface in a document.
Words in non-standard typeface descriptions are largely excluded from unknown word recognition, greatly reducing the chance of dictionary search for unknown words.

【００１０】[0010]

【実施例】図１は、本発明を実施するための一実施例の
機能ブロック図である。FIG. 1 is a functional block diagram of an embodiment for carrying out the present invention.

【００１１】図１の文字認識装置13において、１は未知
語認定を行う文書を、画像データとして計算機に取り込
む文書画像入力手段で、たとえば、スキャナーなどであ
る。２は文書画像入力手段１により読み込んだ文書画像
データを記憶する画像データ記憶手段、３は文書画像入
力手段１により入力された文書画像中から、文を構成す
る順番で、的確に文字画像を切り出す文字画像認定手
段、４は文字画像認定手段３により切り出された文字画
像を、文字として認識し、対応する文字コードを出力す
る文字認識手段、５は文字画像にある文字の書体を判定
する文字書体判定手段である。In the character recognition device 13 shown in FIG. 1, reference numeral 1 denotes a document image input means for inputting a document for which unknown word recognition is performed into a computer as image data, such as a scanner. Reference numeral 2 denotes an image data storage unit for storing document image data read by the document image input unit 1, and 3 denotes a character image that is accurately cut out of the document image input by the document image input unit 1 in the order in which a sentence is composed. The character image recognition means 4 is a character recognition means for recognizing the character image cut out by the character image recognition means 3 as a character and outputting a corresponding character code, and 5 is a character typeface for determining the typeface of the character in the character image. It is a determining means.

【００１２】６は文字認識された結果の文字コードを、
書体情報とともに記憶するテキストデータ記憶手段であ
る。Reference numeral 6 denotes a character code as a result of character recognition,
This is text data storage means that is stored together with typeface information.

【００１３】字句認定装置14において、７はテキストデ
ータ記憶手段６に記憶された文字列から、ピリオド，コ
ンマ，空白などの区切り文字をもとに、字句を認定する
字句認定手段、８は字句認定手段７により認定された字
句の情報を、その書体の情報とともに記憶する字句デー
タ記憶手段、９は字句認定手段７により認定された字句
の書体を判定する字句書体判定手段である。In the lexical character recognition device 14, 7 is a lexical character recognition means for certifying a lexical character from a character string stored in the text data storage means 6 on the basis of a delimiter such as a period, a comma, a space, and the like. Lexical data storage means for storing the information of the lexical character recognized by the means 7 together with the information of the typeface.

【００１４】辞書装置15において、10は字句認定手段７
により認定された字句が未知語であるか否かを知るため
の辞書、11は字句をキーとして、前記辞書10を検索する
辞書検索手段である。12は上記各装置13，14，15及びテ
キストデータ記憶手段６を制御する全体制御装置であ
る。In the dictionary device 15, reference numeral 10 denotes the lexical recognition means 7.
Is a dictionary for knowing whether or not the phrase recognized by is a unknown word, and 11 is a dictionary search means for searching the dictionary 10 using the phrase as a key. Reference numeral 12 denotes an overall control device that controls the devices 13, 14, and 15 and the text data storage means 6.

【００１５】以下、図１の動作を英語を日本語に翻訳す
る場合を例にとり図２のフローチャートに基づいて説明
する。Hereinafter, the operation of FIG. 1 will be described with reference to the flowchart of FIG. 2 taking an example of translating English into Japanese.

【００１６】文書画像入力処理（Ｓ₁）文書画像入力手段(スキャナー)１により入力された、英
文書の英文字は画像データとして、画像データ記憶手段
２に記憶される。Document Image Input Processing (S ₁ ) The English characters of the English document input by the document image input means (scanner) 1 are stored in the image data storage means 2 as image data.

【００１７】ここでは、表１の原文が文字認識装置13に
入力されたとする。英文字のフォントを見るに２つの a
rgv と、１つの exec は太ボールドフォント(アンダー
ラインで示す部分)なのに対し、他の字句の標準フォン
トとは書体が異なる。前者の書体を「非標準書体」と呼
ぶ。後者の他の文字の書体を、「標準書体」と呼ぶ。Here, it is assumed that the original text in Table 1 is input to the character recognition device 13. Two a to see English fonts
rgv and one exec are bold bold fonts (underlined), but have a different typeface from other lexical standard fonts. The former typeface is called "non-standard typeface". The latter other typeface is referred to as a "standard typeface".

【００１８】[0018]

【表１】 [Table 1]

【００１９】文字認識処理（Ｓ₂）文字認識装置13において、文字画像認定手段３が、画像
データ記憶手段２中の画像データから、１文字単位毎に
画像データを分離して切り出す。切り出された文字画像
データは、文字認識手段４に渡される。該文字認識手段
４は、画像データに対して文字認識処理を行い、得られ
た文字コードを、文字データとしてテキストデータ記憶
手段６に記憶する。Character Recognition Processing (S ₂ ) In the character recognition device 13, the character image recognition means 3 separates and cuts out image data for each character from the image data in the image data storage means 2. The cut-out character image data is passed to the character recognition unit 4. The character recognition unit 4 performs a character recognition process on the image data, and stores the obtained character code in the text data storage unit 6 as character data.

【００２０】また、文字認識の際に、文字書体判定手段
５が、文字画像の書体を判別し、その書体情報を、前記
テキストデータ記憶手段６に記憶する。Further, at the time of character recognition, the character type determining means 5 determines the type of the character image, and stores the type information in the text data storage means 6.

【００２１】この結果、各テキストデータには、原文中
における書体情報が、属性情報として設定されることに
なる。このようにして入力された画像データ中の全ての
文字に対して、この文字認識処理が行われた結果、該テ
キストデータ記憶手段６には、入力文書が、テキストコ
ード列に変換されて記憶されることになる。As a result, font information in the original text is set as attribute information in each text data. As a result of performing this character recognition processing on all the characters in the image data input in this way, the input document is converted into a text code string and stored in the text data storage unit 6. Will be.

【００２２】前記の入力文に対して文字認識処理が行わ
れた結果、前記テキストデータ記憶手段６には、論理的
に、表２の情報が記憶されることになる。As a result of performing the character recognition processing on the input sentence, the information in Table 2 is logically stored in the text data storage means 6.

【００２３】[0023]

【表２】 [Table 2]

【００２４】ここで書体属性欄は、標準書体と非標準書
体の区別を示す。０が標準書体で、１が非標準書体であ
る。書体情報は、各文字コードに対応して存在する。Here, the typeface attribute column indicates a distinction between a standard typeface and a non-standard typeface. 0 is a standard typeface and 1 is a non-standard typeface. Typeface information exists corresponding to each character code.

【００２５】字句認定処理（Ｓ₃）字句認定装置14において、字句認定手段７は、前記テキ
ストデータ記憶手段６に記憶されているテキスト情報に
対して、字句認定処理を行う。空白や、コンマ、ピリオ
ドなどの区切り文字が、字句認定の材料となる。字句認
定処理の結果は、字句データ記憶手段８に記憶される。
上記の例に対して、字句認定処理を行うと、論理的に、
表３の結果が、該字句データ記憶手段８に記憶される。Lexical qualification processing (S ₃ ) In the lexical qualification device 14, the lexical qualification means 7 performs lexical qualification processing on the text information stored in the text data storage means 6. Delimiters such as blanks, commas, and periods are lexical recognition materials. The result of the lexical recognition processing is stored in the lexical data storage means 8.
Performing the lexical recognition process on the above example, logically,
The results in Table 3 are stored in the lexical data storage means 8.

【００２６】[0026]

【表３】 [Table 3]

【００２７】また、字句認定処理では、字句書体判定手
段９が、前記字句データ記憶手段８内の字句認定結果に
対して、字句単体の書体判定を行う。各文字コードに付
随する書体情報が、この字句書体判定に使われる。In the lexical recognition processing, the lexical style determining means 9 performs lexical style determination on the lexical recognition result in the lexical data storage means 8. Typeface information associated with each character code is used for this lexical typeface determination.

【００２８】この、字句書体判定の結果、前記字句デー
タ記憶手段８内の字句情報には、表４の字句単位の書体
情報が付加される。書体情報が０であるものは、標準書
体の字句であり、１であるものは、非標準書体の字句で
ある。As a result of the lexical font determination, the lexical information in the lexical data storage means 8 is appended with the lexical unit information shown in Table 4. If the font information is 0, it is a standard font, and if it is 1, it is a non-standard font.

【００２９】[0029]

【表４】 [Table 4]

【００３０】未知語認定処理（Ｓ₄）図３は図２に示す未知語認定処理の詳細フローチャート
である。字句の書体に基づいた辞書検索手段11により辞
書検索を行うことにより、未知語認定が行われる。先
ず、前記字句データ記憶手段８内の各字句に対して、最
初の１つの字句を取り出す(Ｓ５)。その字句の書体が非
標準書体であると(Ｓ６Ｎ)、未知語認定の対象外とな
る。標準書体であると(Ｓ６Ｙ)、辞書検索の対象とな
り、その字句をキーとして、辞書10の検索が行われる
(Ｓ７)。辞書検索(Ｓ８)の結果、未登録であった字句
は、未知語であると判定され、その字句に関する情報
は、テキストデータ記憶手段６の未知語データ管理手段
に未知語として登録される(Ｓ９)。Unknown Word Recognition Processing (S ₄ ) FIG. 3 is a detailed flowchart of the unknown word recognition processing shown in FIG. The unknown word recognition is performed by performing a dictionary search by the dictionary search means 11 based on the lexical typeface. First, for each phrase in the phrase data storage means 8, the first one phrase is extracted (S5). If the typeface of the lexical character is a non-standard typeface (S6N), it is excluded from unknown word recognition. If the font is a standard typeface (S6Y), the dictionary 10 is searched, and the dictionary 10 is searched using the lexical key as a key.
(S7). As a result of the dictionary search (S8), the unregistered phrase is determined to be an unknown word, and information on the phrase is registered as an unknown word in the unknown word data management unit of the text data storage unit 6 (S9). ).

【００３１】次に、字句データ記憶手段８内の各字句に
残りの字句があると(Ｓ10)、ループの(Ｓ５)に戻り、辞
書検索を繰り返す。字句データ記憶手段８内の各字句の
残りが無くなると、このループから抜け出し、訳語選択
処理が行なわれる(Ｓ11)。Next, when there is a remaining phrase in each phrase in the phrase data storage means 8 (S10), the process returns to (S5) of the loop and repeats the dictionary search. When there is no remaining lexical character in the lexical data storage means 8, the process exits from this loop and performs a translation word selection process (S11).

【００３２】本発明方法によれば、上記の例文には、未
知語の存在は認められないことになる。しかし、従来的
な未知語認定方式では、表５のような結果になる。この
表５のアンダーラインの字句が、未知語と認定された字
句である。指摘されたこれらの字句は、ユーザーが修正
を要するような未知語ではない。According to the present invention how, in the above sentences, will be the presence of an unknown word not observed. However, in the conventional unknown word recognition method, the result is as shown in Table 5. The underlined words in Table 5 are the words recognized as unknown words. These words mentioned are not unknown words that the user needs to correct.

【００３３】[0033]

【表５】 [Table 5]

【００３４】応用例次に、原文が表６のものであったとする。Application Example Next, it is assumed that the original text is as shown in Table 6.

【００３５】[0035]

【表６】 [Table 6]

【００３６】ここでは、意図的に、次のようなスペルミ
スの単語を含めている。Here, the following misspelled words are intentionally included.

【００３７】1. ａｒａｒｙ（誤）・・・ａｒ
ｒａｙ（正） 2. ｐａｒｅｍｅｔｅｒ（誤）・・・ｐａｒａｍｅｔｅ
ｒ（正）従来的な方式では、書体の異なりを考慮していないた
め、表７の未知語認定結果となる。アンダーラインで示
す５個の語句が未知語として指摘されているが、非標準
書体記述の３個を除くと、ここで、ユーザーが修正すべ
き語句は、上記の誤りの２単語のみである。1. array (wrong) ... ar
ray (correct) 2. parmeter (wrong) ... paramete
r (correct) In the conventional method, the unknown word recognition result shown in Table 7 is obtained because the difference in typeface is not considered. Five words indicated by underlines are pointed out as unknown words. However, except for three non-standard typeface descriptions, the words to be corrected by the user are only the two words in error described above.

【００３８】[0038]

【表７】 [Table 7]

【００３９】しかし、本発明による未知語認定方法で
は、あらかじめ原文中において異なる書体で表現されて
いた字句は、その旨を考慮した処理がなされるため、表
８のように、的確に、未知語のみを指摘可能である。[0039] However, <br/> unknown word certified how according to the present invention, the lexical had been expressed in different typefaces in advance original sentence, since the processing in consideration of this fact is made, as shown in Table 8 It is possible to accurately point out only unknown words.

【００４０】[0040]

【表８】 [Table 8]

【００４１】[0041]

【発明の効果】以上説明したように本発明の未知語認定
方法は、未知語認定処理に際し、原文の文書中におい
て、語句の書体がイタリック体などにより表現されてい
た語句は、未知語認定処理の対象外となるため、未知語
候補の発生を抑制できる。従ってコマンド名や変数名な
ど、その文書中においてのみ特徴付けられた非標準的な
書体の語句を未知語候補とせずに、標準的な書体の語句
のみを未知語候補として辞書検索を行うことによって的
確な未知語の指摘が可能となり、翻訳者には、誤字その
他の、必要最低限の未知語候補が提示されることになる
ため、修正処理における不要な煩雑さが削減される。Unknown word Certified <br/> how the present invention described above, according to the present invention, upon unknown word certification process, in a document of the original, the phrase phrase typeface has been represented by like italics , to become the outside of the object of unknown words certification process, it is possible to suppress the occurrence of the unknown word candidates. Accordingly command or variable name, such as the words of a non-standard typeface characterized only in a document without the unknown word candidate, intends rows dictionary searching only terms of a standard typeface as unknown word candidates it is possible to target <br/> precise indication of unknown words by, the translator, that Do that typographical other, necessary minimum of the unknown word candidate is presented
Therefore, unnecessary complications in the correction processing are reduced.

【図面の簡単な説明】[Brief description of the drawings]

【図１】本発明を実施するための一実施例の機能ブロッ
ク図である。FIG. 1 is a functional block diagram of an embodiment for implementing the present invention.

【図２】図１の動作を示すフローチャートである。FIG. 2 is a flowchart showing the operation of FIG.

【図３】図２の未知語認定処理の詳細フローチャートで
ある。FIG. 3 is a detailed flowchart of an unknown word recognition process in FIG. 2;

[Explanation of symbols]

１…文書画像入力手段、２…画像データ記憶手段、
３…文字画像認定手段、４…文字認識手段、５…文字
書体判定手段、６…テキストデータ記憶手段、７…字
句認定手段、８…字句データ記憶手段、９…字句書
体判定手段、10…辞書、 11…辞書検索手段、 12…全
体制御装置、 13…文字認識装置、14…字句認定装置、
15…辞書装置。1 ... document image input means, 2 ... image data storage means,
3 ... character image recognition means, 4 ... character recognition means, 5 ... character style determination means, 6 ... text data storage means, 7 ... lexology recognition means, 8 ... lexical data storage means, 9 ... lexical style determination means, 10 ... dictionary , 11: dictionary search means, 12: overall control device, 13: character recognition device, 14: lexical recognition device,
15… Dictionary device.

Claims

(57) [Claims]

An unknown word recognition method for instructing an operator to intervene in a machine translation device for a word determined as an unknown word in an original sentence, wherein a document image as an original sentence is inputted as image data. Extracting the image data in character units to form a character image, recognizing the character image data of the character image, converting the character image into text data of a character code, and recognizing whether the character image is a standard typeface or a non-standard typeface was carried out, the divided text data into words by delimiters, when the word is composed of non-standard typeface characters, the single
Was excluded from the determination process as unknown words without dictionary search word, when the word is composed of characters of a standard font that does not contain non-standard typeface character performs a dictionary retrieval of the word dictionary unknown word certification method unregistered word determines that unknown words, and controls so that the referent of the operator.