JP4829685B2

JP4829685B2 - Translation phrase pair generation apparatus, statistical machine translation apparatus, translation phrase pair generation method, statistical machine translation method, translation phrase pair generation program, statistical machine translation program, and storage medium

Info

Publication number: JP4829685B2
Application number: JP2006158083A
Authority: JP
Inventors: 元塚田; 太郎渡辺; 秀樹磯崎
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2006-06-07
Filing date: 2006-06-07
Publication date: 2011-12-07
Anticipated expiration: 2026-06-07
Also published as: JP2007328483A

Description

本発明は、翻訳フレーズペア生成装置、統計的機械翻訳装置、翻訳フレーズペア生成方法、統計的機械翻訳方法、翻訳フレーズペア生成プログラム、統計的機械翻訳プログラム、および、記憶媒体に関する。 The present invention relates to a translation phrase pair generation device, a statistical machine translation device, a translation phrase pair generation method, a statistical machine translation method, a translation phrase pair generation program, a statistical machine translation program, and a storage medium.

翻訳元言語の文と翻訳先言語の文のペアを大量に集めた対訳データから、機械翻訳システムを自動構築する技術（統計的機械翻訳）が提案されている。 A technique (statistical machine translation) for automatically constructing a machine translation system from parallel translation data obtained by collecting a large number of pairs of sentences in a source language and a target language has been proposed.

図３は、フレーズ翻訳を示す説明図である。フレーズベースの統計的機械翻訳は、単語単位ではなく、単語列に対する翻訳モデルが用いられる。最初に図３上部の翻訳元言語文「日本の首相は小泉です」を部分文字列に分割するあらゆる可能性を考える。その可能性の１つとして、図３に示すような３つの部分文字列「日本の」「首相は」「小泉です」に分割される。 FIG. 3 is an explanatory diagram showing phrase translation. Phrase-based statistical machine translation uses a translation model for word strings rather than word units. First, consider all the possibilities of dividing the source language sentence at the top of Fig. 3 "Japan's prime minister is Koizumi" into substrings. As one of the possibilities, it is divided into three substrings “Japan”, “Prime Minister” and “I am Koizumi” as shown in Figure 3.

次に、分割された部分文字列ごとに翻訳モデルを用いて翻訳して翻訳先言語文（目的言語の文字列）「The prime minister」「of Japan」「is Koizumi」を作成する。さらに、これら目的言語の文字列のあらゆる並び替えを翻訳文候補として考える。 Next, each divided partial character string is translated using a translation model to create a translated language sentence (a character string in the target language) “The prime minister”, “of Japan”, and “is Koizumi”. Furthermore, all sorts of character strings in these target languages are considered as translation sentence candidates.

こうして生成される膨大な翻訳文候補の中から、最も尤度の高い翻訳文「The prime minister of Japan is Koizumi」を探索し出力する。ただし実際は、これらあらゆる可能性を全探索して最適解を求めることは現実的でないため、様々な制約を加えて候補を絞り、準最適解を求めることが一般的である。 From the enormous translation candidates generated in this way, the most likely translation “The prime minister of Japan is Koizumi” is searched and output. However, in practice, it is not practical to search all these possibilities to find an optimal solution, so it is common to narrow down candidates by adding various constraints to find a sub-optimal solution.

機械翻訳の問題は、翻訳元単語列ｆに対応して、次の（式１）を満たす最適な翻訳先単語列ｅｈを求める問題として定式化される（非特許文献１参照）。なお、本明細書では、説明を理解しやすくするために、翻訳元単語列ｆの一例として日本語文字列を、翻訳先単語列ｅの一例として英語文字列を、それぞれ使用する。 The problem of machine translation is formulated as a problem of finding an optimal translation destination word string eh that satisfies the following (Formula 1) corresponding to the translation source word string f (see Non-Patent Document 1). In this specification, in order to make the explanation easy to understand, a Japanese character string is used as an example of the translation source word string f, and an English character string is used as an example of the translation destination word string e.

ｈ_ｍ（ｅ，ｆ）は翻訳元単語列ｆと翻訳先単語列ｅのペアをスコア付けする素性関数、λ_ｍ（ｍ＝１…Ｍ）はそれらのスケーリングファクタ、Ｅは可能な全ての翻訳先単語列（すなわち翻訳先言語）とする。

h _m (e, f) is a feature function for scoring a pair of source word string f and destination word string e, λ _m (m = 1... M) is their scaling factor, and E is all possible translations. A destination word string (that is, a translation destination language) is used.

各素性関数ｈ_ｍ（ｅ，ｆ）としては、非特許文献２では、次の７つのｌｏｇをとったものが使われる。
・フレーズ翻訳確率：φ（ｅ｜ｆ）
・フレーズ翻訳確率：φ（ｆ｜ｅ）
・レキシカル重み：ｌｅｘ（ｅ｜ｆ）
・レキシカル重み：ｌｅｘ（ｆ｜ｅ）
・フレーズペナルティ：ω^{ｌｅｎｇｔｈ（ｅ）}（ただし、ωは定数、ｌｅｎｇｔｈ（）は単語列長を返す関数）
・Ｎ−ｇｒａｍ言語モデル：Ｐ_ＬＭ（ｅ）
・歪モデル As each feature function _h m (e, f), Non-Patent Document 2, which was taken following seven log is used.
・ Phrase translation probability: φ (e | f)
・ Phrase translation probability: φ (f | e)
・ Lexical weight: lex (e | f)
・ Lexical weight: lex (f | e)
Phrase penalty: ω ^{length (e)} (where ω is a constant and length () is a function that returns the word string length)
-N-gram language model: P _LM (e)
・ Distortion model

各フレーズ翻訳確率やレキシカル重みは翻訳としての尤度を評価する関数であり、別名、翻訳モデルとも呼ばれる。前記の式において、現実にはＥの全ての要素を候補とすることは不可能であるため、次のようにして翻訳候補を生成し、Ｅを近似する。 Each phrase translation probability or lexical weight is a function for evaluating the likelihood as a translation, and is also called a translation model. In the above equation, since it is impossible in reality to use all elements of E as candidates, translation candidates are generated as follows and E is approximated.

図４は、文分割と翻訳候補を示す説明図である。まず、翻訳元の文を単語列に分割し、翻訳モデルを用いて各々の翻訳元単語列に対する翻訳先単語列の候補を生成する。翻訳元の文をオーバーラップやギャップなしに覆うような翻訳元単語列と翻訳先単語列のペア（以後、フレーズペア）の集合（この図の場合、文頭から文末まで有効グラフをたどって得られるフレーズペアの集合）のすべての要素を並び替えたものが翻訳候補となる。 FIG. 4 is an explanatory diagram showing sentence division and translation candidates. First, a translation source sentence is divided into word strings, and translation target word string candidates for each translation source word string are generated using a translation model. A set of pairs of source word strings and target word strings (hereinafter referred to as phrase pairs) that cover the source sentence without overlap or gaps (in this case, it is obtained by tracing the effective graph from the beginning to the end of the sentence) A translation candidate is a combination of all the elements of a phrase pair set).

なお、本明細書では、１つのフレーズペアをカギ括弧でくくり、“「翻訳元単語列ｆから翻訳される翻訳先単語列ｅのフレーズ（翻訳元単語列ｆのフレーズ）」”の形式で表現する。なお、１つの文は、１つ以上のフレーズペアにより構成される。 In this specification, one phrase pair is enclosed in square brackets and expressed in the form of “a phrase of a translation destination word string e translated from a translation source word string f (a phrase of the translation source word string f)”. One sentence is composed of one or more phrase pairs.

例えば、図４の例では以下の集合は翻訳候補を構成するものとして適切なものである。
・｛「Japanese（日本の）」，「The prime minister（首相は）」，「Koizumi（小泉）」，「is（です）」｝
・｛「Japanese（日本の）」，「the prime minister（首相は）」，「is Koizumi（小泉です）」｝ For example, in the example of FIG. 4, the following set is appropriate as a translation candidate.
・ {"Japanese", "The prime minister", "Koizumi", "is"}
・ {"Japanese", "the prime minister", "is Koizumi"}

各集合の要素を並びかえることで、以下のような膨大な翻訳候補文を生成する。
・「Japanese（日本の）」「The prime minister（首相は）」「is Koizumi（小泉です）」
・「the prime minister（首相は）」「Japanese（日本の）」「is Koizumi（小泉です）」 By rearranging the elements of each set, the following huge translation candidate sentences are generated.
・ “Japanese” “The prime minister” “is Koizumi”
・ "The prime minister""Japanese""isKoizumi"

しかしながら、ここまで翻訳候補数を制限しても現実的にはこれら全ての候補から最適解を求めることはできない。そこで、非特許文献３の手法で準最適解をもとめる。 However, even if the number of translation candidates is limited so far, it is practically impossible to obtain an optimal solution from all these candidates. Therefore, a suboptimal solution is obtained by the method of Non-Patent Document 3.

各スケーリングファクタλ_ｍについては、非特許文献１の手法を用いることで、スケーリングファクタ学習用対訳コーパスにおける翻訳精度が最大になるように自動設定できる。 About each scaling factor (lambda) _m , it can set automatically so that the translation accuracy in the bilingual corpus for scaling factor learning may become the maximum by using the method of a nonpatent literature 1.

前記４つの翻訳モデル（φ（ｅ｜ｆ）、φ（ｆ｜ｅ）、ｌｅｘ（ｅ｜ｆ）、および、ｌｅｘ（ｆ｜ｅ））は次のようなフレーズテーブルとして表現される。

The four translation models (φ (e | f), φ (f | e), lex (e | f), and lex (f | e)) are expressed as the following phrase table.

翻訳候補文に対するこれらの値は、翻訳候補を構成するフレーズペアのスコアの累積として計算される。例えば候補
・「of Japan（日本の）」「The prime minister（首相は）」「is Koizumi（小泉です）」に対するφ（ｅ｜ｆ）は、φ（ｅ｜ｆ）＝０．３×０．６×０．２＝０．０３６となり、対応する素性関数の値はそのｌｏｇ値であるため、ｌｏｇφ（ｅ｜ｆ）≒−１．４４となる。 These values for the translation candidate sentence are calculated as the cumulative score of the phrase pair constituting the translation candidate. For example, φ (e | f) for “of Japan”, “The prime minister” and “is Koizumi” is φ (e | f) = 0.3 × 0. Since 6 × 0.2 = 0.036, and the value of the corresponding feature function is the log value, logφ (e | f) ≈−1.44.

４つの翻訳モデルに対応する素性関数のスケーリングファクタがそれぞれ、０．１、０．２、０．３、０．４だとすると、以下の翻訳候補文の翻訳モデルに関する総合スコアを計算する。
・「The prime minister（首相は）」「of Japan（日本の）」「is Koizumi（小泉です）」
計算結果は、次の通りである。
０．１×ｌｏｇ（０．６×０．３×０．２）
＋０．２×ｌｏｇ（０．５×０．３５×０．１）
＋０．３×ｌｏｇ（０．４×０．２×０．３）
＋０．４×ｌｏｇ（０．３×０．２５×０．４） Assuming that the scaling factors of the feature functions corresponding to the four translation models are 0.1, 0.2, 0.3, and 0.4, respectively, an overall score for the translation model of the following translation candidate sentences is calculated.
・ "The prime minister""ofJapan""isKoizumi"
The calculation results are as follows.
0.1 x log (0.6 x 0.3 x 0.2)
+ 0.2 × log (0.5 × 0.35 × 0.1)
+ 0.3 × log (0.4 × 0.2 × 0.3)
+ 0.4 × log (0.3 × 0.25 × 0.4)

翻訳候補文のスコアとしては、翻訳モデルに関する総合スコアだけでなく、その他の素性関数を考慮して、同様に計算される。翻訳結果としては、最もスコアの高い翻訳候補が選ばれる（繰り返しになるが、現実的には最適解は求められず、非特許文献３の手法によって準最適化をもとめる）。 The score of the translation candidate sentence is calculated in the same manner in consideration of not only the overall score related to the translation model but also other feature functions. As a translation result, the translation candidate with the highest score is selected (repeatedly, but in reality, an optimal solution is not required, and quasi-optimization is requested by the method of Non-Patent Document 3).

図５は、翻訳フレーズペア生成装置１ａを示す構成図である。フレーズテーブル１９ａは、翻訳元言語の文と翻訳先言語の文のペアを大量に集めた対訳データから作成される。フレーズテーブル１９ａは、次の（１）から（３）のステップで作られる。
（１）単語対応付け部１４ａは、対訳コーパス１１ａから単語対応１６ａを作成することで、単語対応付けを行う。
（２）フレーズペア抽出部１５ａは、対訳コーパス１１ａおよび単語対応１６ａから対応するフレーズペア１７ａを抽出する。
（３）スコア付加部１８ａは、フレーズペア１７ａにスコア付けを行い、フレーズテーブル１９ａを作成する。 FIG. 5 is a configuration diagram showing the translation phrase pair generation device 1a. The phrase table 19a is created from parallel translation data in which a large number of pairs of sentences in the translation source language and sentences in the translation destination language are collected. The phrase table 19a is created by the following steps (1) to (3).
(1) The word association unit 14a performs word association by creating the word association 16a from the parallel corpus 11a.
(2) The phrase pair extraction unit 15a extracts the corresponding phrase pair 17a from the bilingual corpus 11a and the word correspondence 16a.
(3) The score adding unit 18a scores the phrase pair 17a and creates the phrase table 19a.

図６は、フレーズ抽出を示す説明図である。フレーズ抽出は、前記した（１）、（２）に相当する。最初にＩＢＭモデル（非特許文献４）などを用いて、単語の対応付けを行う。図６の「塗りつぶし矩形」が単語対応を表す。この例の単語対応は、単語位置をインデックスとして、「日本語の単語位置−英語の単語位置」の集合｛１−１，２−２，３−３，３−４，４−０｝（ただし単語位置０はどこにも対応しないことを意味する）と表現される。次にこの単語対応を基に、翻訳元言語の各単語に対応する「塗りつぶし矩形」が閉じるとともに翻訳先言語の各単語に対応する「塗りつぶし矩形」も閉じるように、フレーズペア１７ａを抽出する。この例では、以下の９組が抽出される。
・This（これ）
・is（は）
・a pen（ペン）
・a pen（ペンです）
・This is（これは）
・This is a pen（これはペン）
・This is a pen（これはペンです）
・is a pen（はペン）
・is a pen（はペンです） FIG. 6 is an explanatory diagram showing phrase extraction. Phrase extraction corresponds to the above (1) and (2). First, word association is performed using an IBM model (Non-Patent Document 4) or the like. The “filled rectangle” in FIG. 6 represents word correspondence. The word correspondence in this example is obtained by setting a set {1-1, 2-2, 3-3, 3-4, 4-0} of “Japanese word position−English word position” with the word position as an index (however, Word position 0 means not corresponding anywhere). Next, based on this word correspondence, the phrase pair 17a is extracted so that the “filled rectangle” corresponding to each word in the source language is closed and the “filled rectangle” corresponding to each word in the target language is also closed. In this example, the following 9 sets are extracted.
・ This
・ Is
・ A pen
・ A pen
・ This is
・ This is a pen
・ This is a pen
・ Is a pen
・ Is a pen

スコア付加部１８ａは、対訳データのすべての文ペアから抽出されるフレーズペア１７ａのリストから、各フレーズペア１７ａに対するスコアを計算する。各スコアの計算方法は、非特許文献２に基づく。 The score adding unit 18a calculates a score for each phrase pair 17a from the list of phrase pairs 17a extracted from all sentence pairs of the parallel translation data. The calculation method of each score is based on Non-Patent Document 2.

図７は、翻訳フレーズペア生成装置１ｂを示す構成図である。翻訳フレーズペア生成装置１ｂは、（１）単語対応付けの前処理として単語正規化してから、フレーズテーブル１９ｂを求める。具体的には、単語表現正規化部１３ｂは、対訳コーパス１１ｂから正規化対訳コーパス１２ｂを作成する。単語対応付け部１４ｂは、正規化対訳コーパス１２ｂから単語対応１６ｂを作成する。残りの構成は図５と同じである。 FIG. 7 is a configuration diagram showing the translation phrase pair generation device 1b. The translation phrase pair generation device 1b obtains the phrase table 19b after (1) word normalization as preprocessing for word association. Specifically, the word expression normalization unit 13b creates a normalized bilingual corpus 12b from the bilingual corpus 11b. The word association unit 14b creates a word association 16b from the normalized parallel corpus 12b. The remaining configuration is the same as in FIG.

単語表現正規化部１３ｂの単語正規化手法としてはいくつかの方法が考えられるが、英単語の正規化の場合、全てを小文字化する手法、または、ステミングとよばれる単語の活用語尾を削除する手法などがよく使われる。単語正規化した対訳コーパスは単語対応付けのためだけに用いられ、フレーズペア１７ｂの抽出はもとの対訳コーパスを用いるため、フレーズテーブルに現れる単語列は単語正規化しないものが使われる。
F.J.Och著、「Minimum error rate training in statistical machine translation」、in Proc.of the 41st Annual Meeting of the Association for Computational Linguistics(ACL),pp.160-167,2003 P.Koehn,F.J.Och,andD.Marcu著、「Statistical phrase-based translation」、in Proc.of Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics(HLT-NAACL),pp.127-133,2003 P.Koehn,Pharaoh著、「a beam search decoder for phrase-based statistical machine translation models」、in Proc of the 6th Conforence of the Association for Machine Translation in the Americas(AMTA),pp.115-124,2004 P.F.Brown,S.A.D.Pietra,V.J.D.Pitra and R.L.Mercer著、「the Mathematics of Statistical Machine Translation」、Parameter Estimation,Computational Linguistics,Vol.19,No.2,pp.263-311,1993.3 Several methods can be considered as a word normalization method of the word expression normalization unit 13b. In normalization of English words, a method of lowering all of the words, or a word ending ending called stemming is deleted. Techniques are often used. Since the word-normalized bilingual corpus is used only for word association and the phrase pair 17b is extracted using the original bilingual corpus, a word string that appears in the phrase table is not word-normalized.
FJOch, "Minimum error rate training in statistical machine translation", in Proc. Of the 41st Annual Meeting of the Association for Computational Linguistics (ACL), pp. 160-167, 2003 P.Koehn, FJOch, and D.Marcu, `` Statistical phrase-based translation '', in Proc. Of Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), pp. 127-133, 2003 P. Koehn, Pharaoh, `` a beam search decoder for phrase-based statistical machine translation models '', in Proc of the 6th Conforence of the Association for Machine Translation in the Americas (AMTA), pp. 115-124, 2004 PFBrown, SADPietra, VJDPitra and RLMercer, `` the Mathematics of Statistical Machine Translation '', Parameter Estimation, Computational Linguistics, Vol. 19, No. 2, pp. 263-311, 1993.3

単語列に対する翻訳モデルは、翻訳元言語文字列、翻訳先言語文字列およびスコアの三つ組みの集合で表される。この翻訳モデルは、両言語に含まれる単語同士の共起の統計量を用いて単語対応付けを行い、その単語対応付け結果を用いて構成される。 A translation model for a word string is represented by a set of triplets of a translation source language character string, a translation destination language character string, and a score. This translation model is configured by using the word association result by performing word association using the co-occurrence statistics of words included in both languages.

しかし、単語対応付けに用いる共起の統計量は、翻訳モデルの学習に用いる学習データが少ない場合には信頼できるものが求まらず、結果として精度の高い翻訳モデルが獲得できない問題がある。 However, the co-occurrence statistic used for word association cannot be obtained when the learning data used for learning the translation model is small, and as a result, there is a problem that a highly accurate translation model cannot be obtained.

そこで、あらかじめ英単語を小文字に統一する、または、英語の活用語尾を削除する（ステミング）など、単語表記の正規化をしておくことで共起の統計量を安定してもとめやすくすることが広く行われている。これまで単語の対応付けを正規化された対訳コーパスのみから生成していた。 Therefore, it is easier to stop co-occurrence statistics by stabilizing the word notation by unifying English words in lower case in advance or deleting the English endings (stemming). Widely done. Until now, word associations were generated only from normalized bilingual corpora.

しかし、正規化により、どのような学習データに対しても安定し、性能を引き出すことは困難であるという副作用が発生してしまう。つまり、単語表記を正規化することはある種の情報が失われることであるため、単語表記を正規化してもとめた単語対応に基づき作成された翻訳モデルは、学習データが増えると単語表記を正規化しないで作成したものより精度が落ちるという問題があった。 However, normalization causes a side effect that it is difficult for any learning data to be stable and to extract performance. In other words, normalizing the word notation means that some information is lost. Therefore, the translation model created based on word correspondence obtained by normalizing the word notation normalizes the word notation as the learning data increases. There was a problem that the accuracy was lower than the one created without conversion.

例えば、日本語の単語「日本」は英単語の「Japan」や「Japanese」に対応づき易い。しかし、学習データが少ない場合、英単語をステミングすることで、「Japan」も「Japanese」も「Japan」に変換して扱った方が、安定して共起統計量がもとまる。しかし、「日本人は日本が好き」と「Japanese people like Japan」の二文間で単語対応を求める際は、どちらの「日本」が「Japan」と「Japanese」に対応しているかの区別が必要なため、このような正規化は副作用も及ぼす。 For example, the Japanese word “Japan” can easily correspond to the English words “Japan” and “Japanese”. However, when learning data is small, it is more stable to obtain co-occurrence statistics by stemming English words and converting “Japan” and “Japanese” to “Japan”. However, when seeking word correspondence between two sentences, “Japanese likes Japan” and “Japanese people like Japan”, it is difficult to distinguish which “Japan” corresponds to “Japan” or “Japanese”. Such normalization also has side effects because it is necessary.

そこで、本発明は、前記した問題を解決し、どのような学習データにおいても、精度の高い翻訳ができるような、フレーズペアを作成することを主な目的とする。 Therefore, the main object of the present invention is to solve the above-described problems and to create a phrase pair that can be translated with high accuracy in any learning data.

前記課題を解決するために、本発明は、統計的機械翻訳に使用されるフレーズペアを生成するフレーズペア生成装置であって、正規化されていない第１対訳コーパスから正規化されていない第１単語対応を作成し、正規化されている第２対訳コーパスから正規化されている第２単語対応を作成する単語対応付け部と、前記第１対訳コーパス、および、前記第１単語対応から第１フレーズペアを抽出し、前記第１対訳コーパス、および、前記第２単語対応から第２フレーズペアを抽出し、前記第２対訳コーパス、および、前記第１単語対応から第３フレーズペアを抽出し、前記第２対訳コーパス、および、前記第２単語対応から第４フレーズペアを抽出するフレーズペア抽出部と、前記第１フレーズペア、前記第２フレーズペア、前記第３フレーズペア、および、前記第４フレーズペアに対して、それぞれスコア付けを行い、フレーズテーブルに格納するスコア付加部と、を有することを特徴とする。 In order to solve the above-described problem, the present invention provides a phrase pair generation device that generates a phrase pair used for statistical machine translation, and is a first unnormalized first untranslated corpus. Creating a word correspondence and creating a normalized second word correspondence from the normalized second bilingual corpus, the first bilingual corpus, and the first word correspondence to the first A phrase pair is extracted, a second phrase pair is extracted from the first bilingual corpus and the second word correspondence, a third phrase pair is extracted from the second bilingual corpus and the first word correspondence, A phrase pair extraction unit for extracting a fourth phrase pair from the second bilingual corpus and the second word correspondence; the first phrase pair; the second phrase pair; Zupea, and, with respect to the fourth phrase pair, respectively perform scoring, and having a score adding unit for storing a phrase table, a.

これにより、従来の方法よりも多数の翻訳フレーズペアの生成ができる。 Thereby, it is possible to generate a larger number of translated phrase pairs than in the conventional method.

本発明は、翻訳元言語文の入力を受け付け、前記翻訳フレーズペア生成装置により前記フレーズテーブルに格納された前記第１フレーズペア、前記第２フレーズペア、前記第３フレーズペア、および、前記第４フレーズペアをもとに、翻訳先言語文を出力することを特徴とする。 The present invention accepts an input of a translation source language sentence and stores the first phrase pair, the second phrase pair, the third phrase pair, and the fourth phrase stored in the phrase table by the translated phrase pair generation device. Based on the phrase pair, the translated language sentence is output.

本発明は、統計的機械翻訳に使用されるフレーズペアを生成する翻訳フレーズペア生成方法であって、コンピュータが、正規化されていない第１対訳コーパスから正規化されていない第１単語対応を作成し、正規化されている第２対訳コーパスから正規化されている第２単語対応を作成する手順と、前記第１対訳コーパス、および、前記第１単語対応から第１フレーズペアを抽出し、前記第１対訳コーパス、および、前記第２単語対応から第２フレーズペアを抽出し、前記第２対訳コーパス、および、前記第１単語対応から第３フレーズペアを抽出し、前記第２対訳コーパス、および、前記第２単語対応から第４フレーズペアを抽出する手順と、前記第１フレーズペア、前記第２フレーズペア、前記第３フレーズペア、および、前記第４フレーズペアに対して、それぞれスコア付けを行い、フレーズテーブルに格納する手順と、を実行することを特徴とする。 The present invention is a translation phrase pair generation method for generating a phrase pair used for statistical machine translation, in which a computer creates a non-normalized first word correspondence from a non-normalized first parallel corpus Generating a normalized second word correspondence from the normalized second bilingual corpus, extracting the first bilingual corpus, and the first phrase pair from the first word correspondence, Extracting a second phrase pair from the first bilingual corpus and the second word correspondence; extracting a second phrase pair from the second bilingual corpus and the first word correspondence; and the second bilingual corpus; , Extracting the fourth phrase pair from the second word correspondence, the first phrase pair, the second phrase pair, the third phrase pair, and the fourth frame Against Zupea, respectively perform scoring, and executes the instructions stored in the phrase table, a.

本発明は、コンピュータが、翻訳元言語文の入力を受け付け、前記翻訳フレーズペア生成方法により前記フレーズテーブルに格納された前記第１フレーズペア、前記第２フレーズペア、前記第３フレーズペア、および、前記第４フレーズペアをもとに、翻訳先言語文を出力することを特徴とする。 In the present invention, a computer receives an input of a translation source language sentence, and the first phrase pair, the second phrase pair, the third phrase pair stored in the phrase table by the translated phrase pair generation method, and A translated language sentence is output based on the fourth phrase pair.

本発明は、前記翻訳フレーズペア生成方法をコンピュータに実行させるための翻訳フレーズペア生成プログラムである。 The present invention is a translation phrase pair generation program for causing a computer to execute the translation phrase pair generation method.

本発明は、前記統計的機械翻訳方法をコンピュータに実行させるための統計的機械翻訳プログラムである。 The present invention is a statistical machine translation program for causing a computer to execute the statistical machine translation method.

本発明は、前記プログラムを格納することを特徴とする。 The present invention is characterized by storing the program.

本発明は、統計的機械翻訳に使用されるフレーズペアを生成するフレーズペア生成装置であって、正規化しない１つの対訳コーパスと、その対訳コーパスに対してＮ−１種類の単語正規化手法をそれぞれ１回適用した対訳コーパスをＮ−１個作成することで、合計Ｎ個の対訳コーパスとし、前記Ｎ個の対訳コーパスからＮ個の単語対応グループを作成する単語対応付け部と、前記Ｎ個の対訳コーパスと、前記Ｎ個の単語対応との組み合わせにより、Ｎの２乗個のフレーズペアを抽出するフレーズペア抽出部と、前記Ｎの２乗個のフレーズペアに対して、それぞれスコア付けを行い、フレーズテーブルに格納するスコア付加部と、を有することを特徴とする。 The present invention is a phrase pair generation device that generates a phrase pair used for statistical machine translation, and includes one bilingual corpus that is not normalized and N-1 types of word normalization techniques for the bilingual corpus. By creating N-1 bilingual corpora each applied once, a total of N bilingual corpora, and a word association unit for creating N word correspondence groups from the N bilingual corpora, the N A phrase pair extraction unit that extracts N square phrase pairs by combining the N-word parallel corpus and the N word correspondence, and scoring each of the N square phrase pairs. And a score adding unit for storing in the phrase table.

本発明は、統計的機械翻訳に使用されるフレーズペアを生成するフレーズペア生成装置であって、正規化しない１つの対訳コーパスからＮ種類の単語正規化手法をそれぞれ１回適用したＮ個の単語対応グループを作成する単語対応付け部と、前記１個の対訳コーパスと、前記Ｎ個の単語対応との組み合わせにより、Ｎ個のフレーズペアを抽出するフレーズペア抽出部と、前記Ｎ個のフレーズペアに対して、それぞれスコア付けを行い、フレーズテーブルに格納するスコア付加部と、を有することを特徴とする。 The present invention is a phrase pair generation device for generating a phrase pair used for statistical machine translation, and N words obtained by applying N kinds of word normalization techniques once each from one bilingual corpus without normalization A phrase pair extraction unit that extracts N phrase pairs by a combination of a word association unit that creates a correspondence group, the one bilingual corpus, and the N word correspondences, and the N phrase pairs And a score adding unit that performs scoring and stores in a phrase table.

本発明により、正規化していない対訳コーパスからも生成するようにしたことにより、従来の方法よりも多数の翻訳フレーズペアの生成ができる。よって、どのような学習データにおいても、単語正規化しない翻訳モデルだけ用いた場合や、単語正規化した翻訳モデルだけ用いた場合と比べて、平均して精度の高い翻訳ができ、安定して性能を引き出すことができる。 According to the present invention, by generating from a bilingual corpus that has not been normalized, a larger number of translation phrase pairs can be generated than in the conventional method. Therefore, any learning data can be translated with high accuracy on average compared to the case where only the translation model without word normalization is used or the case where only the translation model with word normalization is used. Can be pulled out.

また、ユーザは、学習データに応じて対訳コーパスを正規化するかしないか、またはどのような正規化手法を用いるかを学習データに応じて先見的に決定する必要がなくなり、決定に関する負担が軽減する。 In addition, the user does not have to make a priori decision on whether or not to normalize the bilingual corpus according to the learning data or what kind of normalization method is used, thus reducing the burden on the decision. To do.

以下に、本発明が適用される翻訳システムの一実施形態について、図面を参照して詳細に説明する。 Hereinafter, an embodiment of a translation system to which the present invention is applied will be described in detail with reference to the drawings.

図１は、翻訳フレーズペア生成装置１ｃを示す構成図である。図７と比較すると、以下に示す構成の差異がある。 FIG. 1 is a configuration diagram showing a translation phrase pair generation device 1c. Compared with FIG. 7, there are differences in the configuration shown below.

まず、単語対応付け部１４ｃは、単語正規化しない対訳コーパス１１ｃ（第１対訳コーパス）、および、正規化した正規化対訳コーパス１２ｃ（第２対訳コーパス）にそれぞれ対応して、正規化なし単語対応１６ｄ（第１単語対応）、および、正規化あり単語対応１６ｃ（第２単語対応）を求める。 First, the word association unit 14c corresponds to a word without normalization corresponding to the bilingual corpus 11c (first bilingual corpus) without word normalization and the normalized bilingual corpus 12c (second bilingual corpus), respectively. 16d (corresponding to the first word) and normalized word correspondence 16c (corresponding to the second word) are obtained.

次に、フレーズペア抽出部１５ｃは、以下に示す４種類のフレーズペア（フレーズペアのマルチセット）を抽出する。
・第１フレーズペア１７ｃとして、正規化していない対訳コーパス１１ｃから正規化なし単語対応１６ｄを用いて、フレーズペアを抽出する。
・第２フレーズペア１７ｄとして、正規化していない対訳コーパス１１ｃから正規化あり単語対応１６ｃを用いて、フレーズペアを抽出する。
・第３フレーズペア１７ｅとして、正規化対訳コーパス１２ｃから正規化なし単語対応１６ｄを用いて、フレーズペアを抽出する。
・第４フレーズペア１７ｆとして、正規化対訳コーパス１２ｃから正規化あり単語対応１６ｃを用いて、フレーズペアを抽出する。 Next, the phrase pair extraction unit 15c extracts the following four types of phrase pairs (multiple sets of phrase pairs).
As the first phrase pair 17c, the phrase pair is extracted from the non-normalized bilingual corpus 11c using the unnormalized word correspondence 16d.
As the second phrase pair 17d, the phrase pair is extracted from the bilingual corpus 11c that has not been normalized using the normalized word correspondence 16c.
As the third phrase pair 17e, the phrase pair is extracted from the normalized parallel corpus 12c using the non-normalized word correspondence 16d.
As the fourth phrase pair 17f, the phrase pair is extracted from the normalized parallel corpus 12c using the normalized word correspondence 16c.

以下にフレーズテーブル１９（一部）の例を示す。

An example of the phrase table 19 (part) is shown below.

フレーズテーブル１９において、抽出されなかったフレーズペアのスコアは非常に小さな値である０．００１に設定する。前記フレーズテーブル１７の例では、フレーズペア「＃１」は、第１フレーズペア１７ｃに含まれなかったため、０．００１となっている。 In the phrase table 19, the score of the phrase pair that has not been extracted is set to 0.001, which is a very small value. In the example of the phrase table 17, the phrase pair “# 1” is 0.001 because it is not included in the first phrase pair 17c.

図２は、統計的機械翻訳装置２を示す構成図である。日本語から英語へ翻訳する日英機械翻訳システムに関する一例を示す。統計的機械翻訳装置２は、スケーリングファクタ学習用対訳データ２１、スケーリングファクタ学習部２２、スケーリングファクタ２３、解探索部２４、フレーズテーブル２５、および、言語モデル２６を有する。これらの構成要素は、例えば、非特許文献１、非特許文献３の従来法により実現できる。 FIG. 2 is a configuration diagram showing the statistical machine translation apparatus 2. An example of a Japanese-English machine translation system that translates from Japanese to English is shown. The statistical machine translation apparatus 2 includes parallel translation data 21 for scaling factor learning, a scaling factor learning unit 22, a scaling factor 23, a solution search unit 24, a phrase table 25, and a language model 26. These components can be realized by the conventional methods of Non-Patent Document 1 and Non-Patent Document 3, for example.

フレーズテーブル１９（図１参照）に格納されている内容と、フレーズテーブル２５に格納されている内容とは、同じである。この内容の同一化をするために、例えば、フレーズテーブル１７からフレーズテーブル２５にデータコピーを行ってもよいし、フレーズテーブル２５からフレーズテーブル１７に参照するリンクを設定してもよい。 The contents stored in the phrase table 19 (see FIG. 1) and the contents stored in the phrase table 25 are the same. In order to make this content the same, for example, data may be copied from the phrase table 17 to the phrase table 25, or a link referring to the phrase table 17 from the phrase table 25 may be set.

フレーズテーブル２５に格納されている個々のフレーズペアとそれに対応する各スコアの三つ組みの集合から、翻訳候補文を評価する関数ｈ_ｍ（ｅ，ｆ）を構成する。スケーリングファクタ学習部２２は、翻訳モデル（言語モデル２６）以外の素性関数も含めて、スケーリングファクタ学習用対訳データ２１における翻訳精度が最大になるように、非特許文献１の手法などを用いてスケーリングファクタ２３を求める。 A function h _m (e, f) for evaluating a translation candidate sentence is constructed from a set of triplets of individual phrase pairs and the corresponding scores stored in the phrase table 25. The scaling factor learning unit 22 performs scaling using the method of Non-Patent Document 1 so that the translation accuracy in the parallel translation data 21 for scaling factor learning including the feature function other than the translation model (language model 26) is maximized. Factor 23 is determined.

解探索部２４は、スケーリングファクタ２３により個々の素性関数を重み付けする。 The solution search unit 24 weights each feature function by the scaling factor 23.

以上の実施形態では、フレーズテーブル２５に格納されている「対訳コーパス」と「正規化対訳コーパス」の２種類を用いることで、２の２乗＝４種類のフレーズペアのバリエーションを獲得した。さらに、単語表現正規化部１３ｃは、Ｎ−１種類の単語正規化手法をそれぞれ１回適用した対訳コーパスをＮ−１個作成し、正規化しない１つの対訳コーパスを含めて、合計Ｎ個の対訳コーパスを用いるように拡張できる。同様に、Ｎ−１種類の単語正規化手法をそれぞれ１回適用した単語対応をＮ−１個作成し、正規化しない１つの単語対応を含めて、合計Ｎ個の単語対応を用いるように拡張できる。なお、Ｎは、２以上の自然数である。 In the above embodiment, by using two types of “parallel translation corpus” and “normalized parallel corpus” stored in the phrase table 25, variations of 2 2 = 4 types of phrase pairs are obtained. Further, the word expression normalization unit 13c creates N-1 bilingual corpuses each applying N-1 types of word normalization techniques once, and includes a total of N corpora including one bilingual corpus that is not normalized. It can be extended to use a bilingual corpus. Similarly, N-1 word correspondences, each of which applies N-1 types of word normalization methods once, are created and expanded to use a total of N word correspondences including one word correspondence that is not normalized. it can. N is a natural number of 2 or more.

この場合、抽出対象となる対訳コーパスがＮ個、利用する単語対応がＮ個で合計Ｎの２乗個のフレーズペアのマルチセットが獲得可能である。または、正規化しない対訳コーパスが１個、利用する単語対応がＮ個で合計Ｎ個のフレーズペアのマルチセットが獲得可能である。このように、本実施形態では対訳コーパスのバリエーションを組み合わせ的に活かして、翻訳モデルのバリエーションを増やすことが可能となる。 In this case, it is possible to acquire a multi-set of N 2 parallel corpus to be extracted, N word correspondences to be used, and a total of N square phrase pairs. Alternatively, it is possible to acquire a multi-set of a total of N phrase pairs with one bilingual corpus without normalization and N word correspondences to be used. As described above, in the present embodiment, it is possible to increase the variations of the translation model by combining the variations of the bilingual corpus.

以上説明した本発明は、単語正規化しないで求めた翻訳モデルと、単語正規化してもとめた翻訳モデルを重み付けて併用して用いる。翻訳モデルの重み付けは、翻訳モデルを学習した学習データとは別の重み決定用学習データを用い、重み決定用学習データでの翻訳精度が一番高くなるように設定することを特徴とする。 In the present invention described above, the translation model obtained without word normalization and the translation model obtained after word normalization are used in a weighted manner. The weighting of the translation model is characterized in that weighting learning data different from the learning data obtained by learning the translation model is used and the translation accuracy in the weight determination learning data is set to be the highest.

以上説明した本発明は、以下のようにその趣旨を逸脱しない範囲で広く変形実施することができる。 The present invention described above can be widely modified without departing from the spirit thereof as follows.

例えば、本明細書では、翻訳元単語列ｆおよび翻訳先単語列ｅとして、任意の言語体系から任意の言語体系への翻訳が適用可能である。 For example, in this specification, translation from an arbitrary language system to an arbitrary language system can be applied as the translation source word string f and the translation destination word string e.

なお、翻訳フレーズペア生成装置１ｃ、および、統計的機械翻訳装置２は、それぞれ演算処理を行う際に用いられる主記憶手段としてのメモリと、前記演算処理を行う演算処理装置と、各テーブルを格納するＨＤＤ（Hard Disk Drive）などの補助記憶手段を少なくとも備えるコンピュータとして構成される。なお、メモリは、ＲＡＭ（Random Access Memory）などにより構成される。演算処理は、ＣＰＵ（Central Processing Unit）によって構成される演算処理装置が、メモリ上のプログラムを実行することで、実現される。本実施形態は、各装置に加え、各装置に演算処理を実行させるためのプログラム、および、そのプログラムを格納したコンピュータ読み取り可能な記憶媒体を含む。 The translation phrase pair generation device 1c and the statistical machine translation device 2 each store a memory as a main storage unit used when performing arithmetic processing, an arithmetic processing device that performs the arithmetic processing, and each table. The computer is configured to include at least auxiliary storage means such as a hard disk drive (HDD). The memory is constituted by a RAM (Random Access Memory) or the like. Arithmetic processing is realized by an arithmetic processing unit configured by a CPU (Central Processing Unit) executing a program on a memory. In addition to each device, the present embodiment includes a program for causing each device to perform arithmetic processing, and a computer-readable storage medium storing the program.

本発明の一実施形態に関する翻訳フレーズペア生成装置を示す構成図である。It is a block diagram which shows the translation phrase pair production | generation apparatus regarding one Embodiment of this invention. 本発明の一実施形態に関する統計的機械翻訳装置を示す構成図である。It is a block diagram which shows the statistical machine translation apparatus regarding one Embodiment of this invention. 本発明の一実施形態に関するフレーズ翻訳を示す説明図である。It is explanatory drawing which shows the phrase translation regarding one Embodiment of this invention. 本発明の一実施形態に関する文分割と翻訳候補を示す説明図である。It is explanatory drawing which shows the sentence division and translation candidate regarding one Embodiment of this invention. 本発明の一実施形態に関する翻訳フレーズペア生成装置（従来法１）を示す構成図である。It is a block diagram which shows the translation phrase pair production | generation apparatus (conventional method 1) regarding one Embodiment of this invention. 本発明の一実施形態に関するフレーズ抽出を示す説明図である。It is explanatory drawing which shows the phrase extraction regarding one Embodiment of this invention. 本発明の一実施形態に関する翻訳フレーズペア生成装置（従来法２）を示す構成図である。It is a block diagram which shows the translation phrase pair production | generation apparatus (conventional method 2) regarding one Embodiment of this invention.

Explanation of symbols

１翻訳フレーズペア生成装置
２統計的機械翻訳装置
１４単語対応付け部
１５フレーズペア抽出部
１８スコア付加部
１９フレーズテーブル DESCRIPTION OF SYMBOLS 1 Translation phrase pair production | generation apparatus 2 Statistical machine translation apparatus 14 Word matching part 15 Phrase pair extraction part 18 Score addition part 19 Phrase table

Claims

A translation phrase pair generator for generating phrase pairs used for statistical machine translation,
A word associating unit that creates a non-normalized first word correspondence from the unnormalized first bilingual corpus and creates a normalized second word correspondence from the normalized second bilingual corpus; ,
Extracting a first phrase pair from the first bilingual corpus and the first word correspondence; extracting a first phrase pair from the first bilingual corpus and the second word correspondence; and the second bilingual corpus, And a phrase pair extraction unit that extracts a third phrase pair from the first word correspondence, extracts the second parallel corpus, and a fourth phrase pair from the second word correspondence;
A score adding unit that performs scoring for each of the first phrase pair, the second phrase pair, the third phrase pair, and the fourth phrase pair, and stores the score in a phrase table;
The translation phrase pair production | generation apparatus characterized by having.

The input of the translation source language sentence is accepted, and the first phrase pair, the second phrase pair, the third phrase pair, and the stored in the phrase table by the translated phrase pair generation device according to claim 1, A statistical machine translation device that outputs a translated language sentence based on a fourth phrase pair.

A translation phrase pair generation method for generating a phrase pair used for statistical machine translation,
Computer
Creating a non-normalized first word correspondence from the non-normalized first bilingual corpus, and creating a normalized second word correspondence from the second bilingual corpus that is normalized;
Extracting a first phrase pair from the first bilingual corpus and the first word correspondence; extracting a first phrase pair from the first bilingual corpus and the second word correspondence; and the second bilingual corpus, And extracting a third phrase pair from the first word correspondence, extracting the second parallel corpus, and a fourth phrase pair from the second word correspondence;
The first phrase pair, the second phrase pair, the third phrase pair, and the fourth phrase pair are each scored and stored in a phrase table;
The translation phrase pair generation method characterized by performing this.

The computer accepts input of a translation source language sentence, and the first phrase pair, the second phrase pair, the third phrase pair stored in the phrase table by the translated phrase pair generation method according to claim 3, And the statistical machine translation method characterized by outputting a translation destination language sentence based on said 4th phrase pair.

A translation phrase pair generation program for causing a computer to execute the translation phrase pair generation method according to claim 3.

A statistical machine translation program for causing a computer to execute the statistical machine translation method according to claim 4.

A computer-readable storage medium storing the program according to claim 5 or 6.

A phrase pair generation device for generating a phrase pair used for statistical machine translation,
One bilingual corpus without normalization and N-1 bilingual corpora obtained by applying N-1 types of word normalization methods once to the bilingual corpus are generated as a total of N bilingual corpora, A word association unit for creating N word correspondence groups from the N parallel corpora;
A phrase pair extraction unit that extracts N square phrase pairs by combining the N bilingual corpora and the N word correspondences;
A score adding unit for scoring each of the N square phrase pairs and storing it in the phrase table;
The translation phrase pair production | generation apparatus characterized by having.

A phrase pair generation device for generating a phrase pair used for statistical machine translation,
A word associating unit that creates N word correspondence groups each applying N kinds of word normalization techniques once from one bilingual corpus that is not normalized;
A phrase pair extraction unit that extracts N phrase pairs by combining the one bilingual corpus and the N word correspondence;
A score adding unit for scoring each of the N phrase pairs and storing it in a phrase table;
The translation phrase pair production | generation apparatus characterized by having.