JP5410334B2

JP5410334B2 - Word order conversion device, machine translation statistical model creation device, machine translation device, word order conversion method, machine translation statistical model creation method, machine translation method, program

Info

Publication number: JP5410334B2
Application number: JP2010039656A
Authority: JP
Inventors: 秀樹磯崎; 克仁須藤; 元塚田
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2010-02-25
Filing date: 2010-02-25
Publication date: 2014-02-05
Anticipated expiration: 2030-02-25
Also published as: JP2011175500A

Description

本発明は、英語などの非主辞後置型言語（主辞を後置することが多くない言語）（原言語）で書かれた文を、日本語に代表される主辞後置型言語（主辞を後置することが多い言語）（目的言語）に翻訳する言語翻訳技術に関する。 In the present invention, a sentence written in a non-subscript postfix language such as English (a language in which the main prefix is not often postfixed) (source language) (source language) The present invention relates to a language translation technology that translates into (target language) (target language).

日本語などの多くのアジア圏の言語は、主辞（被修飾語）が修飾語の後に位置することが多いという性質があり、主辞後置型言語と呼ばれている。例えば、“I respect you.”という英語を日本語に翻訳すると、「私はあなたを尊敬している。」となる。ここで、「respect（尊敬している）」は、「you（あなた）」の主辞である。語順を見ると、英語では「respect」が「you」の前に位置しているのに対し、日本語では「尊敬している」が「あなた」の後に位置しているのがわかる。 Many Asian languages, such as Japanese, have the property that the head (or modifier) is often located after the modifier, and is called a head postfix language. For example, if you translate the English word “I respect you.” Into Japanese, “I respect you.” Here, “respect” is the head of “you”. Looking at the word order, it can be seen that “respect” is positioned before “you” in English, whereas “respect” is positioned after “you” in Japanese.

近年、インターネットが発達し、インターネットユーザは、外国語の文書を見る機会が増えてきている。これにつれて翻訳の需要も高まっており、機械翻訳サービスが、インターネット上で無料あるいは有料で利用可能となっている。これらの翻訳サービスは、その技術により大きく分けると、ルールベース型機械翻訳（ＲＢＭＴ：Rule-based Machine Translation）と統計的機械翻訳（ＳＭＴ：Statistical Machine Translation）に分けられる。翻訳サービスの中には、数十言語間の翻訳を行うことが可能なものも存在する。 In recent years, the Internet has been developed, and Internet users have increased opportunities to view foreign language documents. Along with this, the demand for translation is also increasing, and machine translation services are available on the Internet free of charge or for a fee. These translation services are roughly classified into rule-based machine translation (RBMT) and statistical machine translation (SMT). Some translation services can translate between several tens of languages.

ＲＢＭＴは、人間が翻訳辞書や翻訳ルールをプログラムすることで実現するものである。その際、プログラマは、原言語と目的言語の両方の文法に精通し、語順を入れ替えたり、対応する語句で置き換えたりするための規則を考え、プログラムとして実行可能な形式でルールを書き下していく。 RBMT is realized by a person programming a translation dictionary and translation rules. At that time, the programmer is familiar with the grammar of both the source language and the target language, thinks about rules for changing the word order or replacing with corresponding words, and writes down the rules in a form that can be executed as a program.

しかし、ＲＢＭＴでは、ルールの数が増えるにつれ、プログラムの保守が困難になる。また、目的言語が日本語の場合を考えても、英語と日本語の両方の文法に精通するプログラマの数は多いとしても、英語以外の原言語と日本語の両方の文法に精通するプログラマの数は少ない。そのため、最近は、実際に翻訳された大量の文書データ（対訳コーパス）から翻訳ルールを統計処理によって自動獲得するＳＭＴがさかんに研究されている。 However, in RBMT, program maintenance becomes more difficult as the number of rules increases. Even if the target language is Japanese, there are many programmers who are familiar with both English and Japanese grammar, but there are programmers who are familiar with both non-English source and Japanese grammar. The number is small. Therefore, recently, there has been a lot of research on SMT that automatically acquires translation rules by statistical processing from a large amount of document data (translational corpus) actually translated.

ＳＭＴでは、人間が翻訳ルールを直接管理しなくてよいが、実際に翻訳された例（対訳コーパス）を大量に準備しなければ、質のよい翻訳ルールを得ることができない。また、対訳コーパスが大量に準備できたとしても、膨大な計算量が必要となる。 In SMT, humans do not have to directly manage translation rules, but high-quality translation rules cannot be obtained unless a large number of actual translated examples (parallel corpus) are prepared. In addition, even if a large number of parallel corpora are prepared, a huge amount of calculation is required.

このように、各翻訳技術には一長一短がある。そこで、単純に単語を翻訳の単位として扱うＳＭＴの枠組みを拡張して、複数の単語の並び（フレーズ）を扱えるようにしたり（フレーズベースＳＭＴ）、さらにそのフレーズを階層化したり（階層的フレーズベースＳＭＴ）といった改良が提案されている。 Thus, each translation technique has advantages and disadvantages. Therefore, the framework of SMT that simply handles words as translation units can be expanded to handle multiple word sequences (phrases) (phrase-based SMT), and the phrases can be hierarchized (hierarchical phrase-based). Improvements such as SMT) have been proposed.

例えば、特許文献１には、フレーズベースＳＭＴにおける翻訳ルールの自動獲得法に関する改良技術が開示されている。また、特許文献２には、階層的フレーズベースＳＭＴにおける計算量に関する改良技術が開示されている。特に、特許文献２に開示された技術では、階層的フレーズベースＳＭＴにおいて生成される翻訳ルールの形に制約を加えることで、解探索の効率を改善している。 For example, Patent Document 1 discloses an improved technique related to a method for automatically acquiring translation rules in phrase-based SMT. Patent Document 2 discloses an improved technique related to the calculation amount in hierarchical phrase-based SMT. In particular, in the technique disclosed in Patent Document 2, the efficiency of solution search is improved by adding restrictions to the form of translation rules generated in the hierarchical phrase-based SMT.

一方、原言語または目的言語を構文解析してから統計処理するという試みも行われており、非特許文献１，２には、この分野の動向について開示されている。特に、非特許文献２では、ＳＭＴのかかえる解探索空間の広さを削減するため、ＩＴＧ（Inversion Transduction Grammar）と呼ばれる解探索空間の削減技術や、その拡張であるＩＳＴ（Imposing Source Tree）−ＩＴＧなどの技術について開示されている。 On the other hand, attempts have been made to perform statistical processing after parsing the source language or the target language, and Non-Patent Documents 1 and 2 disclose trends in this field. In particular, in Non-Patent Document 2, in order to reduce the size of a solution search space for SMT, a solution search space reduction technique called ITG (Inversion Transduction Grammar) and its extension IST (Imposing Source Tree) -ITG Such a technique is disclosed.

また、語順をどう並べ替えるかについては、これまでに多くの人々が様々な経験則を示している。例えば、非特許文献３，４では、英韓翻訳において、語順を変更する指標となる経験則がいくつか示されており、その中には、動詞句や形容詞句において主辞を最後に移動する、という具体的なルールも含まれている。 In addition, many people have shown various rules of thumb regarding how to rearrange the word order. For example, Non-Patent Documents 3 and 4 show some empirical rules that serve as an index for changing the word order in English-Korean translation, including moving the main word last in verb phrases and adjective phrases. The specific rule is also included.

特開２００７−３２８４８３号公報JP 2007-328483 A 特開２００８−１５８４４号公報JP 2008-15844 A

潮田明、「ルールベース翻訳と統計翻訳の融合におけるsyntaxの役割」、Japio 2009 YEARBOOK、日本特許情報機構、２００９年、pp.286-289Akira Shioda, “The role of syntax in the fusion of rule-based translation and statistical translation”, Japio 2009 YEARBOOK, Japan Patent Information Organization, 2009, pp.286-289 隅田英一郎、「統計翻訳に構造制約を導入する新しいアプローチ」、Japio 2008 YEARBOOK、日本特許情報機構、２００８年、pp.266-271Eiichiro Sumida, “A New Approach to Introduce Structural Constraints in Statistical Translation”, Japio 2008 YEARBOOK, Japan Patent Information Organization, 2008, pp.266-271 Gumwon Hong, Seung-Wook Lee and Hae-Chang Rim, “Bridging Morpho-Syntactic Gap between Source and Target Sentences for English-Korean Statistical Machine Translation”, Proceedings of the ACL-IJCNLP-2009, Singapore, pp. 233-236, 2009Gumwon Hong, Seung-Wook Lee and Hae-Chang Rim, “Bridging Morpho-Syntactic Gap between Source and Target Sentences for English-Korean Statistical Machine Translation”, Proceedings of the ACL-IJCNLP-2009, Singapore, pp. 233-236, 2009 Peng Xu, Jaeho Kang, Michael Ringgaard, and Franz Och: “Using a Dependency Parser to Improve SMT for Subject-Object-Verb Languages”, Proceedings of the NAACL-HLT-2009, pp. 245-253, 2009.Peng Xu, Jaeho Kang, Michael Ringgaard, and Franz Och: “Using a Dependency Parser to Improve SMT for Subject-Object-Verb Languages”, Proceedings of the NAACL-HLT-2009, pp. 245-253, 2009.

しかしながら、前記した従来技術のいずれによっても、充分な品質の翻訳を実現できていない。その理由として、例えば、英語から日本語への翻訳においては、語順を大幅に入れ替えなければならないという問題がある。ＲＢＭＴでは、英語と日本語の語順を比べて、語順が入れ替わる場合を見つけ、どのようなときにどう入れ替えるか、という規則を人間が考えて規定しなければならず、高精度の実現は困難である。 However, none of the above-mentioned conventional techniques can realize a translation with sufficient quality. The reason is that, for example, in the translation from English to Japanese, there is a problem that the word order has to be largely changed. In RBMT, it is difficult to realize high accuracy by comparing the order of words in English and Japanese, finding the case where the order of words is changed, and defining the rules for when and how to change the order. is there.

一方、ＳＭＴでは、どの単語とどの単語を入れ替えるべきかが統計的に判断される。しかし、例えば、１文中に単語が３０個あるとすれば、その単語の並び替え方は、理論的には３０！≒１０^３２．４にも上る。わずか１文の翻訳にこのような膨大な解探索を行うことは、時間がかかりすぎて非現実的である。 On the other hand, in SMT, it is statistically determined which word and which word should be replaced. However, for example, if there are 30 words in a sentence, the word rearrangement method is theoretically 30! ≒ 10 ^32.4 . Performing such a huge solution search for translation of only one sentence takes too much time and is unrealistic.

また、非特許文献１，２や特許文献１，２などに開示された技術により、解探索空間は少しずつ削減されてきているが、どの語順がよいかの判断は、目的言語（英日翻訳ならば日本語）の統計的な言語モデル（単言語モデル）により判断されている。つまり、ある単語の次にどの単語が来るかの確率値を求め、できるだけ確率が高い語順を優先するが、この言語モデルによる語順決定では、新語が含まれている場合に、よい語順を捨ててしまう可能性がある。 In addition, the solution search space has been gradually reduced by the techniques disclosed in Non-Patent Documents 1 and 2, Patent Documents 1 and 2, and so on. In this case, it is determined by a statistical language model (single language model). In other words, the probability value of which word comes after a certain word is obtained, and the word order with the highest probability is given priority. In this word model determination, when a new word is included, the good word order is discarded. There is a possibility.

また、非特許文献３に開示された経験則はあくまで英韓翻訳に対するものであり、例えば、英日翻訳や独韓翻訳等の他の翻訳にそのまま適用することはできない。また、非特許文献４に開示された技術では、例えば、英日翻訳に適用した場合に、複数の節を含む複雑な文を日本語の語順にできないことがあるなどの問題がある。 The rule of thumb disclosed in Non-Patent Document 3 is only for English-Korean translation and cannot be applied to other translations such as English-Japanese translation or German-Korean translation as it is. In addition, the technique disclosed in Non-Patent Document 4 has a problem that, for example, when applied to English-Japanese translation, a complicated sentence including a plurality of sections may not be in Japanese word order.

そこで、本発明は、このような機械翻訳の現状に鑑みてなされたもので、非主辞後置型言語である原言語（例えば英語）の文の単語について、構文解析結果を用いて、主辞後置型言語の目的言語（例えば日本語）の語順に並び替えることを課題とする。 Therefore, the present invention has been made in view of the current state of machine translation, and the postfix type of a sentence in a source language (for example, English) which is a non-subject postfix language is used by using the result of parsing. The task is to rearrange the language in the order of the target language (for example, Japanese).

前記課題を解決するために、請求項１に係る本発明は、統語上の主辞に関して、非主辞後置型言語である原言語の文の単語を、主辞後置型言語である目的言語の語順に並び替える語順変換装置であって、前記原言語の文について、予め行った構文解析により作成した、前記文を構成する単語に対応するノードの階層構造を示す構文木情報と、各ノードに関する子ノードのうち主辞ノードがどれかを示す主辞ノード情報と、各ノードのうち並列構造であるときと数式であるときを含み語順を変えるべきでないノードを示す語順不変ノード情報と、を記憶する記憶部と、前記文について、前記構文木情報、前記主辞ノード情報、前記語順不変ノード情報を参照して、前記構文木情報に示される構造木のルートノードから始めてその構造木のすべてのノードについて、その子ノードのうち、語順を変えるべきでないノード以外のノードに対して、主辞ノードを当該子ノードの最後の位置に移動する処理を繰り返すことで、前記原言語の文の単語を前記目的言語の語順に並び替える処理部と、を備えることを特徴とする。 In order to solve the above-mentioned problem, the present invention according to claim 1 arranges words of a sentence of a source language that is a non-subject postfix language in order of words of a target language that is a postfix postscript language with respect to a syntactic main subject. A word-order conversion device for replacing the syntax tree information indicating the hierarchical structure of nodes corresponding to the words constituting the sentence, and the child node related to each node, created by a syntax analysis performed in advance on the source language sentence. A storage unit that stores main node node information indicating which of the main node is included, and word order invariant node information that indicates a node that should not be changed in word order, including when each node has a parallel structure and a mathematical expression; For the sentence, referring to the syntax tree information, the head node information, and the word order invariant node information, starting from the root node of the structure tree shown in the syntax tree information, all the structure trees For nodes, among the child nodes, the node other than the node that should not alter the word order, by repeating the process of moving the head-node to the last position of the child node, the target word of the sentence of the source language And a processing unit that rearranges the words in order of language.

かかる発明によれば、統語上の主辞に関して、非主辞後置型言語である原言語の文の単語について、その構造木のルートノードから始めてすべてのノードについて、その子ノードのうち、語順を変えるべきでないノード以外のノードに対して、主辞ノードを当該子ノードの最後の位置に移動する処理を繰り返すことで、主辞後置型言語の目的言語の語順に並び替えることができる。 According to this invention, with respect to the syntactic main word, the word order should not be changed among the child nodes of all the nodes starting from the root node of the structural tree for the words of the source language sentence which is a non-subscript postfix language. By repeating the process of moving the main word node to the last position of the child node for nodes other than the node, it is possible to rearrange the word order in the target language of the main word postfix language.

また、請求項２に係る本発明は、統語上の主辞に関して、非主辞後置型言語である原言語の文を、主辞後置型言語である目的言語の文に翻訳するために使用する統計的機械翻訳用の統計モデルを作成する機械翻訳用統計モデル作成装置であって、前記原言語の文と前記目的言語の文との対訳コーパス、および、前記目的言語の文の単言語コーパスを記憶する記憶部と、前記対訳コーパスのうち、前記原言語の文を構文解析する構文解析部と、請求項１に記載の語順変換装置と、前記構文解析部による構文解析の後に前記語順変換装置によって語順変換された前記原言語の文と、前記目的言語の文と、からなる対訳コーパスである主辞後置対訳コーパスを作成する主辞後置対訳コーパス作成部と、前記主辞後置対訳コーパス、および、前記目的言語の単言語コーパスを用いて、前記統計モデルを作成する統計モデル作成部と、を備えることを特徴とする。 Further, the present invention according to claim 2 relates to a statistical machine used for translating a sentence in a source language, which is a non-subscript postfix language, into a sentence in a target language, which is a postfix language, with respect to a syntactic head A statistical model creation device for machine translation that creates a statistical model for translation, the storage storing a bilingual corpus of the source language sentence and the target language sentence, and a monolingual corpus of the target language sentence A syntactic analysis unit that parses the sentence in the source language of the bilingual corpus, a word order conversion device according to claim 1, and a word order conversion by the word order conversion device after the syntax analysis by the syntax analysis unit A main postfix translation bilingual corpus that creates a bilingual corpus that is a bilingual corpus consisting of the source language sentence and the target language sentence, the main postfix translation bilingual corpus, and the object Using monolingual corpora of words, characterized in that it comprises a statistical model creation unit for creating the statistical model.

かかる発明によれば、非主辞後置型言語である原言語の文の単語を請求項１に記載の語順変換装置で語順変換する以外は、周知の統計的機械翻訳技術を用いて、統計モデルを作成することができる。 According to this invention, a statistical model is obtained using a well-known statistical machine translation technique, except that a word order conversion device according to claim 1 converts a word order of a source language sentence which is a non-subscript postfix language. Can be created.

また、請求項３に係る本発明は、統語上の主辞に関して、非主辞後置型言語である原言語の文を、主辞後置型言語である目的言語の文に翻訳する機械翻訳装置であって、請求項２に記載の前記統計モデルを記憶する記憶部と、翻訳対象の前記原言語の文を構文解析する構文解析部と、前記構文解析した原言語の文の単語の語順を変換する請求項１に記載の語順変換装置と、当該変換した原言語の文に対して前記記憶部に記憶した統計モデルを適用することで、前記目的言語の文への翻訳を行う統計的機械翻訳部と、を備えることを特徴とする。 Further, the present invention according to claim 3 is a machine translation device that translates a sentence in a source language that is a non-subscript postfix language into a sentence in a target language that is a postfix language for a syntactic main word, 3. A storage unit for storing the statistical model according to claim 2, a syntax analysis unit for parsing the source language sentence to be translated, and converting a word order of words of the source language sentence that has been parsed. 1, a statistical machine translation unit that translates the sentence into the target language by applying the statistical model stored in the storage unit to the converted source language sentence; It is characterized by providing.

かかる発明によれば、統計的機械翻訳部によって、請求項１に記載の語順変換装置が語順変換した原言語の文について、請求項２に記載の統計モデルを参照することで、膨大な語順の探索等を行うことなく、目的言語の文への翻訳を行うことができる。 According to this invention, the statistical machine translation unit refers to the statistical model according to claim 2 for the source language sentence converted by the word order conversion device according to claim 1, thereby enabling enormous word order conversion. Translation into a target language sentence can be performed without performing a search or the like.

また、本発明は、前記した各装置による方法をコンピュータに実行させるためのプログラムである。かかる発明によれば、このプログラムをコンピュータにインストールすることで、前記した各装置による方法をコンピュータに実行させることができる。 Further, the present invention is a program for causing a computer to execute the method using each of the above devices. According to this invention, by installing this program in the computer, the computer can execute the method of each device described above.

本発明によれば、非主辞後置型言語である原言語の文の単語について、構文解析結果を用いて、主辞後置型言語の目的言語の語順に並び替えることができる。 According to the present invention, words of a sentence in a source language that is a non-subscript postfix language can be rearranged in the order of words in the target language of the postfix language using a syntactic analysis result.

本実施形態の語順変換装置の構成図である。It is a block diagram of the word order conversion apparatus of this embodiment. 英文を主辞後置英文に変換する例を示す図である。It is a figure which shows the example which converts an English sentence into a postfix main sentence English sentence. 独文を主辞後置独文に変換する例を示す図である。It is a figure which shows the example which converts a German sentence into a head postfix German sentence. 語順変換装置の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of a word order conversion apparatus. 語順変換の例を示す図である。It is a figure which shows the example of word order conversion. 語順変換の例を示す図である。It is a figure which shows the example of word order conversion. 本実施形態の機械翻訳用統計モデル作成装置の構成図である。It is a block diagram of the statistical model production apparatus for machine translation of this embodiment. 本実施形態の機械翻訳装置の構成図である。It is a block diagram of the machine translation apparatus of this embodiment. 英文の解析結果の出力例を示す図である。It is a figure which shows the example of an output of the analysis result of an English sentence. 英文の解析結果の出力例を示す図である。It is a figure which shows the example of an output of the analysis result of an English sentence.

以下、本発明を実施するための形態（以下、「実施形態」と称する。）に関し、第１実施形態、第２実施形態および第３実施形態について、図面を参照（言及図以外の図面も適宜参照）して詳細に説明する。 Hereinafter, with respect to a mode for carrying out the present invention (hereinafter referred to as “embodiment”), refer to the drawings for the first embodiment, the second embodiment, and the third embodiment (the drawings other than the referenced drawings are also appropriately referred to). The details will be described.

＜第１実施形態＞
図１に示すように、第１実施形態における語順変換装置１００は、コンピュータ装置であり、処理部１０１と、記憶部１９５とを備えて構成される。なお、語順変換装置１００は、図示していないが、入出力インタフェースも備えている。 <First Embodiment>
As shown in FIG. 1, the word order conversion device 100 according to the first embodiment is a computer device, and includes a processing unit 101 and a storage unit 195. Although not shown, the word order conversion apparatus 100 also includes an input / output interface.

処理部１０１は、例えば、ＣＰＵ（Central Processing Unit）とメモリによって実現され、初期設定部１１０、終端ノード判定部１２０、子ノード列挙部１３０、語順交換部１４０、動詞判定部１５０、引数ラベル付与部１６０、単語置換部１７０、ノード選択部１８０、助詞記号追加部１９０とを備えている。それらの動作については後記する。 The processing unit 101 is realized by, for example, a CPU (Central Processing Unit) and a memory, and includes an initial setting unit 110, a terminal node determination unit 120, a child node enumeration unit 130, a word order exchange unit 140, a verb determination unit 150, and an argument label assignment unit. 160, a word replacement unit 170, a node selection unit 180, and a particle symbol addition unit 190. These operations will be described later.

記憶部１９５は、例えば、ＲＯＭ（Read Only Memory）、ＨＤＤ（Hard Disk Drive）などによって実現され、処理対象の原文、処理部１０１の動作プログラム、各種設定値、処理対象となる原言語の文について予め行った構文解析により作成した構文木情報、構文木における各ノードに関する子ノードのうち主辞ノードがどれかを示す主辞ノード情報、構文木における各ノードのうち並列構造であるときと数式であるときを含み語順を変えるべきでないノードを示す語順不変ノード情報、語順変換後の文などが記憶される。 The storage unit 195 is realized by, for example, a ROM (Read Only Memory), an HDD (Hard Disk Drive), and the like. The processing target text, the operation program of the processing unit 101, various setting values, and the processing target language text Parse tree information created by parsing in advance, head node information indicating which head node is a child node of each node in the syntax tree, and a parallel structure or expression of each node in the syntax tree Word order invariant node information indicating nodes that should not change the word order, and sentences after word order conversion are stored.

以下では、原言語が英語である場合を例にとって、説明する。語順変換装置１００の動作概要を説明すると、語順変換装置１００は、入力された英文を、日本語や韓国語などの主辞後置型言語の語順に並び替えた主辞後置英文にして出力する。 In the following, a case where the source language is English will be described as an example. The operation outline of the word order conversion apparatus 100 will be described. The word order conversion apparatus 100 outputs the input English sentence as a main part postfix English sentence rearranged in the word order of a main part postfix language such as Japanese or Korean.

なお、入力文は、予め構文解析されているものとし、その結果を構文木（構文木情報）で表現したものを入力とする。入力文が構文解析されていない場合には、構文解析部（図１に破線で記載）によって構文を解析して構文木とする必要があるが、その構文解析部は、語順変換装置１００が備えていてもよいし、他の装置が備えていてもよい。 It is assumed that the input sentence has been parsed in advance, and the result is expressed as a syntax tree (syntax tree information). When the input sentence has not been parsed, it is necessary to analyze the syntax by a syntax analysis unit (shown by a broken line in FIG. 1) to form a syntax tree. The syntax analysis unit is included in the word order conversion apparatus 100. Or other devices may be provided.

英語の構文解析は、例えば、文献“Feature forest models for probabilistic HPSG parsing,”（Y. Miyao and J. Tsujii: Computational Linguistics, vol. 34, no. 1, pp. 35-80, 2008）”（以下、「英語構文解析文献」と称する。）で提案され、ウェブで一般公開されているｅｎｊｕなどの技術を利用することができる。 The English parsing is, for example, the document “Feature forest models for probabilistic HPSG parsing,” (Y. Miyao and J. Tsujii: Computational Linguistics, vol. 34, no. 1, pp. 35-80, 2008) ”(below , Referred to as “English parsing document”), and techniques such as enju that are publicly available on the web can be used.

例えば、“John saw a beautiful girl yesterday.”という英文を入力としたとき、構文解析部が出力する英語の構文木は、図２（ａ）に示す通りである。そして、本実施形態の語順変換装置１００がその構文木を語順変換して出力する主辞後置英文は、図２（ｂ）に示す通りである。 For example, when an English sentence “John saw a beautiful girl yesterday.” Is input, an English syntax tree output by the syntax analysis unit is as shown in FIG. And the word postfix English sentence which the word order conversion apparatus 100 of this embodiment converts and outputs the syntax tree is as shown in FIG.2 (b).

構文木の各ノードにはノードＩＤ（IDentifier）が付与されている。図２のノード（丸で表記）の中に書かれている番号がノードＩＤである。灰色で表されているノードは、すぐ上側に接続しているノードにとっての主辞ノードである。白色で表されているノードは、主辞以外のノードである。例えば、ノードＩＤが３のノード（以下、ノード３と称する。他のノードも同様）に注目すると、このノード３はノード４とノード１１を子ノードとして持っており、主辞はノード４である。 A node ID (IDentifier) is assigned to each node of the syntax tree. A number written in a node (indicated by a circle) in FIG. 2 is a node ID. The node shown in gray is the main node for the node connected immediately above. A node represented in white is a node other than the main letter. For example, when attention is paid to a node having a node ID of 3 (hereinafter referred to as node 3, the same applies to other nodes), this node 3 has node 4 and node 11 as child nodes, and the main word is node 4.

なお、図示しないが、構文解析の結果得られる構文木の各ノードには、次の意味を示すラベルが付与されているものとする。
・主辞となるノードのＩＤ
・主語となるノードのＩＤ(動詞に対応するノードの場合のみ)
・目的語となるノードのＩＤ(他動詞に対応するノードの場合のみ) Although not shown in the figure, it is assumed that a label indicating the following meaning is given to each node of the syntax tree obtained as a result of the syntax analysis.
-ID of the node that is the main word
-ID of the subject node (only for nodes corresponding to verbs)
-ID of the target node (only for nodes corresponding to transitive verbs)

前記した“John saw a beautiful girl yesterday.”という英文をｅｎｊｕで解析した結果をＸＭＬ（eXtensible Markup Language）形式で出力したものを図８に示す。図２の構文木のノードＩＤは、図８の<cons>タグ(２〜４，８〜１０，１４，１５，１８，１９，２２，２８，２９行目）中のid属性に記載されている。なお、図８ではidはc0~c12となっているが、図２ではcを省略して数字のみを記している。ｅｎｊｕの解析結果では、各ノードに、以下のような情報が付与されているので、これを用いて、各ノードにラベルを与える。 FIG. 8 shows the result of analyzing the English sentence “John saw a beautiful girl yesterday.” In enju in the XML (eXtensible Markup Language) format. The node ID of the syntax tree in FIG. 2 is described in the id attribute in the <cons> tag (lines 2 to 4, 8 to 10, 14, 15, 18, 19, 22, 28, 29) in FIG. Yes. In FIG. 8, id is c0 to c12, but in FIG. 2, c is omitted and only numbers are shown. In the analysis result of “enju”, the following information is given to each node, and this is used to give a label to each node.

（１）子ノードの中の主辞ノードのＩＤ(head)
情報“head”は、<cons>タグ中に記載され、head属性を示す。つまり、例えば、図２（ａ）のノード３の場合、図８の８行目にはhead=“c4”と記載されていることから、ノード４がノード３に対する主辞ノードになる。同様に、図８の１４行目から、ノード８がノード６に対する主辞ノードになる。 (1) ID (head) of main node in child node
Information “head” is described in a <cons> tag and indicates a head attribute. That is, for example, in the case of the node 3 in FIG. 2A, since head = “c4” is described in the eighth line in FIG. 8, the node 4 becomes the main node for the node 3. Similarly, from the 14th line in FIG. 8, the node 8 becomes the main node for the node 6.

（２）そのノードの第一引数となるノードのＩＤ(arg1)
情報“arg1”は、動詞の場合、その主語を指す。例えば、図２（ａ）において、ノード５(c5)に接続しているsawのＸＭＬタグ(図８の１１，１２行目)を見ると、arg1にはc1が記されている。そこで、ノード１(c1)のheadをたどっていくと（図８の３行目参照）ノード２(John)に至るので、ノード２に「主語」というラベルを付与する。 (2) ID of the node that is the first argument of the node (arg1)
The information “arg1” indicates the subject in the case of a verb. For example, in FIG. 2A, when the saw XML tag (11th and 12th lines in FIG. 8) connected to the node 5 (c5) is seen, c1 is described in arg1. Therefore, if the head of the node 1 (c1) is traced (see the third line in FIG. 8), the node 2 (John) is reached, so that the label “subject” is given to the node 2.

（３）そのノードの第二引数となるノードのＩＤ(arg2)
情報“arg2”は、他動詞の場合、その目的語を指し、be動詞の場合、補語を指す。例えば、図２（ａ）において、ノード５(c5)に接続しているsawのＸＭＬタグ(図８の１１，１２行目)を見ると、arg2にはc6が記されている。そこで、ノード６(c6)のheadをたどっていくと（図８の１４，１８行目参照）ノード１０(girl)に至るので、ノード１０に「目的語」というラベルを付与する。 (3) ID of the node that is the second argument of that node (arg2)
The information “arg2” indicates the object in the case of a transitive verb and the complement in the case of a be verb. For example, in FIG. 2A, looking at the XML tag of the saw connected to the node 5 (c5) (11th and 12th lines in FIG. 8), c6 is described in arg2. Therefore, if the head of the node 6 (c6) is traced (see the 14th and 18th lines in FIG. 8), the node 10 (girl) is reached, and therefore, the label “object” is given to the node 10.

本実施形態の語順変換装置１００では、入力された構文木の最上位のノード（ルートノード）から処理を開始して、処理対象のノード（以下、「処理対象ノード」と称する。）に接続している子ノードの中から主辞となるノードを探し出し、そのノードを子ノードの最後に移動させる。なお、記憶部１９５の語順不変ノード情報を参照し、子ノードが並列構造や数式の場合には、主辞ノードの移動を行わない。これを全ての子ノードについて繰り返すことにより、語順変換を行う。 In the word order conversion apparatus 100 of the present embodiment, processing is started from the highest node (root node) of the input syntax tree and connected to a processing target node (hereinafter referred to as “processing target node”). Find the node that is the main character from the child nodes that are in the list, and move that node to the end of the child node. Note that the word order invariant node information in the storage unit 195 is referred to, and if the child node has a parallel structure or a mathematical expression, the main word node is not moved. By repeating this for all child nodes, word order conversion is performed.

以下、図１、図４、図８を参照して、語順変換の処理について説明する。なお、その後、図５Ａ、図５Ｂを参照して、語順変換の具体例について説明する。 Hereinafter, word order conversion processing will be described with reference to FIGS. 1, 4, and 8. After that, a specific example of word order conversion will be described with reference to FIGS. 5A and 5B.

＜初期設定部１１０＞
初期設定部１１０は、構文解析部から構文解析済みの英文（構文木）を受け取り（読み込み）（ステップＳ１００）、その英文の構文木のルートノードを処理対象ノードとして選択するとともに、ノードリストにその処理対象ノードを記録する（ステップＳ１１０）。 <Initial setting unit 110>
The initial setting unit 110 receives (reads) the parsed English sentence (syntax tree) from the syntax analysis unit (step S100), selects the root node of the grammatical tree of the English sentence as a processing target node, and displays the node in the node list. The processing target node is recorded (step S110).

＜終端ノード判定部１２０＞
終端ノード判定部１２０は、処理対象ノードが終端ノード（子ノードのないノード）か否かを判定し（ステップＳ１２０）、ＹｅｓであればステップＳ１５０に進み、ＮｏであればステップＳ１３０に進む。なお、ｅｎｊｕでＸＭＬ形式での出力を選んだ場合は、<tok>タグが付いているものが終端ノードであり（図８の５，１１，１６，２０，２３，３０行目）、<cons>タグが付いているものは終端ノードでない。 <Termination node determination unit 120>
The termination node determination unit 120 determines whether the processing target node is a termination node (a node having no child node) (step S120). If Yes, the process proceeds to step S150, and if No, the process proceeds to step S130. When XML format output is selected in enju, the node with the <tok> tag is the terminal node (lines 5, 11, 16, 20, 23, and 30 in FIG. 8), and <cons The one with tag is not a terminal node.

＜子ノード列挙部１３０＞
子ノード列挙部１３０は、処理対象ノードが持つ子ノードのリストである「子ノードリスト」を作成する（ステップＳ１３０）。例えば、ｅｎｊｕのＸＭＬ出力をpythonのxml.dom.minidomライブラリで読んでいる場合は、childNodesで子ノードを列挙できる。 <Child node enumeration unit 130>
The child node enumeration unit 130 creates a “child node list” that is a list of child nodes of the processing target node (step S130). For example, if the XML output of enju is read by the python xml.dom.minidom library, child nodes can be enumerated by childNodes.

＜語順交換部１４０＞
語順交換部１４０は、処理対象ノードが並列構造や数式でない場合、「子ノードリスト」中の主辞ノードに対応するノードＩＤを、「子ノードリスト」の最後に移動させる（ステップＳ１４０）。ｅｎｊｕのＸＭＬ出力では、主辞ノードのＩＤは、処理対象ノードのhead属性に記録されている。並列構造の場合、処理対象ノードに対応する<cons>タグのcatまたはxcatという属性が“COOD”となる。例えば、“Mary saw John and Bob.”という英文では、“John and Bob”の部分が並列構造となる。 <Word order exchanging unit 140>
If the processing target node is not a parallel structure or a mathematical expression, the word order exchange unit 140 moves the node ID corresponding to the main node in the “child node list” to the end of the “child node list” (step S140). In the XML output of enju, the ID of the main node is recorded in the head attribute of the processing target node. In the case of a parallel structure, the attribute of cat or xcat of the <cons> tag corresponding to the processing target node is “COOD”. For example, in the English sentence “Mary saw John and Bob.”, The part of “John and Bob” has a parallel structure.

図９は、“Mary saw John and Bob.”という英文をｅｎｊｕで解析した結果であり、２０行目にxcat=“COOD”という記載がある。このノード(id=“c19”)のheadをたどっていくと“John”に到達する（図９の２０〜２３行目参照）。なお、数式かどうかは、等号や不等号が含まれているか、などの基準によって判定できる。そして、ノードリスト中の処理対象ノードのノードＩＤの部分を、書き換えた「子ノードリスト」で置き換える（ステップＳ１４１）。 FIG. 9 shows the result of analyzing an English sentence “Mary saw John and Bob.” With enju, and xcat = “COOD” is described on the 20th line. By following the head of this node (id = “c19”), “John” is reached (see the 20th to 23rd lines in FIG. 9). Whether or not the expression is a mathematical expression can be determined by a criterion such as whether an equal sign or an inequality sign is included. Then, the node ID portion of the processing target node in the node list is replaced with the rewritten “child node list” (step S141).

ここで、子ノードの構造を考える理由は以下の通りである。周知の構文解析器の中には、A and BやA or Bなどの並列構造（コーディネーション）の場合に、Aを主辞と判定するものがある。このとき並列構造を考慮することなく単に主辞ノードに対応するノードを子ノードリストの最後に移動させてしまうと、B and AやB or Aに語順を変換することになってしまう。AとBが対等とはいえ、このような入れ替えは忠実な翻訳とは言えない。そこで、子ノードが並列構造の場合には語順の入れ替えを行わないことにより、このような構文解析器に起因する問題を解決することができる。また、m<n+1のような数式も、語順を入れ替えると意味が変わってしまうので、語順を入れ替えない。 Here, the reason for considering the structure of the child node is as follows. Some well-known parsers determine A as a main character in the case of a parallel structure (coordination) such as A and B or A or B. At this time, if the node corresponding to the main node is simply moved to the end of the child node list without considering the parallel structure, the word order is converted to B and A or B or A. Even though A and B are equal, such a change is not a faithful translation. Therefore, when the child node has a parallel structure, the problem caused by such a syntax analyzer can be solved by not changing the word order. Also, mathematical expressions such as m <n + 1 do not change the word order because the meaning changes if the word order is changed.

＜動詞判定部１５０＞
動詞判定部１５０は、引数ラベル付与部１６０を適用するために、処理対象ノードが動詞であるか否かを判定し（ステップＳ１５０）、ＹｅｓであればステップＳ１６０に進み、ＮｏであればステップＳ１７０に進む。 <Verb determination unit 150>
The verb determining unit 150 determines whether or not the processing target node is a verb in order to apply the argument label assigning unit 160 (step S150). If Yes, the process proceeds to step S160, and if No, step S170 is performed. Proceed to

＜引数ラベル付与部１６０＞
引数ラベル付与部１６０は、英語に存在しない助詞を生成するための前処理を行う。引数ラベル付与部１６０は、処理対象ノードが動詞であるときに、その主語・目的語となっている単語に、ラベル付けをする。ｅｎｊｕの場合は、動詞に付与されているarg1、arg2属性に対応するノードＩＤが記載されている。ただし、これらのノードＩＤは、単語に直接対応していないので、headをたどるという処理を加える必要がある。 <Argument label giving unit 160>
The argument label assigning unit 160 performs preprocessing for generating particles that do not exist in English. When the processing target node is a verb, the argument label assigning unit 160 labels the word that is the subject / object. In the case of “enju”, node IDs corresponding to the arg1 and arg2 attributes given to the verb are described. However, since these node IDs do not directly correspond to words, it is necessary to add a process of tracing head.

つまり、次のようにして、終端ノード（単語）に到達することができる。
・処理対象ノード（動詞）のarg1のノードからスタートして、headをたどれば、主語である終端ノード（単語）に到達できる。
・処理対象ノードのarg2のノードからスタートして、headをたどり、目的語である終端ノード（単語）に到達できる。 That is, the terminal node (word) can be reached as follows.
・ Starting from the node arg1 of the processing target node (verb) and following head, the terminal node (word) that is the subject can be reached.
-Starting from the node arg2 of the processing target node, the head can be traced to reach the terminal node (word) that is the object.

そして、これらの終端ノードＩＤに主語、目的語であることを示すラベルva1, va2を付ける（ステップＳ１６０）。例えば、図２のノード５(saw)であれば、arg1=c1, arg2=c6なので（図８の１２行目参照）、ノード１のheadであるJohnにva1、ノード６のhead（ノード８）のhead（ノード１０）であるgirlにva2というラベルが付与される。 Then, labels va1 and va2 indicating the subject and object are attached to these terminal node IDs (step S160). For example, in the case of node 5 (saw) in FIG. 2, since arg1 = c1, arg2 = c6 (see the 12th line in FIG. 8), va1 is the head of node1 and head of node6 (node8) A label va2 is given to the girl which is the head (node 10).

なお、ラベルを付ける際に、入力文が受動態であるか否かを判別して、受動態である場合には、構文解析器の結果を修正して主格及び目的格のラベルを割り当ててもよい。例えば、“A was stolen by B”のような文では、構文解析器によっては、Aが目的語、Bが主語と判定されることがあるが、これを、単純に構文情報に合わせてAを主語、Bを目的語としてラベルを割り当てるほうが、原文に忠実な翻訳を得ることができる。 In addition, when attaching a label, it may be determined whether or not the input sentence is passive. If it is passive, the result of the parser may be corrected and the main and objective labels may be assigned. For example, in a sentence such as “A was stolen by B”, depending on the parser, A may be determined to be the object and B to be the subject. By assigning a label with the subject, B as the object, a translation that is faithful to the original can be obtained.

また、入力文が“John bought a toy that was popular in Japan”の場合、toyは、boughtの目的語であり、かつ、wasの主語である。このような場合、連体修飾されている名詞(toy)は、連体修飾節の外の格によってのみラベル付けする。つまり、このtoyは、連体修飾節の外側の動詞boughtの目的語なので、目的語としてラベル付けする。 Also, when the input sentence is “John bought a toy that was popular in Japan”, toy is the object of bought and the subject of was. In such a case, the noun (toy) that has been modified is only labeled by a case outside the modified clause. In other words, since this toy is the object of the verb bought outside the combined modifier, it is labeled as an object.

＜単語置換部１７０＞
単語置換部１７０は、ノードリスト中の処理対象ノードを、対応する単語で置き換える（ステップＳ１７０）。なお、その際、ノードリスト中にva1,va2のラベルがある場合、それらのラベルは、すでに置き換えられている単語に対応付けたまま残す（図５Ｂの（１４）参照）。 <Word replacement unit 170>
The word replacement unit 170 replaces the processing target node in the node list with the corresponding word (step S170). At this time, if there are labels of va1 and va2 in the node list, these labels remain associated with the words that have already been replaced (see (14) in FIG. 5B).

＜ノード選択部１８０＞
ノード選択部１８０は、ノードリストにノードＩＤが含まれているか否かを判定し（ステップＳ１８０）、Ｙｅｓであれば、ノードリストの先頭から最も近いノードＩＤを処理対象ノードとして選択し（ステップＳ１８１）、ステップＳ１２０に進む。ステップＳ１８０でＮｏであれば、ステップＳ１９０に進む。 <Node selection unit 180>
The node selection unit 180 determines whether or not the node ID is included in the node list (step S180). If Yes, the node ID closest to the head of the node list is selected as the processing target node (step S181). ), The process proceeds to step S120. If No in step S180, the process proceeds to step S190.

＜助詞記号追加部１９０＞
助詞記号追加部１９０は、ノードリスト中のva1,va2のラベルのある単語の直後に、助詞に相当する語が入る可能性を示す助詞相当語“_va1”,“_va2”を追加して（ステップＳ１９０）、ノードリスト中の単語を前から順に出力する（ステップＳ２００）。なお、日本語における主節の主語は、「が」よりも「は」のほうが自然な場合が多いので、主節の“_va1”を“_va0”のような別の語に置き換えてもよい（例については後記）。ただし、「が」と「は」、「を」と「に」などの使い分けは難しいので、これらの助詞相当語を実際に日本語の単語で固定的に置き換えることはせず、どのような状況でどの助詞に置き換えるかは統計的機械翻訳の処理にゆだねる。
以上の処理により、主辞後置に単語が並び替えられた英文が出力される。 <Particulate symbol addition unit 190>
The particle symbol adding unit 190 adds the particle equivalent words “_va1” and “_va2” indicating the possibility that the word corresponding to the particle is included immediately after the word labeled va1 and va2 in the node list (step S1). S190), the words in the node list are output in order from the front (step S200). In many cases, the main subject in Japanese is “ha” rather than “ga”, so “_va1” in the main clause may be replaced with another word like “_va0” ( See below for examples). However, it is difficult to properly use "ga" and "ha", "wo" and "ni", etc., so these particle equivalents are not actually fixedly replaced with Japanese words, It depends on the processing of statistical machine translation which particle to replace with.
With the above processing, English sentences in which words are rearranged after the main word are output.

次に、構文解析部で得られた構文木が図２（ａ）に示すものであるとして、上記の語順変換装置１００の動作を具体的に説明する。ここで、ノードリストの更新される様子を図５Ａ及び図５Ｂに示す。なお、以下の（）内の番号（（１）、（２）、・・・）は、図５Ａ及び図５Ｂにおける（）内の番号（（１）、（２）、・・・）と対応している。また、図５Ａ及び図５Ｂにおいては、ノードリスト中で、１つ前のステップと比較して内容が更新された部分に下線を施している。 Next, the operation of the above-described word order conversion apparatus 100 will be specifically described on the assumption that the syntax tree obtained by the syntax analysis unit is as shown in FIG. Here, how the node list is updated is shown in FIGS. 5A and 5B. The numbers in () below ((1), (2),...) Correspond to the numbers in () in FIGS. 5A and 5B ((1), (2),...). doing. In FIG. 5A and FIG. 5B, the portion of the node list whose contents are updated as compared with the previous step is underlined.

（１）まず、ノード０を処理対象ノードとして選択する（図４のステップＳ１１０）。ノードリストは、[ノードID:0]となる。 (1) First, node 0 is selected as a processing target node (step S110 in FIG. 4). The node list is [Node ID: 0].

（２）ノード０には子ノードが存在するので（終端ノードではないので）、終端ノード判定部１２０（図４のステップＳ１２０）から子ノード列挙部１３０（図４のステップＳ１３０）に処理を移行する。子ノード列挙部１３０において、ノード０の子ノードのリスト[ノードID:1，ノードID:3]を、子ノードリストとして設定する（図４のステップＳ１３０）。語順交換部１４０では、子ノードの構造は並列構造や数式でないので、子ノードリスト中の主辞ノード（ノード３）を、子ノードリストの最後に移動させる（図４のステップＳ１４０）。ただし、この場合は既に主辞ノードが最後になっているため移動の必要はない。よって、ノードリストを[ノードID:1,ノードID:3]に置き換える（図４のステップＳ１４１）。 (2) Since a child node exists in node 0 (since it is not a terminal node), the processing is transferred from the terminal node determination unit 120 (step S120 in FIG. 4) to the child node enumeration unit 130 (step S130 in FIG. 4). To do. In the child node enumeration unit 130, the child node list of node 0 [node ID: 1, node ID: 3] is set as a child node list (step S130 in FIG. 4). In the word order exchange unit 140, since the child node structure is not a parallel structure or a mathematical expression, the main word node (node 3) in the child node list is moved to the end of the child node list (step S140 in FIG. 4). However, in this case, there is no need to move because the head node is already the last. Therefore, the node list is replaced with [node ID: 1, node ID: 3] (step S141 in FIG. 4).

（３）次に、ノード選択部１８０は、ノード１を処理対象ノードとして選択し（図４のステップＳ１８１）、同様の処理を繰り返す。ノード１の子ノードはノード２のみである。よって、語順交換部１４０で語順を交換する必要はなく、ノードリストは[ノードID:2,ノードID:3]となる。 (3) Next, the node selection unit 180 selects node 1 as a processing target node (step S181 in FIG. 4), and repeats the same processing. The only child node of node 1 is node 2. Therefore, it is not necessary to exchange the word order in the word order exchange unit 140, and the node list becomes [node ID: 2, node ID: 3].

（４）次に、ノード２を処理対象ノードとして選択する。ノード２には子ノードが存在しないので動詞判定部１５０（図４のステップＳ１５０）に処理が移行される。また、ノード２は動詞ではないので（図４のステップＳ１５０でＮｏ）、単語置換部１７０（図４のステップＳ１７０）に処理が移行される。ノード２に対応する単語はJohnなので、ノードリストが[単語:“John”,ノードID:3]となる（図４のステップＳ１７０）。 (4) Next, node 2 is selected as a processing target node. Since node 2 has no child node, the process proceeds to the verb determination unit 150 (step S150 in FIG. 4). Further, since node 2 is not a verb (No in step S150 in FIG. 4), the process proceeds to the word replacement unit 170 (step S170 in FIG. 4). Since the word corresponding to the node 2 is John, the node list is [word: “John”, node ID: 3] (step S170 in FIG. 4).

（５）次に、ノード３を処理対象ノードとして選択する。子ノードリストは[ノードID:4,ノードID:11]となる。語順交換部１４０は、主辞ノード（ノード４）を、子ノードリストの最後に移動させる。よって、その子ノードリストを反映したノードリストは[単語:“John”,ノードID:11,ノードID:4]となる。 (5) Next, the node 3 is selected as a processing target node. The child node list is [node ID: 4, node ID: 11]. The word order exchange unit 140 moves the main word node (node 4) to the end of the child node list. Therefore, the node list reflecting the child node list is [word: “John”, node ID: 11, node ID: 4].

（６）続いて、ノード１１を処理対象ノードとして選択する。子ノードはノード１２しかないので、ノード１１がノード１２で置き換えられ、ノードリストが[単語:“John”,ノードID:12,ノードID:4]となる。 (6) Subsequently, the node 11 is selected as a processing target node. Since the child node is only the node 12, the node 11 is replaced with the node 12, and the node list becomes [word: “John”, node ID: 12, node ID: 4].

（７）次に、ノード１２を処理対象ノードとして選択する。ノード１２には子ノードが存在しないので、単語に置き換えられ、ノードリストは[単語:“John”,単語:”yesterday”,ノードID:4]となる。 (7) Next, the node 12 is selected as a processing target node. Since the node 12 has no child node, it is replaced with a word, and the node list becomes [word: “John”, word: “yesterday”, node ID: 4].

（８）ノード４を処理対象ノードとして選択して同様の処理を繰り返すと、ノードリストは[単語:“John”,単語:”yesterday”,ノードID:6,ノードID:5]となる。 (8) When node 4 is selected as a processing target node and the same processing is repeated, the node list becomes [word: “John”, word: “yesterday”, node ID: 6, node ID: 5].

（９）ノード６を処理対象ノードとして選択する。結果、ノードリストは[単語:“John”,単語:”yesterday”,ノードID:7,ノードID:8,ノードID:5]となる。 (9) The node 6 is selected as a processing target node. As a result, the node list is [word: “John”, word: “yesterday”, node ID: 7, node ID: 8, node ID: 5].

（１０）ノード７を処理対象ノードとして選択する。ノード７には子ノードが存在しないので、単語に置き換えられ、ノードリストが[単語:“John”,単語:”yesterday”,単語:“a”,ノードID:8,ノードID:5]となる。 (10) Node 7 is selected as a processing target node. Since node 7 has no child node, it is replaced with a word, and the node list becomes [word: “John”, word: “yesterday”, word: “a”, node ID: 8, node ID: 5]. .

（１１）ノード８を処理対象ノードとして選択し、ノードリストが[単語:“John”,単語”yesterday”,単語:“a”,ノードID:9,ノードID:10,ノードID:5]となる。 (11) Node 8 is selected as a processing target node, and the node list is [word: “John”, word “yesterday”, word: “a”, node ID: 9, node ID: 10, node ID: 5]. Become.

（１２）同様に、ノード９を処理対象ノードとして選択し、単語に置き換えるので、ノードリストは[単語:“John”,単語:”yesterday”,単語:“a”,単語:“beautiful”,ノードID:10,ノードID:5]となる。 (12) Similarly, since node 9 is selected as a processing target node and replaced with a word, the node list is [word: “John”, word: “yesterday”, word: “a”, word: “beautiful”, node ID: 10, node ID: 5].

（１３）ノード１０を処理対象ノードとして選択し、単語に置き換えるので、ノードリストは[単語:“John”,単語:”yesterday”,単語:“a”,単語:“beautiful”,単語:“girl”,ノードID:5]となる。 (13) Since node 10 is selected as a processing target node and replaced with a word, the node list is [word: “John”, word: “yesterday”, word: “a”, word: “beautiful”, word: “girl” ", Node ID: 5].

（１４）次に、ノード５を処理対象ノードとして選択する。ノード５は動詞なので（図４のステップＳ１５０でＹｅｓ）、引数ラベル付与部１６０においてその属性を調べると、arg1=c1,arg2=c6と書かれている（図８の１０〜１２行目参照）。そこで、ノード１の属性を調べるとhead=c2なので（図８の３行目参照）、ノード２の“John”にたどりつく（図８の４，５行目参照）。そこで、“John”に“va1”のラベルが付与される（図４のステップＳ１６０）。また、ノード６はhead=c8（図８の１４行目参照）、ノード８はhead=c10なので（図８の１８行目参照）、“girl”にたどりつき（図８の２２，２３行目参照）、ラベル“va2”が“girl”に付与される（図４のステップＳ１６０）。したがって、ノードリストは[単語:“John”:va1,単語:”yesterday”,単語:“a”,単語:“beautiful”,単語:“girl”:va2,単語:“saw”]となる。なお、ここではva1は主語、va2は目的語を示すラベルとして使用している。 (14) Next, the node 5 is selected as a processing target node. Since node 5 is a verb (Yes in step S150 in FIG. 4), when its attributes are examined in the argument label assigning unit 160, arg1 = c1, arg2 = c6 is written (see lines 10 to 12 in FIG. 8). . Therefore, when the attribute of the node 1 is examined, since head = c2 (see the third line in FIG. 8), the node 2 reaches “John” (see the fourth and fifth lines in FIG. 8). Therefore, the label “va1” is assigned to “John” (step S160 in FIG. 4). Since node 6 is head = c8 (see the 14th line in FIG. 8) and node 8 is head = c10 (see the 18th line in FIG. 8), it reaches “girl” (see the 22nd and 23rd lines in FIG. 8). ), The label “va2” is given to “girl” (step S160 in FIG. 4). Therefore, the node list is [word: “John”: va1, word: “yesterday”, word: “a”, word: “beautiful”, word: “girl”: va2, word: “saw”]. Here, va1 is used as a subject and va2 is used as a label indicating an object.

（１５）最後に、助詞記号追加部１９０によって、ラベルva1,va2が付与された単語の直後に助詞相当語“_va1(が)”,“_va2(を)”が追加され、ノードリストは[“John”,“_va1”,”yesterday”,“a”,“beautiful”,“girl”,“_va2”,“saw”]となる（図４のステップＳ１９０）。さらに、“_va1（が）”を“_va0（は）”で置き換えると、ノードリストは[“John”,“_va0”,”yesterday”,“a”,“beautiful”,“girl”,“_va2”,“saw”]となる（図４のステップＳ１９０）。 (15) Lastly, the particle symbol adding unit 190 adds the particle equivalent words “_va1 (ga)” and “_va2 ()” immediately after the word given the labels va1, va2, and the node list [[ John ”,“ _va1 ”,“ yesterday ”,“ a ”,“ beautiful ”,“ girl ”,“ _va2 ”,“ saw ”] (step S190 in FIG. 4). Furthermore, if “_va1 (ga)” is replaced with “_va0 (ha)”, the node list will be [“John”, “_va0”, “yesterday”, “a”, “beautiful”, “girl”, “_va2” , “Saw”] (step S190 in FIG. 4).

以上の処理により、“John saw a beautiful girl yesterday.” という英文が、“John _va0 yesterday a beautiful girl _va2 saw.”という主辞後置英文に変換される。これにより、「ジョンは昨日美しい少女を見た。」と逐次的に日本語に翻訳することが可能となる。 Through the above processing, the English sentence “John saw a beautiful girl yesterday.” Is converted into a main part postfix English sentence “John _va0 yesterday a beautiful girl _va2 saw.”. This makes it possible to sequentially translate into Japanese, “John saw a beautiful girl yesterday.”

なお、もっと複雑な英文を入力とした場合でも、日本語に近い語順に変換することができる。例えば、“John went to the police because Mary lost his wallet.”という英文は、“John _va0 Mary _va1 his wallet _va2 lost because the police to went.”と変換される。これを逐次的に日本語に翻訳すると、“ジョンはメアリが彼の財布をなくしたので警察に行った。”となり、語順変換の精度が高いことがわかる。 Even when more complicated English sentences are input, they can be converted in the order of words close to Japanese. For example, the English sentence “John went to the police because Mary lost his wallet.” Is converted to “John _va0 Mary _va1 his wallet _va2 lost because the police to went.” Sequentially translating this into Japanese, "John went to the police because Mary lost his wallet," and it turns out that the accuracy of word order conversion is high.

同様に、多くの英文を日本語に近い語順にすることができる。これは、日本語が「主辞を最後に置く」という、規則的な性質を持つ言語（主辞後置型言語）であるために可能なことである。 Similarly, many English sentences can be arranged in a word order close to Japanese. This is possible because Japanese is a language with a regular nature ("postfix postfix language") that "places the head at the end".

このように、本実施形態の語順変換装置１００によれば、原言語（英語など）の文についての構文木を調べ、主辞ノード（被修飾語）を主辞以外のノード（修飾語）の後ろに移動することで、原言語の文の各単語を、主辞後置型言語である目的言語の語順に並べ替えることができる。なお、その際、並列構造や数式など、語順を変えるべきでないノードの語順を入れ替えないようにしたことで、より精度の高い語順変換を行うことができる。 As described above, according to the word order conversion apparatus 100 of the present embodiment, the syntax tree for the sentence in the source language (English or the like) is examined, and the head node (qualifier) is placed behind the node (modifier) other than the main word. By moving, each word of the sentence in the source language can be rearranged in the order of the target language, which is the postfix word. At that time, it is possible to perform more accurate word order conversion by not changing the word order of the nodes whose word order should not be changed, such as a parallel structure or a mathematical expression.

なお、語順変換装置１００は、英語だけでなく、他の言語にも適用できる。例えば、独語（ドイツ語）の場合について説明する。図３は、図２と同様の説明図である。つまり、図３（ａ）は、原言語を独語、目的言語を日本語とした場合の入力の構文木を示す図である。図３（ｂ）は、その構文木を主辞後置に変換した結果を示す図である。 The word order conversion apparatus 100 can be applied not only to English but also to other languages. For example, the case of German (German) will be described. FIG. 3 is an explanatory view similar to FIG. That is, FIG. 3A is a diagram showing an input syntax tree when the source language is German and the target language is Japanese. FIG. 3B is a diagram illustrating a result of converting the syntax tree into a head suffix.

独語では、助動詞があるとき、動詞が文末に移動する。図３（ａ）に示す“Albert muste jeden Tag Arbeiten.”について語順変換を行うと、図３（ｂ）に示す“Albert （_va0） jeden Tag Arbeiten muste.”となる。これを逐次的に日本語に翻訳すると、「アルバートは毎日働かなければならなかった。」となり、語順変換の精度が高いことがわかる。 In German, when there is an auxiliary verb, the verb moves to the end of the sentence. When word order conversion is performed on “Albert muste jeden Tag Arbeiten.” Shown in FIG. 3A, “Albert (_va0) jeden Tag Arbeiten muste.” Shown in FIG. Sequentially translating this into Japanese, “Albert had to work every day.” It turns out that the accuracy of word order conversion is high.

＜第２実施形態＞
次に、第１実施形態の語順変換装置１００を用いた機械翻訳用統計モデル作成装置について説明する。なお、この第２実施形態では、原言語を英語、目的言語を日本語とした場合を例に説明をするが、日本語の代わりに目的言語を主辞後置の文法を持つ韓国語などの言語としてもよい。また、原言語も英語の代わりに独語などの言語としてもよい。 Second Embodiment
Next, a statistical model creation apparatus for machine translation using the word order conversion apparatus 100 of the first embodiment will be described. In the second embodiment, the case where the source language is English and the target language is Japanese will be described as an example. However, instead of Japanese, the target language is a language such as Korean having the grammar of the suffix It is good. Also, the source language may be a language such as German instead of English.

図６に示すように、第２実施形態の機械翻訳用統計モデル作成装置２００は、コンピュータ装置であり、英語と日本語の対訳コーパス２１０、日本語の単言語コーパス２２０、構文解析部２３０、語順変換部１００ａ（第１実施形態の語順変換装置１００に相当。特許請求の範囲における「主辞後置対訳コーパス作成部」も兼ねる。）、主辞後置対訳コーパス２４０、統計モデル作成部２５０を備えている。 As shown in FIG. 6, the statistical model creation apparatus 200 for machine translation according to the second embodiment is a computer device, and includes an English-Japanese parallel corpus 210, a Japanese monolingual corpus 220, a syntax analysis unit 230, a word order. A conversion unit 100a (corresponding to the word order conversion device 100 of the first embodiment. Also serving as a “postfix postfix translation corpus creation unit” in the claims), a postfix postfix translation corpus 240, and a statistical model creation unit 250 are provided. Yes.

機械翻訳用統計モデル作成装置２００では、従来の統計的機械翻訳（ＳＭＴ）技術において翻訳用の統計モデル（翻訳ルール）を作成する際に使用する学習データのうち、英語と日本語の対訳コーパス２１０の英語の部分に対して、構文解析部２３０で各文を前記した英語構文解析文献などの手法により構文解析し、語順変換部１００ａで語順変換を行うことにより、主辞後置に変換した主辞後置対訳コーパス２４０を作成する。 The machine translation statistical model creation apparatus 200 uses English and Japanese bilingual corpus 210 of learning data used to create a statistical model (translation rule) for translation in the conventional statistical machine translation (SMT) technique. After the main part is converted into the main part postfix by parsing each sentence by the method such as the English parsing literature described above by the syntax analysis unit 230 and performing the word order conversion by the word order conversion unit 100a. A bilingual corpus 240 is created.

そして、統計モデル作成部２５０で、主辞後置対訳コーパス２４０と日本語の単言語コーパス２２０を学習データとして、周知の統計的機械翻訳の技術であるＳＭＴ（特許文献１，２等参照）により統計モデル２６０を作成（学習）することができる。 Then, the statistical model creation unit 250 uses the main part postfix bilingual corpus 240 and the Japanese monolingual corpus 220 as learning data, and performs statistics using a well-known statistical machine translation technique SMT (see Patent Documents 1 and 2). A model 260 can be created (learned).

＜第３実施形態＞
次に、第１実施形態の語順変換装置１００と第２実施形態の統計モデル２６０を用いた機械翻訳装置について説明する。図７に示すように、第３実施形態の機械翻訳装置３００は、コンピュータ装置であり、構文解析部２３０、語順変換部１００ａ（第１実施形態の語順変換装置１００に相当）、統計的機械翻訳部３２０、統計モデル２６０（第２実施形態で作成）を備えている。 <Third Embodiment>
Next, a machine translation apparatus using the word order conversion apparatus 100 of the first embodiment and the statistical model 260 of the second embodiment will be described. As shown in FIG. 7, the machine translation device 300 according to the third embodiment is a computer device, and includes a syntax analysis unit 230, a word order conversion unit 100 a (corresponding to the word order conversion device 100 according to the first embodiment), statistical machine translation. Unit 320 and statistical model 260 (created in the second embodiment).

機械翻訳装置３００では、入力された英文を、構文解析部２３０により構文解析した後、語順変換部１００ａで語順変換を行うことにより、主辞後置英文を作成する。そして、第２実施形態の機械翻訳用統計モデル作成装置２００によって作成（学習）した統計モデル２６０を用いて、統計的機械翻訳部３２０において、主辞後置英文を日本語に翻訳する。なお、統計的機械翻訳部３２０では、周知のルールベース型機械翻訳（ＲＢＭＴ）や統計的機械翻訳（ＳＭＴ）の手法をそのまま用いることができる。 In the machine translation apparatus 300, the input English sentence is parsed by the syntax analysis unit 230, and then the word order conversion unit 100 a performs word order conversion to create a main part postfix English sentence. Then, using the statistical model 260 created (learned) by the machine translation statistical model creation apparatus 200 according to the second embodiment, the statistical machine translation unit 320 translates the postfix English sentence into Japanese. The statistical machine translation unit 320 can use a well-known rule-based machine translation (RBMT) or statistical machine translation (SMT) technique as it is.

このように、機械翻訳装置３００によれば、英日翻訳などの語順が大きく異なる言語間の翻訳で、目的言語が主辞後置型言語の場合に、あらかじめ各文を目的言語の語順に近づけておくことができるので、膨大な語順の探索を行うことなく、ほぼ逐語的に訳すことが可能になり、高速に自然な語順の翻訳結果を得ることができる。 As described above, according to the machine translation apparatus 300, when the target language is a postfix word language for translation between languages such as English-Japanese translation and the like that are greatly different in word order, each sentence is brought close to the target language in advance. Therefore, it is possible to translate almost verbatim without searching for a large number of word orders, and it is possible to obtain a natural word order translation result at high speed.

また、本手法では、原言語の構文解析部の作成者は、原言語の文法の知識があればよく、目的言語の文法を知っている必要はない。構文解析ができれば、その構文解析済みの文（構文木）を語順変換部１００ａに適用するだけで、目的言語の語順に近づけることができる。 In this method, the creator of the source language syntax analysis unit need only have knowledge of the grammar of the source language, and does not need to know the grammar of the target language. If the syntax analysis can be performed, it is possible to approximate the word order of the target language only by applying the sentence (syntax tree) after the syntax analysis to the word order conversion unit 100a.

また、非特許文献３，４に開示された技術では、基本的には動詞や形容詞といった用言を後置することだけを行っており、そのために、because節などを含む複雑な文を日本語の語順にすることができない。例えば、非特許文献４では、becauseが文頭に来てしまうことが示されている。このため、日本語や韓国語の語順とは大きくかけ離れてしまう。一方、本実施形態によれば、品詞を限定せずに後置処理を行うことにより、because節ではbecauseが最後に来ることになり、becauseが日本語の「ので」と同じ位置に配置される。 In addition, in the technologies disclosed in Non-Patent Documents 3 and 4, basically only predicates such as verbs and adjectives are postfixed. For this reason, complicated sentences including a cause clause are written in Japanese. Can't be in the order of words. For example, Non-Patent Document 4 shows that “because” comes to the beginning of a sentence. For this reason, it is far from the word order of Japanese and Korean. On the other hand, according to the present embodiment, by performing the postfix processing without limiting the part of speech, the “because” comes last in the “because” clause, and the “because” is placed at the same position as the Japanese “So”. .

以上で本実施形態の説明を終えるが、本発明の態様はこれらに限定されるものではない。
例えば、語順不変ノード情報としては、並列構造、数式以外に、語順変換装置１００等の管理者が自由に設定してよい。
その他、各装置の構成や処理の具体的な内容について、本発明の主旨を逸脱しない範囲で適宜変更が可能である。 Although description of this embodiment is finished above, the aspect of the present invention is not limited to these.
For example, as the word order invariant node information, an administrator such as the word order conversion apparatus 100 may freely set other than the parallel structure and the mathematical expression.
In addition, the configuration of each device and the specific contents of the processing can be appropriately changed without departing from the gist of the present invention.

１００語順変換装置
１００ａ語順変換部
１０１処理部
１１０初期設定部
１２０終端ノード判定部
１３０子ノード列挙部
１４０語順交換部
１５０動詞判定部
１６０引数ラベル付与部
１７０単語置換部
１８０ノード選択部
１９０助詞記号追加部
１９５記憶部
２００機械翻訳用統計モデル作成装置
２１０英語と日本語の対訳コーパス
２２０日本語の単言語コーパス
２３０構文解析部
２４０主辞後置対訳コーパス
２５０翻訳用統計モデル作成部
２６０統計モデル
３００機械翻訳装置
３２０統計的機械翻訳部 DESCRIPTION OF SYMBOLS 100 Word order conversion apparatus 100a Word order conversion part 101 Processing part 110 Initial setting part 120 Terminal node determination part 130 Child node enumeration part 140 Word order exchange part 150 Verb determination part 160 Argument label provision part 170 Word substitution part 180 Node selection part 190 Addition of particle symbol Section 195 Storage section 200 Machine translation statistical model creation device 210 English-Japanese bilingual corpus 220 Japanese monolingual corpus 230 Parsing section 240 Head-postfix bilingual corpus 250 Translation statistical model creation section 260 Statistical model 300 Machine translation Device 320 Statistical Machine Translation Department

Claims

A word order conversion device that rearranges words in a source language sentence that is a non-subscript postfix language in the order of words in a target language that is a postfix language, with respect to a syntactic head letter,
About the sentence of the source language, the tree structure information indicating the hierarchical structure of the nodes corresponding to the words constituting the sentence and the main node among the child nodes related to each node, which is created by the syntax analysis performed in advance. A storage unit for storing main node information and word order invariant node information indicating a node that should not be changed in word order, including a parallel structure and a mathematical expression among the nodes;
For the sentence, referring to the syntax tree information, the head node information, and the word order invariant node information, starting from the root node of the structural tree indicated in the syntax tree information, all the nodes of the structural tree Among the nodes other than the node whose word order should not be changed, by repeating the process of moving the main word node to the last position of the child node, the words of the source language sentence are rearranged in the target language word order. A processing unit;
A word order conversion device comprising:

For machine translation that creates a statistical model for statistical machine translation that is used to translate a sentence in the source language that is a non-subscript postfix language into a sentence in the target language that is a postfix language A statistical model creation device,
A bilingual corpus of the source language sentence and the target language sentence, and a storage unit that stores a monolingual corpus of the target language sentence;
A parsing unit that parses the source language sentence of the parallel corpus;
The word order conversion device according to claim 1;
A main part posttranslation that creates a main part posttranslational corpus that is a bilingual corpus consisting of the source language sentence converted by the word order conversion device after the syntax analysis by the syntax analysis unit and the target language sentence Corpus creation department,
A statistical model creation unit that creates the statistical model using the head-postfix bilingual corpus and the target language monolingual corpus;
A statistical model creation device for machine translation, comprising:

A machine translation device that translates a sentence in a source language, which is a non-subscript postfix language, into a sentence in a target language, which is a postfix language, with respect to a syntactic subject,
A storage unit that stores the statistical model according to claim 2;
A parsing unit that parses the source language sentence to be translated;
The word order conversion device according to claim 1, wherein the word order of words in the sentence of the source language that has been analyzed is converted,
A statistical machine translation unit that translates the sentence into the target language by applying the statistical model stored in the storage unit to the converted source language sentence;
A machine translation device comprising:

A word order conversion method by a word order conversion apparatus that rearranges words in a source language sentence that is a non-subject postfix language in the order of words in a target language that is a postfix word,
The word order conversion device includes:
About the sentence of the source language, the tree structure information indicating the hierarchical structure of the nodes corresponding to the words constituting the sentence and the main node among the child nodes related to each node, which is created by the syntax analysis performed in advance. A storage unit that stores main node information, word order invariant node information indicating a node that should not be changed in word order, including a parallel structure and a mathematical expression among the nodes, and a processing unit. ,
The processor is
For the sentence, referring to the syntax tree information, the head node information, and the word order invariant node information, starting from the root node of the structural tree indicated in the syntax tree information, all the nodes of the structural tree Among the nodes other than the node whose word order should not be changed, the word of the sentence in the source language is rearranged in the word order of the target language by repeating the process of moving the main word node to the last position of the child node. A word order conversion method characterized by that.

For machine translation that creates a statistical model for statistical machine translation that is used to translate a sentence in the source language that is a non-subscript postfix language into a sentence in the target language that is a postfix language A statistical model creation method for machine translation by a statistical model creation device,
The machine translation statistical model creation device comprises:
A bilingual corpus of the source language sentence and the target language sentence, a storage unit that stores a monolingual corpus of the target language sentence, a word order conversion device according to claim 4, a syntax analysis unit, The main-postfix bilingual corpus creation unit and the statistical model creation unit
The parsing unit parses the source language sentence in the bilingual corpus,
The postscript bilingual corpus creation unit is a bilingual corpus consisting of a sentence in the source language and a sentence in the target language, which are converted into a word order by the word order conversion device after the syntax analysis by the syntax analysis unit. Create a bilingual corpus,
The statistical model creation unit creates the statistical model by using the prefix postfix parallel corpus and the monolingual corpus of the target language. The statistical model creation method for machine translation,

A machine translation method using a machine translation device that translates a sentence in a source language that is a non-subscript postfix language into a sentence in a target language that is a postfix language with respect to a syntactic subject,
The machine translation device includes:
A storage unit that stores the statistical model according to claim 5, a syntax analysis unit, a word order conversion device according to claim 4, and a statistical machine translation unit,
The parsing unit parses the source language sentence to be translated;
The word order conversion device converts the word order of words in the sentence of the source language that has been parsed,
The statistical machine translation unit performs translation into the target language sentence by applying the statistical model stored in the storage unit to the converted source language sentence. .

The program for making a computer perform the method of any one of Claims 4-6.