JP6678087B2

JP6678087B2 - Bilingual sentence extraction device, bilingual sentence extraction method and program

Info

Publication number: JP6678087B2
Application number: JP2016165873A
Authority: JP
Inventors: 松永　務; 務松永; 佐藤　大輔; 大輔佐藤
Original assignee: NTT Data Corp
Current assignee: NTT Data Group Corp
Priority date: 2016-08-26
Filing date: 2016-08-26
Publication date: 2020-04-08
Anticipated expiration: 2036-08-26
Also published as: JP2018032324A

Description

本発明は、対訳コーパスを作成する技術に関する。 The present invention relates to a technique for creating a bilingual corpus.

近年、統計的機械翻訳やテキストマイニングに利用するため、大量で良質な対訳コーパスを作成することの重要性が認識されてきている。一般に対訳コーパスの作成には多大なコストがかかることから、その効率的な作成方法が技術的課題となっている。対訳コーパスを作成する方法としては、例えば、対訳文書を構成する一方の言語の文書を他方の言語に翻訳し、作成した翻訳文と他方の言語の文の間で単語の一致数を計ることで、文の対応付けを行う方法が知られている（非特許文献１参照）。 In recent years, it has been recognized that it is important to create a large number of high-quality bilingual corpora for use in statistical machine translation and text mining. In general, the creation of a bilingual corpus requires a great deal of cost, and there is a technical problem in how to efficiently create such a corpus. As a method of creating a bilingual corpus, for example, a document in one language that constitutes a bilingual document is translated into another language, and the number of word matches between the created translation and the sentence in the other language is measured. There is a known method of associating sentences (see Non-Patent Document 1).

石坂達也、内山将夫、隅田英一郎、山本和英、「大規模オープンソース日英対訳コーパスの構築」、情報処理学会研究報告、2009-NL-191、p.1-6、2009年5月Tatsuya Ishizaka, Masao Uchiyama, Eiichiro Sumida, Kazuhide Yamamoto, "Building a Large-Scale Open Source Japanese-English Parallel Corpus", Information Processing Society of Japan, 2009-NL-191, p.1-6, May 2009

しかし、従来のコーパス作成方法では、単語の一致数のみに着目して文同士を対応付ける結果、文全体として見たときに対訳となっていない文同士を対応付けてしまう場合があった。 However, in the conventional corpus creation method, as a result of associating sentences with focusing only on the number of matching words, there is a case where sentences that are not translated when viewed as a whole sentence are associated with each other.

本発明は、このような事情に鑑みてなされたものであり、単語の一致数のみに基づいて文の対応付けを行う場合と比較して、より品質の高い対訳コーパスを作成することを目的とする。 The present invention has been made in view of such circumstances, and it is an object of the present invention to create a higher-quality bilingual corpus as compared with a case where sentences are associated based only on the number of matching words. I do.

上記の課題を解決するため、本発明は、第１言語と第２言語の対訳文書を取得する対訳文書取得部と、前記取得された対訳文書を構成する前記第１言語の文と前記第２言語の文を、前記第１言語と前記第２言語の対訳辞書を用いてマッチングして、前記第１言語と前記第２言語の１以上の対訳文を取得する対訳文取得部と、前記取得された１以上の対訳文に基づいて翻訳モデルを生成する翻訳モデル生成部と、前記取得された１以上の対訳文の各々について、前記生成された翻訳モデルを用いて、当該対訳文を構成する前記第１言語の文を前記第２言語に翻訳する翻訳部と、前記取得された１以上の対訳文の各々について、前記第２言語に翻訳された前記第１言語の文と、当該文に対応する前記第２言語の文との間の編集距離を算出する編集距離算出部と、前記取得された１以上の対訳文のうち、前記算出された編集距離が閾値よりも大きい対訳文を選別する対訳文選別部とを備える対訳文抽出装置を提供する。 In order to solve the above-mentioned problem, the present invention provides a bilingual document acquisition unit that acquires a bilingual document in a first language and a second language, and a sentence in the first language and the second A bilingual sentence acquisition unit that matches a sentence of a language using the bilingual dictionary of the first language and the second language, and acquires one or more bilingual sentences of the first language and the second language; A translation model generation unit that generates a translation model based on the obtained one or more translated texts, and, for each of the obtained one or more translated texts, configures the translated text using the generated translation model. A translation unit for translating the sentence in the first language into the second language; and for each of the obtained one or more bilingual sentences, a sentence in the first language translated into the second language; Editing for calculating the editing distance between the corresponding second language sentence And a release calculator, among the obtained one or more sentence pairs, providing a bilingual sentence extraction device and a translated sentence selection unit editing the calculated distance is selected greater translated sentence than the threshold value.

また、本発明は、１以上のコンピュータにより実行される対訳文抽出方法であって、第１言語と第２言語の対訳文書を取得するステップと、前記取得された対訳文書を構成する前記第１言語の文と前記第２言語の文を、前記第１言語と前記第２言語の対訳辞書を用いてマッチングして、前記第１言語と前記第２言語の１以上の対訳文を取得するステップと、前記取得された１以上の対訳文に基づいて翻訳モデルを生成するステップと、前記取得された１以上の対訳文の各々について、前記生成された翻訳モデルを用いて、当該対訳文を構成する前記第１言語の文を前記第２言語に翻訳するステップと、前記取得された１以上の対訳文の各々について、前記第２言語に翻訳された前記第１言語の文と、当該文に対応する前記第２言語の文との間の編集距離を算出するステップと、前記取得された１以上の対訳文のうち、前記算出された編集距離が閾値よりも大きい対訳文を選別するステップとを備える対訳文抽出方法を提供する。 The present invention is also directed to a bilingual sentence extraction method executed by one or more computers, wherein a step of acquiring a bilingual document in a first language and a second language, the first bilingual document constituting the acquired bilingual document is provided. Matching a sentence of a language and a sentence of the second language using a bilingual dictionary of the first language and the second language to obtain one or more bilingual sentences of the first language and the second language Generating a translation model based on the obtained one or more translations, and constructing the translation using the generated translation model for each of the obtained one or more translations. Translating the sentence in the first language into the second language, and for each of the obtained one or more bilingual sentences, the sentence in the first language translated into the second language; Between the corresponding second language sentence Calculating an edit distance, among the obtained one or more sentence pairs, providing a translated sentence extraction method comprising the steps of: editing the calculated distance is selected greater translated sentence than the threshold value.

また、本発明は、コンピュータに、第１言語と第２言語の対訳文書を取得するステップと、前記取得された対訳文書を構成する前記第１言語の文と前記第２言語の文を、前記第１言語と前記第２言語の対訳辞書を用いてマッチングして、前記第１言語と前記第２言語の１以上の対訳文を取得するステップと、前記取得された１以上の対訳文に基づいて翻訳モデルを生成するステップと、前記取得された１以上の対訳文の各々について、前記生成された翻訳モデルを用いて、当該対訳文を構成する前記第１言語の文を前記第２言語に翻訳するステップと、前記取得された１以上の対訳文の各々について、前記第２言語に翻訳された前記第１言語の文と、当該文に対応する前記第２言語の文との間の編集距離を算出するステップと、前記取得された１以上の対訳文のうち、前記算出された編集距離が閾値よりも大きい対訳文を選別するステップとを実行させるためのプログラムを提供する。 The present invention also provides a computer with a step of acquiring a bilingual document in a first language and a second language, and transmitting the sentence in the first language and the sentence in the second language constituting the acquired bilingual document to the computer. Matching using a bilingual dictionary of a first language and the second language to obtain at least one bilingual sentence of the first language and the second language, based on the obtained one or more bilingual sentences Generating a translation model, and for each of the obtained one or more parallel translations, using the generated translation model to translate the first language sentence constituting the bilingual translation into the second language. Translating and, for each of the obtained one or more translations, editing between the first language sentence translated into the second language and the second language sentence corresponding to the sentence Calculating a distance; and Among the one or more sentence pairs provides a program for executing a step of editing the calculated distance is selected greater translated sentence than the threshold value.

本発明によれば、単語の一致数のみに基づいて文の対応付けを行う場合と比較して、より品質の高い対訳コーパスを作成することができる。 According to the present invention, a bilingual corpus with higher quality can be created as compared with the case where sentences are associated based only on the number of matching words.

対訳文抽出装置１の構成の一例を示すブロック図である。FIG. 2 is a block diagram illustrating an example of a configuration of a bilingual sentence extraction device 1. 対訳文抽出処理の一例を示すフロー図である。It is a flowchart which shows an example of a bilingual sentence extraction process. 対訳文書の一例を示す図である。FIG. 4 is a diagram illustrating an example of a bilingual document. 対訳文記憶部１０６のデータの一例を示す図である。FIG. 4 is a diagram illustrating an example of data in a bilingual sentence storage unit 106. 対訳文記憶部１０６のデータの一例を示す図である。FIG. 4 is a diagram illustrating an example of data in a bilingual sentence storage unit 106. 対訳文記憶部１０６のデータの一例を示す図である。FIG. 4 is a diagram illustrating an example of data in a bilingual sentence storage unit 106.

１．実施形態
１−１．構成
図１は、本実施形態に係る対訳文抽出装置１の構成の一例を示すブロック図である。対訳文抽出装置１は、ＣＰＵ等の演算処理装置と、ＨＤＤ等の記憶装置を備えるコンピュータである。この対訳文抽出装置１は、対訳文書記憶部１０１と、対訳文書取得部１０２と、単語分割部１０３と、対訳辞書記憶部１０４と、対訳文取得部１０５と、対訳文記憶部１０６と、翻訳モデル生成部１０７と、翻訳部１０８と、編集距離算出部１０９と、対訳文選別部１１０と、対訳文編集部１１１という機能を備える。これらの機能のうち、対訳文書記憶部１０１、対訳辞書記憶部１０４および対訳文記憶部１０６の機能は、記憶装置により実現される。その他の機能は、演算処理装置が、記憶装置に記憶されるプログラムを実行することにより実現される。 1. Embodiment 1-1. Configuration FIG. 1 is a block diagram showing an example of the configuration of a bilingual sentence extraction device 1 according to the present embodiment. The bilingual sentence extraction device 1 is a computer including an arithmetic processing device such as a CPU and a storage device such as an HDD. The bilingual sentence extraction device 1 includes a bilingual document storage unit 101, a bilingual document acquisition unit 102, a word division unit 103, a bilingual dictionary storage unit 104, a bilingual sentence acquisition unit 105, a bilingual sentence storage unit 106, It has functions of a model generation unit 107, a translation unit 108, an edit distance calculation unit 109, a bilingual sentence selection unit 110, and a bilingual sentence editing unit 111. Among these functions, the functions of the bilingual document storage unit 101, the bilingual dictionary storage unit 104, and the bilingual sentence storage unit 106 are realized by a storage device. Other functions are realized by the arithmetic processing device executing a program stored in the storage device.

対訳文書記憶部１０１は、第１言語と第２言語の対訳文書を記憶する。ここで、第１言語は日本語であり、第２言語は英語である。対訳文書とは、日本語の文書と、当該文書を英語に翻訳して作成した英語の文書の対である。対訳文書は、例えば、同じ特許ファミリに属する日本特許出願の特許公報と米国特許出願の特許公報の対である。または、日本語の新聞記事と、当該新聞記事の英語版の対である。または、オープンソースソフトウェアの英語版のマニュアルと、当該マニュアルの日本語訳の対である。 The bilingual document storage unit 101 stores bilingual documents in a first language and a second language. Here, the first language is Japanese and the second language is English. A bilingual document is a pair of a Japanese document and an English document created by translating the document into English. The bilingual document is, for example, a pair of a patent publication of a Japanese patent application and a patent publication of a US patent application belonging to the same patent family. Or, a pair of a Japanese newspaper article and an English version of the newspaper article. Or, a pair of an English version of the open source software and a Japanese translation of the manual.

対訳文書取得部１０２は、対訳文書記憶部１０１から対訳文書を取得する。 The bilingual document acquisition unit 102 acquires a bilingual document from the bilingual document storage unit 101.

単語分割部１０３は、対訳文書取得部１０２により取得された対訳文書を文に分割し、かつ、各文を単語に分割する。日本語の文書については、形態素解析を行って、句点を手掛かりに文に分割し、かつ、各文を単語に分割する。その際、活用語を基本形に変換してもよい。英語の文書については、ピリオドを手掛かりに文に分割し、かつ、スペースを手掛かりに各文を単語に分割する。その際、語尾の解析を行って活用語を基本形に変換してもよい。また、大文字を小文字に変換し、かつ、複数形を単数形に変換してもよい。 The word division unit 103 divides the bilingual document acquired by the bilingual document acquisition unit 102 into sentences, and divides each sentence into words. For Japanese documents, morphological analysis is performed to divide sentences into sentences using clues as clues, and to divide each sentence into words. At that time, the inflected words may be converted into the basic form. For English documents, the sentences are divided into sentences using periods as clues, and each sentence is divided into words using spaces as clues. At that time, the ending word may be analyzed to convert the inflected word into the basic form. Also, uppercase letters may be converted to lowercase letters, and plural forms may be converted to singular forms.

対訳辞書記憶部１０４は、対訳辞書を記憶する。ここで、対訳辞書とは、日本語の単語と、当該単語と同じ意味を持つ英語の単語の対の集合である。 The bilingual dictionary storage unit 104 stores a bilingual dictionary. Here, the bilingual dictionary is a set of pairs of Japanese words and English words having the same meaning as the words.

対訳文取得部１０５は、単語分割部１０３により切り出された日本語の文と英語の文を、対訳辞書記憶部１０４に記憶される対訳辞書を用いてマッチングして、日本語と英語の１以上の対訳文を取得する。具体的には、対訳文取得部１０５は、単語分割部１０３により切り出された日本語の文を英語に翻訳し、作成した翻訳文と英語の各文との類似度を算出し、算出した類似度が最大となる英語の文と上記日本語の文の対を対訳文として取得する。ここで、類似度とは、作成した翻訳文と英語の文の間で一致する単語の数に基づいて算出される値である。より具体的には、翻訳文と英語の文に含まれるすべての自立語の数に対する、両者の間で一致する自立語の数の割合により表現される値である。例えば、対訳文取得部１０５は、上記の非特許文献に記載の対訳コーパス作成方法のように、ＤＰ（Dynamic Programming）マッチングを用いて対訳文を取得する。別の例として、対訳文取得部１０５は、Takehito Utsuro, et al. "Bilingual Text Matching using Bilingual Dictionary and Statistics," COLING, p.1076-1082, 1994に記載のようにＤＰマッチングを用いて対訳文を取得してもよい。なおここで、対訳文とは、日本語の文と、当該文を英語に翻訳して作成した英語の文の対である。言い換えると、日本語の文と、当該文と同じ意味を持つ英語の文の対である。 The bilingual sentence acquisition unit 105 matches the Japanese sentence and the English sentence cut out by the word division unit 103 using a bilingual dictionary stored in the bilingual dictionary storage unit 104, and obtains at least one of Japanese and English. Get the bilingual sentence of Specifically, the bilingual sentence acquisition unit 105 translates the Japanese sentence cut out by the word division unit 103 into English, calculates the similarity between the created translated sentence and each sentence in English, and calculates the calculated similarity. A pair of the English sentence with the maximum degree and the Japanese sentence described above is acquired as a bilingual sentence. Here, the similarity is a value calculated based on the number of words that match between the created translated sentence and the English sentence. More specifically, it is a value represented by the ratio of the number of independent words that match between the translated sentence and the number of all independent words included in the English sentence. For example, the bilingual sentence acquisition unit 105 acquires a bilingual sentence using DP (Dynamic Programming) matching as in the bilingual corpus creation method described in the above-mentioned non-patent document. As another example, the bilingual sentence acquisition unit 105 uses the DP matching as described in Takehito Utsuro, et al. "Bilingual Text Matching using Bilingual Dictionary and Statistics," COLING, p. May be obtained. Here, the bilingual sentence is a pair of a Japanese sentence and an English sentence created by translating the sentence into English. In other words, it is a pair of a Japanese sentence and an English sentence having the same meaning as the sentence.

対訳文記憶部１０６は、対訳文取得部１０５により取得された１以上の対訳文（言い換えると、対訳コーパス）を記憶する。その際、対訳文記憶部１０６は、各対訳文を、当該対訳文を識別する対訳文ＩＤと対応付けて記憶する。 The bilingual sentence storage unit 106 stores one or more bilingual sentences (in other words, a bilingual corpus) acquired by the bilingual sentence acquiring unit 105. At this time, the bilingual sentence storage unit 106 stores each bilingual sentence in association with a bilingual sentence ID for identifying the bilingual sentence.

翻訳モデル生成部１０７は、対訳文記憶部１０６に記憶された１以上の対訳文に基づいて翻訳モデルを生成する。その際、翻訳モデル生成部１０７は、例えばMosesデコーダ（http://www.statmt.org/moses/）を用いて翻訳モデルを生成する。Mosesデコーダについては、例えば、Philipp Koehn, et al. "Moses: Open Source Toolkit for Statistical Machine Translation," Annual Meeting of the Association for Computational Linguistics, demonstration session, Prague, Czech Republic, June 2007を参照のこと。 The translation model generation unit 107 generates a translation model based on one or more translations stored in the translation storage unit 106. At this time, the translation model generation unit 107 generates a translation model using, for example, a Moses decoder (http://www.statmt.org/moses/). For a Moses decoder, see, for example, Philipp Koehn, et al. "Moses: Open Source Toolkit for Statistical Machine Translation," Annual Meeting of the Association for Computational Linguistics, demonstration session, Prague, Czech Republic, June 2007.

翻訳部１０８は、対訳文記憶部１０６に記憶された１以上の対訳文の各々について、翻訳モデル生成部１０７により生成された翻訳モデルを用いて、当該対訳文を構成する日本語の文を英語に翻訳する。翻訳部１０８は、作成した翻訳文を、原文である日本語の文と対応付けて対訳文記憶部１０６に記憶する。 The translation unit 108 uses the translation model generated by the translation model generation unit 107 for each of one or more translations stored in the bilingual sentence storage unit 106 to translate the Japanese sentence constituting the translation into English. Translate to The translation unit 108 stores the created translation in the bilingual sentence storage unit 106 in association with the original Japanese sentence.

編集距離算出部１０９は、対訳文記憶部１０６に記憶された１以上の対訳文の各々について、翻訳部１０８により英語に翻訳された日本語の文と、当該文に対応する英語の文との間の編集距離を算出する。ここで編集距離とは、英語に翻訳された日本語の文を、当該文に対応する英語の文に変更するために必要とされる編集操作の回数に基づいて算出される値である。具体的には、編集距離算出部１０９は、編集距離としてＴＥＲ（Translation Error Rate）を算出する。ここで、編集操作とは、具体的には、挿入、削除、置換および並び替えの４つの操作である。ＴＥＲについては、例えば、Matthew Snover, et al. "A study of translation edit rate with targeted human annotation," Proceedings of Association for Machine Translation in the Americas, p.223-231, 2006を参照のこと。編集距離算出部１０９は、ＴＥＲを算出すると、算出したＴＥＲを、対応する対訳文と対応付けて対訳文記憶部１０６に記憶する。 The edit distance calculation unit 109 compares, for each of one or more parallel sentences stored in the parallel sentence storage unit 106, a Japanese sentence translated into English by the translation unit 108 and an English sentence corresponding to the sentence. The edit distance between them is calculated. Here, the editing distance is a value calculated based on the number of editing operations required to change a Japanese sentence translated into English into an English sentence corresponding to the sentence. Specifically, the editing distance calculation unit 109 calculates TER (Translation Error Rate) as the editing distance. Here, the editing operation is specifically four operations of insertion, deletion, replacement, and rearrangement. For TER, see, for example, Matthew Snover, et al. "A study of translation edit rate with targeted human annotation," Proceedings of Association for Machine Translation in the Americas, p. 223-231, 2006. After calculating the TER, the edit distance calculation unit 109 stores the calculated TER in the bilingual sentence storage unit 106 in association with the corresponding bilingual sentence.

対訳文選別部１１０は、対訳文記憶部１０６に記憶された１以上の対訳文のうち、編集距離算出部１０９により算出されたＴＥＲが閾値よりも大きい対訳文を選別する。ここで閾値は、例えば、対訳文記憶部１０６に記憶された１以上の対訳文のうち所定の割合の対訳文が選別されるように設定される。対訳文を選別すると、編集距離算出部１０９は、当該対訳文の対訳文ＩＤを、削除対象として対訳文編集部１１１に通知する。 The bilingual sentence selection unit 110 selects, from one or more bilingual sentences stored in the bilingual sentence storage unit 106, a bilingual sentence whose TER calculated by the edit distance calculating unit 109 is larger than a threshold value. Here, the threshold value is set, for example, such that a predetermined percentage of bilingual sentences among one or more bilingual sentences stored in the bilingual sentence storage unit 106 are selected. When the bilingual sentence is selected, the editing distance calculating unit 109 notifies the bilingual sentence editing unit 111 of the bilingual sentence ID of the bilingual sentence as a deletion target.

対訳文編集部１１１は、対訳文選別部１１０から通知された対訳文ＩＤにより識別される対訳文を対訳文記憶部１０６から削除する。 The bilingual sentence editing unit 111 deletes the bilingual sentence identified by the bilingual sentence ID notified from the bilingual sentence selection unit 110 from the bilingual sentence storage unit 106.

１−２．動作
対訳文抽出装置１の動作について説明する。図２は、対訳文抽出装置１により実行される対訳文抽出処理の一例を示すフロー図である。 1-2. Operation The operation of the bilingual sentence extraction device 1 will be described. FIG. 2 is a flowchart illustrating an example of a bilingual sentence extraction process performed by the bilingual sentence extraction device 1.

この対訳文抽出処理のステップＳ１において、対訳文抽出装置１の対訳文書取得部１０２は、対訳文書記憶部１０１から対訳文書を取得する。図３は、対訳文書の一例を示す図である。 In step S1 of the bilingual sentence extraction process, the bilingual document acquisition unit 102 of the bilingual sentence extraction device 1 acquires a bilingual document from the bilingual document storage unit 101. FIG. 3 is a diagram illustrating an example of a bilingual document.

対訳文書取得部１０２により対訳文書が取得されると、単語分割部１０３は、取得された対訳文書を文に分割し、かつ、各文を単語に分割する（ステップＳ２）。日本語の文書については、形態素解析を行って、句点を手掛かりに文に分割し、かつ、各文を単語に分割する。英語の文書については、ピリオドを手掛かりに文に分割し、かつ、スペースを手掛かりに各文を単語に分割する。 When the bilingual document acquisition unit 102 acquires the bilingual document, the word division unit 103 divides the acquired bilingual document into sentences and divides each sentence into words (step S2). For Japanese documents, morphological analysis is performed to divide sentences into sentences using clues as clues, and to divide each sentence into words. For English documents, the sentences are divided into sentences using periods as clues, and each sentence is divided into words using spaces as clues.

単語分割部１０３により対訳文書が文に分割され、かつ、各文が単語に分割されると、対訳文取得部１０５は、単語分割部１０３により切り出された日本語の文と英語の文を、対訳辞書記憶部１０４に記憶される対訳辞書を用いてマッチングして、日本語と英語の１以上の対訳文を取得する（ステップＳ３）。１以上の対訳文を取得すると、対訳文取得部１０５は、各対訳文を対訳文ＩＤと対応付けて対訳文記憶部１０６に記憶する（ステップＳ４）。図４は、対訳文取得部１０５により対訳文が記憶された対訳文記憶部１０６のデータの一例を示す図である。 When the bilingual document is divided into sentences by the word dividing unit 103 and each sentence is divided into words, the bilingual sentence acquiring unit 105 converts the Japanese sentence and the English sentence cut out by the word dividing unit 103 into words. Matching is performed using the bilingual dictionaries stored in the bilingual dictionary storage unit 104 to obtain one or more bilingual sentences in Japanese and English (step S3). When one or more bilingual sentences are acquired, the bilingual sentence acquisition unit 105 stores each bilingual sentence in the bilingual sentence storage unit 106 in association with the bilingual sentence ID (step S4). FIG. 4 is a diagram illustrating an example of data in the bilingual sentence storage unit 106 in which the bilingual sentence is stored by the bilingual sentence acquisition unit 105.

対訳文取得部１０５により１以上の対訳文が対訳文記憶部１０６に記憶されると、翻訳モデル生成部１０７は、記憶された１以上の対訳文に基づいて翻訳モデルを生成する（ステップＳ５）。 When one or more bilingual sentences are stored in the bilingual sentence storage unit 106 by the bilingual sentence acquiring unit 105, the translation model generating unit 107 generates a translation model based on the stored one or more bilingual sentences (step S5). .

翻訳モデル生成部１０７により翻訳モデルが生成されると、翻訳部１０８は、対訳文記憶部１０６に記憶された１以上の対訳文の各々について、生成された翻訳モデルを用いて、当該対訳文を構成する日本語の文を英語に翻訳する（ステップＳ６）。翻訳部１０８は、作成した翻訳文を、原文である日本語の文と対応付けて対訳文記憶部１０６に記憶する。図５は、翻訳部１０８により翻訳文が記憶された対訳文記憶部１０６のデータの一例を示す図である。 When the translation model is generated by the translation model generation unit 107, the translation unit 108 uses the generated translation model for each of the one or more bilingual sentences stored in the bilingual sentence storage unit 106. The constituent Japanese sentences are translated into English (step S6). The translation unit 108 stores the created translation in the bilingual sentence storage unit 106 in association with the original Japanese sentence. FIG. 5 is a diagram illustrating an example of data in the bilingual sentence storage unit 106 in which a translation is stored by the translation unit 108.

翻訳部１０８により対訳文を構成する日本語の文が英語に翻訳されると、編集距離算出部１０９は、対訳文記憶部１０６に記憶された１以上の対訳文の各々について、英語に翻訳された日本語の文と、当該文に対応する英語の文との間のＴＥＲを算出する（ステップＳ７）。編集距離算出部１０９は、ＴＥＲを算出すると、算出したＴＥＲを、対応する対訳文と対応付けて対訳文記憶部１０６に記憶する。図６は、編集距離算出部１０９によりＴＥＲが記憶された対訳文記憶部１０６のデータの一例を示す図である。同図に示すＴＥＲの「総合」とは、挿入、削除、置換および並び替えの各操作の回数を合計した値である。 When the translation unit 108 translates the Japanese sentence constituting the bilingual sentence into English, the edit distance calculation unit 109 translates each of the one or more bilingual sentences stored in the bilingual sentence storage unit 106 into English. The TER between the sentence in Japanese and the English sentence corresponding to the sentence is calculated (step S7). After calculating the TER, the edit distance calculation unit 109 stores the calculated TER in the bilingual sentence storage unit 106 in association with the corresponding bilingual sentence. FIG. 6 is a diagram illustrating an example of data in the bilingual sentence storage unit 106 in which the TER is stored by the edit distance calculation unit 109. The “overall” of the TER shown in FIG. 3 is a value obtained by summing the number of times of each operation of insertion, deletion, replacement, and rearrangement.

編集距離算出部１０９によりＴＥＲが算出されると、対訳文選別部１１０は、対訳文記憶部１０６に記憶された１以上の対訳文のうち、ＴＥＲが閾値よりも大きい対訳文を選別する（ステップＳ８）。対訳文を選別すると、編集距離算出部１０９は、当該対訳文の対訳文ＩＤを、削除対象として対訳文編集部１１１に通知する。例えば、閾値が「５」に設定されていたとすると、対訳文選別部１１０は、図６に示す対訳文のうち、対訳文ＩＤ「０３２」を削除対象として対訳文編集部１１１に通知する。 When the TER is calculated by the edit distance calculation unit 109, the bilingual sentence selection unit 110 selects a bilingual sentence having a TER larger than a threshold value from one or more bilingual sentences stored in the bilingual sentence storage unit 106 (step). S8). When the bilingual sentence is selected, the editing distance calculating unit 109 notifies the bilingual sentence editing unit 111 of the bilingual sentence ID of the bilingual sentence as a deletion target. For example, if the threshold is set to “5”, the bilingual sentence selection unit 110 notifies the bilingual sentence editing unit 111 of the bilingual sentence ID “032” of the bilingual sentences shown in FIG. 6 as a deletion target.

対訳文編集部１１１は、対訳文選別部１１０から対訳文ＩＤが通知されると、当該対訳文ＩＤにより識別される対訳文を対訳文記憶部１０６から削除する（ステップＳ９）。
以上が、対訳文抽出処理についての説明である。 When the bilingual sentence ID is notified from the bilingual sentence selection unit 110, the bilingual sentence editing unit 111 deletes the bilingual sentence identified by the bilingual sentence ID from the bilingual sentence storage unit 106 (step S9).
The above is the description of the bilingual sentence extraction process.

以上説明した対訳文抽出装置１によれば、対訳文取得部１０５によりＤＰマッチングを用いて対訳文書から対訳文が取得された後に、その取得された対訳文の中から、対訳文選別部１１０により、ＴＥＲに基づいて選別が行われる。そのため、この対訳文書記憶部１０１によれば、単語の一致数のみに基づいて文の対応付けを行う場合と比較して、より品質の高い対訳コーパスを作成することができる。 According to the bilingual sentence extraction device 1 described above, after the bilingual sentence acquisition unit 105 acquires the bilingual sentence from the bilingual document using DP matching, the bilingual sentence selection unit 110 selects the bilingual sentence from the acquired bilingual sentences. , TER are selected. Therefore, according to the bilingual document storage unit 101, a higher quality bilingual corpus can be created as compared with the case where sentences are associated based only on the number of matching words.

２．変形例
上記の実施形態は、以下に記載するように変形してもよい。以下に記載する１以上の変形例は、互いに組み合わせてもよい。 2. Modifications The above embodiment may be modified as described below. One or more of the modifications described below may be combined with one another.

２−１．変形例１
上記の実施形態に係る対訳文抽出装置１は、複数のコンピュータにより構成されるコンピュータシステムであってもよい。上記の実施形態に係る対訳文抽出装置１が備える記憶装置は、インターネット等の通信回線を介して対訳文抽出装置１と接続されてもよい。 2-1. Modification 1
The bilingual sentence extraction device 1 according to the above embodiment may be a computer system including a plurality of computers. The storage device included in the bilingual sentence extraction device 1 according to the above embodiment may be connected to the bilingual sentence extraction device 1 via a communication line such as the Internet.

２−２．変形例２
上記の実施形態において、第１言語を英語とし、第２言語を日本語としてもよい。また、第１言語と第２言語の組み合わせは、日本語と英語の他に、ドイツ語、フランス語、中国語、韓国語等の自然言語の中から任意に選択されてよい。 2-2. Modification 2
In the above embodiment, the first language may be English and the second language may be Japanese. Further, the combination of the first language and the second language may be arbitrarily selected from natural languages such as German, French, Chinese, and Korean in addition to Japanese and English.

２−３．変形例３
翻訳モデル生成部１０７は、Mosesデコーダ以外の他のデコーダを用いて翻訳モデルを生成してもよい。例えば、Pharaohデコーダを用いて翻訳モデルを生成してもよい。Pharaohデコーダについては、例えば、Philipp Koehn, "Pharaoh: A Beam Search Decoder for Phrase-Based Statistical Machine Translation Models," Proceedings of the 6th Conference of the Association for Machine Translation in the Americas, p.115-124, 2004を参照のこと。 2-3. Modification 3
The translation model generation unit 107 may generate a translation model using a decoder other than the Moses decoder. For example, a translation model may be generated using a Pharaoh decoder. For the Pharaoh decoder, see, for example, Philipp Koehn, "Pharaoh: A Beam Search Decoder for Phrase-Based Statistical Machine Translation Models," Proceedings of the 6th Conference of the Association for Machine Translation in the Americas, p.115-124, 2004. See also.

２−４．変形例４
上記の実施形態において、翻訳モデル生成部１０７を省略し、翻訳部１０８は、予め定められた翻訳モデルを用いて、対訳文記憶部１０６に記憶された１以上の対訳文の各々について、当該対訳文を構成する日本語の文を英語に翻訳してもよい。 2-4. Modification 4
In the above-described embodiment, the translation model generation unit 107 is omitted, and the translation unit 108 uses a predetermined translation model for each of one or more bilingual sentences stored in the bilingual sentence storage unit 106. The Japanese sentence constituting the sentence may be translated into English.

２−５．変形例５
上記の実施形態に係る編集距離算出部１０９は、編集距離として、ＴＥＲ以外の値を算出してもよい。例えば、Levenshtein距離や、Damerau-Levenshtein距離や、Jaro-Winkler距離を算出してもよい。 2-5. Modification 5
The editing distance calculation unit 109 according to the above embodiment may calculate a value other than TER as the editing distance. For example, a Levenshtein distance, a Damerau-Levenshtein distance, or a Jaro-Winkler distance may be calculated.

別の例として、編集距離算出部１０９は、対訳文記憶部１０６に記憶された１以上の対訳文の各々について、翻訳部１０８により英語に翻訳された日本語の文と、当該文に対応する英語の文との間のBLUEまたはRIBESを算出してもよい。BLUEについては、例えば、
Kishore Papineni, et al. "BLUE: a method for automatic evaluation of machine translation," Proc. of the 40th Annual Meeting of the Association for Computational Linguistics, p.311-318, July 2002を参照のこと。RIBESについては、例えば、平尾努，他「RIBES: 順位相関に基づく翻訳の自動評価法」、言語処理学会第17回年次大会発表論文集、p.1115-1118、2011年3月を参照のこと。 As another example, the edit distance calculation unit 109, for each of one or more bilingual sentences stored in the bilingual sentence storage unit 106, corresponds to a Japanese sentence translated into English by the translating unit 108 and a corresponding sentence. BLUE or RIBES between English sentences may be calculated. For BLUE, for example,
See Kishore Papineni, et al. "BLUE: a method for automatic evaluation of machine translation," Proc. Of the 40th Annual Meeting of the Association for Computational Linguistics, p.311-318, July 2002. For details on RIBES, see, for example, Tsutomu Hirao, et al., "RIBES: Automatic Evaluation of Translation Based on Rank Correlation", Proc. Of the 17th Annual Meeting of the Linguistic Processing Society, p.1115-1118, March 2011. thing.

２−６．変形例６
上記の実施形態に係る対訳文選別部１１０は、ＴＥＲが閾値よりも大きい対訳文を選別した後、選別した対訳文以外の対訳文であって対訳文記憶部１０６に記憶されている対訳文の対訳文ＩＤを、削除せずに残す対訳文として対訳文編集部１１１に通知してもよい。この場合、対訳文編集部１１１は、対訳文選別部１１０から通知された対訳文ＩＤにより識別される対訳文以外の対訳文を対訳文記憶部１０６から削除する。 2-6. Modification 6
The bilingual sentence selection unit 110 according to the above-described embodiment, after selecting bilingual sentences having a TER greater than the threshold value, selecting a bilingual sentence other than the selected bilingual sentence and storing the bilingual sentence stored in the bilingual sentence storage unit 106. The bilingual sentence ID may be notified to the bilingual sentence editing unit 111 as a bilingual sentence to be left without being deleted. In this case, the bilingual sentence editing unit 111 deletes the bilingual sentence other than the bilingual sentence identified by the bilingual sentence ID notified from the bilingual sentence selection unit 110 from the bilingual sentence storage unit 106.

２−７．変形例７
対訳文抽出装置１の各機能を実現するためのプログラムは、コンピュータ装置が読み取り可能な記録媒体を介して提供されてもよい。ここで、記録媒体とは、例えば、磁気テープや磁気ディスクなどの磁気記録媒体や、光ディスクなどの光記録媒体や、光磁気記録媒体や、半導体メモリ等である。また、このプログラムは、インターネット等のネットワークを介して提供されてもよい。 2-7. Modification 7
A program for implementing each function of the bilingual sentence extraction device 1 may be provided via a recording medium readable by a computer device. Here, the recording medium is, for example, a magnetic recording medium such as a magnetic tape or a magnetic disk, an optical recording medium such as an optical disk, a magneto-optical recording medium, or a semiconductor memory. This program may be provided via a network such as the Internet.

１…対訳文抽出装置、１０１…対訳文書記憶部、１０２…対訳文書取得部、１０３…単語分割部、１０４…対訳辞書記憶部、１０５…対訳文取得部、１０６…対訳文記憶部、１０７…翻訳モデル生成部、１０８…翻訳部、１０９…編集距離算出部、１１０…対訳文選別部、１１１…対訳文編集部 DESCRIPTION OF SYMBOLS 1 ... Translation sentence extraction apparatus, 101 ... Translation document storage part, 102 ... Translation document acquisition part, 103 ... word division part, 104 ... Translation dictionary storage part, 105 ... Translation sentence acquisition part, 106 ... Translation sentence storage part, 107 ... Translation model generation unit, 108: translation unit, 109: edit distance calculation unit, 110: bilingual sentence selection unit, 111: bilingual sentence editing unit

Claims

A bilingual document acquisition unit for acquiring bilingual documents in the first language and the second language;
The sentence of the first language and the sentence of the second language constituting the acquired bilingual document are matched using the bilingual dictionary of the first language and the second language, and the first language and the second language are translated. A bilingual sentence acquisition unit that acquires one or more bilingual sentences in two languages;
A translation model generation unit that generates a translation model based on the obtained one or more translations;
For each of the obtained one or more translations, using the generated translation model, a translation unit that translates the first language sentence constituting the translation into the second language,
For each of the obtained one or more translated sentences, an edit for calculating an edit distance between the sentence of the first language translated into the second language and the sentence of the second language corresponding to the sentence A distance calculator,
A bilingual sentence selection unit that selects a bilingual sentence whose calculated editing distance is greater than a threshold value from the one or more acquired bilingual sentences.

A bilingual sentence extraction method executed by one or more computers,
Obtaining a bilingual document in a first language and a second language;
The sentence of the first language and the sentence of the second language constituting the acquired bilingual document are matched using the bilingual dictionary of the first language and the second language, and the first language and the second language are translated. Obtaining one or more bilingual sentences in two languages;
Generating a translation model based on the obtained one or more translations;
Translating the sentence in the first language constituting the translated sentence into the second language using the generated translation model for each of the obtained one or more translated sentences;
Calculating, for each of the obtained one or more parallel sentences, an edit distance between a sentence in the first language translated into the second language and a sentence in the second language corresponding to the sentence; When,
Selecting a bilingual sentence whose calculated editing distance is greater than a threshold from the one or more obtained bilingual sentences.

On the computer,
Obtaining a bilingual document in a first language and a second language;
The sentence of the first language and the sentence of the second language constituting the acquired bilingual document are matched using the bilingual dictionary of the first language and the second language, and the first language and the second language are translated. Obtaining one or more bilingual sentences in two languages;
Generating a translation model based on the obtained one or more translations;
Translating the sentence in the first language constituting the translated sentence into the second language using the generated translation model for each of the obtained one or more translated sentences;
Calculating, for each of the obtained one or more parallel sentences, an edit distance between a sentence in the first language translated into the second language and a sentence in the second language corresponding to the sentence; When,
Selecting a bilingual sentence whose calculated edit distance is greater than a threshold value from the one or more acquired bilingual sentences.