JP6345638B2

JP6345638B2 - System and method for maintaining speech-to-speech translation in the field

Info

Publication number: JP6345638B2
Application number: JP2015218066A
Authority: JP
Inventors: アール．レーン，イアン; ワイベル，アレキサンダー
Original assignee: Facebook Inc
Current assignee: Meta Platforms Inc
Priority date: 2008-04-15
Filing date: 2015-11-06
Publication date: 2018-06-20
Anticipated expiration: 2029-04-15
Also published as: CN102084417B; JP2011524991A; CN102084417A; KR101445904B1; JP2016053726A; EP2274742A1; KR20110031274A; BRPI0910706A2

Description

（関連出願の相互参照）
本願は、２００８年４月１５日に出願された米国仮特許出願Ｎｏ．６１／０４５，０７９、２００８年８月２８日に出願された米国仮出願Ｎｏ．６１／０９２，５８１、及び２００８年９月３日に出願された米国仮特許出願Ｎｏ．６１／０９３，８９８の優先権を主張する。 (Cross-reference of related applications)
This application is filed in US provisional patent application no. 61/045, 079, US provisional application no. 61 / 092,581, and US provisional patent application no. Claim priority of 61 / 093,898.

本発明は、一般に異言語間コミュニケーション用の音声−音声翻訳システムを対象とし、より詳細には、ユーザが、現場で、言語学若しくは技術的知識又は専門的知識を必要とせずに、新語彙（新たな語彙）項目を加え、ユーザのシステムの内容及び使用法を改善し修正することを可能にする、現場メンテナンス用の方法及び装置を対象とする。 The present invention is generally directed to speech-to-speech translation systems for interlingual communication, and more particularly, a new vocabulary (in which a user does not require linguistic or technical or technical knowledge in the field) It is directed to a method and apparatus for field maintenance that allows new vocabulary) items to be added and to improve and modify the content and usage of the user's system.

自動音声認識（ＡＳＲ）及び機械翻訳（ＭＴ）技術は十分発達してきており、限定された領域用及び非限定の領域用として、ラップトップ又はモバイルデバイス上に実用的な音声翻訳システムを開発することが実現可能となるまでになってきた。特に、領域限定の音声−音声変換システムは、研究分野において及び研究所において、観光事業、医療展開及び軍用を含むさまざまな適用領域向けに開発されてきた。このようなシステムは、以下に示す研究の中で以前から知られていた：A. Waibel, C. Fugen, "Spoken language translation" in Signal Processing Magazine, IEEE May 2008; 25(3):70-79, In Proc. HLT, 2003、及び Nguyen Bach, Matthias Eck, Paisarn Charoenpornsawat, Thilo Kohler, Sebastian Stuker, ThuyLinh Nguyen, Roger Hsiao, Alex Waibel, Stephan Vogel, Tanja Schultz and Alan W. Black, による例えば、"The CMU TransTac 2007 eyes-free and hands-free two-way speech-to-speech translation system," In Proc. of the IWSLT, Trento, Italy, October 2007。しかしながらこれらの研究は、前もってシステムの開発者によって定義されており、適用領域で決定された限定語彙と、システムが使用される想定される場所とで動作するように制限されている。それ故に語彙及び言語の使用法は、用例のシナリオに基づいて、且つこのようなシナリオにおいて収集又は推定されたデータによって、大きく決定される。 Automatic speech recognition (ASR) and machine translation (MT) technologies have been well developed to develop practical speech translation systems on laptops or mobile devices for limited and non-limited areas. Has become feasible. In particular, domain-specific speech-to-speech conversion systems have been developed for a variety of application areas, including tourism, medical deployment, and military, in the research field and in the laboratory. Such a system has been known for some time in the following work: A. Waibel, C. Fugen, "Spoken language translation" in Signal Processing Magazine, IEEE May 2008; 25 (3): 70-79. , In Proc. HLT, 2003, and Nguyen Bach, Matthias Eck, Paisarn Charoenpornsawat, Thilo Kohler, Sebastian Stuker, ThuyLinh Nguyen, Roger Hsiao, Alex Waibel, Stephan Vogel, Tanja Schultz and Alan W. Black, TransTac 2007 eyes-free and hands-free two-way speech-to-speech translation system, "In Proc. Of the IWSLT, Trento, Italy, October 2007. However, these studies have been defined in advance by system developers and are limited to work with the limited vocabulary determined by the domain of application and the intended location where the system will be used. Vocabulary and language usage is therefore largely determined based on example scenarios and by data collected or estimated in such scenarios.

しかしながら、現場の状況では、実際の言葉及び言語の使用法は、研究所における予想のシナリオから逸脱している。観光事業などの単純な領域であっても、ユーザが種々の場所へと移動し、種々の人々とやり取りし、種々の目標及びニーズを追求するにつれて、現場では言語の使用法は劇的に変化する。それ故に、新語（新たな語）及び新たな言い回しが常に生じるようになる。このような新語、つまり音声認識用語「未知語」（ＯＯＶ）は、辞書内の語として誤認識され、そして、不正確に翻訳される。ユーザは言い換えを試みてもよいが、しかし（人名や都市名などの）決定的な語若しくは概念を入力する又はコミュニケーションできない場合には、語又は言い回しの不足により、コミュニケーションが断絶される場合もある。 However, in the field situation, the actual language and language usage deviates from the expected scenario in the laboratory. Even in simple areas such as tourism, language usage changes dramatically in the field as users move to different locations, interact with different people and pursue different goals and needs. To do. Therefore, new words (new words) and new phrases always arise. Such a new word, the speech recognition term “unknown word” (OOV), is misrecognized as a word in the dictionary and translated incorrectly. The user may try to paraphrase, but if he / she cannot enter a definitive word or concept (such as a person name or city name) or cannot communicate, communication may be disrupted due to lack of words or phrases .

ユーザが修正可能な音声−音声翻訳システムの必要性があるにもかかわらず、実際の解決法は今まで提案されてなかった。システムに語を加えることは容易に見えるかもしれないが、このような修正を行うことは極めて困難なことが分かっている。適切な修正は、システム全体を通して多くのコンポーネントモジュールに対して行わなければならず、大部分のモジュールは再トレーニングされて、コンポーネントのバランス及び統合化された機能を修復しなければならない。実際には、大略２０個の種々のモジュールが、新語を学習するために修正され又は再最適化されなければならないだろう。このような修正は、音声翻訳システムのコンポーネントに関する専門的知識及び経験を必要とする。その結果として、発明者の理解ではこのような修正は、これまでのところ専門家によって研究所内だけで行われてきており、人間の専門的知識、時間及びコストを必要としていた。 Despite the need for a user-correctable speech-to-speech translation system, no actual solution has been proposed so far. While it may seem easy to add words to the system, it has proven to be extremely difficult to make such modifications. Appropriate modifications must be made to many component modules throughout the system, and most modules must be retrained to restore component balance and integrated functionality. In practice, roughly 20 different modules will have to be modified or reoptimized to learn new words. Such modifications require specialized knowledge and experience with the components of the speech translation system. As a result, according to the inventor's understanding, such modifications have so far been made only by specialists within the laboratory, requiring human expertise, time and cost.

例えば、欧州のユーザ向けに設計されたシステムが、その語彙内に名称「ＨｏｎｇＫｏｎｇ」を含んでいない場合を想定する。話し手が「Ｌｅｔ'ｓｇｏｔｏＨｏｎｇＫｏｎｇ」という文を話すと、システムは、辞書内における最も音声が似ている類似語を認識して、「Ｌｅｔ'ｓｇｏｔｏｈｏｍｅｃａｌｌ」と生成することになる。この時点では、エラーが、認識エラーの結果かどうか、又は音声−音声翻訳システム全体におけるこの語の欠落の結果かどうかは明らかでない。それ故に、ユーザは、次にシステムを修正し始める。これは、幾つかの修正技術のうちの１つによって行うことができる。最も単純には、再度話す又はタイプすることがあるが、代わりとして、他の開示及び従来技術（Ｗａｉｂｅｌ等による、米国特許第５，８５５，０００号明細書）に記載されるクロスモーダルによるエラー修正技術によって、より効果的に行える。所望の語系列による修正スペリングが確立される（「Ｌｅｔ'ｓｇｏｔｏＨｏｎｇＫｏｎｇ」）と、システムは翻訳を実行する。「ＨｏｎｇＫｏｎｇ」が辞書内にある場合、システムはそこから正常に続行し、翻訳及び合成を実行することになる。しかしながら、認識及び翻訳辞書には無い場合、システムは、この語が固有表現か否かを確証する必要があるだろう。最後に、そして最も重要なことだが、ユーザが介在することによって名前又は語が学習されずに出力言語へと適切に翻訳できたとしても、ユーザが次回に同一の語を話す場合に、システムは再び故障することになる。 For example, assume a system designed for European users does not include the name “Hong Kong” in its vocabulary. When a speaker speaks the sentence “Let's go to Hong Kong”, the system recognizes the most similar phonetic similar words in the dictionary and generates “Let's go to home call”. Become. At this point it is not clear whether the error is the result of a recognition error or the result of this word loss in the entire speech-to-speech translation system. Therefore, the user then begins to modify the system. This can be done by one of several modification techniques. Most simply, it may speak or type again, but instead, cross-modal error correction as described in other disclosures and prior art (Waibel et al., US Pat. No. 5,855,000). It can be done more effectively with technology. Once the correct spelling with the desired word sequence is established ("Let's go to Hong Kong"), the system performs the translation. If "Hong Kong" is in the dictionary, the system will continue normally from there and perform translation and synthesis. However, if it is not in the recognition and translation dictionary, the system will need to verify whether this word is a proper expression. Finally, and most importantly, if the user speaks the same word the next time, even if the name or word can be properly translated into the output language without being learned through user intervention, the system will It will break down again.

残念ながら、新語の学習は、語リスト内の新語をただ単にタイプ入力することだけでは対処することができず、大略２０個の種々の点において且つ音声翻訳システムの全てのレベルにおいて変更を必要とする。現在、新語の学習は、エントリについての手動によるタグ付け（ｔａｇｇｉｎｇ）及び編集、必要となる語を含む広範囲にわたるデータベースの収集、言語モデル及び翻訳モデルの確率の再トレーニング、並びに全体システムの再最適化を必要とし、それによって、全てのコンポーネントとコンポーネントの辞書との間の一貫性を再確立しようとし、且つシステム内の語、句及び概念間の統計的バランスを修復しようとする（確率は合計で１になる必要があり、それ故に全ての語は単一の語の追加の影響を受けることになる）。 Unfortunately, new word learning cannot be addressed by simply typing in a new word in the word list and requires changes at roughly 20 different points and at all levels of the speech translation system. To do. Currently, new word learning includes manual tagging and editing of entries, extensive database collection including required words, retraining probabilities of language and translation models, and reoptimization of the entire system. To attempt to reestablish consistency between all components and component dictionaries, and to repair the statistical balance between words, phrases and concepts in the system (probability is total 1 and therefore all words will be affected by the addition of a single word).

結果として、現行の音声翻訳システムでは、一般に、小修正であっても、研究所内に見出される先端的な計算ツール及び言語学のリソースを使用する必要があった。しかしながら、実際の現場で使用するに際しては、研究所で行われるあらゆる修正を必要とすることは、多大な時間、努力及びコストがかかりすぎるので、受け入れ可能なことではない。代わりに、全ての複雑性をユーザから隠蔽し、全ての決定的な動作及び言語処理ステップを、裏で半自律的に又は自律的に実行すると共に、単純で直感的なインターフェースを介してできる限り混乱を起こさないやり方で人間ユーザとやり取りすることで、現場で言語学的又は技術的専門的知識の必要性を完全に取り除く学習及びカスタマイズモジュールが必要となる。本発明において、我々は、学習及びカスタマイズモジュールがこれらのニーズを満足するという詳細な説明を提供する。 As a result, current speech translation systems generally required the use of advanced computational tools and linguistic resources found within the laboratory, even for minor modifications. However, when used in a real field, requiring any modifications made in the laboratory is not acceptable because it takes too much time, effort and cost. Instead, it hides all complexity from the user and performs all critical actions and language processing steps semi-autonomously or autonomously behind the scenes, and through a simple and intuitive interface whenever possible There is a need for learning and customization modules that interact with human users in a non-confusing manner that completely eliminates the need for linguistic or technical expertise in the field. In the present invention, we provide a detailed explanation that the learning and customization module satisfies these needs.

残念ながら翻訳システムの多くは、非常に複雑であり、ユーザによるアクセスは実行可能でなく又は利用されない。それ故に、機械翻訳技術を利用すると共に、ユーザ修正機能が、言語学若しくは技術的知識又は専門的知識を必要とせずに異言語間コミュニケーションを提供することを可能にし、それによって、言語障壁を克服し、人々をより親密にさせることを可能にする、システム及び方法が必要である。 Unfortunately, many of the translation systems are very complex and user access is not feasible or utilized. Therefore, while utilizing machine translation technology, the user correction function allows to provide interlingual communication without the need for linguistic or technical knowledge or expertise, thereby overcoming language barriers What is needed is a system and method that allows people to be more intimate.

種々の実施形態では、本発明は、音声翻訳システムの語彙を更新する方法及び装置を提供することで、上述した課題を解決する。種々の実施形態では、書き言葉及び話し言葉を含む第１言語を第２言語に翻訳する音声翻訳システムの語彙を更新する方法が提供される。この方法は、第１言語における新語を第１言語の第１認識辞書（recognition lexicon）へと加えるステップと、発音及び語のクラス情報を含む説明を新語と関連付けるステップとを含む。次いで、これらの新語及び説明は、第１言語と関連付けられた第１機械翻訳モジュールにおいて更新される。この第１機械翻訳モジュールは、第１タグ付けモジュール、第１翻訳モデル及び第１言語モジュールを含み、新語を第２言語において対応する翻訳語へと翻訳するように構成される。 In various embodiments, the present invention solves the above-described problems by providing a method and apparatus for updating a vocabulary of a speech translation system. In various embodiments, a method is provided for updating a vocabulary of a speech translation system that translates a first language, including written and spoken language, into a second language. The method includes adding a new word in the first language to a first recognition lexicon in the first language and associating a description including pronunciation and word class information with the new word. These new words and descriptions are then updated in a first machine translation module associated with the first language. The first machine translation module includes a first tagging module, a first translation model, and a first language module, and is configured to translate the new word into a corresponding translated word in the second language.

選択的に、双方向の翻訳用として、この方法は、更に加えて、翻訳語を第２言語から第１言語の新語に翻訳し返すステップと、新語を第２言語の対応する翻訳語と関連付けるステップと、翻訳語及びこの翻訳語の説明を第２言語の第２認識辞書へと加えるステップとを含む。次いで、第２言語と関連付けられた第２機械翻訳モジュールは、この翻訳語及び説明を用いて更新される。この第２機械翻訳モジュールは、第２タグ付けモジュール、第２翻訳モデル及び第２言語モジュールを含む。 Optionally, for bi-directional translation, the method additionally includes translating the translated word back from the second language to a new word in the first language and associating the new word with a corresponding translated word in the second language. Adding a translated word and a description of the translated word to the second recognition dictionary of the second language. The second machine translation module associated with the second language is then updated with this translation and description. The second machine translation module includes a second tagging module, a second translation model, and a second language module.

複数の実施形態では、この方法は、第１言語と関連付けられたテキスト−音声発音辞書に第１語を入力し、第２言語と関連付けられたテキスト−音声発音辞書に第２語を入力するステップを更に含む。これらの入力信号は、異なるモダリティ（ｍｏｄａｌｉｔｙ）（例えば音声によるスペリング及び言葉によらないスペリング、音声によるスペリング及び言葉によるスペリング、書面及び音声など）（本明細書では「クロスモーダル」と称する）であってもよく、又は同一のモダリティ（音声及びやり直し音声、書面及び書き直し書面など）であってもよい。 In embodiments, the method includes inputting a first word into a text-phonetic dictionary associated with the first language and inputting a second word into a text-phonetic dictionary associated with the second language. Is further included. These input signals are of different modalities (eg, voice spelling and verbal spelling, voice spelling and verbal spelling, written and voice, etc.) (referred to herein as “cross modal”). Or the same modality (voice and redo voice, written and rewritten text, etc.).

本発明のある実施形態は、現場でメンテナンス可能なクラスベース型システムであって、第１言語と第２言語との間でコミュニケーションする音声−音声翻訳システムを対象とする。このシステムは、２つの音声認識ユニットであって、各音声認識ユニットは第１言語又は第２言語の口頭語を含む音声を受け取り、口頭の言語に対応するテキストを生成するように構成される音声認識ユニットと、２つの対応する機械翻訳ユニットであって、各機械翻訳ユニットは音声認識ユニットのうちの一方からテキストを受け、他方の言語のテキストへとテキストの翻訳を出力するように構成される機械翻訳ユニットとを含む。このシステムは、システムがユーザと連携して新語を学習可能にするユーザ現場カスタマイズモジュールも含む。このユーザ現場カスタマイズモジュールは、言語のうちの一方若しくは両方に対応する音声又はテキストを含むユーザ選択入力を受け取るように構成され、機械翻訳ユニットをこのユーザ選択入力で適切に更新する。 An embodiment of the present invention is directed to a speech-to-speech translation system that communicates between a first language and a second language, which is a class-based system that can be maintained in the field. The system includes two speech recognition units, each speech recognition unit configured to receive speech containing a spoken language of a first language or a second language and generate text corresponding to the spoken language. A unit and two corresponding machine translation units, each machine translation unit configured to receive text from one of the speech recognition units and output the translation of the text into text in the other language Including translation units. The system also includes a user site customization module that enables the system to learn new words in cooperation with the user. The user site customization module is configured to receive a user selection input that includes speech or text corresponding to one or both of the languages, and appropriately updates the machine translation unit with the user selection input.

ある実施形態では、現場でメンテナンス可能なクラスベース型音声−音声翻訳システムを提供するために、４つの主要な特徴がシステムに備えられる。第１の特徴は、アクティブなシステム語彙に新語を加えることを可能にし、又は場所若しくはタスク固有の複数の語彙間で切り替えることを可能にする音声翻訳フレームワークを含む。この音声翻訳フレームワークは、音声認識モジュールに語を、このモジュールの再起動を必要とせずに動的に加えることをもたらす。このシステムは、音声−音声翻訳デバイス内の全てのシステムコンポーネントにわたる多言語システム辞書及び言語独立の語−クラスと、クラスベース型機械翻訳（句ベース型統計的ＭＴ、統語的ＭＴ、実例ベース型ＭＴなど）と、モデルトレーニングの間に単一言語のタガー（tagger）による組み合わせに基づいてなされる多言語の語−クラスのタグ付けと、既知のタグ付き言語からパラレルコーパスを介してアラインメントを行うために新たな言語でなされる語−クラスのタグ付けとを使用する。第２の特徴は、マルチモーダルなインタラクティブインターフェースにより、非専門家が新語をシステムへと加えることが可能となることである。第３の特徴は、このシステムが、ユーザによって提供されたマルチモーダルなフィードバックを用いて、ＡＳＲ及びＳＭＴモデル適応を適応させるように設計されていることである。そして第４の特徴は、このシステムが、修正又は語の共有を可能にするネットワーク機能を有することである。 In one embodiment, the system is equipped with four main features to provide a class-based speech-to-speech translation system that can be maintained in the field. The first feature includes a speech translation framework that allows new words to be added to the active system vocabulary, or allows switching between multiple vocabularies specific to a place or task. The speech translation framework results in dynamically adding words to the speech recognition module without requiring restart of the module. The system includes a multilingual system dictionary and language independent word-class across all system components in a speech-to-speech translation device, and class-based machine translation (phrase-based statistical MT, syntactic MT, example-based MT Etc.) and multi-language word-class tagging based on a combination of single language taggers during model training, and alignment from known tagged languages via parallel corpus Use word-class tagging in new languages. The second feature is that a multi-modal interactive interface allows non-experts to add new words to the system. A third feature is that the system is designed to adapt ASR and SMT model adaptation using multimodal feedback provided by the user. And the fourth feature is that this system has a network function that enables correction or word sharing.

別の実施形態では、現場で且つ技術的専門知識なしに、ユーザが音声−音声翻訳デバイスに新語を加えることを可能にするマルチモーダルなインタラクティブインターフェースが開示される。これには次のような例が含まれる：（１）システムへと加えられる語又は語句のクラスを自動的に分類し、語の発音及び翻訳を自動的に生成する方法、（２）話すこと、タイピング、スペリング、手書き、ブラウジング、言い換えのうちの１つ以上によって、クロスモーダル的に新語を入力する方法、（３）言語的に未トレーニングのユーザが、音声の字訳及び翻訳が適正かどうかを決定することを支援するマルチモーダルなフィードバックであって、多数のテキスト形式（即ちローマ字形式及び他の言語の文字による書面形式）と、テキスト−音声（ＴＴＳ、即ち適切に聞こえるかどうか）経由の音声形式とを含むフィードバック、（４）新語用に言語モデル及び翻訳の確率を設定する方法、並びに（５）ユーザの行動、関心事及び使用履歴との関連性に基づいて学習した新語に対して、言語モデル及び翻訳の確率を引き上げる又は割り引くこと。 In another embodiment, a multimodal interactive interface is disclosed that allows a user to add new words to a speech-to-speech translation device in the field and without technical expertise. This includes the following examples: (1) how to automatically classify the class of words or phrases added to the system and automatically generate pronunciation and translation of the words, (2) speaking To input new words in a cross-modal manner by one or more of typing, spelling, handwriting, browsing, or paraphrasing. (3) Whether a linguistically untrained user has proper transliteration and translation of speech. Multi-modal feedback to help determine the decision, via multiple text formats (ie written in roman letters and other languages) and text-to-speech (TTS, ie whether it sounds properly) Feedback including speech format, (4) method of setting language models and translation probabilities for new words, and (5) user behavior, interests and usage history Against new words learned on the basis of relevance to raise the probability of language model and translation or discount it.

別の実施形態では、現場においてユーザによるマルチモーダルなフィードバックを介して修正するオンラインシステムが開示される。これには次のような例が含まれる：（１）ユーザが自動音声認識結果を修正し、このフィードバック情報を使用して音声認識コンポーネントを適応させることを可能にするインターフェース及び方法、（２）ユーザが機械翻訳の仮説を修正し、このフィードバック情報を使用して機械翻訳コンポーネントを改善することを可能にするインターフェース及び方法、並びに（３）ユーザ修正に基づいて正しい語又は修正された語に対する言語モデル、辞書及び翻訳モデルの確率を自動的に調整する（高める又は減少させる）方法。 In another embodiment, an online system is disclosed that modifies via multimodal feedback by the user in the field. This includes examples such as: (1) an interface and method that allows a user to modify the automatic speech recognition results and adapt the speech recognition component using this feedback information; (2) Interfaces and methods that allow a user to modify a machine translation hypothesis and use this feedback information to improve the machine translation component, and (3) the language for the correct or modified word based on the user modification A method of automatically adjusting (increasing or decreasing) the probability of models, dictionaries and translation models.

別の実施形態では、現場で行われる修正又は新語の追加を、複数のデバイスにわたってユーザが共有することを可能にするインターネット応用が開示される。これらには次のような例が含まれる：（１）ワールドワイドウェブを介して音声−音声翻訳デバイスに用いるモデルをアップロードする、ダウンロードする及び編集する方法、（２）ユーザのコミュニティ全体にわたる新語の追加及び修正を現場で照合する方法、及び（３）音声−音声翻訳デバイスに用いる場所又はタスク固有の語彙をアップロードする、ダウンロードする及び編集する方法。 In another embodiment, an Internet application is disclosed that allows users to share modifications or new word additions made in the field across multiple devices. These include the following examples: (1) how to upload, download and edit models for use in speech-to-speech translation devices via the World Wide Web, (2) new words across the user community A method of collating additions and modifications in the field, and (3) a method of uploading, downloading and editing a place or task specific vocabulary for use in a speech-to-speech translation device

添付の図面は、本発明の実施形態における例を図示する。このような図面の簡単な説明は以下の通りである。 The accompanying drawings illustrate examples in embodiments of the present invention. A brief description of such drawings is as follows.

図１は、本発明の一実施形態に従って構成される音声−音声翻訳システムを図示するブロック図である。FIG. 1 is a block diagram illustrating a speech-to-speech translation system configured in accordance with one embodiment of the present invention.

図２は、タブレットインターフェースを介してユーザへと表示されるグラフィック・ユーザ・インターフェースの一例を図示する。FIG. 2 illustrates an example of a graphic user interface displayed to the user via the tablet interface.

図３は、図１における本発明の一実施形態に従って実行される音声−音声翻訳のステップを図示するフローチャートである。FIG. 3 is a flow chart illustrating the steps of speech-to-speech translation performed in accordance with one embodiment of the present invention in FIG.

図４は、ユーザによって行われた修正からシステムが学習するステップを図示するフローチャートである（修正及び修理モジュール）。FIG. 4 is a flowchart illustrating the steps the system learns from modifications made by the user (modification and repair module).

図５は、ユーザが新語をシステムへと加えることができるステップを図示するフローチャートである（ユーザ現場カスタマイズモジュール）。FIG. 5 is a flowchart illustrating steps by which a user can add new words to the system (user site customization module).

図６は、ユーザがシステムに加えることを望む新語について、装置が翻訳及び発音を自動的に生成する方法の一例を図示するフローチャートである。FIG. 6 is a flow chart illustrating an example of how the device automatically generates translations and pronunciations for new words that the user wishes to add to the system.

図７は、マルチモーダルなインターフェースを介して入力する新語を検証する方法の一例を図示するフローチャートである。FIG. 7 is a flowchart illustrating an example of a method for verifying a new word input via a multimodal interface.

図８は、自動的に生成される語情報を表示するビジュアルインターフェースの一例を図示する。FIG. 8 illustrates an example of a visual interface that displays automatically generated word information.

図９は、クラスベース型ＭＴモデルをトレーニングするために必要なステップを図示するフローチャートである。FIG. 9 is a flowchart illustrating the steps necessary to train a class-based MT model.

図１０は、クラスベース型ＭＴを入力文へと適用するステップを図示するフローチャートである。FIG. 10 is a flowchart illustrating the steps of applying the class-based MT to the input sentence.

図１１は、統計的又は機械学習アプローチによって語−クラスのタグ付けの間に使用される可能性がある特徴を図示する図である。FIG. 11 is a diagram illustrating features that may be used during word-class tagging by statistical or machine learning approaches.

本発明の種々の実施形態は、音声−音声翻訳用の方法及びシステムを説明する。実施形態は、モデル適応によって、ユーザの音声及び話し方に適応させるために使用されてもよい。さらなる実施形態において、ユーザは認識エラーを修正でき、システムは、ユーザが修正したエラーから明確に学習でき、それによってこれらのエラーが将来再び発生する可能性を小さくする。新語をシステムに加えるか又は固有の場所若しくはタスク用に最適化された事前定義型の辞書を選択するかの何れかによって、本発明は、ユーザが語彙を彼又は彼女の個々のニーズ及び環境にカスタマイズ可能にする。新語を加える場合に、マルチモーダルなインターフェースは、自動的に生成された翻訳及び発音をユーザが修正し検証することを可能にする。これにより、ユーザが他の言語についての知識を少しも有しないとき、ユーザが新語をシステムへと加えることが可能となる。ある実施形態では、システムは、更に、ユーザによって入力される任意の新語彙をユーザのコミュニティへと送るように構成される。このデータは照合され、辞書は自動的に生成され、次いで任意のユーザによってダウンロードできる。 Various embodiments of the present invention describe methods and systems for speech-to-speech translation. Embodiments may be used to adapt to a user's voice and speech through model adaptation. In further embodiments, the user can correct recognition errors, and the system can clearly learn from the errors corrected by the user, thereby reducing the likelihood that these errors will occur again in the future. By either adding new words to the system or selecting a predefined dictionary optimized for a specific location or task, the present invention allows the user to add the vocabulary to his or her individual needs and environment. Make it customizable. When adding new words, the multimodal interface allows the user to modify and verify automatically generated translations and pronunciations. This allows the user to add new words to the system when the user has no knowledge of other languages. In some embodiments, the system is further configured to send any new vocabulary entered by the user to the user's community. This data is verified and a dictionary is automatically generated and can then be downloaded by any user.

図１は、本発明に従っており、現場でメンテナンス可能な音声−音声翻訳システムの一例のブロック図の概要を図示する。この例は、システムは２つの言語Ｌ_ａ及びＬ_ｂの間で動作する。これは、音声−音声変換ダイアログシステムの典型的な実装例であって、ＬａからＬ_ｂへ及びＬ_ｂからＬ_ａへの両方向の音声−音声翻訳を含む。しかしながら、この構成における双方向性は、本発明にとって必要条件ではない。Ｌ_ａからＬ_ｂへの一方向システム、又は幾つかの言語Ｌ_１．．．Ｌ_ｎを含む多方向システムが、本発明から等しく利益を得られるだろう。システムは、２つのＡＳＲモジュール２及び９を有し、ＡＳＲモジュール２及び９は、音声モデル１８、ＡＳＲクラスベース型言語モデル１９及び認識辞書モデル２０（図３に示される）を用いて、Ｌ_ａ及びＬ_ｂの音声を夫々認識し、Ｌ_ａ及びＬ_ｂに対応するテキストを夫々生成する。この例では、我々は、モバイルテクノロジーズＬＬＣ（Mobile Technologies, LLC）で開発された音声認識システム「Ｎｉｎｊａ」を使用した。使用されてもよい他のタイプのＡＳＲモジュールは、ＩＢＭコーポレーション（IBM Corporation）、ＳＲＩ、ＢＢＮによって、又はケンブリッジ（Cambridge）若しくはアーヘン（Aachen）において開発された音声認識装置を含む。 FIG. 1 illustrates a block diagram overview of an example of a speech-to-speech translation system that can be maintained in the field according to the present invention. This example, the system operates between two languages L _a and L _b. This speech - including speech translation - speech conversion a typical implementation of a dialog system, both the sound from La to L _b and from L _b to L _a. However, bidirectionality in this configuration is not a requirement for the present invention. A one-way system from L _a to L _b , or several languages L ₁ . . . A multi-directional system including L _n would equally benefit from the present invention. System has two ASR modules 2 and 9, ASR modules 2 and 9, using the speech models 18, ASR class-based language model 19 and a recognition dictionary model 20 (shown in FIG. 3), _{L a} And L _b are recognized, and texts corresponding to L _a and L _b are generated. In this example, we used a speech recognition system “Ninja” developed by Mobile Technologies, LLC. Other types of ASR modules that may be used include speech recognizers developed by IBM Corporation (IBM Corporation), SRI, BBN, or in Cambridge or Aachen.

システムは、２つの機械翻訳モジュール３及び８も含み、機械翻訳モジュール３及び８は、Ｌ_ａからＬ_ｂへ及びＬ_ｂからＬ_ａへテキストを夫々翻訳する（モジュール１１）。この例で使用されるＭＴモジュールは、モバイルテクノロジーズＬＬＣで開発された「ＰａｎＤｏＲＡ」システムであった。他のＭＴモジュールでは、ＩＢＭコーポレーション、ＳＲＩ、ＢＢＮなどによって、又はアーヘン大学などにおいて開発されたモジュールが使用可能である。 The system also includes two mechanical translation module 3 and 8, the machine translation module 3 and 8, the text respectively translating from _{L a} to _{L b} and from _{L b} to _{L a} (module 11). The MT module used in this example was a “PanDoRA” system developed by Mobile Technologies LLC. For other MT modules, modules developed by IBM Corporation, SRI, BBN, etc., or at Aachen University etc. can be used.

２つのテキスト−音声変換エンジン４及び７は、各々が機械翻訳モジュール３及び８のうちの一方に対応し、対応するＡＳＲユニットから生成されたテキストを受け取るように構成される。出力テキストは、対応するＭＴモジュール３又は８へと転送され、ＭＴモジュール３及び８はテキストをＬ_ａからＬ_ｂへ及びテキストをＬ_ｂからＬ_ａへ夫々翻訳する。ＴＴＳモジュールは音声出力を生成し、Ｌ_ａの少なくとも１つのテキスト語を、スピーカーなどの出力デバイス５を介して音声へと、及びＬ_ｂの少なくとも１つのテキスト語を、デバイス５又はスピーカー６などの別の出力デバイスを介して音声へと夫々変換する。この例では、ケプストラル（Cepstral）ＴＴＳモジュールが使用された。ウィンドウズ（登録商標）ＳＡＰＩ（speech application programming interface）規約をサポートする任意のＴＴＳモジュールを、同様に用いることができる。 The two text-to-speech engines 4 and 7 each correspond to one of the machine translation modules 3 and 8 and are configured to receive text generated from the corresponding ASR unit. Output text is transferred to the corresponding MT module 3 or 8, MT modules 3 and 8 respectively translating _{L b} into and text text from _{L a} from _{L b} to _{L a.} The TTS module generates a speech output, and at least one text word of L _a to speech through an output device 5 such as a speaker, and at least one text word of L _b such as a device 5 or a speaker 6 Each audio is converted to audio via another output device. In this example, a Cepstral TTS module was used. Any TTS module that supports the Windows® SAPI (speech application programming interface) convention can be used as well.

修正及び修理モジュール１１は、音声、ジェスチャー、書面、触角、タッチセンサー式及びキーボードインターフェースを含む多数のモダリティを介して、ユーザがシステムの出力を修正することを可能にし、システムがユーザの修正から学習することを可能にする。修正及び修理モジュールは、米国特許第５，８５５，０００号明細書において開示されるようなタイプであってもよい。ユーザ現場カスタマイズモジュール１２は、ユーザが新語彙をシステムへと加えるインターフェースを設けており、現在の状況に適切なシステム語彙を選択することもできる。例えば、デバイスの現在の場所を示すＧＰＳ座標によって決定されるような、場所の変更によって始動される、又はユーザによるタスク若しくは場所の明示の選択によって始動されるインターフェースがある。 The modification and repair module 11 allows the user to modify the system's output via a number of modalities including voice, gestures, writing, antennae, touch sensitive and keyboard interface, and the system learns from the user's modifications. Make it possible to do. The correction and repair module may be of the type as disclosed in US Pat. No. 5,855,000. The user site customization module 12 provides an interface for the user to add new vocabulary to the system and can also select a system vocabulary appropriate to the current situation. For example, there is an interface that is triggered by a change of location, as determined by GPS coordinates indicating the current location of the device, or by explicit selection of a task or location by the user.

ユーザは、ユーザ現場カスタマイズモジュール１２にアクセスし、デバイス１３のスクリーン（又はアクティブなタッチスクリーン）上に表示されるグラフィック・ユーザ・インターフェース・ディスプレイ、及びマウス又はペンを含むポインティングデバイス１４を介して、システムとやり取りできる。グラフィック・ユーザ・インターフェースの一例が図２に示される。この例では、デバイス１３は、Ｌ_ａの音声入力のテキスト及び対応するテキストをウィンドウ１５に表示する。テキストＬ_ａの第２言語Ｌ_ｂにおける機械翻訳は、ウィンドウ１６に表示される。 A user accesses the user site customization module 12 and through a graphical user interface display displayed on the screen (or active touch screen) of the device 13 and a pointing device 14 including a mouse or pen, the system Can communicate with. An example of a graphical user interface is shown in FIG. In this example, the device 13 displays the text and corresponding text for audio input L _a window 15. Machine translation in the second language _{L b} of text _{L a} is displayed in the window 16.

ある実施形態では、両方の言語で同一のマイクロホン及びスピーカーが利用可能である。それ故に、マイクロホン１及び１０は、単一の物理的デバイスであり得、スピーカー５及び６は単一の物理的デバイスであり得る。 In one embodiment, the same microphone and speaker are available in both languages. Therefore, the microphones 1 and 10 can be a single physical device and the speakers 5 and 6 can be a single physical device.

本発明の方法における一例の動作を図示するフローチャートが、図３に示される。最初にステップ１５ｂにおいて、音声認識システムはユーザによって起動される。例えば、グラフィック・ユーザ・インターフェース（図２）上のボタン又は外部の物理的ボタン（図示されていない）を選択できる。次いでユーザの音声（項目２５）は、ステップ２７内のＡＳＲモジュールのうちの１つによって、即ち、ユーザがＬ_ａを話している場合はモジュール２によって、ユーザがＬ_ｂを話している場合にはモジュール９によって認識される。ＡＳＲモジュール２及び９は、３つのモデル、即ち音声モデル１８、ＡＳＲクラスベース型言語モデル１９及び認識辞書モデル２０を適用する。これらのモデルは言語固有のものであり、各ＡＳＲモジュールはそのＡＳＲモジュール独自のモデル集合を含む。ステップ２８において、ユーザの音声の結果としてもたらされるテキストが、デバイススクリーン１３上のＧＵＩを介して表示される。 A flowchart illustrating an example operation in the method of the present invention is shown in FIG. Initially, in step 15b, the speech recognition system is activated by the user. For example, a button on a graphic user interface (FIG. 2) or an external physical button (not shown) can be selected. Then the voice of the user (item 25), by one of the ASR module in step 27, i.e., by the module 2 if the user is speaking the L _a, if the user is talking L _b is Recognized by module 9. The ASR modules 2 and 9 apply three models: a speech model 18, an ASR class-based language model 19 and a recognition dictionary model 20. These models are language specific and each ASR module includes its own set of models. In step 28, the text resulting from the user's voice is displayed via the GUI on the device screen 13.

次いで、入力言語に基づきＭＴモジュール３又は８を用いて、翻訳がなされる（ステップ２９）。ＭＴモジュール３及び８は、３つの主なモデル、即ち語クラスを特定するタグ付け又は構文解析［Ｃｏｌｌｉｎｓ０２］モデル（モデル２２）、クラスベース型翻訳モデル（モデル２３）、及びクラスベース型言語モデル（モデル２４）を適用する。タグ付けモデル２２は、以下の先行技術文献に記載されたタイプなどの、適切などんなタイプのタグ付け又は構文解析モデルであってもよい：J. Lafferty, A. McCallum, and F. Pereira, "Conditional random fields: Probabilistic models for segmenting and labeling sequence data, "In Proceedings of 18th International Conference on Machine Learning, pages 282-289, 2001 ("Lafferty01") 又は Michael Collins, "Parameter estimation for statistical parsing models: Theory and practice of distribution-free methods" (2004) In Harry Bunt, John Carroll, and Giorgio Satta, editors, New Developments in Parsing Technology, Kluwer。機械翻訳の間に適用される他のモデルには、翻訳の間に語が並べ替えられるやり方を制約する歪みモデル、及び文長モデルがある。クラスベース型機械翻訳の詳細な説明が、以下に与えられる。その結果としてもたらされる翻訳が、ステップ３０に示されるようにデバイス１３上にＧＵＩを介して表示される。 Next, translation is performed using the MT module 3 or 8 based on the input language (step 29). MT modules 3 and 8 have three main models: tagging or parsing to identify word classes [Collins02] model (model 22), class-based translation model (model 23), and class-based language model ( Model 24) is applied. The tagging model 22 may be any suitable type of tagging or parsing model, such as the types described in the following prior art documents: J. Lafferty, A. McCallum, and F. Pereira, " Conditional random fields: Probabilistic models for segmenting and labeling sequence data, "In Proceedings of 18th International Conference on Machine Learning, pages 282-289, 2001 (" Lafferty01 ") or Michael Collins," Parameter estimation for statistical parsing models: Theory and practice of distribution-free methods "(2004) In Harry Bunt, John Carroll, and Giorgio Satta, editors, New Developments in Parsing Technology, Kluwer. Other models applied during machine translation include distortion models and sentence length models that constrain how words are rearranged during translation. A detailed description of class-based machine translation is given below. The resulting translation is displayed on the device 13 via the GUI as shown in step 30.

翻訳出力が適正かどうかをユーザが決定するのを支援するために、自動的に生成された翻訳（図２、項目１６）は、ＭＴモジュール３又は８を介して、入力言語へと翻訳し返されて、例えば図２、項目１５ａにおいて図示されるように、元の入力の真下に括弧付きで表示される。ＡＳＲモデル２又は９及びＭＴモジュール３又は８によって決定されるように、音声認識及び翻訳の両方の信頼性が高い場合（ステップ３１）、ＴＴＳモジュール４又は７を経由し（ステップ３３）、スピーカー５又は６を介して、音声出力（項目２６）が生成される。そうでない場合、システムは、ＧＵＩ、音声及び／又は周到なフィードバックを介して、翻訳が間違っている可能性があることを示す。ステップ３３において使用される特定のＴＴＳモジュールは、出力言語に基づいて選択される。 To help the user determine whether the translation output is correct, the automatically generated translation (FIG. 2, item 16) is translated back to the input language via the MT module 3 or 8. Then, for example, as shown in FIG. 2, item 15a, it is displayed in parentheses immediately below the original input. If the reliability of both speech recognition and translation is high (step 31) as determined by ASR model 2 or 9 and MT module 3 or 8, then via TTS module 4 or 7 (step 33), speaker 5 Or, via 6, an audio output (item 26) is generated. Otherwise, the system indicates that the translation may be wrong via GUI, voice and / or careful feedback. The particular TTS module used in step 33 is selected based on the output language.

その後、ユーザが生成された翻訳に不満である場合、ステップ２７から３３までの何れかにおける音声−音声翻訳処理の間に、又は処理が完了した後で、ユーザが介入してもよい。これにより、修正及び修理モジュール１１が呼び出される（ステップ３５）。修正及び修理モジュール１１は、ユーザが行える如何なる修正を記録しログをとる。この修正は後でＡＳＲモジュール２及び９並びにＭＴモジュール３及び８を更新するために使用でき、このことはこの文書の更に下方で詳細に説明される。修正が新語彙項目を含む場合（ステップ３６）、ステップ１５ｃにおいてユーザが現場カスタマイズモードを入力し、新語をシステムに明確に加える場合、又は、ステップ１５ｄにおいて、Thomas Schaaf, "Detection of OOV words using generalized word models and a semantic class language model," in Proc. of Eurospeech, 2001に記載された方法などの、信頼基準又は新語モデルを用いて入力音声内で新語が自動的に検出される場合、ユーザ現場カスタマイズモジュール１２が呼び出される。このモジュール１２は、マルチモーダルなインターフェースをもたらし、ユーザが新語をアクティブなシステム語彙へと加えることを可能にする。ユーザによって新語又は新句（新たな句）が加えられると、ＡＳＲ、ＭＴ及びＴＴＳモデル（項目１７、２１及び３３ａ）は要求に応じて更新される。このモジュールの機能は、両方の言語について更に下方で説明される。 Thereafter, if the user is dissatisfied with the generated translation, the user may intervene during or after the speech-to-speech translation process in any of steps 27-33. Thereby, the correction and repair module 11 is called (step 35). The correction and repair module 11 records and logs any corrections that can be made by the user. This modification can later be used to update ASR modules 2 and 9 and MT modules 3 and 8, which are described in more detail further below this document. If the modification includes a new vocabulary item (step 36), the user enters a field customization mode in step 15c and explicitly adds a new word to the system, or in step 15d, Thomas Schaaf, “Detection of OOV words using generalized. User site customization if new words are automatically detected in the input speech using confidence standards or new word models, such as the methods described in word models and a semantic class language model, "in Proc. of Eurospeech, 2001 Module 12 is called. This module 12 provides a multi-modal interface and allows the user to add new words to the active system vocabulary. As new words or new phrases (new phrases) are added by the user, the ASR, MT and TTS models (items 17, 21 and 33a) are updated on demand. The functionality of this module is described further below for both languages.

複数のクラスの共通集合（例えば人名、地名、及び組織名）が、両方の言語におけるＡＳＲ及びＭＴの両方において使用される。この共通集合は、語義スロット（semantic slots）のシステム全体にわたる集合をもたらし、これにより新語がシステムに加えられることが可能となる。これらのクラス内に発生する名前、特別用語及び言い回しは、種々のユーザの配置、場所、行動様式、習慣及びタスクに依存して大いに変わりやすい語であり、それ故に最もユーザによるカスタマイズが必要となる。 A common set of classes (eg, person names, place names, and organization names) is used in both ASR and MT in both languages. This common set provides a system-wide set of semantic slots that allows new words to be added to the system. The names, special terms and phrases that occur within these classes are highly variable terms depending on the various user locations, locations, behaviors, habits and tasks and therefore require the most user customization. .

好ましい例では、使用される特定のクラスは、システムの適用領域に依存する。クラスには、固有表現用の意味クラス、即ち人名、地名及び組織名、又はタスク固有の名詞句、例えば食品名、病名、薬名、及びどんな事前定義型のクラスにも適応しない語又は句用に開かれているもう１つのクラスが含まれてもよい。類義語などの統語的クラス又は語等価クラスも同様に使用できる。適用領域の例には、旅行者、医療行為、平和維持及び同類のものが含まれるが、これらには限定されない。一例では、旅行者の適用領域において必要となるクラスには、人々、都市、食品及び同類のものの名前が含まれる。別の例では、医療の専門家が必要とする適用クラスには、疾患名、薬剤名、解剖学上の名前、及び同類のものが含まれる。別の例では、平和維持適用において必要となるクラスには、武器、車両、及び同類のものの名前が含まれる。現場でカスタマイズできる音声翻訳を可能にするために、システムは、ユーザ現場カスタマイズモジュール１２と組み合わせた修正及び修理モジュール１１の動作を通じて、エラーを修正するとともに後でこれらのエラーから学習することを可能にする。 In the preferred example, the particular class used depends on the application area of the system. Class includes a semantic class for proper expression, i.e. person name, place name and organization name, or task specific noun phrase, e.g. food name, disease name, drug name, and a word or phrase that does not apply to any predefined class Another class that is open may be included. Syntactic classes such as synonyms or word equivalence classes can be used as well. Examples of application areas include, but are not limited to, travelers, medical practices, peacekeeping and the like. In one example, the classes required in a traveler's application area include names of people, cities, food, and the like. In another example, application classes required by medical professionals include disease names, drug names, anatomical names, and the like. In another example, classes required for peacekeeping applications include the names of weapons, vehicles, and the like. In order to enable on-site customizable speech translation, the system can correct errors and later learn from these errors through the operation of the correction and repair module 11 in combination with the user field customization module 12. To do.

[修正及び修理モジュール]
修正及び修理モジュール１１は、ユーザが音声−音声翻訳処理中に任意の時点で介入可能にする。ユーザは、エラーを特定してそのログをとるか、又は彼／彼女が望む場合、音声認識若しくは翻訳出力内のエラーを修正してもよい。このようなユーザの介入は、人と人のコミュニケーション過程において即時に修正することをもたらし、システムがユーザのニーズ及び関心事に順応するとともに誤りから学習する機会をもたらすので、かなり価値がある。このようなエラーをフィードバックする機能を図示するフロー図が、図４に示される。ユーザが発言の翻訳に不満である場合（即ちエラーの発生）、ユーザは現在の入力のログをとり得る（ステップ４０）。システムは、現在の発言の音声及び他の情報をログファイルへと保存する。これにより、後の時点でユーザがアクセスして修正できる。或いは、ユーザがコミュニティのデータベースへとアップロードすることで、専門家のユーザがエラーを特定し修正することが可能になる。 [Correction and repair module]
The correction and repair module 11 allows the user to intervene at any time during the speech-to-speech translation process. The user may identify the error and log it, or correct the error in the speech recognition or translation output if he / she wants. Such user intervention is of considerable value because it provides immediate correction in the person-to-person communication process and the system provides the opportunity to adapt to the user's needs and concerns and learn from mistakes. A flow diagram illustrating the ability to feed back such errors is shown in FIG. If the user is dissatisfied with the translation of the statement (ie, an error has occurred), the user may log the current input (step 40). The system saves the current speech and other information in a log file. This allows the user to access and correct at a later time. Alternatively, the user uploads to the community database, allowing the expert user to identify and correct the error.

ユーザは、相当数のモダリティを介して、音声認識又は機械翻訳出力を修正することもできる。ユーザは、再度話すことによって、又はキーボード若しくは手書きインターフェースを介して文を入力して、発言全体を修正できる。代わりの構成としては、ユーザは、タッチスクリーン、マウス又はカーソルキーを介してアウトプット仮説の誤りのある部分をハイライトすることができ、キーボード、手書き、若しくは音声を用いて、又は語を一字一字はっきりとつづることで、その句又は語だけを修正する。ユーザは、タッチスクリーンを介してアウトプット仮説内の誤りのある部分を選択し、自動的に生成されるドロップダウンリスト内の競合仮説を選択することによって、若しくは音声で再度入力することによって、又は他の任意の補完的モダリティ（例えば、手書き、スペリング、言い換えなど）によっても修正できる。これらの方法、及び補完的修理行為を適切に組み合わせるやり方は、マルチモーダルな音声認識修正及び修理について米国特許第５，８５５，０００号明細書においてＷａｉｂｅｌ等によって提案された方法に基礎を置く。ここで、この提案された方法はインタラクティブな音声翻訳システムの音声認識及び翻訳モジュールへと適用される。 Users can also modify speech recognition or machine translation output through a number of modalities. The user can modify the entire utterance by speaking again or by entering a sentence via the keyboard or handwriting interface. Alternatively, the user can highlight the erroneous part of the output hypothesis via the touch screen, mouse or cursor keys, using the keyboard, handwriting, or voice, or a word Correct only that phrase or word by spelling it clearly. The user selects the erroneous part in the output hypothesis via the touch screen and selects the competing hypothesis in the automatically generated drop-down list, or re-enters by voice, or It can also be modified by any other complementary modality (eg handwriting, spelling, paraphrasing, etc.). These methods, and how to properly combine complementary repair actions, are based on the method proposed by Wabel et al. In US Pat. No. 5,855,000 for multimodal speech recognition correction and repair. Here, the proposed method is applied to the speech recognition and translation module of the interactive speech translation system.

ユーザが音声認識出力を修正する場合（ステップ４３）、システムは最初にこの修正が新語を含むか否かを決定する（ステップ４４）。言語Ｌ_ａ及びＬ_ｂと夫々関連付けられた認識辞書モデル２０の語を照合して、この決定が行われる。語が見つけられない場合、システムは、ユーザが、所望ならば新語をアクティブなシステム語彙へと加えるように促す（図５、ステップ５０）。そうでない場合、ＡＳＲモデルにおける確率（図３、項目１７）は、再び同一エラーが発生する尤度を低減するように更新される。これは、修正された語系列の確率が増加し、似ている競合仮説の確率が低減される特徴的なやり方で実行できる。 If the user modifies the speech recognition output (step 43), the system first determines whether this modification includes a new word (step 44). By matching the word language L _a and L _b and respective recognition dictionary model 20 associated, this determination is made. If the word is not found, the system prompts the user to add a new word to the active system vocabulary if desired (FIG. 5, step 50). Otherwise, the probability in the ASR model (FIG. 3, item 17) is updated again to reduce the likelihood that the same error will occur. This can be done in a characteristic way in which the probability of the modified word sequence is increased and the probability of similar competitive hypotheses is reduced.

ユーザは、言語の専門的知識を十分に有する場合、機械翻訳出力を修正できる。ＡＳＲの場合に使用されるのと同一のモダリティが使用可能である。機械翻訳出力がユーザによって修正される場合（ステップ４５）、且つ修正が新語を含む場合、ユーザは、新語をアクティブなシステム語彙へと加えるようにダイアログに促される（図５、ステップ５０）。修正が、アクティブなシステム語彙内にすでにある語を含むだけの場合、機械翻訳モデル（図３、項目２１）は更新される。具体的には、修正された文ペアから句が抽出されて、翻訳モデルに納められるように実施され得る。使用される目的言語モデルは、ＡＳＲの場合と類似のやり方で更新できる。 The user can modify the machine translation output if he has sufficient language expertise. The same modalities that are used in the case of ASR can be used. If the machine translation output is modified by the user (step 45), and if the modification includes a new word, the user is prompted by the dialog to add the new word to the active system vocabulary (FIG. 5, step 50). If the modification only includes words that are already in the active system vocabulary, the machine translation model (FIG. 3, item 21) is updated. Specifically, the phrase can be extracted from the corrected sentence pair and put into a translation model. The target language model used can be updated in a similar way as in ASR.

[ユーザ現場カスタマイズモジュール]
ユーザ現場カスタマイズモジュール１２は、システムがユーザと連携して新語を学習可能にする。従来のシステムは、ユーザが音声−音声翻訳システム内の語彙を修正するのを可能にしていない。従来のシステムとは違って、ユーザ現場カスタマイズモデル１２は、ユーザが実行システムにおいて逐次の修正を可能にし、この逐次の修正は、コンピューター音声及び言語処理技術の知識又は言語学の知識が最小限しかない又は少しもないような非専門家が実行するのに比較的容易である。モデル１２は、このような現場カスタマイズを提供するが、それは、ユーザからのある程度理解しやすいフィードバックをもたらすとともに受け取ることによって、並びにこのフィードバックに基づいて必要な全パラメータ及びシステム構成を自律的に導き出すことによってなされる。現場カスタマイズモジュール１２は、以下のことを通じてこれを成し遂げる：１）ユーザ−カスタマイズ用の直感的なインターフェース、及び２）ユーザカスタマイズのために必要となる内部の全パラメータ及び全設定を自動的に見積もり、それによってユーザの負担を軽減する内部ツール。 [User field customization module]
The user site customization module 12 enables the system to learn new words in cooperation with the user. Conventional systems do not allow users to modify vocabulary within a speech-to-speech translation system. Unlike conventional systems, the user site customization model 12 allows the user to make sequential modifications in the execution system, which requires minimal knowledge of computer speech and language processing techniques or linguistics. It is relatively easy to implement by non-experts who have none or none. Model 12 provides such field customization, which provides and receives some understandable feedback from the user and autonomously derives all necessary parameters and system configuration based on this feedback. Made by. The field customization module 12 accomplishes this through: 1) an intuitive interface for user-customization, and 2) automatically estimating all internal parameters and settings required for user customization, An internal tool that reduces the burden on the user.

一方向の翻訳用に、システムは、語又は句に関する最低で４つの情報を処理し、新語又は新句をアクティブなシステム語彙へと加える。これら情報は以下を含む：
− クラス（即ち新たなエントリの意味的又は統語的クラス）
− 言語Ｌ_ａにおける語（即ちＬ_ａ内の書面形式）
− Ｌ_ａにおける語の発音
− Ｌ_ｂにおける語の翻訳（即ちＬ_ｂにおける書面形式）
双方向の翻訳用に、システムは、Ｌ_ｂにおける新語の発音の入力も必要とする。Ｌ_ｂは、逆に、ＴＴＳが音声出力を生成し、Ｌ_ｂ用のＡＳＲモジュールが新語を認識することを可能にする。 For one-way translation, the system processes at least four pieces of information about a word or phrase and adds a new word or phrase to the active system vocabulary. This information includes:
-Class (ie the semantic or syntactic class of the new entry)
- word in the language _{L a} (ie written form in the _{L a)}
- word pronunciation of the L _a - word translation in _{L b} (ie written form in _{L b)}
For bidirectional translation, the system also requires the input of the new word pronunciation in L _b. L _b, on the contrary, TTS generates an audio output, ASR module for _{L b} is it possible to recognize new words.

ユーザ現場カスタマイズモデル１２の動作ステップを図示するフローチャートが、例えば図５において示される。システムが新語に直面すると、前節に記載の修正及び修理モデル１１を介してなされる修正介入に基づいて、システムは、ユーザ（図５、ステップ５０）が、この語を「学習する」べきかどうか、即ちアクティブなシステム語彙へと加えるべきかどうかを決定するように促すことになる。その場合、語学習モードが起動され、現場カスタマイズモジュール１２が作動し始める。ここで、現場カスタマイズ又は新語の学習が、エラー修正ダイアログの結果だけを必要としているのではないことに留意すべきである。ユーザは、具体的にはプルダウンメニューから語学習モードに入り、新語又は先験的な新語のリストを加えることを選択してもよい。外部のイベントによって専門用語、名前、場所などの種々の語が急に必要となっても、新語の学習は同様に始動することができる。しかしながら全てのこのような場合において、システムは上述した情報を収集しなければならない。 A flowchart illustrating the operational steps of the user site customization model 12 is shown, for example, in FIG. When the system encounters a new word, based on the correction and correction interventions made via the repair model 11 described in the previous section, the system will determine whether the user (FIG. 5, step 50) should “learn” this word. That is, it will prompt you to decide whether to add to the active system vocabulary. In that case, the word learning mode is activated and the on-site customization module 12 begins to operate. It should be noted here that field customization or learning new words does not require only the results of the error correction dialog. Specifically, the user may enter word learning mode from a pull-down menu and choose to add a new word or a priori new word list. Even if various words such as technical terms, names, places, etc. are suddenly needed by external events, learning new words can be started as well. However, in all such cases, the system must collect the information described above.

ユーザが新語をシステム語彙へと加えたい旨を示した後（ステップ５０）、システムは最初に大きな外部辞書を調べる。この辞書は、デバイス上にローカルに含まれるか、若しくはインターネットを介してアクセスできる辞書サービスであるか、又は両方の組み合わせであるかの何れかである。外部辞書は、語翻訳ペアのエントリで構成される。各エントリは発音及び語−クラス情報を含み、この発音及び語−クラス情報によって、アクティブなシステム語彙に新語を容易に加えられる。各エントリは、両方の言語における語ペアの各々の説明も含む。これにより、目的の言語の知識が無くても、ユーザが語の適切な翻訳を選択可能となる。新語が外部の辞書に含まれる場合（ステップ５１）、システムは、各々の説明を含む語の代替的な翻訳リストを表示する（ステップ５２）。ユーザが辞書から事前定義型の翻訳のうちの１つを選択する場合（ステップ５３）、ユーザは、辞書によってもたらされる発音及び他の情報を検証し（ステップ５３ａ）、もし必要なら編集できる。次いで新語は、アクティブなシステム語彙へと加えられる。 After the user indicates that he wants to add a new word to the system vocabulary (step 50), the system first looks up a large external dictionary. This dictionary is either contained locally on the device, is a dictionary service accessible via the Internet, or a combination of both. The external dictionary is composed of entries of word translation pairs. Each entry includes pronunciation and word-class information that allows new words to be easily added to the active system vocabulary. Each entry also includes a description of each of the word pairs in both languages. This allows the user to select an appropriate translation of the word without knowledge of the target language. If the new word is included in an external dictionary (step 51), the system displays an alternative translation list of words that includes each explanation (step 52). If the user selects one of the predefined translations from the dictionary (step 53), the user can verify the pronunciation and other information provided by the dictionary (step 53a) and edit if necessary. The new word is then added to the active system vocabulary.

アクティブなシステム語彙に新語を加えるために、３つのステップが必要となる（ステップ５９、５９ａ、５９ｂ）。最初に、語及びその翻訳が、モジュール２及び９のＡＳＲ認識辞書へと加えられる（ステップ５９）。この語は、外部の辞書によって与えられる発音（単数又は複数）とともに、この認識辞書２０へと加えられる。ユーザがこの語を単に入力すると、この語の発生確率は、ＡＳＲクラスベース型言語モデル１９内の同一クラスの競合する要素よりも大きく設定される。これにより、ユーザが加えた語の可能性が高くなる。次に、語及びその翻訳がＭＴモデルへと加えられ（図３、項目２１）、システムが新たな語を両方の翻訳方向に翻訳可能となる。最後に、この語は、ＴＴＳ発音モデルに登録され（図３、モデル３３ａ）、システムが両方の言語でこの語を正確に発音可能となる。 In order to add a new word to the active system vocabulary, three steps are required (steps 59, 59a, 59b). Initially, the word and its translation are added to the ASR recognition dictionary of modules 2 and 9 (step 59). This word is added to this recognition dictionary 20 along with the pronunciation (s) provided by the external dictionary. If the user simply inputs this word, the probability of occurrence of this word is set to be larger than competing elements of the same class in the ASR class-based language model 19. This increases the possibility of words added by the user. Next, the word and its translation are added to the MT model (FIG. 3, item 21), allowing the system to translate the new word in both translation directions. Finally, this word is registered in the TTS pronunciation model (FIG. 3, model 33a), allowing the system to pronounce this word correctly in both languages.

ユーザによって入力された新語が外部の辞書に含まれていない場合、システムは、語をアクティブなシステム語彙へと登録するために必要となる情報を自動的に生成して、この情報をユーザと検証する。最初に、周囲の語の文脈が利用可能ならばそれを用い（ステップ５４）、タグ付けモデルを介して（図３、モデル２２）、新語のクラスが推定される。次に、ルールベース型モデル又は統計的モデルのうちの何れかを介して、新語の発音及び翻訳が自動的に生成される（ステップ５５）。次いで、その結果としてもたらされる情報が、マルチモーダルなインターフェースを介してユーザへと示される（ステップ５８）。システムは、ユーザが自動的に生成される翻訳若しくは発音を検証する（ステップ５８）又は修正する（ステップ５７）ように促す。最後に、ユーザがこの情報を検証した後で、新語はアクティブなシステム語彙へと加えられる（ステップ５９、５９ａ、５９ｂ）。新語（具体的には、「語＋発音＋語クラス」）をＡＳＲ語彙へと動的に加える（５９）ために、認識辞書２０（それは典型的にはＡＳＲモジュール２又は９内にツリー構造として記憶される）は検索され、次いで新語を含むように更新される。これにより、新語が認識語彙へと動的に加えられることが可能となり、それ故に口頭で次の発言があれば直ちに認識することができる。ＡＳＲシステムにとっては、従来のシステムにあるような再初期化又は再起動は必要としない。 If the new word entered by the user is not included in the external dictionary, the system will automatically generate the information needed to register the word into the active system vocabulary and verify this information with the user. To do. First, the context of the surrounding word is used if available (step 54), and the new word class is estimated via the tagging model (FIG. 3, model 22). Next, pronunciations and translations of new words are automatically generated via either the rule-based model or the statistical model (step 55). The resulting information is then presented to the user via a multimodal interface (step 58). The system prompts the user to verify (step 58) or modify (step 57) the automatically generated translation or pronunciation. Finally, after the user verifies this information, the new word is added to the active system vocabulary (steps 59, 59a, 59b). To dynamically add (59) new words (specifically “word + pronunciation + word class”) to the ASR vocabulary 59, it is typically organized as a tree structure within ASR module 2 or 9. Stored) is retrieved and then updated to include new words. This allows new words to be dynamically added to the recognition vocabulary, and therefore immediately recognized if there is a next verbal statement. ASR systems do not require reinitialization or restart as in conventional systems.

同様に、新語（具体的には「語＋翻訳＋語クラス」）はＭＴ翻訳モデルに付け足すことができ（５９ａ）、翻訳モデル２３（それはＭＴモジュール３及び／又は８内にハッシュ−マップとして記憶することができる）は検索され、新語及びその翻訳を含む新たな翻訳ペア（対訳）、並びに語クラスが付加される。これにより、ＭＴモジュール３及び／又は８に動的に新語が加えられることが可能となり、新語は続く発言において正確に翻訳されることになる。ＭＴシステムにとっては、従来の研究にあるような再初期化又は再起動は必要としない。 Similarly, new words (specifically “word + translation + wordclass”) can be added to the MT translation model (59a) and stored as a hash-map in translation model 23 (which is in MT module 3 and / or 8). Can be searched), and a new translation pair containing the new word and its translation, as well as the word class are added. This allows new words to be dynamically added to the MT modules 3 and / or 8, and the new words will be accurately translated in subsequent utterances. MT systems do not require re-initialization or restart as in conventional research.

この情報をすべて自動的に推定することは、現場における非専門家のユーザがカスタマイズの作業を行えるようにするのに不可欠である。以下では、語に関するこの決定的な情報が如何に自動的に推定されるか、次いで、それが如何にユーザから得られるか又は直感的に検証できるかを詳細に説明する。 Estimating all of this information automatically is essential to allow non-expert users in the field to perform customization tasks. In the following, it will be described in detail how this critical information about the word is automatically estimated and then how it can be obtained from the user or intuitively verified.

[新語の発音及び翻訳の生成]
音声−音声翻訳システムのユーザは、通常は、音声学、言語学、言語技術に関して、限定された知識を有する又は少しの知識もなく、しばしば他の言語における語及びその使用法について少しの知識さえもないので、システムへと加えることを望む各新語についての翻訳及び関連のある全情報（発音、正書法、語の使用法など）を、ユーザがもたらすと期待することはできない。それ故に、ユーザが新語を入力するとき、システムは語−クラスを推定し、両方の言語における語の翻訳及び発音情報を自動的に生成する。 [New word pronunciation and translation generation]
Users of speech-to-speech translation systems usually have limited or no knowledge of phonetics, linguistics, linguistics, and often even little knowledge of words and their usage in other languages. As such, the user cannot be expected to provide translations and all relevant information (pronunciation, orthography, word usage, etc.) for each new word that he wishes to add to the system. Therefore, when the user enters a new word, the system estimates the word-class and automatically generates word translation and pronunciation information in both languages.

アクティブなシステム語彙に新語を登録するために、この語の翻訳並びにこの語及びその翻訳の両方の発音が必要となる。この情報の生成は、例えば図６に示されるように、３つのステップ処理として実行できる。最初に、語の発音が生成される（ステップ６０）。語及びその発音の文字系列に基づいて、翻訳が生成される（ステップ６１）。次に、目的の言語における新語の発音が、先行するステップにおいて生成される情報を用いて生成される（ステップ６２）。日本語−英語現場メンテナンス可能Ｓ２Ｓ翻訳システムにおける種々の技術を用いてこの情報を生成する２つの例が、図６の左側に示される。英語の新語「Ｗｈｅｅｌｉｎｇ」（項目６１）をシステムへと加えるために、まず機械学習を介して英語の発音が生成される（ステップ６５）。機械学習は、以下の先行技術文献に記載されるような適切な任意の技術によって行われてもよい：Damper, R. I. (Ed.), Data-Driven Techniques in Speech Synthesis. Dordrecht, The Netherlands: Kluwer Academic Publishers (2001)。次に、この語の日本語における字訳が統計的機械字訳を介して自動的に生成され（ステップ６６）、次いで手動で定義されたルールを介して日本語の発音が生成される（ステップ６７）。字訳は、適切な任意の統計的機械字訳エンジンを用いて成し遂げられてもよい。例えば、以下の先行技術文献には例が記載されている：K. Knight and J. Graehl, Machine transliteration. Computational Linguistics 24 4 (1998), pp. 599-612、及び HLT/NAACL-2007に見られるBing Zhao, Nguyen Bach, Ian Lane, and Stephan Vogel, "A Log-linear Block Transliteration Model based on Bi-Stream HMMs"。次いでその結果としてもたらされる情報（項目６８）は、アクティブなシステム語彙に語を登録する前に、ユーザによって音響再生機を介して及び音声ストリングで検証される。 To register a new word in the active system vocabulary, a translation of this word and the pronunciation of both this word and its translation are required. The generation of this information can be executed as a three-step process, for example, as shown in FIG. First, the pronunciation of a word is generated (step 60). A translation is generated based on the word and its character sequence (step 61). Next, a new word pronunciation in the target language is generated using the information generated in the preceding step (step 62). Two examples of generating this information using various techniques in a Japanese-English field maintainable S2S translation system are shown on the left side of FIG. In order to add the new English word “Wheeling” (item 61) to the system, English pronunciation is first generated via machine learning (step 65). Machine learning may be performed by any suitable technique as described in the prior art literature: Damper, RI (Ed.), Data-Driven Techniques in Speech Synthesis. Dordrecht, The Netherlands: Kluwer Academic Publishers (2001). Next, a transliteration of this word in Japanese is automatically generated via statistical machine transliteration (step 66), and then a Japanese pronunciation is generated via manually defined rules (step 66). 67). Transliteration may be accomplished using any suitable statistical machine transliteration engine. For example, examples can be found in the following prior art documents: K. Knight and J. Graehl, Machine transliteration. Computational Linguistics 24 4 (1998), pp. 599-612, and HLT / NAACL-2007. Bing Zhao, Nguyen Bach, Ian Lane, and Stephan Vogel, "A Log-linear Block Transliteration Model based on Bi-Stream HMMs". The resulting information (item 68) is then verified by the user through the sound player and with the audio string before registering the word in the active system vocabulary.

同様に、日本語の新語「ワカヤマ」（項目７０）をシステムへと加えるために、まず手動で定義されたルールを介して日本語の発音が生成される（ステップ７１）。次に、日本語におけるこの語の字訳がルールベース型字訳を介して自動的に生成され（ステップ７２）、次いで手動で定義されたルールを介して英語の発音が生成される（ステップ７３）。ルールベース型字訳は、以下の先行技術文献の方法を用いて実行されてもよい：Mansur Arbabi, Scott M. Fischthal, Vincent C. Cheng, and Elizabeth Bar, "Algorithms for Arabic name transliteration, "IBM Journal of research and Development, 38(2):183-193, 1994。次いでその結果としてもたらされる情報（項目７４）は、アクティブなシステム語彙に語を登録する前に、ユーザによって検証される。 Similarly, in order to add a new Japanese word “Wakayama” (item 70) to the system, a Japanese pronunciation is first generated via a manually defined rule (step 71). Next, a transliteration of this word in Japanese is automatically generated via a rule-based transliteration (step 72), and then an English pronunciation is generated via a manually defined rule (step 73). ). Rule-based transliteration may be performed using the methods of the prior art literature: Mansur Arbabi, Scott M. Fischthal, Vincent C. Cheng, and Elizabeth Bar, "Algorithms for Arabic name transliteration," IBM Journal. of research and Development, 38 (2): 183-193, 1994. The resulting information (item 74) is then verified by the user before registering the word in the active system vocabulary.

ユーザは、生成された翻訳及び音声出力による発音を検証できる。代わりの構成としては、ユーザに更に適切に考慮されるなら、母語を与える（即ちユーザが英語の話し手の場合、中国語に対しては「ＨａｎｙｕＰｉｎｙｉｎ」で、又は日本語に対しては「Ｒｏｍａｊｉ」で）記載形式が使用されてもよい。ユーザは、もし必要なら、翻訳及び／又は発音を編集してもよい。ユーザによって承認された時点で、語及び語の特徴が多言語システム辞書に加えられる。 The user can verify the pronunciation by the generated translation and voice output. An alternative configuration is to give the native language if the user considers it more appropriately (ie, if the user is an English speaker, “Hanyu Pinyin” for Chinese or “Romaji” for Japanese) The description format may be used. The user may edit the translation and / or pronunciation if necessary. Once approved by the user, words and word features are added to the multilingual system dictionary.

システムはまた、インタラクティブなユーザ入力の助けを借りて、必要となる情報を自動的に生成することによって、辞書に加えられる各新語の翻訳の必要性を取り除く。ユーザインターフェースの一例が図３及び図８に示される。 The system also eliminates the need for translation of each new word added to the dictionary by automatically generating the necessary information with the help of interactive user input. An example of the user interface is shown in FIGS.

[インタラクティブなユーザインターフェース]
その後、システムは、推定された言語学の情報を確認し検証するために、ユーザに相談する。これは、どんな特別な言語学の又は技術的な知識でも推定しないように、直感的なやり方で行われる。それ故に、適切なインターフェースが使用される。以下にて、新語の学習の間におけるユーザとのやり取りを説明する。 [Interactive user interface]
The system then consults the user to confirm and verify the estimated linguistic information. This is done in an intuitive manner so as not to infer any special linguistic or technical knowledge. An appropriate interface is therefore used. In the following, the interaction with the user during the learning of new words will be described.

このインターフェースにおいて、ユーザは、メニューから「新語」モードを選択してもよく、又はユーザ修正によって新／未知語が生成された後で、新語学習モードを呼び出すことができる。現れるウィンドウ枠において、彼／彼女は、所望の新語、名前、特別用語、概念、言い回しをタイプできる。これはユーザの言語における正書法の入力に基づいている（これは英語とは異なる、例えば中国語、日本語、ロシア語などの文字集合であり得る）。次に、システムは、ローマ字の字訳と、言葉の予測発音（words predicted pronunciation）とを生成する。これは、手書きの若しくは既存の音声辞書から抽出された又は字訳された音声データから学習された変換ルールによって行われる。次に、ユーザは自動変換を検査し、ＴＴＳを介して生成された発音の音声を再生できる。ユーザは、これらの表現（何れかの言語における文字、ローマ字で書かれた字訳、発音表記、及び発音表記の音声）のうちの何れかを反復し修正してもよく、他の対応するエントリも同様に再生成される（それ故に一方の言語における修正された発音表記は、他方の言語における発音表記を修正してもよい）。 In this interface, the user may select a “new word” mode from the menu, or can invoke a new word learning mode after a new / unknown word has been generated by user modification. In the window pane that appears, he / she can type the desired new word, name, special term, concept, wording. This is based on orthographic input in the user's language (which may be a character set different from English, eg, Chinese, Japanese, Russian, etc.). Next, the system generates a Roman transliteration and words predicted pronunciations. This is done by conversion rules learned from handwritten or extracted speech data or learned from transliterated speech data. The user can then inspect the automatic conversion and play the sound of the pronunciation generated via TTS. The user may iteratively correct any of these representations (letters in any language, transliterations written in Roman letters, phonetic notation, and phonetic phonetics) and other corresponding entries. Are regenerated as well (hence, a modified phonetic notation in one language may correct the phonetic notation in the other language).

システムは、更に、新語が属する最も高い可能性の語クラスを、類似の文脈内の（既知のクラスを有する）他の複数の語の共起統計（co-occurrence statistics）に基づいて自動的に選択する。新語のウィンドウ枠は、このクラス特定の手動選択（及び／又は修正）も可能にするが、ユーザが、このように推定された任意のクラス判断を覆せるようにされている。 The system also automatically determines the most likely word class to which the new word belongs based on co-occurrence statistics of other words (with known classes) in similar contexts. select. The new word pane also allows this class specific manual selection (and / or modification), but allows the user to overrule any class judgment thus estimated.

要約すると、ユーザから新語／句が与えられたとすると、システムは、次のように動作する。
−（ＡＳＲ及びＭＴコンポーネントによって使用される）エントリの意味クラスを自動的に分類する。
−（ＡＳＲ及びＬ_１用のＴＴＳで使用される）語の発音を自動的に生成する。
−（両方のＭＴコンポーネントで使用される）語の翻訳を自動的に生成する。
−（ＡＳＲ及びＬ_２用のＴＴＳで使用される）翻訳の発音を自動的に生成する。
− 要求に応じて自動的に生成されたデータをユーザが修正／編集可能にする。
− 自動的に生成された翻訳が適正であるかどうかをユーザが検証する他のモダリティを設ける（即ちＴＴＳを介して語の発音を聞く）。 In summary, given a new word / phrase from the user, the system operates as follows.
-Automatically classify the semantic class of entries (used by ASR and MT components).
- (used in ASR and TTS for _{L 1)} word automatically generates pronunciation.
Automatically generate translations of words (used by both MT components)
- (used in ASR and _{L 2} for the TTS) automatically generate pronunciation translation.
-Allows the user to modify / edit data automatically generated on demand.
-Provide other modalities that allow the user to verify that the automatically generated translation is correct (ie listen to the pronunciation of the word via TTS).

システムにおいて事前に定義されたクラスの何れとも一致しない語をユーザが入力する場合、ユーザはこの語を「未知の」クラスへと割り当てることができる。ＡＳＲに対しては、「未知の」クラスは、トレーニングデータに見出され、且つ認識辞書内には見出されない語によって定義される。ＳＭＴに対しては、翻訳辞書内に見出されない二言語のエントリが、目的言語モデル内の未知のタグに設定される。 If the user enters a word that does not match any of the predefined classes in the system, the user can assign this word to an “unknown” class. For ASR, an “unknown” class is defined by words found in the training data and not found in the recognition dictionary. For SMT, bilingual entries not found in the translation dictionary are set to unknown tags in the target language model.

[クラス内確率及び関連性の引き上げ]
これらの入力方法のいずれも、言語学のトレーニングを必要とせず、新語が適切に表されているかどうかをユーザが判断する直感的なやり方をもたらす。次いでユーザは、この新語を「多言語システム−辞書」即ちユーザ個別の辞書へと加えることによって、新語のエントリを受け取ってもよい。全体のシステムは、標準の辞書とカスタマイズした辞書とをユーザのランタイム辞書へと統合する。 [Increase in-class probability and relevance]
None of these input methods require linguistic training and provide an intuitive way for the user to determine whether a new word is properly represented. The user may then receive an entry for the new word by adding the new word to a “multilingual system-dictionary” or user-specific dictionary. The entire system integrates a standard dictionary and a customized dictionary into the user's runtime dictionary.

上述した５つのエントリに加えて、クラス内確率Ｐ（ｗ｜Ｃ）も定義される。このやり方では、システムが、同一のクラスに属する複数の語を識別可能となる。それ故に、ユーザのタスク、好み及び習慣により近接した語が、好ましくてより高いクラス内確率が割り当てられる。ユーザへの関連性に基づいて、クラス内確率をより高くするこの引き上げが決定され、この場合に、以下の点を観察して関連性が評価される：
− 新語エントリ及びその最新性
＊新語を入力することによってユーザが新語を望んでいたことを彼／彼女が示したので、入力された新語は、当然ながら近い将来使用される可能性が高く、それ故にクラス内確率が、代わりの既存のクラスエントリを超えて引き上げられる（増加する）
− 新語と、ユーザ行動、関心事及びタスクとの間の相関関係、以下を含む
＊都市名、ランドマーク、興味深い場所などの場所に対する距離
＊過去の使用履歴
＊共起統計（スシはボゴタよりもトウキョウとより関連している）
− 新語の一般的顕著性、以下を含む。
＊都市人口
＊メディアにおける最近の言及 In addition to the five entries described above, an in-class probability P (w | C) is also defined. In this way, the system can identify multiple words belonging to the same class. Therefore, words that are closer to the user's tasks, preferences and habits are preferably assigned a higher in-class probability. Based on the relevance to the user, this increase to increase the in-class probability is determined, in which case the relevance is evaluated by observing the following points:
-New word entry and its up-to-date * Because he / she indicated that the user wanted a new word by entering a new word, the new word entered is of course likely to be used in the near future, and Hence the intra-class probability is raised (increased) over the alternative existing class entry
-Correlation between new words and user behavior, interests and tasks, including: * Distances to places such as city names, landmarks and interesting places * Past usage history * Co-occurrence statistics (Sushi is more than Bogotá) More related to Tokyo)
-General saliency of new words, including:
* Urban population * Recent mention in the media

このような観察及び関連性の統計は、ユーザの観察した場所、履歴若しくは行動に基づいて、及び／又は代わりにインターネットなどの広いバックグラウンドの言語資源においてシステムの新語の発生を観察することによって、収集される。このような統計は、データが豊富な言語において単一言語で収集され、翻訳辞書及び翻訳言語モデルへと適用されてもよい。 Such observation and relevance statistics are based on the location, history or behavior observed by the user and / or alternatively by observing the occurrence of new words in the system in a wide background language resource such as the Internet. Collected. Such statistics may be collected in a single language in a data rich language and applied to translation dictionaries and translation language models.

ユーザの新たな行動及びタスクが、時間をかけてこのような複数の語の類似性を少なくするように、及び／又は（種々の都市で出現するような）新たな情報が複数の語のサブクラスの関連性を少なくする場合、引き上げられた複数の語の関連性は、時間をかけて減衰する場合もある。 New user actions and tasks over time reduce the similarity of such words and / or new information (such as appearing in different cities) subclasses of words In the case where the relevance of the word is reduced, the relevance of the raised words may be attenuated over time.

[クロスモーダルのエントリ]
選択的に、新語は、次のうちの１つによって入力される：
− スピーキング：ユーザは新語を話す。発音及び字訳などの全ての情報は、音声入力に基づく以外、上述と同様に、新語モデル、翻訳モデル、バックグラウンド辞書によって推定される。システムは、言葉による対話を行い、クラスの同一性及び他の関連情報を選択してもよい。
− スペリング：ユーザは新語を音声でつづる。この入力方法は、一般に話している間にわたって正確な字訳の尤度を改善する。スペリングは、話すやり方及び他の入力モダリティに対して補完的に使用されてもよい。
− ハンドライティング：ユーザは新語を手書きで入力する。この入力方法は、一般に話している間にわたって正確な字訳の尤度を改善する。手書きは、スピーキング、スペリング、又は他の入力モダリティに対して補完的に使用されてもよい。
− ブラウジング：新語は、インタラクティブなブラウジングによって選択されてもよい。ここでシステムは、ユーザの最近の使用法の履歴及び／又は最近、選択され入力された新語と同様な統計的プロファイルを含むテキストを求めてインターネットを検索することによって、関連した、関係のある新語を提案してもよい。 [Crossmodal entry]
Optionally, the new word is entered by one of the following:
-Speaking: The user speaks a new word. All information such as pronunciation and transliteration is estimated by a new word model, a translation model, and a background dictionary, as described above, except based on voice input. The system may interact verbally and select class identity and other relevant information.
-Spelling: The user spells a new word by voice. This input method generally improves the likelihood of an accurate transliteration while speaking. Spelling may be used complementary to the manner of speaking and other input modalities.
-Handwriting: The user enters a new word by hand. This input method generally improves the likelihood of an accurate transliteration while speaking. Handwriting may be used complementary to speaking, spelling, or other input modalities.
-Browsing: New words may be selected by interactive browsing. Here, the system searches the Internet for text that contains a history of the user's recent usage and / or a statistical profile similar to the recently selected and entered new word, and related related new words. May be proposed.

[インターネットを介する遠隔型新語学習及び共有型辞書開発]
前節において説明された方法の全ては、個々のユーザが音声翻訳システムを現場で彼／彼女のニーズ及びタスクにカスタマイズ可能にすることを目指している。しかしながら、このようなユーザカスタマイズの多くは、他のユーザにも同様に役立つ可能性がある。ある実施形態では、ユーザカスタマイズはコミュニティの大きなデータベースへとアップロードされ、名前、特別用語、又は言い回しは、利害関係者間で共有される。語彙エントリ、翻訳及びクラスタグは収集され、類似の利害関係があるコミュニティへと関連付けられる。引き続いてユーザは、これらの共有されたコミュニティ資源をダウンロードし、ユーザ独自のシステムへの資源として加えることができる。 [Remote new word learning and shared dictionary development via the Internet]
All of the methods described in the previous section aim to allow an individual user to customize a speech translation system to his / her needs and tasks in the field. However, many of these user customizations can be useful to other users as well. In some embodiments, user customizations are uploaded to a large database of communities and names, special terms, or phrases are shared among stakeholders. Lexical entries, translations and class tags are collected and associated with similar interested communities. The user can then download these shared community resources and add them as resources to the user's own system.

代わりの構成としては、ユーザは、翻訳文を不完全な状態でアップロードするだけであり、手動翻訳をコミュニティに要求することを選択してもよい。このような不正確な若しくは不完全な翻訳前の語又は文、及びこれらの不足している若しくは不正確な翻訳に対して、他のユーザはボランティア（又は有料）基準でオンライン修正及び翻訳を提供することができる。その結果としてもたらされる修正及び翻訳は、更新され共有されるコミュニティ翻訳データベースへともう一度提出される。 As an alternative configuration, the user may simply upload the translation in an incomplete state and may choose to request manual translation from the community. Other users provide online corrections and translations on a volunteer (or paid) basis for such inaccurate or incomplete pre-translational words or sentences, and their missing or inaccurate translations can do. The resulting modifications and translations are once again submitted to the updated and shared community translation database.

[管理されていない適応]
修正、修理及び新語学習の後、最後に、我々は修正された仮説、及びそれ故に口頭文についての真の筆記録又は翻訳を得る。音声−音声翻訳デバイス又はシステムは、このようなグランドトルース（ground truth）がもたらされ、ＡＳＲモジュール（図１、モジュール２又は９）をこのデバイスの主要なユーザに更に適応させる、という事実を自動的に使用できる。このような適応が、デバイスの正確さ及び有用性を改善するように設計される。適応についての２つの固有の方法が実行される。第１は、ユーザの音声をより良好に認識するシステムの適応；音声モデル及び発音モデル適応、並びに第２は、言語モデル適応のためにユーザの音声スタイルへの適応。特定のユーザ用の適応データを記憶するために、プロファイルが使用され、現場で切り替えることができる。 [Unmanaged adaptation]
After correction, repair, and new word learning, finally we get a corrected transcript and hence a true transcript or translation of the verbal sentence. A speech-to-speech translation device or system automatically provides the fact that such a ground truth is introduced, further adapting the ASR module (FIG. 1, module 2 or 9) to the primary user of the device. Can be used. Such adaptation is designed to improve the accuracy and usefulness of the device. Two unique methods for adaptation are performed. The first is the adaptation of the system to better recognize the user's speech; the speech model and pronunciation model adaptation; and the second is the adaptation to the user's speech style for language model adaptation. Profiles are used to store adaptation data for specific users and can be switched in the field.

[クラスベース型機械翻訳]
前節において、我々はエラー修理及び新語学習を説明した。これらのモジュールでは、基準はクラスベース型機械翻訳としていた。次に我々は、このようなクラスベース型機械翻訳の詳細な機能を説明する。 [Class-based machine translation]
In the previous section, we explained error repair and new word learning. In these modules, the standard was class-based machine translation. Next, we will explain the detailed functions of such class-based machine translation.

[アプローチ]
最先端の機械翻訳システムは、語−レベル上で翻訳を実行する。このことは、次の３つのドキュメントにおいて記載されたシステムを含む先行の翻訳システムから明らかである;(1) P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst, 'Moses: Open source toolkit for statistical machine translation', In Proc. ACL, 2007 ("[Koehn07"); (2) D. Chiang, A. Lopez, N. Madnani, C.Monz, P. Resnik and M. Subotin, "The Hiero machine translation system: extensions, evaluation, and analysis,", In Proc. Human Language Technology and Empirical Methods in Natural Language Processing, pp. 779-786, 2005 ("Chiang05")；及び (3) K. Yamada and K. Knight" A decoder for syntax-based statistical MT". In Proc. Association for Computational Linguistics, 2002 ("Yamada02")。アラインメントはワードツーワード（word-to-word）で実行される；翻訳例、即ち句ペアは語レベルで一致処理が行われる；及び語ベース型言語モデルが適用される。Ｃｈｉａｎｇ０５などに記載の階層的翻訳モジュール、及びＹａｍａｄａ０２などに記載の構文ベース型翻訳モデルは、中間の構造を導入することによってこれを拡張する。しかしながら、これらのアプローチは、依然として語の正確な一致処理を必要としている。各語が別々のエンティティとして処理されるので、これらのモデルは初見の語に対して一般化されていない。 [approach]
State-of-the-art machine translation systems perform translation on a word-level. This is evident from previous translation systems, including those described in the following three documents; (1) P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst, 'Moses: Open source toolkit for statistical machine translation', In Proc. ACL, 2007 ("[Koehn07"); (2) D. Chiang, A. Lopez, N. Madnani, C. Monz, P. Resnik and M. Subotin, "The Hiero machine translation system: extensions, evaluation, and analysis, ", In Proc. Human Language Technology and Empirical Methods in Natural Language Processing, pp. 779-786, 2005 (" Chiang05 "); and (3) K. Yamada and K. Knight" A decoder for syntax-based statistical MT ". In Proc. Association for Computational Linguistics, 2002 (" Yamada02 "). Alignment is performed word-to-word; translation examples, ie phrase pairs, are matched at the word level; and a word-based language model is applied. The hierarchical translation module described in Chang 05 and the like, and the syntax-based translation model described in Yamada 02 and the like extend this by introducing an intermediate structure. However, these approaches still require accurate word matching. Since each word is treated as a separate entity, these models are not generalized to the first-looked word.

クラスベース型機械翻訳の一実施形態は、次のようなクラスベース型統計的機械翻訳であり、外国語文ｆ^Ｊ _１＝ｆ_１、ｆ_２、．．．、ｆ_Ｊは、最大尤度を有する仮説であって次式で与えられる仮説＾ｅ^Ｉ _１を検索することによって別の言語ｅ^Ｉ _１＝ｅ_１、ｅ_２、．．．、ｅ_Ｉへと翻訳される：
＾ｅ^Ｉ _１＝ａｒｇｍａｘＰ（ｅ^Ｉ _１｜ｆ^Ｊ _１）＝ａｒｇｍａｘＰ（ｆ^Ｊ _１｜ｅ^Ｉ _１）・Ｐ（ｅ^Ｉ _１）
クラスは、固有表現、統語的クラス又は等価な語又は語句で構成されるクラスなどの意味クラスであり得る。一例として我々は、固有表現クラスがシステムへと組み込まれる事例を説明する。 One embodiment of the class-based machine translation is a class-based statistical machine translation as follows, where the foreign language sentence f ^J ₁ = f ₁ , f ₂ ,. . . , _{F J} are different language ^e _I 1 ₌ ^e 1 by a hypothesis having a maximum likelihood searching the hypothesis ^{^ e} _{I 1} given _{by:, e} 2,. . . , Translated into e _I :
^ E ^I ₁ = argmaxP (e ^I ₁ | f ^J ₁ ) = argmaxP (f ^J ₁ | e ^I ₁ ) · P (e ^I ₁ )
A class may be a semantic class such as a proper expression, a syntactic class or a class composed of equivalent words or phrases. As an example, we describe a case where a named entity class is incorporated into the system.

翻訳の間に適用される最も大きな２つの情報モデルは、目的言語モデルＰ（ｅ^Ｉ _１）及び翻訳モデルＰ（ｆ^Ｊ _１｜ｅ^Ｉ _１）である。クラスベース型統計的機械翻訳フレームワークにおいて、Ｐ（ｆ^Ｊ _１｜ｅ^Ｉ _１）はクラスベース型翻訳モデル（図３、モデル２３）、及びＰ（ｅ^Ｉ _１）はクラスベース型言語モデル（図３、モデル２４）である。 The two largest information models applied during translation are the target language model P (e ^I ₁ ) and the translation model P (f ^J ₁ | e ^I ₁ ). In the class-based statistical machine translation framework, P (f ^J ₁ | e ^I ₁ ) is a class-based translation model (Fig. 3, model 23), and P (e ^I ₁ ) is a class-based language model (Fig. 3 and model 24).

統計的機械翻訳フレームワーク用のクラスベース型モデルは、図１０に示されるやり方を用いてトレーニングできる。最初に、文ペアのコーパスをトレーニングすることが正規化され（ステップ１００）、コーパスにタグ付けするためにタグ付けモデル（図３、モデル２２）が使用される（ステップ１０１）。これを実行する１つのアプローチがＬａｆｆｅｒｔｙ０１に記載されている。このステップにおいて、トレーニングペアを形成するように組み合わせる文は、独立してタグ付けすることができ、一緒にタグ付けすることができ、又は一方の言語からのタグが他方の言語へと推定できる。トレーニングコーパス全体がタグ付けされた後、文ペア内の複数の語がアライメントされる（ステップ１０２）。アラインメントは、以下の先行技術文献などに記載される現在のアプローチを用いて成し遂げられる：Franz Josef Och, Christoph Tillmann, Hermann Ney: "Improved Alignment Models for Statistical Machine Translation" ;pp. 20-28; Proc. of the Joint Conf. of Empirical Methods in Natural Language Processing and Very Large Corpora; University of Maryland, College Park, MD, June 1999、及び Brown, Peter R, Stephen A. Delia Pietra, Vincent J. Delia Pietra, and R. L. Mercer. 1993." The mathematics of statistical machine translation: Parameter estimation, "Computational Linguistics, vol 19(2):263-311。このステップにおいて、タグ付きエンティティ（即ち「ニューヨーク」）内にある連語の句は、単一のトークンとして処理される。次に、Ｋｏｅｈｎ０７などの方法を用いて句が抽出され（ステップ１０３）、クラスベース型翻訳モデルを生成する（図３、モデル２３）。タグ付きコーパスは、クラスベース型目的言語モデルをトレーニングするためにも使用される（図３、モデル２４）。トレーニングは、以下の先行技術文献などに記載のやり方を用いて成し遂げられてもよい:B. Suhm and W. Waibel, "Towards better language models 'for spontaneous speech' in Proc. ICSLP-1994, 1994 (Suhm94)（ステップ１０４）。 A class-based model for a statistical machine translation framework can be trained using the approach shown in FIG. Initially, training a corpus of sentence pairs is normalized (step 100), and a tagging model (FIG. 3, model 22) is used to tag the corpus (step 101). One approach to do this is described in Latterty01. In this step, sentences combined to form a training pair can be tagged independently, tagged together, or tags from one language can be inferred to the other language. After the entire training corpus is tagged, the words in the sentence pair are aligned (step 102). Alignment is accomplished using current approaches described in prior art documents such as: Franz Josef Och, Christoph Tillmann, Hermann Ney: "Improved Alignment Models for Statistical Machine Translation"; pp. 20-28; Proc. of the Joint Conf. of Empirical Methods in Natural Language Processing and Very Large Corpora; University of Maryland, College Park, MD, June 1999, and Brown, Peter R, Stephen A. Delia Pietra, Vincent J. Delia Pietra, and RL Mercer 1993. "The mathematics of statistical machine translation: Parameter estimation," Computational Linguistics, vol 19 (2): 263-311. In this step, collocation phrases within a tagged entity (ie, “New York”) are treated as a single token. Next, phrases are extracted using a method such as Koehn07 (step 103), and a class-based translation model is generated (FIG. 3, model 23). The tagged corpus is also used to train a class-based target language model (FIG. 3, model 24). Training may be accomplished using the methods described in prior art documents such as: B. Suhm and W. Waibel, "Towards better language models 'for spontaneous speech' in Proc. ICSLP-1994, 1994 (Suhm94 (Step 104).

入力文を翻訳するために、図１１において図示される方法が適用される。最初に、入力文が正規化され（ステップ１０５）、トレーニングコーパスへと適用されたようなやり方を用いてタグ付けされる（ステップ１０６）。入力文は、単一言語タガーを用いてタグ付けされる（図３、モデル２２）。次に、入力文は、クラスベース型ＭＴモデルを用いて復号化される（図３、モデル２３及び２４）。クラスベース型統計的機械翻訳に対して、標準統計的機械翻訳において使用されるやり方と同じやり方を用いて、復号化が実行されるが、しかしながら、句ペアは、以下の例に示されるように、語ではなくクラス−レベルで一致され得る。 In order to translate the input sentence, the method illustrated in FIG. 11 is applied. Initially, the input sentence is normalized (step 105) and tagged using a manner as applied to the training corpus (step 106). Input sentences are tagged using a monolingual tagger (FIG. 3, model 22). Next, the input sentence is decrypted using the class-based MT model (FIG. 3, models 23 and 24). For class-based statistical machine translation, decryption is performed using the same method used in standard statistical machine translation, however, phrase pairs are as shown in the following example: Can be matched at class-level rather than word.

タグ付き入力文を以下のように仮定する：
the train to @PLACE.city{Wheeling} leaves at @TIME{4:30}
次の句は、一致され得る：

クラス内の語又は句（即ち：@PLACE.city{Wheeling}、 @TIME{4:30}）は、数／時間の事例の場合、直接通過するか、翻訳が翻訳モデルから決定される。ユーザは、「ユーザ現場カスタマイズモジュール」を介して翻訳モデルに新語を加えることができる（図１、モジュール１２）。次いで、（図６の例で詳述しているように）ユーザがあらかじめ都市名「Ｗｈｅｅｌｉｎｇ」を加えた場合、翻訳モデルは次の句も含むことになる：

Assume a tagged input statement as follows:
the train to @ PLACE.city {Wheeling} leaves at @TIME {4:30}
The following phrases can be matched:

The words or phrases in the class (ie: @ PLACE.city {Wheeling}, @TIME {4:30}) are passed directly in the case of a number / hour or the translation is determined from the translation model. The user can add new words to the translation model via the “user site customization module” (FIG. 1, module 12). Then, if the user has previously added the city name “Wheeling” (as detailed in the example of FIG. 6), the translation model will also contain the following phrase:

翻訳モデル確率Ｐ（ｆ^Ｊ _１｜ｅ^Ｉ _１）（図３、モデル２３）及びＭＴクラスベース型言語モデル確率Ｐ（ｅ^Ｉ _１）（図３、モデル２４）を仮定すると、検索は、最大尤度Ｐ（ｆ^Ｊ _１｜ｅ^Ｉ _１）・Ｐ（ｅ^Ｉ _１）を有する翻訳仮説を見つけるように実行される。 Assuming translation model probability P (f ^J ₁ | e ^I ₁ ) (FIG. 3, model 23) and MT class-based language model probability P (e ^I ₁ ) (FIG. 3, model 24), the search is the maximum likelihood. It is executed to find a translation hypothesis having a degree P (f ^J ₁ | e ^I ₁ ) · P (e ^I ₁ ).

上述した入力文及び句を仮定すると、結果的にもたらされる翻訳は次のようになる：

これは入力文の正確な翻訳である。 Assuming the input sentence and phrase above, the resulting translation is as follows:

This is an accurate translation of the input sentence.

この例では、語「Ｗｈｅｅｌｉｎｇ」はトレーニングコーパスに現れなかったにもかかわらず、「ユーザ現場カスタマイズモジュール」（図１、モジュール１２）を介してユーザが語を入力した後、システムは、語を正確に翻訳できる。更にその上、語−クラスが既知であるので（この例では「＠ＰＬＡＣＥ．ｃｉｔｙ」）、システムは、周囲の複数の語に対してより良好な翻訳を選択でき、翻訳出力の複数の語を正確に順に並べることになる。 In this example, the word “Wheeling” did not appear in the training corpus, but after the user entered the word via the “User Field Customization Module” (FIG. 1, module 12), the system Can be translated into Furthermore, since the word-class is known (“@ PLACE.city” in this example), the system can select a better translation for multiple words in the vicinity, and select multiple words in the translation output. It will be arranged in order exactly.

[多言語コーパスのパラレルタグ付け]
ある実施形態では、単一言語タガーを有するトレーニングコーパスの各側を独立にタグ付けし、次いで各文ペアから一貫性がないラベルを取り除くことで、ラベリングされたパラレルコーパスが得られる。このアプローチにおいて、各文ペア（Ｓａ，Ｓｂ）に対して、最大条件付きの確率Ｐ（Ｔａ，Ｓａ）及びＰ（Ｔｂ，Ｓｂ）を有するラベル−系列ペア（Ｔａ，Ｔｂ）が選択される。任意のクラス−タグが発生する総数が、Ｐ（Ｔａ，Ｓａ）とＰ（Ｔｂ，Ｓｂ）との間で異なる場合、そのクラス−タグはラベル−系列ペア（Ｔａ，Ｔｂ）から取り除かれる。Ｐ（Ｔａ，Ｓａ）及びＰ（Ｔｂ，Ｓｂ）を推定する１つの方法は、条件付き確率場ベース（conditional random field-based）のタグ付けモデルＬａｆｆｅｒｔｙ０１を適用することによる。単一言語タグ付けの間に使用される機能セット（feature set）の一例が、図１１に示される。 [Parallel tagging of multilingual corpus]
In one embodiment, a labeled parallel corpus is obtained by tagging each side of a training corpus having a single language tagger independently and then removing inconsistent labels from each sentence pair. In this approach, for each sentence pair (Sa, Sb), a label-sequence pair (Ta, Tb) having the maximum conditional probabilities P (Ta, Sa) and P (Tb, Sb) is selected. If the total number of occurrences of any class-tag is different between P (Ta, Sa) and P (Tb, Sb), the class-tag is removed from the label-sequence pair (Ta, Tb). One way to estimate P (Ta, Sa) and P (Tb, Sb) is by applying a conditional random field-based tagging model Luffy01. An example of a feature set used during monolingual tagging is shown in FIG.

ある実施形態では、単一言語の特徴に加えて、語−アラインメント（図１１内のｗｂ，ｊ）から抽出される目標語を用いることによって、複数の文ペア全部にわたるラベリングの一貫性を、更に改善することができる。 In some embodiments, in addition to single language features, the use of target words extracted from word-alignment (wb, j in FIG. 11) further increases the consistency of labeling across multiple sentence pairs. Can be improved.

別の実施形態では、クラス−タグの集合が等価でなければならないという制約を適用している間、翻訳ペア内の両方の文が一緒にラベリングされる。具体的には、文ペア（Ｓａ，Ｓｂ）に対して、我々は結合最大条件付き確率を最大化するラベル−系列ペア（Ｔａ，Ｔｂ）を検索する。

単一言語モデルの性能が著しく異なる場合、λａ及びλｂを最適化して、二言語タグ付け性能を改善できる。 In another embodiment, both sentences in a translation pair are labeled together while applying the constraint that the set of class-tags must be equivalent. Specifically, for a sentence pair (Sa, Sb), we search for a label-sequence pair (Ta, Tb) that maximizes the combined maximum conditional probability.

If the performance of the single language model is significantly different, λa and λb can be optimized to improve bilingual tagging performance.

ある実施形態では、手動で注釈を付けられたコーパスが、特定の言語に対して入手できない場合、トレーニングコーパス内の複数の文ペア全部にわたって、ラベルが既知の第１言語から注釈が付けられていない言語へとラベルを投影すること（projecting）によって、ラベルを生成できる。これを実行する１つのアプローチが、以下の先行技術文献に記載される：D. Yarowsky, G. Ngai and R. Wicentowski, "Inducting Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora," In Proc. HLT, pages 161-168, 2001 ("Yarowsky01")。 In one embodiment, if a manually annotated corpus is not available for a particular language, the labels are not annotated from a known first language across all the sentence pairs in the training corpus A label can be generated by projecting the label into a language. One approach to do this is described in the prior art literature: D. Yarowsky, G. Ngai and R. Wicentowski, “Inducting Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora,” In Proc. HLT, pages 161-168, 2001 ("Yarowsky01").

[クラスベース型機械翻訳のシステム及び評価の例]
実験的評価を通じて、詳細を説明したクラスベース型機械翻訳が、従来のアプローチと比較して翻訳性能を改善することを示す。更に、セクション２．２．２に記載のパラレルタグ付けアプローチを用いることにより、翻訳の正確さが更に改善されることを示す。 [Class-based machine translation system and evaluation example]
Through experimental evaluation, we show that detailed class-based machine translation improves translation performance compared to traditional approaches. Furthermore, we show that using the parallel tagging approach described in section 2.2.2 further improves the accuracy of the translation.

旅行者領域用に開発された日本語と英語の間の翻訳用システムが評価された。トレーニングデータ及びテストデータの説明は、表１に示される。

The system for translation between Japanese and English developed for the tourist domain was evaluated. A description of the training data and test data is shown in Table 1.

効果的なクラスベース型ＳＭＴを実現するために、複数の文ペアにわたる正確な及び一貫性のあるタグ付けが極めて重要である。我々はタグ付け品質を改善する以下の２つのアプローチを研究した；第１は、語−アラインメントからの二言語の特徴の導入；及び第２は、文ペアの両方の側が一緒にタグ付けされる二言語タグ付け。パラレルトレーニングコーパスからの１４，０００文ペアが、表２に示される１６クラスラベルを用いて手動でタグ付けされた。

Accurate and consistent tagging across multiple sentence pairs is extremely important to achieve an effective class-based SMT. We studied the following two approaches to improve tagging quality; first, the introduction of bilingual features from word-alignment; and second, both sides of a sentence pair are tagged together Bilingual tagging. 14,000 sentence pairs from the parallel training corpus were manually tagged using the 16 class labels shown in Table 2.

手動でラベリングされたこの集合から、我々は、タグ付けの正確さを評価するヘルドアウト（held-out）データとして、１つ以上のタグを収容する１０％（１４００文ペア）を選択した。 From this manually labeled set, we selected 10% (1400 sentence pairs) containing one or more tags as held-out data to assess tagging accuracy.

最初に、ベースラインの単一言語ＣＲＦベースのタガーの性能が評価された。ヘルドアウト集合の各側は、言語依存性モデルを用いて独立にラベリングされた。次いで出力は、手動の基準と比較された。種々の測定基準に対するタグ付けの正確さを表３に示す。

Initially, the performance of the baseline monolingual CRF-based tagger was evaluated. Each side of the held-out set was independently labeled using a language-dependent model. The output was then compared to manual criteria. The tagging accuracy for the various metrics is shown in Table 3.

二言語のタグ付けでは、エンティティがコーパスの両方の側に正確にラベリングされる場合、タグは正確であると考えられる。右側列は、両方の側が正確にタグ付けされた文ペアのパーセンテージを示す。Ｆの得点が個々の言語に対して０．９０を上回っているにもかかわらず、二言語タグ付けの正確さは０．８４に著しく低下し、８０％の文ペアだけが正確にタグ付けされたにすぎない。アラインメント特徴を単一言語タガーへと組み込むことにより、両方の言語に対する精度が改善され、日本語側に対する再現率が著しく改善されたが、しかしながら正確にタグ付けされた文ペアのパーセンテージは僅かだけ増加したにすぎない。複数の文ペア全部にわたって一貫性がないタグを取り除くことにより、精度が改善されたが、しかし正確にタグ付けされた文ペアの数は改善されなかった。 In bilingual tagging, a tag is considered accurate if the entity is correctly labeled on both sides of the corpus. The right column shows the percentage of sentence pairs that are correctly tagged on both sides. Despite the fact that F scores above 0.90 for each language, the accuracy of bilingual tagging has dropped significantly to 0.84, and only 80% sentence pairs are correctly tagged. It's just that. Incorporating alignment features into a single language tagger improved accuracy for both languages and significantly improved recall for the Japanese side, but slightly increased the percentage of correctly tagged sentence pairs. I just did it. Removing inconsistent tags across all sentence pairs improved accuracy, but did not improve the number of correctly tagged sentence pairs.

次に、二言語タグ付けの有効性が、上述したアプローチを用いて評価された。このアプローチのタグ付けの正確さ、及び語アラインメント特徴が組み込まれた場合が、表３の下側の２行に示される。単一言語の事例と比較すると、二言語タグ付けはタグ付けの正確さを著しく改善した。タグ付けの一貫性が改善された（二言語タグ付けに対するＦの得点は、０．８４から０．９５へと増加した）だけでなく、英語及び日本語の両方の側におけるタグ付けの正確さも改善された。語−アラインメント特徴を組み込むことにより、全ての測定に対するタグ付けの正確さにおいて更に小さい改善が得られた。 Next, the effectiveness of bilingual tagging was evaluated using the approach described above. The tagging accuracy of this approach and the case of incorporating word alignment features are shown in the bottom two rows of Table 3. Compared to the monolingual case, bilingual tagging significantly improved tagging accuracy. Not only is tagging consistency improved (F score for bilingual tagging increased from 0.84 to 0.95), but also tagging accuracy on both English and Japanese sides Improved. Incorporating word-alignment features resulted in a smaller improvement in tagging accuracy for all measurements.

更に、３つのクラスベース型システムとクラスモデルを使用しないベースラインシステムとの性能を比較することによって、システムの有効性が評価された。 In addition, the effectiveness of the system was evaluated by comparing the performance of three class-based systems with a baseline system that did not use a class model.

ベースラインシステムに対して、句−ベースの翻訳モデルが、Ｋｏｅｈｎ０５及びＧＩＺＡ＋＋（これは以下の先行技術文献などに使用されている）などに記載のＭｏｓｅｓツールキットを用いてトレーニングされた：Franz Josef Och, Hermann Ney. "A Systematic Comparison of Various Statistical Alignment Models", Computational Linguistics, volume 29, number 1, pp.19-51 March 2003。３−グラム言語モデルは、以下の先行技術文献に記載のＳＲＩＬＭツールキットを用いてトレーニングされた:A. Stolcke "SRILM - an extensible language modeling toolkit", In Proc. of ICSLP, pp. 901-904, 2002。復号化は、ＰａｎＤｏＲＡ復号化器を用いて実行された。この復号化器は、以下の先行技術文献に記載されている:Ying Zhang, Stephan Vogel, "PanDoRA: A Large-scale Two-way Statistical Machine Translation System for Hand-held Devices, "In the Proceedings of MT Summit XI, Copenhagen, Denmark, Sep. 10-14 2007。システムは、両方の翻訳方向Ｊ→Ｅ（日本語から英語へ）及びＥ→Ｊ（英語から日本語へ）に対して、表１に記載のトレーニング集合を用いて作成された。目的言語モデルをトレーニングするために使用されるデータは、このコーパスに限定された。ベースラインシステムの翻訳品質は、６００文のテスト集合上で評価された。評価の間、１つの基準が使用された。Ｊ→Ｅ及びＥ→Ｊシステムに対するＢＬＥＵ得点は、それぞれ０．４３８１及び０．３９４７であった。ＢＬＥＵ得点は、以下の先行技術文献に記載されている:Kishore Papineni, Salim Roukos, Todd Ward and Wei-Jing Zhu "BLEU: a Method for Automatic Evaluation of Machine Translation," In Proc. Association for Computational Linguistics, pp. 311-318, 2002。異なる３つのタグ付け法を用いる翻訳品質が評価された：
＋ｎｕｍ：数、時間に関連した８クラス
＋ＮＥ−クラス：上述クラス＋固有表現用の８クラス
＋Ｂｉ−タグ付け：上述の１６クラス、二言語でタグ付けされたトレーニングコーパス For the baseline system, a phrase-based translation model was trained using the Moses toolkit, such as described in Koehn05 and GIZA ++ (which is used in prior art documents such as: Franz Josef Och) "A Systematic Comparison of Various Statistical Alignment Models", Computational Linguistics, volume 29, number 1, pp.19-51 March 2003. The 3-gram language model is the SRILM toolkit described in the following prior art documents: A. Stolcke "SRILM-an extensible language modeling toolkit", In Proc. Of ICSLP, pp. 901-904, 2002. Decoding was performed using a PanDoRA decoder. This decoder is described in the prior art document: Ying Zhang, Stephan Vogel, "PanDoRA: A Large-scale Two-way Statistical Machine Translation System for Hand-held Devices," In the Proceedings of MT Summit. XI, Copenhagen, Denmark, Sep. 10-14 2007. The system was created using the training set listed in Table 1 for both translation directions J → E (from Japanese to English) and E → J (from English to Japanese). The data used to train the target language model was limited to this corpus. The translation quality of the baseline system was evaluated on a test set of 600 sentences. One standard was used during the evaluation. The BLEU scores for the J → E and E → J systems were 0.4381 and 0.3947, respectively. BLEU scores are described in the prior art literature: Kishore Papineni, Salim Roukos, Todd Ward and Wei-Jing Zhu "BLEU: a Method for Automatic Evaluation of Machine Translation," In Proc. Association for Computational Linguistics, pp 311-318, 2002. Translation quality using three different tagging methods was evaluated:
+ Num: 8 classes related to number and time + NE-class: above class + 8 classes for proper expression + Bi-tagging: 16 classes above, training corpus tagged in two languages

単一言語タグ付けは、＋ｎｕｍ及び＋ＮＥ−クラスの事例に対して適用され、文ペアにわたって一貫性がないタグは取り除いた。＋Ｂｉ−タグ付けの事例では、語アラインメント特徴を組み込む二言語タグ付けが使用された。タグ付け法のそれぞれに対して、トレーニングコーパス全体が、クラス−ラベルの適切な集合でタグ付けされた。次いでクラスベース型翻訳モデル及び言語モデルは、ベースラインシステムにおいて使用されるのと等価なやり方を用いてトレーニングされた。テストの間、入力文は単一言語タガーを用いてタグ付けされた。テスト集合内の全ての固有表現は、翻訳の間に使用するために、ユーザ辞書へと入力された。 Monolingual tagging was applied to the + num and + NE-class cases, removing inconsistent tags across sentence pairs. In the + Bi-tagging case, bilingual tagging that incorporates word alignment features was used. For each of the tagging methods, the entire training corpus was tagged with an appropriate set of class-labels. The class-based translation model and language model were then trained using a method equivalent to that used in the baseline system. During the test, input sentences were tagged using a monolingual tagger. All named expressions in the test set were entered into the user dictionary for use during translation.

Ｊ→Ｅ及びＥ→ＪシステムのＢＬＥＵ得点について、ベースライン及びクラスベース型システムに対する６００文のテスト集合における性能を表４に示す。

Table 4 shows the performance of the 600 sentence test set for the baseline and class-based systems for the J → E and E → J system BLEU scores.

数及び時間タグ（＋ｎｕｍ）を用いるクラスベース型ＳＭＴシステムが、ベースラインシステムと比較して、両方の翻訳方向に対して改善された翻訳品質を得た。これらのモデルについて、０．４４４１及び０．４１０４というＢＬＥＵ得点が得られた。数及び時間タグに加えて固有表現を用いるクラスベース型システムが適用される場合、翻訳品質は著しく改善された。Ｊ→Ｅシステムについて０．５０１４、及びＥ→Ｊの事例について０．４４６４というＢＬＥＵ得点が得られた。トレーニングコーパスをタグ付けするために二言語タグ付けが使用されたとき（＋Ｂｉ−タグ付け）、ＢＬＥＵにおいて両方の翻訳方向に対して更に０．８点の増加が得られた。１つ以上の固有表現を含むテスト集合内の１４％の文において、（＋Ｂｉ−タグ付け）システムは、単一言語でタグ付けされたシステム（「＋ＮＥ−クラス」）を、最大３．５ＢＬＥＵ点まで優れていた。 A class-based SMT system using number and time tags (+ num) obtained improved translation quality for both translation directions compared to the baseline system. For these models, BLEU scores of 0.4441 and 0.4104 were obtained. Translation quality was significantly improved when a class-based system using specific expressions in addition to number and time tags was applied. BLEU scores of 0.5014 for the J → E system and 0.4464 for the E → J case were obtained. When bilingual tagging was used to tag the training corpus (+ Bi-tagging), an additional 0.8 point gain was obtained for both translation directions in BLEU. In 14% of the sentences in the test set that contain one or more named expressions, the (+ Bi-tagging) system will replace the system tagged in a single language ("+ NE-class") with a maximum of 3.5 BLEU points. Was excellent.

上述したように非常に詳細な内容が説明されたが、図面及び詳細な実施形態は、説明するために提供されたものであり、限定するためではないことが理解されるべきである。設計及び構成の変形が、本発明の原理内でなされる場合もある。当業者は、本発明のこのような変更若しくは修正若しくは要素の組み合わせ、変形、均等物、又は改善が、添付の請求項において定められるような本発明の範囲内に、依然として存在することを理解することになる。 Although very detailed details have been described above, it should be understood that the drawings and detailed embodiments have been provided for purposes of illustration and not limitation. Variations in design and configuration may be made within the principles of the invention. Those skilled in the art will appreciate that such changes or modifications or combinations of elements, variations, equivalents or improvements of the invention still exist within the scope of the invention as defined in the appended claims. It will be.

Claims

Receiving from a user a utterance in a first language that includes a first term associated with a term in the field;
Translating the speech from the first language into the second language in a speech translation system;
Receiving an instruction to add the first term relating to the field to a first recognition dictionary of a first automatic speech recognition module of the speech translation system;
Adding the first term related to the field and the translation of the statement into the second language to a first machine translation module related to the first language in the speech translation system;
Adding the first term associated with the field and the translation of the utterance into the second language to a shared database for a community associated with the field and associated with the field of the first term;
Including methods.

Receiving a second term related to the field from the shared database;
Adding the second term associated with the field from the shared database to the first recognition dictionary and the first machine translation module associated with the first language in the speech translation system;
The method of claim 1 comprising:

If the translation to the second language added to the shared database includes an inaccurate or incomplete translation;
The first term comprises the step of requesting to correct translation for the first term related to the field method of claim 1 associated with the field applied to the shared database.

From the shared database, and obtaining a corrected translation of the first term associated with the field,
Adding the corrected translation from the shared database to the first machine translation module associated with the first language in the speech translation system;
The method of claim 1 comprising:

The method of claim 1, wherein a corrected translation of the first term associated with the field in the shared database is provided by a second user of the community.

6. The method of claim 5, wherein the shared database stores terms related to one or more fields, including one or more of names, wordings, and special terms.

Displaying simultaneously at least (i) the utterance of the received first language and (ii) the translation of the utterance into the second language on a user interface display of the speech translation system in text. Item 2. The method according to Item 1.

Translating the statement from the first language into the second language includes generating a translation into the second language via either a rule-based model or a statistical model; The method of claim 1.

When an instruction to add the first term is received by the processor, word class information, pronunciation in the first language, translation into the second language, pronunciation in the second language for the first term A step of determining
Adding the determined word class information of the first term and the determined pronunciation in the first language to the first recognition dictionary of the first language in the first automatic speech recognition module. The method of claim 1.

A computer program product comprising a non-transitory computer-readable recording medium containing computer program code,
The computer program code is
Receiving from a user a utterance in a first language that includes a first term associated with a domain;
Translating the speech of the first language into a second language in a speech translation system;
The first recognition dictionary for the first language of the first automatic speech recognition module of the speech translation system, comprising the steps of receiving an indication to apply said first term related to the field of in the first language,
Adding the first term related to the field and the translation of the statement into the second language to a first machine translation module related to the first language in the speech translation system;
Adding the first term associated with the field and the translation of the utterance into the second language into a shared database for the community associated with the field and associated with the field of the first term; ,
Is for running
The computer program product, wherein the first term relating to the field added to the shared database is accessible by the community.

Receiving a second term related to the field from the shared database;
Adding the second term associated with the field from the shared database to the first recognition dictionary and the first machine translation module associated with the first language in the speech translation system;
The computer program product of claim 10, further comprising computer program code for executing

The was added to a shared database, the translation into the second language, if it contains insufficient translation, in the first term associated with the field applied to the shared database, associated with the field The computer program product of claim 10, further comprising computer program code for performing the step of requesting that a translation be corrected for the first term.

From the shared database, and obtaining a corrected translation of the first term associated with the field,
Adding the corrected translation from the shared database to the first machine translation module associated with the first language in the speech translation system;
The computer program product of claim 10, further comprising computer program code for executing

The first term of the corrected translation is given by a second user of the community, computer program product of claim 10 associated with the field in the shared database.

The computer program product of claim 14, wherein the shared database stores terms related to one or more fields, including one or more of names, wordings, and special terms.

To simultaneously display at least (i) the utterance of the received first language and (ii) the translation of the utterance into the second language in text on a user interface display of the speech translation system. The computer program product of claim 10, further comprising computer program code.

Translating the statement from the first language into the second language includes generating a translation into the second language via either a rule-based model or a statistical model; The computer program product of claim 10.

When an instruction to add the first term is received by the processor, word class information, pronunciation in the first language, translation into the second language, pronunciation in the second language for the first term A step of determining
Adding the determined word class information of the first term and the determined pronunciation in the first language to the first recognition dictionary of the first language in the first automatic speech recognition module; The computer program product of claim 10, further comprising computer program code.

A speech translation system,
A computer processor;
Memory,
An automatic speech recognition module that receives from a user a utterance in a first language that includes a first term associated with a term in the field;
A translation module for translating the statement from the first language to the second language;
An instruction to add the first term related to the field to the first recognition dictionary of the first automatic speech recognition module of the speech translation system is received, and the first term related to the field and the second of the remarks. A user site customization module that adds language translations to a first machine translation module associated with the first language in the speech translation system;
With
The user site customization module relates to the first term associated with the field and the translation of the utterance into the second language for the community associated with the field of the first term. A speech translation system further configured to add to a shared database.