JP3708753B2

JP3708753B2 - Translation word selection dictionary automatic creation device and automatic translation device

Info

Publication number: JP3708753B2
Application number: JP15344199A
Authority: JP
Inventors: 玄一郎菊井
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 1999-06-01
Filing date: 1999-06-01
Publication date: 2005-10-19
Anticipated expiration: 2019-06-01
Also published as: JP2000348031A

Description

【０００１】
本発明は、自動翻訳装置、翻訳支援装置等、テキスト中の単語の正しい訳語を選択する処理を含む自然言語処理装置において利用可能な訳語選択辞書自動作成装置及び自動翻訳装置に関するものである。
【０００２】
【従来の技術】
従来より、ある単語に対して選択可能な複数の訳語の中から、その単語の出現する文脈において正しい意味に対応する訳語を選択する「訳語選択」処理の一手法として、次のような訳語選択辞書を使う方法が提案され、実用化されている。
【０００３】
即ち、予め単語の各用例に対して最も適切な訳語を対応付けた「訳語選択辞書」を作成しておき、この訳語選択辞書の中から、訳出対象単語の出現する文脈に最も適合する用例を検索し、その用例に対応する訳語を出力する方法である。
【０００４】
ここで、用例とは訳すべき単語の周囲に出現する語句であり、計算機内部においては単語の列、あるいは単語を品詞等のクラスに置き換えた列で表現される。また、訳出対象単語の出現する文脈も同様に単語等の列で表現され、文脈と用例の適合の度合いは、これら単語列の一致の度合い、あるいは一方に含まれる単語がもう一方に含まれる単語と同一文脈に出現する確率等によって定義される。
【０００５】
このような処理において、本質的な役割を持つ訳語選択辞書は人手で作成される場合の外、大きく分けて次の二種類の自動作成手法が提案されている。
【０００６】
第１の方法は、大量の二言語対照テキストを統計的に処理するものであり、第２の方法は、大量の翻訳元言語のテキスト中の対象単語に予め人手で選ばれた正しい訳語を付与しておき、これを統計的に処理するものである。ここで、二言語対照テキストとは対訳関係にあるテキスト対のことであり、しばしば一方のテキストのどの文が、他方のテキストのどの文と対訳関係にあるかが明示されている。
【０００７】
一方、ある概念、あるいはテキストを特徴づけるような単語のリストを入力とし、翻訳先言語の大量のテキストデータにおける訳語候補の出現分布を用いて、訳語選択辞書を使うことなく最適な訳語を得る「単語リスト翻訳」の手法が既に提案されている。翻訳しようとする単語とその文脈中の単語を組み合わせて単語リストを作ることによって、この手法を前記「訳語選択」の問題に適用することが考えられる。
【０００８】
【発明が解決しようとする課題】
しかしながら、前記第１の方法では、必要とする分野、言語の組に対して大量の対訳テキストが必要であり、これを入手あるいは作成するのには多大なコストがかかるという問題があった。また、第２の方法では、テキスト中の単語に対して正しい訳語を人手で付与しなければならず、同様に多大なコストがかかるという問題があった。
【０００９】
また、前記「単語リスト翻訳」を訳語選択に適用する場合、翻訳対象単語の意味を決定づけるような入力文脈中の単語の訳語が、翻訳先言語のテキストデータに存在するとは限らないため、必ずしも正しい結果が得られないという問題があった。
【００１０】
本発明の目的は、大量の対訳テキストや最適訳語を付与した用例の集合を必要とすることなく、翻訳対象単語に最も適切な訳語を対応づけた訳語選択辞書が作成できる訳語選択辞書自動作成装置を提供することにある。
【００１１】
本発明の他の目的は、前記訳語選択辞書自動作成装置によって作成された訳語選択辞書を用いて、最も適切な訳語を選択し得る自動翻訳装置を提供することにある。
【００１２】
【課題を解決するための手段】
本発明では、前記目的を達成するため、任意の単語Ｗの用例を単語Ｗの用法に基づいて分類し、それぞれの分類に最も適合する単語Ｗの訳語を対応づけた訳語選択辞書を自動的に作成する装置において、翻訳元言語の大量のテキストを蓄積している翻訳元言語コーパスから単語Ｗと一致する単語を全て検出し、その前後のｎ単語からなる文字列である用例を作業用メモリに書き込み、用例中の単語をベクトルで表し、１つの用例をその用例に含まれる単語に対するベクトルの荷重和ベクトルで表し、任意の二つの用例間の距離を当該任意の二つの用例にそれぞれ対応する前記荷重和ベクトルのなす角のコサイン値で定義して、クラスタリングにより前記作業用メモリに書き込まれた用例を複数の部分集合に分割し、用例集合ＤＢに書き込む用例集合生成部と、前記部分集合を構成する全ての用例に含まれる各異なり単語に対して、その単語が当該部分集合を構成する用例中に出現する回数が多いほど且つその単語を含む用例の個数が少ないほど高い値となるように定義付けたスコアを計算し、スコアの大きいものからｍ個の単語をその部分集合を特徴付ける単語の集合として抽出し、これを前記用例集合ＤＢに書き込まれた各部分集合に対して実行する特徴語抽出部と、用例集合生成部で得られた各部分集合に対して、特徴語抽出部によって抽出されたｍ個の特徴単語を１〜ｍ番目の要素とし、辞書作成対象単語Ｗをｍ＋１番目の要素とした入力単語リストを生成し、各入力単語リストについて構成する各単語の訳語を対訳辞書から取得し、構成する各単語の訳語の組み合わせの数の訳語リストを作成し、訳語リスト内の各単語について翻訳先言語の大量のテキストを蓄積している翻訳先言語コーパスを用いて単語のベクトルを求め、訳語リスト内の単語の平均ベクトルを求め、各単語ベクトルと平均ベクトルとのコサイン値の平均を各訳語リストの関連性の値として求め、関連性が最大の訳語リストを選択し、該訳語リスト中の最後の単語を各部分集合の単語Ｗの訳語として出力する単語リスト翻訳部とを備えたことを特徴とする。
【００１３】
前記構成によれば、まず、用例集合生成部によって、大量のテキスト中から抽出された単語Ｗの用例の集合が用例同士の類似性に基づいて複数の部分集合に自動的に分類される。ここで、用例とは単語Ｗの文脈の別名であること、及び「単語の意味はその周囲の文脈によって決定される」という経験則から、用例集合生成部によって作成される部分集合は単語Ｗの一つの意味に対応する。
【００１４】
次に、特徴語抽出部によって、用例集合生成部を用いて分類された用例の部分集合からその部分集合を特徴付ける単語の集合が抽出される。
【００１５】
最後に、単語リスト翻訳部によって、用例集合生成部で得られた各部分集合に対して、対訳辞書から得られる単語Ｗの訳語の中より、前記特徴語抽出部によって抽出された集合と最も関連性の高い訳語が対応づけられる。
【００１６】
これによって、大量の対訳テキストや最適訳語を付与した用例の集合を用いることなく、翻訳元言語及び翻訳先言語それぞれの大量のテキスト、並びに対訳辞書のみから、単語Ｗの用例をその意味によって分類し、各分類に対して単語Ｗの訳語を対応づけた「訳語選択辞書」が自動的に作成される。
【００１７】
また、本発明では、前記目的を達成するため、単語Ｗとその文脈の文字列である入力用例が与えられた時、この入力用例に最も適合する単語Ｗの訳語を選択する装置において、請求項１記載の訳語選択辞書自動作成装置によって作成された訳語選択辞書の中から、単語Ｗに対する全ての訳語とこれらの各訳語に対する用例の部分集合を読み込み、入力用例のベクトルと各部分集合を構成する各用例のベクトルの平均とのコサイン値を関連度として求め、関連度が最大の部分集合を選択し、これと対応づけられた訳語を出力する用例集合検索部を備えたことを特徴とする。
【００１８】
前記構成によれば、前記装置によって作成された訳語選択辞書中の訳語のうちで、入力用例に最も適した訳語を自動的に選択できる。
【００１９】
【発明の実施の形態】
【００２０】
【実施の形態１】
図１は本発明の訳語選択辞書自動作成装置の実施の形態の一例を示すもので、図中、１１は用例集合生成部、１２は翻訳元言語コーパス、１３は用例集合データベース（ＤＢ）、１４は特徴語抽出部、１５は単語リスト翻訳部、１６は対訳辞書、１７は翻訳先言語コーパス、１８は出力部、１９は制御部、２０は訳語選択辞書である。
【００２１】
用例集合生成部１１は、源言語単語（辞書作成対象単語）３０に対して、この単語を含む用例を、翻訳元言語の大量のテキストを蓄積している翻訳元言語コーパス１２から取得し、類似性の高いもの同士を「用例集合」と呼ばれる部分集合にまとめ、用例集合ＤＢ１３に登録する。
【００２２】
特徴語抽出部１４は、用例集合ＤＢ１３の各用例集合に対して、その集合を特徴づける単語のリスト（特徴単語リスト）を抽出する。
【００２３】
単語リスト翻訳部１５は、辞書作成対象単語３０に対して対訳辞書１６に存在する全ての訳語のうち、特徴語抽出部１４から出力される特徴単語リストと最も関連性の高いものを、翻訳先言語の大量のテキストを蓄積している翻訳先言語コーパス１７の情報をもとに一つ選択して出力する。
【００２４】
出力部１８は、単語リスト翻訳部１５から出力された辞書作成対象単語３０の訳語と用例集合ＤＢ１３の対応する用例集合とを対にして訳語選択辞書２０に書き込む。
【００２５】
制御部１９は、各構成要素の動作とこれらの間のデータの流れを制御する。
【００２６】
以下、図１の各部の動作について例を用いて詳細に説明する。
【００２７】
なお、ここでは翻訳元言語を英語、翻訳先言語を日本語、辞書作成対象単語を”ｓｕｉｔ”とし、対訳辞書１６中に規定されている”ｓｕｉｔ”に対する訳語は「スーツ」と「裁判」の２語であるとする。
【００２８】
図２は制御部１９の動作を示すフローチャートである。
【００２９】
まず、ステップ０において配列をクリヤする等の必要な初期化を行った後、ステップ１で辞書作成対象単語３０を読み込む。次に、ステップ２で対訳辞書１６を参照して辞書作成対象単語３０の訳語の多義数を取得し、その値に「０」以上の整数ｐ（例えば、「２」）を加えた値を変数ｍに代入する。
【００３０】
ここで、ｍは辞書作成対象単語３０の用例を分類する際の分類数であり、ｐが「０」以上であることから、この数は訳語の多義数以上のある整数となる。例えば、対訳辞書１６に規定されている”ｓｕｉｔ”の訳語数は２であるとすると、クラスタ（分類）の数はそれにｐ＝２だけ加えた「４」である。
【００３１】
ステップ３では辞書作成対象単語３０と変数ｍを入力として用例集合生成部１１を動作させる。その後、ステップ３で作成された用例集合ＤＢ１３中の各用例集合に対して、ステップ４からステップ７の処理を実行し、訳語選択辞書２０を作成する。
【００３２】
即ち、ステップ４では処理対象の用例集合を入力として特徴語抽出部１４を起動する。ステップ５では特徴語抽出部１４から得られた単語の集合に辞書作成対象単語３０を加え、ステップ６ではこの単語集合を入力として単語リスト翻訳部１５を動作させる。ステップ７では出力部１８において辞書作成対象単語３０に対するステップ４で得られた訳語と用例集合生成部１１との内容を結合して訳語選択辞書２０に書き込む。
【００３３】
図３は用例集合生成部１１の動作を示すフローチャートである。
【００３４】
まず、ステップ１０で初期化を行い、次のステップ１１で辞書作成対象単語３０と分類クラスタ数を読み込む。ステップ１２は翻訳元言語コーパス１２から辞書作成対象単語３０と一致する単語を検出し、その前後のｎ単語からなる文字列を作業用メモリに書き込む処理である。この文字列を「用例」と呼ぶ。ここで、記憶媒体中のテキストから特定の文字列を検出する効率的な方法については既にアルゴリズムの教科書等で詳述されているので、説明を省略する。
【００３５】
ステップ１３ではステップ１２で得られた用例の集合をｍ個の部分集合に、クラスタリングと呼ばれる手法を用いて分割する。この部分については後述する。ステップ１４では得られたクラスタを用例集合ＤＢ１３に書き込む。
【００３６】
ここで、クラスタリングの動作について説明する。クラスタリングにおいては与えられた集合中の任意の二つの要素の間の「距離」を予め定義し、同じ部分集合に属する要素間では距離が平均的に短く、異なった部分集合に属する要素間では距離が平均的に長くなるように部分集合を構成する。
【００３７】
本実施の形態において、要素とは用例、即ち単語のリストを表す文字列であり、要素間の距離は用例の間の類似性を数量化した値である。２つの用例の間の類似性を測定する最適な手法は用例の性質によって異なり得るが、一つの実施の形態として、ここでは次のような方法を用いる。
【００３８】
まず、用例中の単語ｗをベクトルｖ（ｗ）（なお、本明細書ではベクトルを表す記号をアンダーラインで代用するものとする。）で表し、次に、１つの用例をその用例に含まれる単語に対するベクトルの荷重和ベクトルで表す。即ち、用例ｕに含まれる単語の集合をＷ_uで表すと、この用例に対応するベクトルｖ（ｕ）は
【００３９】
【数１】

【００４０】
で定義される。ここで、重みｃ_wは各単語ｗに対して処理の精度を勘案して与えられる定数である。最後に、２つの用例ベクトルの間の距離ｄ（ｖ（ｕ₁），ｖ（ｕ₂））はこれら２つのベクトルのなす角のコサイン値で与えられる。
【００４１】
【数２】

【００４２】
単語ｗをベクトルｗで表現する方法として、ここではベクトルの第ｉ成分を１つの単語ｗ_iに対応づけ、ｗの第ｉ成分の値を翻訳元言語コーパス１２においてｗとｗ_iとが近傍に出現する個数とする方法を用いる。一般に、単語をベクトルで表現する手法はこの他にも様々なものが既に提案されており、それらを用いても良い。
【００４３】
なお、要素間の距離の定義が与えられた集合に対するクラスタリング手法は、従来から様々なアルゴリズムが知られているので、ここでは説明を省略する。
【００４４】
図４は辞書作成対象単語に対する用例集合ＤＢ１３の一例を示すもので、同図（ａ）は用例と用例の識別番号を対応づけたテーブル、同図（ｂ）は用例の識別番号を部分集合（クラスタ）別に分類したテーブルである。
【００４５】
次に、特徴語抽出部１４について説明する。特徴語抽出部１４は用例の集合に対してその集合を特徴づけるような単語のリストを出力する機能をもつ。特徴単語を抽出する最適な手法は対象とするテキストによって変わり得る。本実施の形態においては用例集合中の各用例を連結してできる文字列を一つのテキストとみなして、既存のテキストからのキーワード抽出手法を用いて特徴語を抽出する。
【００４６】
具体的には、用例集合中の各異なり単語ｗに対して次の式によってスコアｓ（ｗ）を計算し、スコアの大きいものからｍ語の特徴語を抽出する。
【００４７】
【数３】

【００４８】
図５は図４中の文脈文字列Ｃ（ｉ）に対する特徴単語のリストの一例を示す。
【００４９】
図６は単語リスト翻訳部１５の動作を示すフローチャートである。この処理は大きく訳語候補生成部分（ブロック１）と多義解消部分（ブロック２）の２つのブロックから構成される。
【００５０】
訳語候補生成部分では対訳辞書１６を参照することによって、入力装置から与えられた単語リストの各単語をそれぞれ選択可能な訳語と置き換えることによって訳語候補を生成する。
【００５１】
具体的には、ステップ２０でＳＬ［ｉ］に入力単語リストの各単語を読み込んだ後、ステップ２１において各単語に対して対訳辞書１６から選択可能な訳語を全て取得し、配列ｄｉｃに代入する。次のステップ２２では、各入力単語に対してｄｉｃに存在する訳語の中から一つを選んで訳語リストを作る。ここで、もし各入力単語に対して複数の訳語が存在する場合は、それらの組み合わせの数だけ訳語リストを作成する。
【００５２】
多義解消部分では各訳語候補リストの中から、リスト内の単語関連性が翻訳先言語において最も高いものを選ぶ処理を行う。
【００５３】
具体的には、ステップ２３において、各訳語リストＴに対して次の式で定義される意味的関連性の値ｒｅｌ（Ｔ）を計算する。
【００５４】
【数４】

【００５５】
ここで、ｖ（ｔ）は式（１）によって単語ｔをベクトルに対応づけたもの、ｃ（Ｔ）は単語集合Ｔの各単語に対応するベクトルの重心を表すベクトル、ｄは式（２）で定義されるベクトル間の距離である。
【００５６】
次に、ステップ２４において、関連性の値が最大の訳語リストを選び、最後にステップ２５において選ばれた訳語リストの最後の単語、即ち辞書作成対象単語３０に対する訳語を出力する。
【００５７】
出力部１８は用例集合ＤＢ１３の各用例集合とこれに対応する訳語を結合して訳語選択辞書２０に出力する。最終的に生成される訳語選択辞書の一例を図７に示す。
【００５８】
【実施の形態２】
図８は本発明の自動翻訳装置の実施の形態の一例を示すもので、図中、４１は翻訳元言語単語リスト、４２は実施の形態１で説明した訳語選択辞書自動作成装置、４３は訳語選択辞書、４４は用例集合検索部である。
【００５９】
訳語選択辞書自動作成装置４２は、翻訳対象単語を収集したデータベースである翻訳元言語単語リスト４１中の各単語を入力として訳語選択辞書、即ちその訳語と対応する用例集合との対を自動作成し、訳語選択辞書４３に出力する。
【００６０】
用例集合検索部４４は、与えられた文脈中に出現する単語Ｗに対して、訳語選択辞書４３を参照してこの文脈における単語Ｗの訳語として最も適切なものを選択し、出力する。
【００６１】
各部の動作の順序及び内容は次の通りである。
【００６２】
まず、翻訳元言語単語リスト４１の各単語に対して訳語選択辞書自動作成装置４２を適用して訳語選択辞書４３の作成を行う。この処理は与えられた翻訳元言語単語リスト４１に対して一回だけ行われる。処理の内容は既に実施の形態１で説明したので、ここでは省略する。
【００６３】
訳語選択辞書４３が作成された後、用例集合検索部４４を用いて翻訳処理を行う。
【００６４】
図９は用例集合検索部４４の動作を示すフローチャートである。
【００６５】
まず、ステップ３０において配列等の初期化を行った後、ステップ３１において入力単語と文脈の文字列を読み込む。次のステップ３２では訳語選択辞書４３の入力単語に対する全ての訳語とこれらの各訳語に対する用例の集合を読み込む。ステップ３３では各訳語に対する用例の集合（ｄｃｓｅｔ［ｉ］）の中で入力文脈と最も関連性（ｄ２）の高いものを選ぶ。この関連性計算の一例は後述する。最後のステップ３４では選ばれた用例集合に対応する訳語を出力する。
【００６６】
ここで、用例集合検索部４４で用いる関連度について説明する。なお、本発明において「文脈」とは単語の列から構成される文字列であり、「用例」と等価であるから、以下では「入力文脈」を「入力用例」と呼ぶ。
【００６７】
用例は単語の列であるから、入力用例と訳語選択辞書４３から与えられる用例集合との間の関連度として、ここでは前者と後者をベクトルに変換し、これらの間のコサインによって定義する。前者のベクトル表現としては、実施の形態１の式（１）で示した用例のベクトル表現をそのまま用い、後者のベクトル表現としては、集合内の各文脈を同じく実施の形態１の式（１）によってベクトルに変換したものの重心（平均）を用いる。
【００６８】
即ち、入力用例をｃｓｔｒ、訳語選択辞書４３から得られるｉ番目の用例（単語列）の集合をｄｃｓｅｔ［ｉ］で表現すると、これらの間の関連度ｄ２は次の式で与えられる。
【００６９】
【数５】

【００７０】
ここで、ｄは実施の形態１の式（２）で与えられるベクトル間の「距離」の定義、ｎはｄｃｓｅｔ［ｉ］中の要素の数である。
【００７１】
【発明の効果】
以上説明したように、本発明の訳語選択辞書自動作成装置によれば、単語Ｗの用例の集合が単語Ｗの意味によって自動的に分類され、各分類に対してその分類を特徴付ける単語の集合が分類内の用例から抽出され、前記得られた各分類に対して、単語Ｗの可能な訳語の中から前記抽出された単語の集合を訳したものと最も関連性の高い訳語が選ばれるので、大量の対訳テキストや最適訳語を付与した用例の集合を用いることなく、単語Ｗの用例の集合と翻訳先言語における単語との間の関連性を示す情報のみによって、翻訳元言語の用例に対して最も適切な訳語を対応づけた訳語選択辞書が作成できる。
【００７２】
また、本発明の自動翻訳装置によれば、前記装置によって作成された訳語選択辞書中の訳語のうちで、入力用例に最も適した訳語を自動的に選択できる。
【図面の簡単な説明】
【図１】本発明の訳語選択辞書自動作成装置の実施の形態の一例を示すブロック図
【図２】制御部の動作を示すフローチャート
【図３】用例集合生成部の動作を示すローチャート
【図４】用例集合ＤＢの一例を示す図
【図５】特徴単語のリストの一例を示す図
【図６】単語リスト翻訳部の動作を示すフローチャート
【図７】訳語選択辞書の一例を示す図
【図８】本発明の自動翻訳装置の実施の形態の一例を示すブロック図
【図９】用例集合検索部の動作を示すフローチャート
【符号の説明】
１１：用例集合生成部、１２：翻訳元言語コーパス、１３：用例集合データベース（ＤＢ）、１４：特徴語抽出部、１５：単語リスト翻訳部、１６：対訳辞書、１７：翻訳先言語コーパス、１８：出力部、１９：制御部、２０：訳語選択辞書、３０：辞書作成対象単語、４１：翻訳元言語単語リスト、４２：訳語選択辞書自動作成装置、４３：訳語選択辞書、４４：用例集合検索部。[0001]
The present invention is an automatic translation system, the translation supporting apparatus or the like, and relates to the Word Selection dictionary device for automatically creating and automatic translation equipment available in the natural language processing apparatus comprising a process of selecting the correct translation of a word in the text .
[0002]
[Prior art]
Conventionally, as one method of “translation selection” processing to select a translation that corresponds to the correct meaning in the context in which the word appears from among multiple translations that can be selected for a word, the following translation selection A method of using a dictionary has been proposed and put into practical use.
[0003]
That is, a “translation word selection dictionary” in which the most appropriate translation word is associated with each word example in advance is created, and from this translation word selection dictionary, an example that best matches the context in which the word to be translated appears. This is a method of searching and outputting a translation corresponding to the example.
[0004]
Here, an example is a phrase that appears around a word to be translated, and is expressed in the computer as a word string or a string in which a word is replaced with a class such as a part of speech. Similarly, the context in which the word to be translated appears is also expressed as a string of words, etc., and the degree of matching between the context and the example is the degree of matching of these word strings, or a word that includes a word included in one Defined by the probability of appearing in the same context.
[0005]
In such processing, the following two types of automatic creation methods have been proposed in addition to the case where the translation word selection dictionary having an essential role is created manually.
[0006]
The first method statistically processes a large amount of bilingual contrast text, and the second method assigns a correct translation word selected in advance to the target word in a large amount of source language text. In addition, this is statistically processed. Here, the bilingual contrast text is a text pair in a translation relation, and it is often specified which sentence in one text is in a translation relation with which sentence in the other text.
[0007]
On the other hand, using a list of words that characterize a concept or text as an input, and using the appearance distribution of candidate translations in a large amount of text data in the target language, an optimal translation is obtained without using a translation selection dictionary. A method of “word list translation” has already been proposed. It is conceivable to apply this method to the problem of “translation selection” by creating a word list by combining words to be translated and words in the context.
[0008]
[Problems to be solved by the invention]
However, the first method has a problem that a large amount of bilingual text is required for a required field and language set, and it takes a lot of cost to obtain or create the text. In the second method, there is a problem that a correct translation must be manually added to the words in the text, which is similarly expensive.
[0009]
In addition, when applying the above-mentioned “word list translation” to the translation selection, the translation of the word in the input context that determines the meaning of the translation target word does not always exist in the text data of the translation destination language. There was a problem that results could not be obtained.
[0010]
SUMMARY OF THE INVENTION An object of the present invention is to provide an automatic translation selection dictionary creation device that can create a translation selection dictionary in which the most appropriate translation is associated with a translation target word without requiring a set of examples with a large amount of bilingual texts and optimal translations. Is to provide a place.
[0011]
Another object of the present invention uses the translation selected dictionary created by the Word Selection dictionary automatically creating device is to provide an automatic translation equipment capable of selecting the most appropriate translation.
[0012]
[Means for Solving the Problems]
In the present invention, in order to achieve the above object, an example of an arbitrary word W is classified based on the usage of the word W, and a translation selection dictionary in which a translation of the word W most suitable for each classification is associated is automatically created. In the device to be created , all the words that match the word W are detected from the source language corpus in which a large amount of text in the source language is accumulated , and an example that is a character string of n words before and after the word W is detected in the working memory Writing, expressing a word in an example as a vector, representing one example as a weighted vector of vectors for words included in the example, and the distance between any two examples corresponding to the two examples for defined by a cosine value of the angle of the weighted sum vector, the example written in the working memory by clustering divided into subsets of multiple writes to the examples set DB The number of examples, including a set generation unit for each different word contained in all of the example constituting the subset, a and the word as the number of times that the word appears in the example constituting the subset Calculate the score defined so that the smaller the number is, the m words having the highest score are extracted as a set of words characterizing the subset , and this is written in the example set DB. For each subset obtained by the feature word extraction unit to be executed for the subset and the example set generation unit , m feature words extracted by the feature word extraction unit are used as the first to mth elements, Generate an input word list with the dictionary creation target word W as the (m + 1) th element, obtain translations of each word constituting each input word list from the bilingual dictionary, and the number of translation combinations of each word constituting Create a translation list, find a vector of words for each word in the translation list using a translation language corpus that stores a large amount of text in the translation language, find an average vector of words in the translation list, The average cosine value of the word vector and the average vector is obtained as the relevance value of each translation list, the translation list having the maximum relevance is selected, and the last word in the translation list is selected as the word W of each subset. A word list translation unit that outputs the translated words is provided.
[0013]
According to the above configuration, first, the example set generation unit automatically classifies a set of examples of the word W extracted from a large amount of text into a plurality of subsets based on the similarity between the examples. Here, from the rule of thumb that the example is an alias for the context of the word W and that the meaning of the word is determined by the surrounding context, the subset created by the example set generation unit is the word W Corresponds to one meaning.
[0014]
Next, the feature word extraction unit extracts a set of words characterizing the subset from the subset of the examples classified using the example set generation unit .
[0015]
Finally, for each subset obtained by the example set generation unit by the word list translation unit , from the translation of the word W obtained from the bilingual dictionary, it is most related to the set extracted by the feature word extraction unit A highly translated translation is associated.
[0016]
As a result, the example of the word W is classified according to its meaning from only a large amount of text in each of the source language and the target language and the bilingual dictionary without using a set of examples to which a large amount of parallel texts and optimal translations are assigned. A “translation word selection dictionary” in which the translation of the word W is associated with each category is automatically created.
[0017]
According to the present invention, in order to achieve the above object , when an input example which is a word W and a character string of its context is given, an apparatus for selecting a translation of the word W most suitable for the input example. 1. Read all translations for word W and a subset of examples for each translation from the translation selection dictionary created by the automatic translation selection dictionary creation device described in 1 to construct a vector of input examples and each subset. An example set search unit is provided that obtains a cosine value with an average of vectors of each example as a degree of relevance , selects a subset having the maximum degree of relevance , and outputs a translation corresponding to the subset .
[0018]
According to the said structure, the translation most suitable for the example for an input can be automatically selected among the translation in the translation selection dictionary created by the said apparatus.
[0019]
DETAILED DESCRIPTION OF THE INVENTION
[0020]
Embodiment 1
FIG. 1 shows an example of an embodiment of the automatic translation dictionary creation apparatus of the present invention. In the figure, 11 is an example set generation unit, 12 is a source language corpus, 13 is an example set database (DB), 14 Is a feature word extraction unit, 15 is a word list translation unit, 16 is a bilingual dictionary, 17 is a translation destination language corpus, 18 is an output unit, 19 is a control unit, and 20 is a translation word selection dictionary.
[0021]
For the source language word (dictionary creation target word) 30, the example set generation unit 11 acquires an example including this word from the translation source language corpus 12 in which a large amount of text in the translation source language is accumulated, Those having high characteristics are collected into a subset called “example set” and registered in the example set DB 13.
[0022]
Feature word extraction section 14, for each example set of use examples set DB 13, extracts a list of words (features word list) that characterizes the set.
[0023]
Word list translation unit 15, among all the translations that exist in the bilingual dictionary 16 with respect to the dictionary creation target word 30, the one most relevant features word list output from the feature word extraction unit 14, the translation One is selected and output based on the information of the translation destination language corpus 17 in which a large amount of text in the destination language is accumulated.
[0024]
The output unit 18 writes the translation of the dictionary creation target word 30 output from the word list translation unit 15 and the corresponding example set in the example set DB 13 in pairs into the translation word selection dictionary 20.
[0025]
The control unit 19 controls the operation of each component and the flow of data between them.
[0026]
Hereinafter, the operation of each unit in FIG. 1 will be described in detail using an example.
[0027]
Here, the translation source language is English, the translation destination language is Japanese, the dictionary creation target word is “suit”, and the translations for “suit” defined in the bilingual dictionary 16 are “suit” and “trial”. Suppose that it is two words.
[0028]
FIG. 2 is a flowchart showing the operation of the control unit 19.
[0029]
First, after necessary initialization such as clearing the array in step 0, the dictionary creation target word 30 is read in step 1. Next, in step 2, the bilingual dictionary 16 is referred to obtain the ambiguous number of the translated word of the dictionary creation target word 30, and the value obtained by adding an integer p (eg, “2”) equal to or greater than “0” to the variable Substitute for m.
[0030]
Here, m is the classification number when classifying the example of the dictionary creation target word 30, and since p is “0” or more, this number is an integer greater than or equal to the ambiguous number of the translation word. For example, if the number of translated words of “suit” defined in the bilingual dictionary 16 is 2, the number of clusters (classifications) is “4” added by p = 2.
[0031]
In step 3, the example set generation unit 11 is operated with the dictionary creation target word 30 and the variable m as inputs. Thereafter, the processing from step 4 to step 7 is executed for each example set in the example set DB 13 created in step 3 to create a translation word selection dictionary 20.
[0032]
That is, in step 4, the feature word extraction unit 14 is activated with the example set to be processed as an input. In step 5, the dictionary creation target word 30 is added to the set of words obtained from the feature word extraction unit 14, and in step 6, the word list translation unit 15 is operated with this word set as an input. In step 7, the output unit 18 combines the translated words obtained in step 4 for the dictionary creation target word 30 and the contents of the example set generation unit 11 and writes them in the translated word selection dictionary 20.
[0033]
FIG. 3 is a flowchart showing the operation of the example set generation unit 11.
[0034]
First, initialization is performed in step 10, and the dictionary creation target word 30 and the number of classification clusters are read in the next step 11. Step 12 is a process of detecting a word that matches the dictionary creation target word 30 from the translation source language corpus 12 and writing a character string consisting of n words before and after the word to the working memory. This character string is called “example”. Here, an efficient method for detecting a specific character string from the text in the storage medium has already been described in detail in the textbook of the algorithm, and the description thereof will be omitted.
[0035]
In step 13, the set of examples obtained in step 12 is divided into m subsets using a technique called clustering. This part will be described later. In step 14, the obtained cluster is written in the example set DB 13.
[0036]
Here, the clustering operation will be described. In clustering, the “distance” between any two elements in a given set is defined in advance, the distance between elements belonging to the same subset is averagely short, and the distance between elements belonging to different subsets The subset is constructed so that becomes longer on average.
[0037]
In the present embodiment, an element is an example, that is, a character string representing a list of words, and the distance between elements is a value obtained by quantifying the similarity between examples. The optimum method for measuring the similarity between two examples may vary depending on the nature of the example, but as one embodiment, the following method is used here.
[0038]
First, the word w in the example is represented by a vector v (w) (in this specification, the symbol representing the vector is replaced by an underline), and then one example is included in the example. This is expressed as a weighted vector of vectors for words. That is, when a set of words included in the example u is represented by W _u , the vector v (u) corresponding to this example is
[Expression 1]

[0040]
Defined by Here, the weight c _w is a constant given to each word w in consideration of processing accuracy. Finally, the distance d ( v (u ₁ ), v (u ₂ )) between the two example vectors is given by the cosine value of the angle formed by these two vectors.
[0041]
[Expression 2]

[0042]
As a method of expressing the word w by the vector w , here, the i-th component of the vector is associated with one word w _i, and the value of the i-th component of w is set in the source language corpus 12 so that w and w _i are in the vicinity. Use the number of occurrences. In general, various other methods for expressing a word as a vector have already been proposed and may be used.
[0043]
Note that various algorithms have been conventionally known for clustering methods for which a definition of the distance between elements is given, and a description thereof will be omitted here.
[0044]
FIG. 4 shows an example of the example set DB 13 for the dictionary creation target word. FIG. 4A shows a table in which the example numbers are associated with the example identification numbers, and FIG. 4B shows a subset of the example identification numbers ( This table is classified by cluster.
[0045]
Next, the feature word extraction unit 14 will be described. The feature word extraction unit 14 has a function of outputting a list of words that characterizes a set of examples. The optimal method for extracting feature words may vary depending on the target text. In the present embodiment, a character string formed by concatenating each example in the example set is regarded as one text, and a feature word is extracted using a keyword extraction method from existing text.
[0046]
Specifically, for each different word w in the example set, a score s (w) is calculated by the following formula, and m characteristic words are extracted from those having a high score.
[0047]
[Equation 3]

[0048]
FIG. 5 shows an example of a list of feature words for the context character string C (i) in FIG.
[0049]
FIG. 6 is a flowchart showing the operation of the word list translation unit 15. This process is largely composed of two blocks, a translation candidate generation part (block 1) and an ambiguity elimination part (block 2).
[0050]
In the translation candidate generation part, by referring to the bilingual dictionary 16, each word in the word list given from the input device is replaced with a selectable translation word to generate a translation word candidate.
[0051]
Specifically, after each word of the input word list is read into SL [i] in step 20, all the selectable translation words are acquired from the bilingual dictionary 16 for each word and substituted in the array dic. . In the next step 22, a translation word list is created by selecting one translation word existing in dic for each input word. Here, if there are a plurality of translated words for each input word, a translated word list is created for the number of combinations.
[0052]
In the ambiguity elimination part, a process is performed for selecting a translation word candidate list having the highest word relevance in the translation target language from each translation word candidate list.
[0053]
Specifically, in step 23, a semantic relevance value rel (T) defined by the following equation is calculated for each translated word list T.
[0054]
[Expression 4]

[0055]
Here, v (t) is obtained by associating the word t with a vector according to the equation (1), c (T) is a vector representing the center of gravity of the vector corresponding to each word of the word set T, and d is the equation (2). The distance between vectors defined by
[0056]
Next, in step 24, the translated word list having the maximum relevance value is selected, and finally, the translated word for the last word in the translated word list selected in step 25, that is, the dictionary creation target word 30 is output.
[0057]
The output unit 18 combines each example set in the example set DB 13 and the corresponding translation word and outputs the combined word set dictionary 20. An example of the translation word selection dictionary finally generated is shown in FIG.
[0058]
Embodiment 2
FIG. 8 shows an example of an embodiment of the automatic translation apparatus according to the present invention. In the figure, 41 is a translation source language word list, 42 is a translation word selection dictionary automatic creation apparatus described in the first embodiment, and 43 is a translation word. A selection dictionary 44 is an example set search unit.
[0059]
The translation word selection dictionary automatic creation device 42 automatically creates a translation word selection dictionary, that is, a pair of the translation word and a corresponding example set, with each word in the translation source language word list 41 as a database collecting the translation target words as an input. , Output to the translated word selection dictionary 43.
[0060]
The example set search unit 44 refers to the translation word selection dictionary 43 for the word W appearing in the given context, selects the most appropriate translation word of the word W in this context, and outputs it.
[0061]
The order and contents of the operation of each part are as follows.
[0062]
First, the translation word selection dictionary 43 is created by applying the translation word selection dictionary automatic creation device 42 to each word in the translation source language word list 41. This processing is performed only once for the given translation source language word list 41. Since the contents of the processing have already been described in Embodiment 1, they are omitted here.
[0063]
After the translated word selection dictionary 43 is created, translation processing is performed using the example set search unit 44.
[0064]
FIG. 9 is a flowchart showing the operation of the example set search unit 44.
[0065]
First, after initialization of the array and the like in step 30, the input word and the character string of the context are read in step 31. In the next step 32, all translations for the input words of the translation selection dictionary 43 and a set of examples for each translation are read. In step 33, the most relevant (d2) with the input context is selected from the set of examples (dcset [i]) for each translated word. An example of this relevance calculation will be described later. In the final step 34, the translation corresponding to the selected example set is output.
[0066]
Here, the relevance used in the example set search unit 44 will be described. In the present invention, “context” is a character string composed of a string of words and is equivalent to “example”, and therefore, “input context” will be referred to as “input example” below.
[0067]
Since the example is a sequence of words, the former and the latter are converted into vectors and defined by a cosine between them as the degree of association between the input example and the example set given from the translation word selection dictionary 43. As the former vector expression, the vector expression of the example shown in the expression (1) of the first embodiment is used as it is, and as the latter vector expression, each context in the set is similarly expressed by the expression (1) of the first embodiment. The center of gravity (average) of those converted into vectors by is used.
[0068]
That is, when the input example cstr and the set of the i th example (word string) obtained from the translation word selection dictionary 43 are expressed by dcset [i], the relevance d2 between them is given by the following equation.
[0069]
[Equation 5]

[0070]
Here, d is a definition of “distance” between vectors given by the expression (2) in the first embodiment, and n is the number of elements in dcset [i].
[0071]
【The invention's effect】
As described above, according to the automatic translation dictionary creation apparatus of the present invention, a set of examples of the word W is automatically classified according to the meaning of the word W, and a set of words characterizing the classification for each classification is obtained. For each of the obtained classifications extracted from the examples in the classification, the translation most relevant to the translation of the extracted set of words is selected from the possible translations of the word W. Without using a large amount of bilingual text and a set of examples to which optimal translations are assigned, only the information indicating the relationship between the set of examples of the word W and the word in the target language is used for the source language example. A translation selection dictionary that associates the most appropriate translations can be created.
[0072]
Further, according to the automatic translation apparatus of the present invention, the translation most suitable for the input example can be automatically selected from the translations in the translation selection dictionary created by the apparatus.
[Brief description of the drawings]
FIG. 1 is a block diagram showing an example of an embodiment of an automatic translation dictionary creation apparatus of the present invention. FIG. 2 is a flowchart showing the operation of a control unit. FIG. 3 is a flowchart showing the operation of an example set generation unit. 4 is a diagram illustrating an example of an example set DB. FIG. 5 is a diagram illustrating an example of a list of feature words. FIG. 6 is a flowchart illustrating an operation of a word list translation unit. FIG. 7 is a diagram illustrating an example of a translation word selection dictionary. 8 is a block diagram showing an example of an embodiment of an automatic translation apparatus according to the present invention. FIG. 9 is a flowchart showing the operation of an example set search unit.
11: Example set generation unit, 12: Source language corpus, 13: Example set database (DB), 14: Feature word extraction unit, 15: Word list translation unit, 16: Bilingual dictionary, 17: Destination language corpus, 18 : Output unit, 19: Control unit, 20: Translation word selection dictionary, 30: Dictionary creation target word, 41: Translation source language word list, 42: Translation word selection dictionary automatic creation device, 43: Translation word selection dictionary, 44: Example set search Department.

Claims

In an apparatus for classifying an example of an arbitrary word W based on the usage of the word W and automatically creating a translation selection dictionary in which the translation of the word W that best fits each classification is associated
From the source language corpus in which a large amount of text in the source language is accumulated , all the words that match the word W are detected, and an example that is a character string consisting of n words before and after the word W is written to the working memory. A word is represented by a vector, and an example is represented by a vector weighted sum vector for words included in the example, and the distance between any two examples is defined by the weighted sum vector corresponding to each of the two examples. defined by a cosine value of the angle, the examples written in the working memory by clustering divided into subsets of multiple, & Examples set generation unit that writes the examples set DB,
For each different word included in all examples constituting the subset, the higher the number of occurrences of the word in the example constituting the subset and the smaller the number of examples including the word, the higher the value. The score defined so as to be calculated is extracted, m words having the highest score are extracted as a set of words characterizing the subset , and this is extracted for each subset written in the example set DB. A feature word extraction unit to be executed ;
For each subset obtained by the example set generation unit , the m feature words extracted by the feature word extraction unit are the 1st to mth elements and the dictionary creation target word W is the m + 1th element input Generate a word list, obtain the translation of each word that constitutes each input word list from the bilingual dictionary, create a translation list of the number of combinations of translations of each word that composes, and translate each word in the translation list Obtain a word vector using the translated language corpus that stores a large amount of text in the language, find the average vector of words in the translation list, and calculate the average cosine value of each word vector and average vector for each translation list determined as of the relevance of the value, relevance to select a maximum of translation list, Bei and a word list translation unit for outputting the last word in the該訳word list as the translation of the word W of each subset Word Selection dictionary automatically creating apparatus characterized by a.

When an input example that is a word W and a character string of its context is given, an apparatus for selecting a translation of the word W that best matches the input example,
All translations for the word W and a subset of examples for each translation are read from the translation selection dictionary created by the automatic translation dictionary creation apparatus according to claim 1, and a vector of input examples and each subset are read. An example set search unit is provided that obtains the cosine value of the average of the vectors of each example constituting as a degree of relevance , selects a subset having the maximum degree of relevance , and outputs a translation corresponding to the subset. Automatic translation device.