JP6535607B2

JP6535607B2 - Preprocessing model learning device, method and program

Info

Publication number: JP6535607B2
Application number: JP2016008294A
Authority: JP
Inventors: いつみ斉藤; 九月貞光; 久子浅野; 松尾　義博; 義博松尾
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2016-01-19
Filing date: 2016-01-19
Publication date: 2019-06-26
Anticipated expiration: 2036-01-19
Also published as: JP2017129995A

Description

本発明は、前処理モデル学習装置、方法、及びプログラムに係り、特に、言語処理のために文を書き換える前処理モデル学習装置、方法、及びプログラムに関する。 The present invention relates to a preprocessing model learning apparatus, method, and program, and more particularly, to a preprocessing model learning apparatus, method, and program for rewriting a sentence for language processing.

従来技術では、書き換え規則に基づいて書き換えを行う手法が複数提案されている（非特許文献１、及び非特許文献２参照）。 In the prior art, a plurality of methods for performing rewriting based on the rewriting rule have been proposed (see Non-Patent Document 1 and Non-Patent Document 2).

また、従来より、特定のドメインテキストを用いてモデルを学習する技術が知られている。特定のドメインテキストとは、例えば、翻訳のモデル学習であれば翻訳モデルの学習時に利用した日本語側の目的コーパスを指す。その他、構文解析や情報抽出のシステム構築時にモデル学習用コーパスとして用いた日本語コーパス若しくは、特定のドメイン（例えば、新聞表記や口語調等）の表記に変換する処理の場合はその特定のドメインのテキストを指す。 Also, conventionally, techniques for learning a model using a specific domain text are known. For example, in the case of model learning for translation, a specific domain text refers to a Japanese-side target corpus used when learning a translation model. In addition, a Japanese corpus used as a model learning corpus at the time of construction of a system for syntactic analysis and information extraction, or in the case of processing for converting to a description of a specific domain (for example, newspaper notation or spoken language etc.) Points to the text.

吉見毅彦，佐田いち子，福持陽士，"頑健な英日機械翻訳システム実現のための原文自動前編集"，自然言語処理，2000Yoshimi Yoshihiko, Sada Ichiko and Fukumochi Yoji, "Original Text Preediting for Robust English-to-Japanese Machine Translation System", Natural Language Processing, 2000 坂本明子，田中浩之，"話し言葉機械翻訳のための日本語前編集"，言語処理学会第21回年次大会，2015Sakamoto Akiko, Tanaka Hiroyuki, "Japanese Pre-editing for Spoken Language Machine Translation", The 21st Annual Meeting of the Association for Speech Processing, 2015

しかし、翻訳などの言語処理において、処理対象の入力文の言語表現と、モデル学習用コーパスとして用いた目的コーパスの言語表現とが一致しないために適切に言語処理ができない現象が存在する。 However, in language processing such as translation, there is a phenomenon that language processing can not be appropriately performed because the language expression of the input sentence to be processed does not match the language expression of the target corpus used as the model learning corpus.

例えば翻訳処理の場合において、入力文「これおいしーい」を翻訳したとき、翻訳結果が「This Oishi-I」となり、「おいしーい」が正しく解析できず誤った翻訳となってしまう。一方、入力文を書き換えて「これおいしい」を翻訳したとすれば、翻訳結果は「It tastes great」となり、「おいしい」が正しく解析され意味の通る翻訳となる。 For example, in the case of translation processing, when the input sentence "This is delicious" is translated, the translation result is "This Oishi-I", and "It is delicious" can not be correctly analyzed, resulting in an incorrect translation. On the other hand, if the input sentence is rewritten to translate "It is delicious", the translation result will be "It tastes great", and "Delicious" will be correctly analyzed and a meaningful translation.

本発明は、上記問題点を解決するために成されたものであり、高精度な言語処理を行うための前処理モデルを学習することができる前処理モデル学習装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made to solve the above-mentioned problems, and provides a preprocessing model learning device, method, and program that can learn a preprocessing model for performing high-accuracy language processing. The purpose is

上記目的を達成するために、第１の発明に係る前処理モデル学習装置は、入力された文字列に対して生成された書き換え候補の各々について、前記書き換え候補に対して予め定められた言語処理を行った結果に基づいて、前記書き換え候補に対して前記言語処理を行った結果の尤もらしさを評価し、前記文字列と、最も評価が高い前記書き換え候補とのペアを、正解ペアとして選定する評価選定部と、前記評価選定部によって選定された前記正解ペアと、前記書き換え候補の各々から抽出される素性とに基づいて、前記言語処理の前処理における書き換え候補を選択するための前処理モデルを学習する前処理モデル学習部と、を含んで構成されている。 In order to achieve the above object, a preprocessing model learning device according to a first aspect of the present invention performs language processing predetermined for the rewrite candidate for each of the rewrite candidates generated for the input character string. Evaluating the likelihood of the result of performing the language processing on the rewrite candidate based on the result of performing the step of selecting the pair of the character string and the rewrite candidate having the highest evaluation as a correct pair A preprocessing model for selecting a rewrite candidate in the language processing preprocessing based on an evaluation selection unit, the correct pair selected by the evaluation selection unit, and the features extracted from each of the rewrite candidates And a preprocessing model learning unit that learns.

また、第１の発明に係る前処理モデル学習装置において、前記前処理モデル学習部は、
前記文字列と、前記書き換え候補の各々とを表す各部分文字列に対応するノード及び連結される部分文字列に対応するノードを結んだエッジからなるグラフ構造であるラティスを生成し、前記ラティスの各経路のうち、前記前処理モデルを用いて計算したスコアが最大となる経路が表す文字列が、前記正解ペアの書き換え候補となるように前記前処理モデルを学習するようにしてもよい。 Further, in the preprocessing model learning device according to the first invention, the preprocessing model learning unit is
Generating a lattice having a graph structure including an edge corresponding to a node corresponding to each partial character string representing the character string and each of the rewrite candidates and a node corresponding to the partial character string to be connected; The pre-processing model may be learned such that a character string represented by a path having a maximum score calculated using the pre-processing model among the respective paths is a rewrite candidate of the correct pair.

第２の発明に係る前処理モデル学習方法は、評価選定部が、入力された文字列に対して生成された書き換え候補の各々について、前記書き換え候補に対して予め定められた言語処理を行った結果に基づいて、前記書き換え候補に対して前記言語処理を行った結果の尤もらしさを評価し、前記文字列と、最も評価が高い前記書き換え候補とのペアを、正解ペアとして選定するステップと、前処理モデル学習部が、前記評価選定部によって選定された前記正解ペアと、前記書き換え候補の各々から抽出される素性とに基づいて、前記言語処理の前処理における書き換え候補を選択するための前処理モデルを学習するステップと、を含んで実行することを特徴とする。 In the preprocessing model learning method according to the second aspect of the invention, the evaluation selecting unit performs predetermined language processing on the rewrite candidate for each of the rewrite candidates generated for the input character string. Evaluating likelihood of a result of performing the language processing on the rewrite candidate based on the result, and selecting a pair of the character string and the rewrite candidate having the highest evaluation as a correct solution pair; Before the preprocessing model learning unit selects a rewriting candidate in the language processing preprocessing based on the correct pair selected by the evaluation selecting unit and the feature extracted from each of the rewriting candidates And learning the process model.

また、第２の発明に係る前処理モデル学習方法において、前記前処理モデル学習部が学習するステップは、前記文字列と、前記書き換え候補の各々とを表す各部分文字列に対応するノード及び連結される部分文字列に対応するノードを結んだエッジからなるグラフ構造であるラティスを生成し、前記ラティスの各経路のうち、前記前処理モデルを用いて計算したスコアが最大となる経路が表す文字列が、前記正解ペアの書き換え候補となるように前記前処理モデルを学習するようにしてもよい。 Further, in the preprocessing model learning method according to the second invention, in the step for the preprocessing model learning unit to learn, a node corresponding to each partial character string representing the character string and each of the rewrite candidates and concatenation are provided. A lattice which is a graph structure having edges connecting nodes corresponding to the partial character strings to be generated, and a character represented by a path having the largest score calculated using the preprocessing model among the paths of the lattice The pre-processing model may be learned such that a row is a candidate for rewriting the correct pair.

第３の発明に係るプログラムは、コンピュータを、第１の発明に係る前処理モデル学習装置の各部として機能させるためのプログラムである。 A program according to a third invention is a program for causing a computer to function as each unit of the preprocessing model learning device according to the first invention.

本発明の前処理モデル学習装置、方法、及びプログラムによれば、書き換え候補の各々について、書き換え候補に対して予め定められた言語処理を行った結果と、予め定めた正解データとに基づいて、書き換え候補に対して言語処理を行った結果の尤もらしさを評価し、文字列と、最も評価が高い書き換え候補とのペアを、正解ペアとして選定し、選定された正解ペアと、言語処理のための学習に用いられる目的コーパスから作成された言語モデルにおける部分文字列の各々の言語モデルスコアとに基づいて、言語処理の前処理における書き換え候補を選択するための前処理モデルを学習することにより、高精度な言語処理を行うための前処理モデルを学習することができる、という効果が得られる。 According to the preprocessing model learning device, method, and program of the present invention, for each of the rewrite candidates, based on the result of performing predetermined language processing on the rewrite candidates and the predetermined correct data. The likelihood of the result of language processing on the rewrite candidate is evaluated, and a pair of the character string and the highest-rated rewrite candidate is selected as the correct pair, and the selected correct pair and the language processing By learning a preprocessing model for selecting a rewriting candidate in preprocessing of language processing, based on the language model score of each of the partial character strings in the language model created from the target corpus used for learning of The effect of being able to learn a pre-processing model for performing high-accuracy language processing is obtained.

目的コーパスを用いた言語モデルの一例を示す図である。It is a figure which shows an example of the language model which used the object corpus. 本発明の実施の形態に係る前処理モデル学習装置の構成を示すブロック図である。It is a block diagram which shows the structure of the pre-processing model learning apparatus which concerns on embodiment of this invention. 語彙素「御早う」の見出し語集合の一例を示す図である。It is a figure which shows an example of the entry word set of lexeme "Oyau". 「はらへったー」に対する意味類似度が上位の書き換え候補と意味類似度の値の一例を示す図である。It is a figure which shows an example of the rewriting candidate and the value of a semantic similarity of a higher order of the semantic similarity with "Harahe". 書き換え候補テーブルの一例を示す図である。It is a figure which shows an example of a rewriting candidate table. ラティスの生成の一例を示す図である。It is a figure which shows an example of a production | generation of a lattice. 素性のスコアの計算例を示す図である。It is a figure which shows the example of calculation of the score of a feature. 本発明の実施の形態に係る前処理モデル学習装置における書き換え候補テーブル作成処理ルーチンを示すフローチャートである。It is a flowchart which shows the rewriting candidate table creation process routine in the pre-processing model learning apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る前処理モデル学習装置における前処理モデル学習処理ルーチンを示すフローチャートである。It is a flowchart which shows the pre-processing model learning processing routine in the pre-processing model learning apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る前処理モデル学習装置におけるラティス生成処理ルーチンを示すフローチャートである。It is a flowchart which shows the lattice production | generation processing routine in the pre-processing model learning apparatus which concerns on embodiment of this invention.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜本発明の実施の形態に係る概要＞ <Overview of Embodiment of the Present Invention>

まず、本発明の実施の形態における概要を説明する。 First, an outline of the embodiment of the present invention will be described.

本発明の実施の形態で提案する技術は、あらかじめ複数の表記ゆれや言い回しの表現を書き換え候補として獲得し、一つのラティスとして展開する。目的文の学習コーパスである目的コーパスから作成した言語モデルのスコアと、書き換え候補の意味的類似度のスコアを用いて言語処理に適した上位Ｎ個の書き換え候補を出力する。目的コーパスは、書き換え対象となるテキストを準備し、テキストを形態素解析したものとする。また、予め目的コーパスを用いて作成した、表記、品詞、及びスコアを組み合わせた言語モデルを作成しておくものとする。図１に目的コーパスを用いた言語モデルの一例を示す。また、最適な書き換え候補は、目的コーパスの言語表現と最も近くなる候補である。そして、上位Ｎ個の書き換え候補を翻訳処理し、正解データを用いて求められた、書き換え前の文字列と書き換え候補との正解ペアを用いて、前処理における書き換え候補を選択するための前処理モデルを学習する。 The technique proposed in the embodiment of the present invention acquires a plurality of expressions of notation fluctuation and wording as rewrite candidates in advance, and develops them as one lattice. Using the score of the language model created from the target corpus, which is a learning corpus of the target sentence, and the score of the semantic similarity of the rewrite candidate, the top N rewrite candidates suitable for language processing are output. The target corpus is prepared by preparing text to be rewritten and performing morphological analysis on the text. In addition, it is assumed that a language model is created in advance using a target corpus and combining writing, part of speech, and score. FIG. 1 shows an example of a language model using a target corpus. The optimal rewrite candidate is the one closest to the language expression of the target corpus. Then, pre-processing is performed to select the rewrite candidate in the pre-processing using the correct pair of the character string before rewriting and the rewriting candidate obtained by translating the top N rewriting candidates and using the correct data Learn a model.

なお、本発明の実施の形態では、機械翻訳を行うための前処理モデルを学習する場合について説明するが、これに限定されるものではなく、構文解析、自動要約等のあらゆる言語処理に適用することができる。 In the embodiment of the present invention, although a case of learning a preprocessing model for performing machine translation is described, the present invention is not limited to this and is applied to all language processing such as syntactic analysis and automatic summary. be able to.

＜本発明の実施の形態に係る前処理モデル学習装置の構成＞ <Configuration of Preprocessing Model Learning Device According to Embodiment of the Present Invention>

次に、本発明の実施の形態に係る前処理モデル学習装置の構成について説明する。図２に示すように、本発明の実施の形態に係る前処理モデル学習装置１００は、ＣＰＵと、ＲＡＭと、後述する書き換え候補テーブル作成処理ルーチン及び前処理モデル学習処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この前処理モデル学習装置１００は、機能的には図２に示すように入力部１０と、演算部２０と、出力部９０とを備えている。 Next, the configuration of the preprocessing model learning apparatus according to the embodiment of the present invention will be described. As shown in FIG. 2, the preprocessing model learning device 100 according to the embodiment of the present invention is a program for executing a CPU, a RAM, a rewriting candidate table creating processing routine and a preprocessing model learning processing routine described later. And a ROM storing various data, and can be configured by a computer. The preprocessing model learning device 100 functionally includes an input unit 10, an arithmetic unit 20, and an output unit 90 as shown in FIG.

入力部１０は、テキスト集合として、ＵｎｉＤｉｃ（https://osdn.jp/projects/unidic/）やＪＵＭＡＮ（http://nlp.ist.i.kyoto-u.ac.jp/index.php?JUMAN）等の日本語の辞書と、Ｔｗｉｔｔｅｒ（Ｒ）等のＳＮＳから収集した大規模テキストとを受け付け、書き換え候補獲得部２２に出力する。また、入力部１０は、書き換え対象の文字列を受け付ける。 The input unit 10 includes, as text sets, UniDic (https://osdn.jp/projects/unidic/) and JUMAN (http://nlp.ist.i.kyoto-u.ac.jp/index.php?JUMAN Etc. and large-scale texts collected from SNSs such as Twitter (R), and output to the rewrite candidate acquisition unit 22. Also, the input unit 10 receives a character string to be rewritten.

演算部２０は、書き換え候補獲得部２２と、前処理部２４と、言語モデル２６と、本処理モデル２８と、本処理部６０と、評価選定部６２と、前処理モデル学習部６４とを含んで構成されている。 Arithmetic unit 20 includes rewrite candidate acquiring unit 22, preprocessing unit 24, language model 26, main processing model 28, main processing unit 60, evaluation selecting unit 62, and preprocessing model learning unit 64. It consists of

書き換え候補獲得部２２は、入力部１０で受け付けた辞書及び大規模テキストから、入力表記に対する書き換え候補を格納した書き換え候補テーブルを作成する。 The rewrite candidate acquisition unit 22 creates a rewrite candidate table storing rewrite candidates for the input notation from the dictionary and the large-scale text received by the input unit 10.

書き換え候補獲得部２２は、辞書候補獲得部３０と、同義フレーズ獲得部３２と、同義述部獲得部３４と、類似度設定部３６とを含んで構成されている。なお、本実施の形態では、書き換え候補獲得部２２の辞書候補獲得部３０、同義フレーズ獲得部３２、及び同義述部獲得部３４の各々で書き換え候補を獲得する場合を例に説明するが、これに限定されるものではなく、例えば、人手で作成した書き換え候補や、読みの類似度を用いた書き換え候補等、他の手法によって書き換え候補を獲得してもよい。 The rewrite candidate acquisition unit 22 includes a dictionary candidate acquisition unit 30, a synonym phrase acquisition unit 32, a synonym predicate acquisition unit 34, and a similarity setting unit 36. In the present embodiment, the case where the rewrite candidate is acquired by each of the dictionary candidate acquisition unit 30, the synonym phrase acquisition unit 32, and the synonym predicate acquisition unit 34 of the rewrite candidate acquisition unit 22 will be described as an example. For example, rewriting candidates may be obtained by other methods such as rewriting candidates created manually, rewriting candidates using similarity of reading, and the like.

辞書候補獲得部３０は、入力部１０で受け付けた辞書を用いて、入力表記の各々に対して、複数のレベルの書き換え候補を獲得する。具体的には、辞書の語彙素（ＵｎｉＤｉｃを使用）、代表表記（ＪＵＭＡＮを使用）などを見出し語として用いて、辞書に登録された同一の語彙素、及び代表表記をもつ見出し語集合を書き換え候補グループとして定義する。図３にＵｎｉＤｉｃにおける語彙素「御早う」の見出し語集合の一例を示す。作成した書き換え候補グループの各々について、辞書見出し語の各々を入力表記とし、同一のグループの入力表記以外の見出し語の各々を書き換え候補とすればよい。このように辞書候補獲得部３０によって得た書き換え候補によって、単語レベルの表記揺れを吸収することが可能になる。 The dictionary candidate acquisition unit 30 acquires rewrite candidates at a plurality of levels for each of the input notations, using the dictionary accepted by the input unit 10. Specifically, using the lexical elements of the dictionary (using UniDic), the representative notation (using JUMAN), etc. as the entry words, the same lexical element registered in the dictionary and the entry word set having the representative notation are rewritten Define as a candidate group. FIG. 3 shows an example of the entry word set of the lexical element "Ohashi" in UniDic. For each of the created rewrite candidate groups, each of the dictionary entry words may be used as an input notation, and each entry word other than the input notation of the same group may be used as a rewrite candidate. As described above, the rewriting candidate obtained by the dictionary candidate acquiring unit 30 can absorb word-level writing fluctuation.

同義フレーズ獲得部３２は、入力部１０で受け付けた大規模テキストを用いて、意味類似度を用いた単語レベル及びフレーズレベルの文字列のペアを、入力表記に対する書き換え候補として獲得する。具体的には、Ｔｗｉｔｔｅｒ（Ｒ）上の短文（ｎ文字以内）を、句読点、記号などで分割した際に１０文字以内となる文字列のまとまりを辞書に登録し、解析によって、文字列の各々について意味ベクトルを求める。意味ベクトルの計算にはｗｏｒｄ２ｖｅｃ（参考文献１参照）等を用いればよい。そして、文字列の各々の意味ベクトルに基づいて、文字列の各ペアの意味的類似度を、コサイン類似度等を用いて推定し、入力表記に対する書き換え候補を獲得する。 The synonym phrase acquiring unit 32 acquires, using the large-scale text received by the input unit 10, a pair of character string at the word level and the phrase level using the semantic similarity as a rewrite candidate for the input notation. Specifically, when a short sentence (within n characters) on Twitter (R) is divided by punctuation marks, symbols, etc., a group of character strings that is within 10 characters is registered in the dictionary, and each of the character strings is analyzed by analysis. Find the semantic vector for Word2vec (see Reference 1) or the like may be used to calculate the semantic vector. Then, based on the semantic vector of each of the character strings, the semantic similarity of each pair of character strings is estimated using cosine similarity or the like to obtain a rewrite candidate for the input notation.

参考文献１：Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013. Reference 1: Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.

ここでは、意味類似度が予め定めた閾値以内の文字列のペアを、入力表記に対する書き換え候補として定義する。図４に文字列「はらへったー」に対する意味類似度が上位の書き換え候補と意味類似度の値の一例を示す。同義フレーズ獲得部３２によって得た書き換え候補により、表記が似ていなくても意味的に類似している書き換え候補への書き換えが可能になり、書き換え候補の意味的空間が広がる。 Here, a pair of character strings whose semantic similarity is within a predetermined threshold is defined as a rewrite candidate for the input notation. FIG. 4 shows an example of the rewriting candidate of the upper rank and the value of the semantic similarity with respect to the character string “Harahe ー”. The rewriting candidate obtained by the synonym phrase acquiring unit 32 enables rewriting to a rewriting candidate that is semantically similar even if the notations are not similar, and the semantic space of the rewriting candidate is expanded.

同義述部獲得部３４は、入力部１０で受け付けた大規模テキストを用いて、述語である入力表記に対して、述部の機能語を書き換えた書き換え候補を獲得する。具体的には、大規模テキストに対し、述部正規化解析(参考文献２参照)を行い、同一の意味ラベル、述部を持つ候補を、書き換え候補として定義する。 The synonym predicate acquiring unit 34 acquires, using the large-scale text received by the input unit 10, rewriting candidates in which the function word of the predicate is rewritten with respect to the input notation which is a predicate. Specifically, predicate normalization analysis (see reference 2) is performed on a large-scale text, and candidates having the same semantic label and predicate are defined as rewriting candidates.

参考文献２：泉朋子，今村賢治，菊井玄一郎，藤田篤，佐藤理史，"正規化を指向した機能動詞表現の述部言い換え"，第15回言語処理学会年次大会，2009 Reference 2: Izumi Atsuko, Imamura Kenji, Kikui Genichiro, Fujita Atsushi, Sato, "Predator paraphrasing functional verb expressions directed to normalization", The 15th Annual Conference of the Association for Language Processing, 2009

例えば「みる＋完了」に対して、同一の機能語の意味ラベル及び述語を持つ書き換え候補となる例としては、「みちゃった」、「みた」、「みたよ」、「みちゃいました」が挙げられる。 For example, "mich", "might", "mighty", and "michata" are examples of rewrite candidates that have the same functional word semantic label and predicate as opposed to "muri + comple". It can be mentioned.

日本語は特に述部の機能語が冗長で、表現も多様であることから、同義述部獲得部３４によって獲得した書き換え候補によって、このような多様な述部の機能語を書き換え候補として用いることができる。また、機能語の細かな表現は意味類似度のような手法では識別できないことも多いため、述部の機能語に特化したモデルを用いることで、意味的に同一である信頼性が高い多様な述部書き換えバリエーションを取得可能になる。 Especially in Japanese, function words of predicates are redundant and expressions are also diverse, so function words of such various predicates should be used as rewriting candidates according to the rewrite candidates acquired by the synonym predicate acquiring unit 34 Can. In addition, since detailed expressions of function words can not often be identified by methods such as semantic similarity, using a model specialized for function words of predicates makes it possible to use various models that are semantically identical and have high reliability. It is possible to obtain a predicate rewriting variation.

類似度設定部３６は、辞書候補獲得部３０、同義フレーズ獲得部３２、及び同義述部獲得部３４の各々で獲得した書き換え候補に意味類似度を付与し、入力表記と、書き換え候補と、入力表記に対する書き換え候補の意味類似度との組み合わせの各々からなる書き換えテーブルを書き換え候補ＤＢ３８に格納する。辞書候補獲得部３０、及び同義述部獲得部３４で獲得した書き換え候補には、意味類似度として１の値を付与する。一方、同義フレーズ獲得部３２で獲得した書き換え候補には、意味ベクトルによって算出された意味類似度を付与する。このように類似度設定を行う理由は、辞書候補獲得部３０、及び同義述部獲得部３４では、予め同義判定が人手チェックによってなされた、確実な書き換え候補のみを獲得することになるが、同義フレーズ獲得部３２では、意味類似度の関数によって自動獲得した書き換え候補は必ずしも確実な候補とは限らないためである。意味類似度そのものを素性として用いることで意味の類似の度合いを反映させる。 The similarity setting unit 36 assigns semantic similarity to the rewrite candidates acquired by each of the dictionary candidate acquisition unit 30, the synonym phrase acquisition unit 32, and the synonym predicate acquisition unit 34, and the input notation, the rewrite candidate, and the input A rewrite table composed of each combination of the semantic similarity of the rewrite candidate to the notation is stored in the rewrite candidate DB 38. The rewrite candidate acquired by the dictionary candidate acquisition unit 30 and the synonym predicate acquisition unit 34 is assigned a value of 1 as a semantic similarity. On the other hand, the rewriting candidate acquired by the synonymous phrase acquiring unit 32 is assigned the semantic similarity calculated by the semantic vector. The reason for setting the degree of similarity in this way is that the dictionary candidate acquiring unit 30 and the synonym predicate acquiring unit 34 acquire only certain rewrite candidates whose synonym determination has been made by manual check in advance. This is because, in the phrase acquisition unit 32, the rewrite candidate automatically acquired by the function of the semantic similarity is not necessarily a reliable candidate. By using the semantic similarity itself as a feature, the degree of similarity of the meaning is reflected.

書き換え候補ＤＢ３８には、書き換え候補獲得部２２で作成された書き換え候補テーブルが格納されている。図５に書き換え候補テーブルの一例を示す。なお、図５のルール１のように、あらかじめ定めたルールを追加することも可能である。また、本実施の形態では、書き換え候補テーブルには、書き換え候補が獲得された由来が更に格納されている。 The rewrite candidate DB 38 stores the rewrite candidate table created by the rewrite candidate acquisition unit 22. FIG. 5 shows an example of the rewrite candidate table. In addition, it is also possible to add a predetermined rule like the rule 1 of FIG. Further, in the present embodiment, the rewrite candidate table further stores the source from which the rewrite candidate has been acquired.

なお、上記の書き換え候補テーブルでは、表記と品詞をキーとして照合を行うが、品詞は省略可とする。以降の説明では簡単のため品詞を省略した例を示す。 In the above-described rewrite candidate table, collation is performed using the notation and the part of speech as a key, but the part of speech can be omitted. In the following description, an example in which the part of speech is omitted for simplicity is shown.

前処理部２４は、ラティス生成部４０と、Ｎｂｅｓｔ解生成部５０とを含んで構成されている。 The preprocessing unit 24 is configured to include a lattice generation unit 40 and an Nbest solution generation unit 50.

言語モデル２６は、書き換え先となる特定のドメインテキスト（本実施の形態では、翻訳処理のための学習に用いられる目的コーパス）を用いて表記及び品詞の組み合わせに対して作成したモデルである。ここで、言語モデル２６の言語モデルスコアとは、目的コーパスにおける表記の尤もらしさを表すスコアである。 The language model 26 is a model created for a combination of notation and part of speech using a specific domain text to be rewritten (in the present embodiment, a target corpus used for learning for translation processing). Here, the language model score of the language model 26 is a score representing the likelihood of the expression in the target corpus.

ラティス生成部４０は、以下の各部の処理によって、入力部１０で受け付けた文字列に対して、書き換え候補獲得部２２によって作成された書き換え候補テーブルを用いて辞書引きを行い、書き換え候補を含む各部分文字列に対応するノード及び連結される部分文字列に対応するノードを結んだエッジからなるグラフ構造であるラティスを生成する。 The lattice generation unit 40 performs dictionary lookup on the character string received by the input unit 10 using the rewrite candidate table generated by the rewrite candidate acquisition unit 22 by the processing of each unit described below, and includes each of the rewrite candidates. A lattice is generated, which is a graph structure including nodes corresponding to partial strings and edges connecting nodes corresponding to partial strings to be connected.

ラティス生成部４０は、形態素解析部４２と、書き換え候補テーブル参照部４４と、書き換え候補ラティス生成部４６とを含んで構成されている。 The lattice generation unit 40 includes a morphological analysis unit 42, a rewrite candidate table reference unit 44, and a rewrite candidate lattice generation unit 46.

形態素解析部４２は、入力部１０で受け付けた文字列を形態素解析し、解析により得られた入力形態素を書き換え候補テーブル参照部４４に出力する。 The morphological analysis unit 42 morphologically analyzes the character string received by the input unit 10, and outputs the input morpheme obtained by the analysis to the rewrite candidate table reference unit 44.

書き換え候補テーブル参照部４４は、形態素解析部４２により得られた入力形態素の各々について、入力形態素を入力表記の参照キーとして書き換え候補ＤＢ３８の書き換え候補テーブルを参照し、書き換え候補集合を取得する。 The rewrite candidate table reference unit 44 acquires, for each of the input morphemes obtained by the morpheme analysis unit 42, the rewrite candidate table of the rewrite candidate DB 38 using the input morpheme as the reference key of the input notation and acquires the rewrite candidate set.

書き換え候補ラティス生成部４６は、書き換え候補テーブル参照部４４で取得した書き換え候補集合を用いて、入力形態素の各々に対して書き換え候補を展開してラティスを生成する。 The rewrite candidate lattice generation unit 46 generates a lattice by expanding the rewrite candidates for each of the input morphemes using the rewrite candidate set acquired by the rewrite candidate table reference unit 44.

具体的には、図６に示すように、左から順に入力形態素とマッチする書き換え候補リストを列挙する。そして、入力形態素と書き換え候補との各々をノードとし、連続するノード間をエッジで結んだグラフ構造を、ラティスとして生成する。 Specifically, as shown in FIG. 6, rewrite candidate lists matching the input morphemes are listed in order from the left. Then, each of the input morpheme and the rewrite candidate is used as a node, and a graph structure in which consecutive nodes are connected by an edge is generated as a lattice.

Ｎｂｅｓｔ解生成部５０は、以下の処理によって、ラティス生成部４０によって生成されたラティスと、ラティスにおける各ノードに対応する書き換え候補の意味類似度と、言語モデル２６における部分文字列の各々の言語モデルスコアとに基づいて、ラティスのエッジからなる各経路のうち、スコアが上位Ｎ個の経路が表す文字列を、入力された文字列の書き換え候補の各々として生成する。ここで経路のスコアは、経路上の各ノードの部分文字列に対応する書き換え候補の意味類似度と言語モデルスコアとに基づいて求められる。 The Nbest solution generation unit 50 generates the lattice generated by the lattice generation unit 40, the semantic similarity of the rewrite candidate corresponding to each node in the lattice, and the language model of each of the partial character strings in the language model 26 by the following processing. Based on the score, among the paths consisting of lattice edges, a character string represented by the top N paths with a score is generated as each of the input character string rewrite candidates. Here, the score of the path is obtained based on the semantic similarity of the rewrite candidate corresponding to the partial character string of each node on the path and the language model score.

Ｎｂｅｓｔ解生成部５０は、まず、ラティス生成部４０で生成されたラティスと、ラティスにおける各ノードに対応する書き換え候補の意味類似度と、言語モデル２６における部分文字列の各々の言語モデルスコアとに基づいて、生成されたラティスにおける各経路について、言語モデルスコア、意味類似度、及び書き換えフラグを用いた当該経路の各素性のスコアを計算する。そして、計算された各素性のスコアに基づいて、例えば動的計画法を用いて、ラティスにおける、以下に示す総スコアが上位Ｎ個となる経路を計算する。 First, the Nbest solution generation unit 50 generates the lattice generated by the lattice generation unit 40, the semantic similarity of the rewrite candidate corresponding to each node in the lattice, and the language model score of each of the partial character strings in the language model 26. Based on each path in the generated lattice, the score of each feature of the path using the language model score, the semantic similarity, and the rewrite flag is calculated. Then, based on the calculated scores of each feature, for example, using dynamic programming, the path having the top N total scores shown below in the lattice is calculated.

ここで、α、β、及びγは、言語モデルスコア、意味類似度、及び書き換えフラグからなる各素性の重みである。α、及びβの値は、予めデータに基づいて実験的に決定する。書き換えフラグは、書き換えられたノードの場合に１、それ以外のノードの場合に０の値をとる変数である。各素性のスコアを合算することで、総スコアを算出することができる。総スコアは以下（１）式で算出される。なお、α、β、及びγの値は、後述する前処理モデル学習部６４によって予め学習された前処理モデルの値を用いるようにしてもよい。 Here, α, β, and γ are weights of each feature including the language model score, the semantic similarity, and the rewriting flag. The values of α and β are experimentally determined based on the data in advance. The rewrite flag is a variable that takes a value of 1 in the case of a rewritten node and a value of 0 in the case of other nodes. The total score can be calculated by adding up the scores of each feature. The total score is calculated by the following equation (1). The values of α, β, and γ may be values of a pre-processing model learned in advance by a pre-processing model learning unit 64 described later.

総スコア=Σ_ｉ（α×言語モデルスコア＋β×(-log(意味類似度))
＋γ×書き換えフラグ）
・・・（１） Total score = _{i i} (α x language model score + β x (-log (meaning similarity))
+ Γ × rewrite flag)
... (1)

ただし、ｉは、経路上のノードの部分文字列を表す。例えば、図７に示すように、各ノードのスコアを、αにノードの部分文字列に対応する言語モデルスコアを掛けることにより、計算する。 Where i represents a substring of a node on the path. For example, as shown in FIG. 7, the score of each node is calculated by multiplying α by the language model score corresponding to the substring of the node.

例えば入力された文字列に対応する経路が「おっはよう/はら/へっ/た/ー」であれば当該経路の総スコアは「1.2+2.3+2.1+1.9+1.7+5.7=14.9」となる。また、書き換え候補を含む経路１が「おはよう/おなか/すい/た/」であれば当該経路１の総スコアは「1.0+2.1+1.9+1.8+3.8+0.5*(-log(1))+0.5*(-log(0.6))=10.86」となる。他の書き換え候補を含む経路も同様に計算する。上記例の経路では、経路１が最もスコアが小さいため経路１が上位１番目の経路となる。 For example, if the route corresponding to the input character string is "Oha yo / / / っ / / /-", the total score of the route is "1.2 + 2.3 + 2.1 + 1.9 + 1.7 + 5.7 = 14.9" Become. In addition, if the route 1 including the rewrite candidate is "Good morning / stomach / cold / I /", the total score of the route 1 is "1.0 + 2.1 + 1.9 + 1.8 + 3.8 + 0.5 * (-log (1)) + 0.5 * (− log (0.6)) = 10.86 ”. Paths including other rewriting candidates are similarly calculated. In the route of the above example, since the route 1 has the smallest score, the route 1 is the top first route.

なお、複数の経路の総スコアが同じ場合は、優先度に従って選択する。例えば、入力表記を最優先とし、次の優先度を、文字コード順などにすればよい。 In addition, when the total score of several path | routes is the same, it selects according to a priority. For example, the input notation may be given the top priority, and the next priority may be in the order of character code.

本処理モデル２８は、原言語の文を目的言語の文へ翻訳するための予め学習された翻訳モデルである。翻訳モデルは原言語を英語、目的言語を日本語として学習されているものとする。なお、翻訳モデルは外部の翻訳システム等を用いるようにしてもよい。なお、本処理モデル２８は、翻訳モデルに限定されるものではなく、前処理モデルの学習対象に応じて、構文解析、自動要約等のあらゆる言語処理モデルにすることができる。 The processing model 28 is a pre-learned translation model for translating a source language sentence into a target language sentence. The translation model is assumed to be learned with the source language as English and the target language as Japanese. As a translation model, an external translation system or the like may be used. The present processing model 28 is not limited to the translation model, and may be any language processing model such as syntactic analysis and automatic summary depending on the learning target of the preprocessing model.

本処理部６０は、前処理部２４のＮｂｅｓｔ解生成部５０によって生成された書き換え候補の各々に対し、本処理モデル２８を用いて翻訳処理を行い、翻訳結果を出力する。例えば、書き換え前の文字列が「おはよーはらへったー」であれば、翻訳結果は「Whoa Hayo belly heh was over」であるが、生成された書き換え候補は次のように翻訳される。書き換え候補「おはようおなかすいた」であれば、翻訳結果は「Good morning hungry」となる。書き換え候補「おはようお腹減った」であれば、翻訳結果は「Good morning I was reduced stomach」となる。書き換え候補「おはよう腹減った」であれば、翻訳結果は「Good morning I was hungry」となる。書き換え候補「お早うおなかすいた」であれば、翻訳結果は「Good morning hungry」となる。また、書き換え前の文字列が「テレビみちゃった」であれば、翻訳結果は「I chat Terebimi」となるが、生成された書き換え候補は次のように翻訳される。書き換え候補「テレビみた」であれば、翻訳結果は「I saw TV」となる。「テレビみちゃいました」であれば、翻訳結果は「It was chai Terebimi」となる。 The main processing unit 60 performs a translation process on each of the rewrite candidates generated by the Nbest solution generation unit 50 of the preprocessing unit 24 using the main processing model 28, and outputs a translation result. For example, if the character string before rewriting is "Oha-yo-ha-he-he", the translation result is "Whoa Hayo belly heh was over", but the generated rewriting candidate is translated as follows: . If the rewriting candidate is "Good morning," the translation result will be "Good morning hungry." If the rewrite candidate is "Good morning I was hungry", the translation result will be "Good morning I was reduced stomach". If the rewrite candidate is "Good morning I'm hungry", the translation result will be "Good morning I was hungry". If it is the rewrite candidate "Good morning," the translation result will be "Good morning hungry." In addition, if the character string before rewriting is "television has been seen", the translation result is "I chat Terebimi", but the generated rewriting candidate is translated as follows. If the rewrite candidate is "TV", the translation result is "I saw TV". If it is "TV has been found", the translation result will be "It was chai Terebimi".

評価選定部６２は、Ｎｂｅｓｔ解生成部５０で生成された書き換え候補の各々について、本処理部６０で書き換え候補に対して予め定められた翻訳処理を行った結果に基づいて、当該書き換え候補に対して翻訳処理を行った結果の尤もらしさを評価し、入力部１０で受け付けた文字列と、最も評価が高い書き換え候補とのペアを、正解ペアとして選定する。 The evaluation and selection unit 62 applies, to each of the rewrite candidates generated by the Nbest solution generation unit 50, the rewrite candidate based on the result of performing a predetermined translation process on the rewrite candidate in the main processing unit 60. The likelihood of the result of translation processing is evaluated, and a pair of the character string accepted by the input unit 10 and the rewriting candidate with the highest evaluation is selected as a correct pair.

ここで、翻訳処理の尤もらしさの評価には、入力部１０で受け付けた文字列に対して翻訳処理を行ったときの予め定めた翻訳正解データと、翻訳処理の自動評価システムであるＲＩＢＩＥＳとを用いて、翻訳の評価スコアを付与する。元の書き換え前の文字列の翻訳結果と、書き換え候補の翻訳結果とに対する自動評価における評価スコアを算出し、書き換え前の文字列、及び書き換え候補をランキングする。自動評価における評価スコアが最も高かった書き換え候補を書き換えの正解文とする。元の書き換え前の文字列が最も評価値が高かった場合は、書き換えない文が正解データとなる。なお、ＲＩＢＩＥＳ以外にもＢＬＵＥ等、他の自動評価システムを用いてもよい。 Here, in order to evaluate the likelihood of the translation process, the translation correct answer data determined in advance when the translation process is performed on the character string received by the input unit 10, and RIBIES, which is an automatic evaluation system for the translation process. Use to give an evaluation score for the translation. An evaluation score in automatic evaluation of the translation result of the original character string before rewriting and the translation result of the rewriting candidate is calculated, and the character string before rewriting and the rewriting candidate are ranked. The rewrite candidate with the highest evaluation score in the automatic evaluation is taken as the correct sentence of the rewrite. If the original character string before rewriting has the highest evaluation value, the sentence that is not rewritten is the correct data. Other automatic evaluation systems such as BLUE may be used besides RIBIES.

翻訳処理の尤もらしさの評価スコアの例を以下に示す。自動評価における評価スコアは以下のようになる。ここで、正解データとなる参照訳は「Good morning. I am hungry」である。 An example of the evaluation score of the likelihood of translation processing is shown below. The evaluation score in the automatic evaluation is as follows. Here, the reference translation which becomes correct data is "Good morning. I am hungry".

書き換え候補「おはよう腹減った」の翻訳結果「Good morning I was hungry」については、評価スコアは４．９である。書き換え候補「おはようおなかすいた」の翻訳結果「Good morning hungry」については、評価スコアは４．１である。書き換え候補「お早うおなかすいた」の翻訳結果「Good morning hungry」については、評価スコアは４．１である。書き換え候補「おはようお腹減った」の翻訳結果「Good morning I was reduced stomach」については、評価スコアは４．０である。また、元の書き換え前の文字列「おっはようはらへったー」の翻訳結果「Whoa Hayo belly heh was over」については、評価スコアは３．３である。 The evaluation score is 4.9 for the translation result "Good morning I was hungry" of the rewrite candidate "Good morning I was hungry". For the translation result "Good morning hungry" of the rewrite candidate "Good morning", the evaluation score is 4.1. The evaluation score is 4.1 for the translation result "Good morning hungry" of the rewrite candidate "Good morning hungry". For the translation result "Good morning I was reduced stomach" of the rewriting candidate "Good morning I was hungry", the evaluation score is 4.0. In addition, the evaluation score is 3.3 for the translation result “Whoa Hayo belly heh was over” of the original string “Oha yo yo ha hen ー” before the rewrite.

上記の評価スコアをランキングし、１位の書き換え候補「おはよう腹減った」を、書き換え正解データ（正例）と決定し、元の書き換え前の文字列「おっはようはらへったー」とのペアを正解ペアとする。 The above evaluation scores are ranked, and the 1st rewrite candidate "Good morning" is determined to be the correct rewrite data (positive example), and the original pre-rewrite character string "Oh yo yo ha ha ーー" The pair of is the correct pair.

前処理モデル学習部６４は、評価選定部によって選定された正解ペアと、入力部１０で受け付けた文字列、及び書き換え候補の各々から抽出された素性とに基づいて、言語処理の前処理における書き換え候補を選択するための前処理モデルを学習する。ここで、前処理モデル学習部６４は、文字列と、書き換え候補の各々とを表す各部分文字列に対応するノード及び連結される部分文字列に対応するノードを結んだエッジからなるグラフ構造であるラティスの各経路のうち、前処理モデルを用いて計算したスコアが最大となる経路が表す文字列が、正解ペアの書き換え候補となるように前処理モデルを学習する。 The preprocessing model learning unit 64 rewrites in the language processing preprocessing based on the correct pair selected by the evaluation selecting unit, the character string received by the input unit 10, and the features extracted from each of the rewrite candidates. Train a preprocessing model to select candidates. Here, the preprocessing model learning unit 64 has a graph structure including an edge corresponding to a character string, a node corresponding to each partial character string representing each of the rewrite candidates, and a node corresponding to the partial character string to be connected. The preprocessing model is learned so that the character string represented by the path having the largest score calculated using the preprocessing model among the paths of a certain lattice is a candidate for rewriting the correct pair.

具体的には、前処理モデル学習部６４は、まず、入力部１０で受け付けた文字列について、上記ラティス生成部４０と同様の手法によって、Ｎｂｅｓｔ解生成部５０で得た書き換え候補の各々を用いて辞書引きを行い、ラティスを生成する。あるいはラティス生成部４０で作成したラティスのうち、上記Ｎｂｅｓｔ解生成部５０によって得られた書き換え候補の各々に係る経路を用いるようにしてもよい。このとき形態素情報は予め付与したものを用いるか、解析しなおす。 Specifically, for the character string accepted by the input unit 10, the preprocessing model learning unit 64 first uses each of the rewrite candidates obtained by the Nbest solution generation unit 50 by the same method as the lattice generation unit 40. Perform dictionary lookup to generate a lattice. Alternatively, among the lattices created by the lattice generation unit 40, paths relating to each of the rewrite candidates obtained by the Nbest solution generation unit 50 may be used. At this time, morpheme information is pre-assigned or reanalyzed.

次に、前処理モデル学習部６４は、評価選定部３６２で得られた正解ペアと、ラティスとに基づいて、パラメータα、β、及びγの学習を行う。前処理モデル学習部３６４で学習する前処理モデルを用いたスコア関数は以下のように定義される。 Next, the preprocessing model learning unit 64 learns the parameters α, β, and γ based on the correct pair obtained by the evaluation selection unit 362 and the lattice. The score function using the preprocessing model learned by the preprocessing model learning unit 364 is defined as follows.

Ｆ＝α＊ｆ１（書き換え候補素性）+β＊ｆ２（言語モデル素性）+γ＊ｆ３(書き換え候補類似度素性) F = α * f1 (rewrite candidate feature) + β * f2 (language model feature) + γ * f3 (rewrite candidate similarity feature)

ここで、ｆ１(書き換え候補素性)は、該当する書き換え候補が用いられているとき１、それ以外の時は０となるバイナリ素性である。f１(はらへったー→腹減った）、f１(はら→腹)など、各書き換え候補に対して定義される素性である。例えば、「おはよう」を通る経路のとき、ｆ１(おっはよう→おはよう)＝１となる。また、「おはよう」を通らない経路のとき、ｆ１(おっはよう→おはよう)＝０となる。ｆ２(言語モデル素性)とは、言語モデル２６から得られる、該当する経路の言語モデルスコアを表す。ｆ３(書き換え候補類似度素性)とは、該当する経路で採用された書き換え候補の元の書き換え前の文字列との類似度関数値を表す。例えば、経路１「おっはようはらへったー」、経路２「おはようお腹減った」の場合、類似度関数値は、 Here, f1 (rewrite candidate feature) is a binary feature that is 1 when the corresponding rewrite candidate is used and 0 otherwise. It is a feature defined for each rewrite candidate, such as f1 (は → 腹した), f1 (→→ 腹). For example, in the case of a path passing through "good morning", f1 (good morning → good morning) = 1. Also, in the case of a route that does not pass "Good morning", f1 (Good morning → Good morning) = 0. f2 (language model feature) represents the language model score of the corresponding path obtained from the language model 26. f3 (rewrite candidate similarity feature) represents a similarity function value with the original unrewritten character string of the rewrite candidate adopted in the corresponding route. For example, in the case of the route 1 “oh ah ha ha ha ーーー”, the route 2 “good morning you are hungry”, the similarity function value is

ｆ３(経路１→経路２)＝０．７(おっはよう→おはよう)＋０．６(はらへったー→お腹へった)＝１．３ f3 (path 1 → path 2) = 0.7 (oh aw → good morning) + 0.6 (a ha ha ーー腹腹腹) = 1.3

となる。生成したラティスに対し、上記のスコア関数を計算した際の最適な経路が表す文が、正解ペアの正解文である書き換え候補と一致する（近くなる）ようにパラメータα、β、及びγの学習を行う。パラメータα、β、及びγの学習は、パーセプトロン学習やＣＲＦを用いて行うことができる。
＜本発明の実施の形態に係る前処理モデル学習装置の作用＞ It becomes. For the generated lattice, learning of the parameters α, β, and γ so that the sentence represented by the optimal path when the above score function is calculated matches (becomes close to) the rewrite candidate that is the correct sentence of the correct pair I do. The learning of the parameters α, β, and γ can be performed using perceptron learning or CRF.
<Operation of Preprocessing Model Learning Device According to the Embodiment of the Present Invention>

次に、本発明の実施の形態に係る前処理モデル学習装置１００の作用について説明する。入力部１０において日本語の辞書と、大規模テキストとを受け付け、書き換え候補獲得部２２に出力すると、前処理モデル学習装置１００は、図８に示す書き換え候補テーブル作成処理ルーチンを実行する。 Next, the operation of the preprocessing model learning device 100 according to the embodiment of the present invention will be described. When the input unit 10 receives a Japanese dictionary and large-scale text and outputs it to the rewrite candidate acquisition unit 22, the preprocessing model learning device 100 executes a rewrite candidate table creation processing routine shown in FIG.

まず、ステップＳ２００では、入力部１０で受け付けた辞書を用いて、入力表記の各々に対して、複数のレベルの書き換え候補を獲得する。 First, in step S200, using the dictionary received by the input unit 10, rewrite candidates at a plurality of levels are acquired for each of the input notations.

ステップＳ２０２では、入力部１０で受け付けた大規模テキストを用いて、意味類似度を用いた単語レベル及びフレーズレベルの文字列のペアを、入力表記に対する書き換え候補として獲得する。 In step S202, using the large scale text received by the input unit 10, a pair of character level and word level character strings using semantic similarity is acquired as a rewrite candidate for the input notation.

ステップＳ２０４では、入力部１０で受け付けた大規模テキストを用いて、述語である入力表記に対して、述部の機能語を書き換えた書き換え候補を獲得する。 In step S204, using the large-scale text accepted by the input unit 10, a rewrite candidate obtained by rewriting the function word of the predicate is acquired for the input notation which is a predicate.

ステップＳ２０６では、ステップＳ２００〜Ｓ２０４の各々で獲得した書き換え候補に意味類似度を付与し、入力表記と、書き換え候補と、入力表記に対する書き換え候補の意味類似度との組み合わせの各々からなる書き換えテーブルを作成し、書き換え候補ＤＢ３８に格納し、書き換え候補テーブル作成処理ルーチンを終了する。 In step S206, a semantic similarity is added to the rewrite candidate acquired in each of steps S200 to S204, and a rewrite table including combinations of the input notation, the rewrite candidate, and the semantic similarity of the rewrite candidate for the input notation is provided. It creates and stores it in the rewrite candidate DB 38, and ends the rewrite candidate table creation processing routine.

そして、入力部１０において書き換え対象の文字列を受け付けると、前処理モデル学習装置１００は、図９に示す前処理モデル学習処理ルーチンを実行する。 Then, when the character string to be rewritten is received in the input unit 10, the preprocessing model learning device 100 executes a preprocessing model learning processing routine shown in FIG.

まず、ステップＳ１０２では、入力部１０で受け付けた文字列に対して、ステップＳ１００で作成された書き換え候補テーブルを用いて辞書引きを行い、書き換え候補を含む各部分文字列に対応するノード及び連結される部分文字列に対応するノードを結んだエッジからなるグラフ構造であるラティスを生成する。 First, in step S102, dictionary substitution is performed on the character string accepted by the input unit 10 using the rewrite candidate table generated in step S100, and nodes corresponding to each partial character string including the rewrite candidate are connected. Create a lattice, which is a graph structure consisting of edges connecting nodes corresponding to the substrings.

ステップＳ１０４では、ステップＳ１０２で生成されたラティスと、ラティスにおける各ノードに対応する書き換え候補の意味類似度と、言語モデル２６における部分文字列の各々の言語モデルスコアとに基づいて、ラティスのエッジからなる各経路のうち、スコアが上位Ｎ個となる経路が表す文字列を、入力された文字列の書き換え候補の各々として生成する。 In step S104, based on the lattice generated in step S102, the semantic similarity of the rewriting candidate corresponding to each node in the lattice, and the language model score of each of the substrings in the language model 26, from the edge of the lattice Among the respective routes, the character string represented by the route having the top N scores is generated as each of the rewrite candidates of the input character string.

ステップＳ１０６では、ステップＳ１０４で生成された書き換え候補の各々に対し、本処理モデル２８を用いて翻訳処理を行い、翻訳結果を得る。 In step S106, translation processing is performed on each of the rewrite candidates generated in step S104 using the processing model 28 to obtain a translation result.

ステップＳ１０８では、ステップＳ１０４で生成された書き換え候補の各々について、ステップＳ１０６で書き換え候補に対して予め定められた翻訳処理を行った結果に基づいて、書き換え候補に対して翻訳処理を行った結果の尤もらしさを評価し、文字列と、最も評価が高い書き換え候補とのペアを、正解ペアとして選定する。 In step S108, for each of the rewrite candidates generated in step S104, a translation process is performed on the rewrite candidates based on the result of performing a predetermined translation process on the rewrite candidates in step S106. The likelihood is evaluated, and the pair of the character string and the rewriting candidate with the highest evaluation is selected as the correct pair.

ステップＳ１１０では、入力部１０で受け付けた文字列について、ステップＳ１０４で得た書き換え候補の各々を用いて辞書引きを行い、ラティスを生成する。 In step S110, dictionary extraction is performed on each of the character strings accepted by the input unit 10 using each of the rewrite candidates obtained in step S104 to generate a lattice.

ステップＳ１１２では、ステップＳ１０８で選定された正解ペアと、入力部１０で受け付けた文字列、及び書き換え候補の各々から抽出された素性とに基づいて、ラティスの各経路のうち、前処理モデルを用いて計算したスコアが最大となる経路が表す文字列が、正解ペアの書き換え候補となるように言語処理の前処理における書き換え候補を選択するための前処理モデルを学習する。 In step S112, based on the correct pair selected in step S108, the character string received by the input unit 10, and the features extracted from each of the rewrite candidates, a preprocessing model is used among the lattice paths. The pre-processing model for selecting a rewrite candidate in the pre-processing of language processing is learned so that the character string represented by the path with the largest calculated score becomes the rewrite candidate of the correct pair.

上記ステップＳ１０２は、図１０に示すラティス生成処理ルーチンによって実現される。 The above step S102 is realized by the lattice generation processing routine shown in FIG.

ステップＳ３００では、入力部１０で受け付けた文字列を形態素解析し、解析により得られた入力形態素を書き換え候補テーブル参照部４４に出力する。 In step S300, the character string received by the input unit 10 is morphologically analyzed, and the input morpheme obtained by the analysis is output to the rewrite candidate table reference unit 44.

ステップＳ３０２では、ステップＳ３００で出力された入力形態素の各々について、入力形態素を入力表記の参照キーとして書き換え候補ＤＢ３８の書き換え候補テーブルを参照し、書き換え候補集合を取得する。 In step S302, for each of the input morphemes output in step S300, the rewrite candidate table of the rewrite candidate DB 38 is referred to using the input morpheme as the reference key of the input notation, and the rewrite candidate set is acquired.

ステップＳ３０４では、ステップＳ３０２で取得した書き換え候補集合を用いて、入力形態素の各々に対して書き換え候補を展開してラティスを生成し、ラティス生成処理ルーチンを終了する。 In step S304, using the rewrite candidate set acquired in step S302, rewrite candidates are expanded for each of the input morphemes to generate a lattice, and the lattice generation processing routine is ended.

以上説明したように、本発明の実施の形態に係る前処理モデル学習装置によれば、書き換え候補の各々について、書き換え候補に対して予め定められた言語処理を行った結果に基づいて、書き換え候補に対して言語処理を行った結果の尤もらしさを評価し、文字列と、最も評価が高い書き換え候補とのペアを、正解ペアとして選定し、選定された正解ペアと、書き換え候補の各々から抽出される素性とに基づいて、言語処理の前処理における書き換え候補を選択するための前処理モデルを学習することにより、高精度な言語処理を行うための前処理モデルを学習することができる。 As described above, according to the preprocessing model learning apparatus according to the embodiment of the present invention, rewriting candidates are each provided with rewriting candidates based on the result of performing predetermined language processing on the rewriting candidates. The likelihood of the result of the language processing is evaluated, and the pair of the character string and the highest-rated rewriting candidate is selected as the correct pair, and extracted from each of the selected correct pair and the rewriting candidate By learning a pre-processing model for selecting a rewrite candidate in the pre-processing of language processing based on the selected feature, it is possible to learn a pre-processing model for performing high-precision language processing.

また、自動獲得した複数の書き換え候補とその組み合わせの中から、自動評価に基づいて最適な候補を選択することができる。例えばルールで設定する場合、「みちゃった」→「見た」などの一意の書き換えを規定し、複数の候補の順序を付ける場合はそれぞれ順序関数を規定する必要があるが、本実施の形態の技術では、「みちゃった」→「見た」「見ちゃった」「見たよ」などの中から目的コーパスの言語モデルという基準に基づき最も適した候補を出力することができる。 Further, it is possible to select an optimal candidate based on automatic evaluation from among a plurality of automatically acquired rewrite candidates and their combination. For example, in the case of setting by a rule, it is necessary to define unique rewriting such as “mich” → “seen”, and in order to put a plurality of candidates in order, it is necessary to define an order function. In the technique of (1), it is possible to output the most suitable candidate based on the criteria of the language model of the target corpus among "Michi" → "I saw" "I saw" "I saw" and so on.

なお、本発明は、上述した実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made without departing from the scope of the present invention.

例えば上述した実施の形態では、前処理モデル学習装置１００は、書き換え候補獲得部２２、前処理部２４の各処理によって書き換え候補を獲得し、本処理部６０で書き換え候補について予め定めた言語処理を行っているが、これに限定されるものではなく、前処理モデル学習装置１００は、書き換え候補獲得部２２、前処理部２４、及び本処理部６０を構成に含まずに、予め入力部１０により書き換え候補の各々、及び言語処理の結果を受け付けて、評価選定部６２、及び前処理モデル学習部６４の各部の処理を行うようにしてもよい。 For example, in the above-described embodiment, the preprocessing model learning device 100 acquires rewriting candidates by the processing of the rewriting candidate acquiring unit 22 and the preprocessing unit 24, and language processing predetermined for the rewriting candidates in the main processing unit 60 is performed. Although the pre-processing model learning apparatus 100 does not include the rewrite candidate acquisition unit 22, the pre-processing unit 24, and the main processing unit 60 in the configuration, the pre-processing model learning device 100 uses the input unit 10 in advance. Each of the rewrite candidates and the result of the language processing may be received, and the processing of each unit of the evaluation selection unit 62 and the preprocessing model learning unit 64 may be performed.

１０入力部
２０、２２０演算部
２２書き換え候補獲得部
２４前処理部
２６言語モデル
２８本処理モデル
３０辞書候補獲得部
３２同義フレーズ獲得部
３４同義述部獲得部
３６類似度設定部
４０ラティス生成部
４２形態素解析部
４４書き換え候補テーブル参照部
４６書き換え候補ラティス生成部
５０Ｎｂｅｓｔ解生成部
６０本処理部
６２評価選定部
６４前処理モデル学習部
９０出力部
１００前処理モデル学習装置 10 input unit 20, 220 operation unit 22 rewrite candidate acquisition unit 24 pre-processing unit 26 language model 28 main processing model 30 dictionary candidate acquisition unit 32 synonymous phrase acquisition unit 34 synonymous predicate acquisition unit 36 similarity setting unit 40 lattice generation unit 42 Morphological analysis unit 44 Rewriting candidate table reference unit 46 Rewriting candidate lattice generation unit 50 Nbest solution generation unit 60 Main processing unit 62 Evaluation selection unit 64 Preprocessing model learning unit 90 Output unit 100 Preprocessing model learning device

Claims

Rewriting candidate with a headword having the same lexical element and representative notation registered in a predetermined dictionary, generated for an input character string, a word and a phrase whose semantic similarity to the character string satisfies the condition For each of the rewriting candidates including the rewriting candidate according to and the rewriting candidate obtained by rewriting the function word of the predicate of the character string , based on the result of performing predetermined language processing on the rewriting candidate, An evaluation selection unit which evaluates the likelihood of the result of the language processing on the rewrite candidate and selects the pair of the character string and the rewrite candidate with the highest evaluation as a correct pair;
A preprocessing that learns a preprocessing model for selecting a rewriting candidate in the preprocessing of the language processing based on the correct pair selected by the evaluation selecting unit and the features extracted from each of the rewriting candidates Model learning unit,
Pre-processing model learning device including.

The pre-processing model learning unit
Generating a lattice having a graph structure including an edge corresponding to a node corresponding to each partial character string representing the character string and each of the rewrite candidates and a node corresponding to the partial character string to be connected; The front processing model according to claim 1, wherein among the paths, a character string represented by a path with the largest score calculated using the pre-processing model represents a candidate for rewriting the correct pair. Processing model learning device.

The pre-processing model learning unit relates to the correctness pair selected by the evaluation selecting unit, the feature extracted from each of the rewrite candidates, and the language model score representing the likelihood of each writing of the rewrite candidate The preprocessing model for selecting a rewriting candidate in the preprocessing of the language processing is learned on the basis of the feature regarding the similarity between each of the character string and each of the rewriting candidates. Preprocessing model learning device as described.

The evaluation and selection unit is a candidate for rewriting by a headword having the same lexical element and representative notation registered in a predetermined dictionary generated for the input character string, and the semantic similarity to the character string is a condition Result of performing predetermined language processing on the rewriting candidate for each of the rewriting candidates including the rewriting candidate by the word and the phrase satisfying the above, and the rewriting candidate which is the rewriting of the function word of the predicate of the character string Evaluating likelihood of a result of performing the language processing on the rewrite candidate, and selecting a pair of the character string and the rewrite candidate having the highest evaluation as a correct answer pair;
Before the preprocessing model learning unit selects a rewriting candidate in the language processing preprocessing based on the correct pair selected by the evaluation selecting unit and the feature extracted from each of the rewriting candidates Learning the processing model;
Pre-processing model learning method including:

The step that the pre-processing model learning unit learns is:
Generating a lattice having a graph structure including an edge corresponding to a node corresponding to each partial character string representing the character string and each of the rewrite candidates and a node corresponding to the partial character string to be connected; 5. The pre-processing model according to claim 4 , wherein a character string represented by a path having a maximum score calculated using the pre-processing model among the paths represents a candidate for rewriting the correct pair. Processing model learning method.

The step that the pre-processing model learning unit learns is:
The correct pair selected by the evaluation selection unit, the feature extracted from each of the rewrite candidates, the feature related to the language model score representing the likelihood of each writing of the rewrite candidate, the character string, and the rewrite The preprocessing model learning method according to claim 4 or 5, wherein a preprocessing model for selecting a rewriting candidate in the preprocessing of the language processing is learned based on the feature related to the degree of similarity with each of the candidates.

A program for causing a computer to function as each unit of the preprocessing model learning device according to any one of claims 1 to 3 .