JP6312467B2

JP6312467B2 - Information processing apparatus, information processing method, and program

Info

Publication number: JP6312467B2
Application number: JP2014041983A
Authority: JP
Inventors: 塚原　裕史; 裕史塚原; 慶内海
Original assignee: Denso IT Laboratory Inc
Current assignee: Denso IT Laboratory Inc
Priority date: 2014-03-04
Filing date: 2014-03-04
Publication date: 2018-04-18
Anticipated expiration: 2034-03-04
Also published as: JP2015169951A

Description

本発明は、自然言語処理を行う情報処理装置に関し、特に、与えられた文字列を単語分割し、分割された単語の意味的な類似度を求める情報処理装置に関する。 The present invention relates to an information processing apparatus that performs natural language processing, and more particularly, to an information processing apparatus that divides a given character string into words and obtains a semantic similarity of the divided words.

従来、与えられた文字列を単語分割する方法として、あらかじめ単語辞書を用意しておき、その辞書に基づいて、与えられた文字列を単語へ分割する方法が知られている（非特許文献１）。また、単語辞書を用意することなく、教師なし学習によって文字列を単語へ分割する手法も知られている。この方法では、文字Ｎグラム、単語Ｎグラムをノンパラメトリックベイズ法によって確率モデル化し、単語を推定する（非特許文献２）。 Conventionally, as a method of dividing a given character string into words, a method is known in which a word dictionary is prepared in advance and the given character string is divided into words based on the dictionary (Non-Patent Document 1). ). There is also known a method of dividing a character string into words by unsupervised learning without preparing a word dictionary. In this method, character N-grams and word N-grams are probabilistically modeled by a nonparametric Bayes method to estimate words (Non-Patent Document 2).

単語間の意味的な類似度を推定する方法として、あらかじめ概念ベースを用意し、その概念ベースへの照合により、単語間の類似度を推定する方法が知られている。この方法では、概念ベースに登録されていない単語に対しては、その単語を含む文において、それらの文中にあり概念ベースに登録されている単語との関係によって、類似度を推定する（特許文献１）。また、概念ベースを用いることなく、多層のニューラルネットワークを利用し、ニューラルネットワークの中間層に、自発的に単語間の類似度を構成する方法が知られている（非特許文献３）。 As a method for estimating the semantic similarity between words, a method is known in which a concept base is prepared in advance and the similarity between words is estimated by collation with the concept base. In this method, for a word that is not registered in the concept base, a similarity is estimated in a sentence including the word based on a relationship with a word that is in the sentence and registered in the concept base (Patent Literature). 1). Also, a method is known in which a multi-layer neural network is used without using a concept base, and a similarity between words is spontaneously configured in an intermediate layer of the neural network (Non-patent Document 3).

特開２０１０−２２４８８７号公報JP 2010-224887 A

「言語処理のための機械学習入門」奥村学監修、高村大也著（コロナ社）"Introduction to machine learning for language processing", supervised by Manabu Okumura, Daiya Takamura (Corona) 「ベイズ階層言語モデルによる教師なし形態素解析」、持橋大地山田武士上田修功、自然言語処理学会（NL190）"Unsupervised morphological analysis using Bayesian hierarchical language model", Daichi Mochihashi Takeshi Yamada, Nobuo Ueda, Natural Language Processing Society (NL190) “Linguistic Regularities in Continuous Space Word Representations”, T. Mikolov, W-T Yih, G. Zweig, (INTERSPEECH 2013)“Linguistic Regularities in Continuous Space Word Representations”, T. Mikolov, W-T Yih, G. Zweig, (INTERSPEECH 2013)

しかし、単語辞書に基づき、単語分割を行う方法は、辞書に登録されていない未知語が含まれた文字列に対しては、正しく単語分割ができないという課題がある。未知語を少なくするためには、大規模な単語辞書を作成しなければならない。教師なし学習による単語分割では、上記のような課題はクリアーされているが、分割された単語の意味的な類似度を推定することができない。 However, the method of performing word division based on a word dictionary has a problem in that word division cannot be performed correctly for a character string including an unknown word that is not registered in the dictionary. In order to reduce the number of unknown words, a large word dictionary must be created. In the word division by unsupervised learning, the above-mentioned problems are cleared, but the semantic similarity of the divided words cannot be estimated.

意味的な類似度を推定するために、従来技術では、特許文献１に記載されているようにあらかじめ概念ベースを用意しておく必要があるが、概念ベースの構築には、非常に多くの人手を要する。また、概念ベースに登録されていない単語については、その単語の周辺に出現する他の概念ベースに登録がある単語を利用しているが、それらの単語が対象となる単語と無関係のものが含まれていることにより、正しく意味を推定することができないという課題がある。また、特許文献１の方法では、単語はあらかじめ正しく分割されることが仮定されており、単語分割に関する上記の課題が同様に当てはまる。 In order to estimate the semantic similarity, it is necessary in the prior art to prepare a concept base in advance as described in Patent Document 1, but for the construction of the concept base, a very large number of human resources are required. Cost. In addition, for words that are not registered in the concept base, words that are registered in other concept bases appearing around the word are used, but those words that are not related to the target word are included. Therefore, there is a problem that the meaning cannot be estimated correctly. Further, in the method of Patent Document 1, it is assumed that the words are correctly divided in advance, and the above-described problems relating to word division similarly apply.

非特許文献３にある方法によれば、概念ベースを用意することなく、単語を意味的な関係を表す連続空間へ埋め込み、その空間における距離によって、単語間の意味的な類似度を獲得することができる。但し、非特許文献３の方法でも、単語はあらかじめ正しく分割されることが仮定されており、単語分割に関する上記の課題が同様に当てはまる。 According to the method described in Non-Patent Document 3, a word is embedded in a continuous space representing a semantic relationship without preparing a concept base, and a semantic similarity between words is obtained by a distance in the space. Can do. However, even in the method of Non-Patent Document 3, it is assumed that the words are correctly divided in advance, and the above-described problem regarding word division is similarly applied.

以上のように、従来技術においては、単語分割と分割された単語の意味的な類似度とを教師データをあらかじめ用意することなく、同時に推定することができなかった。 As described above, in the prior art, it is not possible to simultaneously estimate the word division and the semantic similarity of the divided words without preparing teacher data in advance.

単語分割と単語の意味的な類似度とを教師なしで自動獲得するには、以上のような課題があるが、これらの課題を解決するような枠組みは、現在まで存在していなかった。 There are the above-mentioned problems to automatically acquire word division and semantic similarity of words without a teacher, but no framework for solving these problems has existed until now.

そこで、本発明は、与えられた学習データの単語分割を行い、分割された単語の意味的な類似度を自動的に獲得する情報処理装置を提供することを目的とする。 Accordingly, an object of the present invention is to provide an information processing apparatus that performs word division of given learning data and automatically acquires the semantic similarity of the divided words.

本発明の情報処理装置は、学習データとして文のデータを入力する入力部と、文字Ｎグラムまたは単語分割モデルを用いて前記学習データを単語分割する単語分割部と、分割された単語のデータに基づいて文字Ｎグラムの学習を行い、学習した文字Ｎグラムを文字Ｎグラム記憶部に記憶する文字Ｎグラム学習部と、分割された単語のデータに基づいて単語境界の認識を行う系列ラベリングによる単語分割モデルの学習を行い、学習によって得られた単語分割モデルを単語分割モデル記憶部に記憶する単語境界学習部と、入力層、中間層及び出力層を有し、中間層からの出力を入力層にも入力する再帰型ニューラルネットワークで表される単語Ｎグラムを、分割された単語のデータを教師データとして用いて学習し、単語Ｎグラム記憶部に記憶する単語Ｎグラム学習部と、前記単語Ｎグラム記憶部に記憶されている再帰型ニューラルネットワークに単語のデータを入力し、中間層にて求められるデータを概念データとして求める概念データ算出部と、前記概念データを出力する出力部とを備え、前記単語分割部が、前記文字Ｎグラム学習部にて学習された文字Ｎグラムを用いた単語分割と前記単語境界学習部にて学習された単語分割モデルを用いた単語分割とを交互に行う処理と、前記単語Ｎグラム学習部が前記単語分割部にて分割された単語のデータを用いて単語Ｎグラムを学習する処理とを、所定の収束条件を満たすまで繰り返し行う。 An information processing apparatus according to the present invention includes an input unit that inputs sentence data as learning data, a word division unit that divides the learning data using a character N-gram or a word division model, and divided word data. A character N-gram learning unit that learns a character N-gram based on the character N-gram and stores the learned character N-gram in a character N-gram storage unit, and a word by sequence labeling that recognizes a word boundary based on the divided word data A word boundary learning unit that learns a division model and stores a word division model obtained by learning in a word division model storage unit, an input layer, an intermediate layer, and an output layer, and outputs from the intermediate layer to the input layer The word N-gram represented by the recursive neural network that is also input is learned using the divided word data as teacher data and stored in the word N-gram storage unit A word N-gram learning unit, a word data input to a recursive neural network stored in the word N-gram storage unit, and a concept data calculation unit that obtains data obtained in an intermediate layer as concept data; An output unit for outputting concept data, wherein the word division unit uses the character N-gram learned by the character N-gram learning unit and the word division model learned by the word boundary learning unit. A process of alternately performing word division using a word, and a process of learning the word N-gram using the word data divided by the word division unit by the word N-gram learning unit, with a predetermined convergence condition Repeat until it is met.

このように単語分割部にて分割された単語のデータを教師データとして用いて、再帰型ニューラルネットワークで表された単語Ｎグラムを繰り返し学習することにより、中間層において、単語の概念を表す概念データを求めることができる。したがって、本発明によれば、単語辞書や概念辞書を用意しなくても、与えられた学習データから自動的に単語の概念を表すデータを獲得することができる。従来、適用が困難であったブログや話し言葉などの自然言語処理による活用を促進できる。 Concept data representing the word concept in the intermediate layer by repeatedly learning the word N-gram represented by the recursive neural network using the word data divided by the word dividing unit as teacher data. Can be requested. Therefore, according to the present invention, it is possible to automatically acquire data representing the concept of a word from given learning data without preparing a word dictionary or a concept dictionary. Utilization of natural language processing such as blogs and spoken language, which has been difficult to apply, can be promoted.

また、前記再帰型ニューラルネットワークは、文を構成する１番目からＮ番目までの単語のデータを入力とし、Ｎ＋１番目の単語を出力とするものであってもよい。これにより、中間層には、文脈のファクターを反映した概念データが現れる。 Further, the recursive neural network may receive data of the first to Nth words constituting the sentence and output the N + 1th word. As a result, conceptual data reflecting contextual factors appears in the intermediate layer.

本発明の情報処理装置は、前記概念データに基づいて、単語どうしの類似度が所定の閾値より大きい単語どうしを同じグループにクラスタリングするクラスタリング部を備え、前記出力部は、クラスタリングの結果を出力してもよい。この際、クラスタを代表する単語として、当該クラスタ内に存在する単語のうち最も頻度の高い単語を出力してもよい。また、前記クラスタリング部は、階層的にクラスタリングを行ってもよい。 The information processing apparatus of the present invention includes a clustering unit that clusters words having a similarity between words larger than a predetermined threshold into the same group based on the concept data, and the output unit outputs a clustering result. May be. At this time, as a word representing the cluster, the most frequently used word among the words existing in the cluster may be output. The clustering unit may perform clustering hierarchically.

本発明の情報処理装置において、前記単語分割部は、前記学習データが与えられたときに、文字コードに基づいて前記学習データの初期分割を行ってもよい。これにより、文字Ｎグラムや単語分割モデルがない場合でも、学習データから単語への初期分割を適切に行うことができる。 In the information processing apparatus of the present invention, the word division unit may perform initial division of the learning data based on a character code when the learning data is given. Thereby, even when there is no character N-gram or word division model, initial division from learning data into words can be performed appropriately.

本発明の情報処理方法は、情報処理装置によって、入力された学習データを単語分割し、分割された単語の概念を求める方法であって、前記情報処理装置が、学習データとして文のデータを入力するステップと、前記情報処理装置が、前記学習データに対して、文字Ｎグラムを用いた単語分割と単語分割モデルを用いた単語分割とを交互に行い、分割された単語のデータを教師データとして用いて、入力層、中間層及び出力層を有し、中間層からの出力を入力層にも入力する再帰型ニューラルネットワークで表される単語Ｎグラムを学習する処理を、所定の収束条件を満たすまで繰り返し行うステップと、前記情報処理装置が、前記単語Ｎグラム記憶部に記憶されている再帰型ニューラルネットワークに単語のデータを入力し、中間層にて求められるデータを概念データとして求めるステップと、前記情報処理装置が、前記概念データを出力するステップとを備え、前記単語Ｎグラムを学習するステップは、前記情報処理装置が、文字Ｎグラムを用いて前記学習データの単語分割を行うステップと、前記情報処理装置が、分割された単語のデータを教師データとして用いて、前記再帰型ニューラルネットワークで表される単語Ｎグラムの学習を行い、単語Ｎグラム記憶部に記憶するステップと、前記情報処理装置が、分割された単語のデータに基づいて単語境界の認識を行う系列ラベリングによる単語分割モデルの学習を行い、学習によって得られた単語分割モデルを単語分割モデル記憶部に記憶するステップと、前記情報処理装置が、前記単語分割モデルを用いて前記学習データの単語分割を行うステップと、前記情報処理装置が、分割された単語のデータを教師データとして用いて、前記再帰型ニューラルネットワークで表される単語Ｎグラムの学習を行い、単語Ｎグラム記憶部に記憶するステップと、前記情報処理装置が、分割された単語のデータに基づいて文字Ｎグラムの学習を行い、学習した文字Ｎグラムを文字Ｎグラム記憶部に記憶するステップとを有する。 The information processing method of the present invention is a method of dividing input learning data into words by an information processing device and obtaining a concept of the divided words, wherein the information processing device inputs sentence data as learning data. And the information processing apparatus alternately performs word division using a character N-gram and word division using a word division model on the learning data, and the divided word data is used as teacher data. A process of learning a word N-gram represented by a recurrent neural network that has an input layer, an intermediate layer, and an output layer, and that also inputs the output from the intermediate layer to the input layer, and satisfies a predetermined convergence condition And the information processing device inputs the word data to the recursive neural network stored in the word N-gram storage unit and obtains it in the intermediate layer. Obtaining the data to be processed as conceptual data, and the information processing apparatus outputting the conceptual data, wherein the information learning apparatus uses the character N gram to learn the word N-gram. A step of dividing the learning data into words, and the information processing apparatus learns a word N-gram represented by the recursive neural network using the divided word data as teacher data, and stores the word N-gram A step of storing in a section, and the information processing apparatus learns a word division model by sequence labeling that recognizes word boundaries based on divided word data, and the word division model obtained by learning is divided into words Storing in the model storage unit, and the information processing apparatus uses the word division model to store the learning data. A step of dividing a word; and the information processing apparatus learns a word N-gram represented by the recursive neural network using the divided word data as teacher data and stores it in a word N-gram storage unit And a step of learning the character N-gram based on the divided word data and storing the learned character N-gram in the character N-gram storage unit.

本発明のプログラムは、上記した情報処理方法をコンピュータに実行させるプログラムである。 The program of the present invention is a program that causes a computer to execute the information processing method described above.

本発明によれば、単語辞書、および概念辞書を用意することなく、与えられたテキストデータから、単語の意味的な類似度を自動で獲得することができる。従来適用が難しかったブログや話し言葉などに対して自然言語処理を行うことを促進することができる。 According to the present invention, it is possible to automatically acquire the semantic similarity of words from given text data without preparing a word dictionary and a concept dictionary. It is possible to promote natural language processing for blogs and spoken words that were difficult to apply in the past.

実施の形態の情報処理装置の機能ブロックを示す図である。It is a figure which shows the functional block of the information processing apparatus of embodiment. 再帰型ニューラルネットワークの例を示す図である。It is a figure which shows the example of a recursive neural network. 実施の形態の情報処理装置のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of the information processing apparatus of embodiment. 実施の形態の情報処理装置の動作の概要を示す図である。It is a figure which shows the outline | summary of operation | movement of the information processing apparatus of embodiment. 単語分割と単語Ｎグラムの学習の処理の一例を示す図である。It is a figure which shows an example of the process of word division | segmentation and the learning of a word N-gram. 単語分割と単語Ｎグラムの学習の処理の一例を示す図である。It is a figure which shows an example of the process of word division | segmentation and the learning of a word N-gram.

以下、本発明の実施の形態の情報処理装置について、図面を参照しながら説明する。
図１は、実施の形態の情報処理装置の機能ブロックを示す図である。情報処理装置は、学習データ２１を入力する入力部１０と、学習データ２１の単語分割を行う単語分割部１１と、分割された単語のデータに基づいて学習を行う文字Ｎグラム学習部１２と、単語Ｎグラム学習部１４と、単語境界学習部１６とを有している。単語Ｎグラム学習部１４は、図２に示す再帰型ニューラルネットワークを用いて、単語Ｎグラムの学習を行う。また、情報処理装置は、学習された再帰型ニューラルネットワークを用いて単語の概念ベクトル２２を算出する概念ベクトル算出部１８と、概念ベクトル２２に基づいて単語のクラスタリングを行うクラスタリング部１９と、概念ベクトル２２を出力する出力部２０とを有している。 Hereinafter, an information processing apparatus according to an embodiment of the present invention will be described with reference to the drawings.
FIG. 1 is a diagram illustrating functional blocks of the information processing apparatus according to the embodiment. The information processing apparatus includes an input unit 10 that inputs learning data 21, a word dividing unit 11 that performs word division of the learning data 21, a character N-gram learning unit 12 that performs learning based on divided word data, It has a word N-gram learning unit 14 and a word boundary learning unit 16. The word N-gram learning unit 14 learns a word N-gram using the recursive neural network shown in FIG. The information processing apparatus also includes a concept vector calculation unit 18 that calculates a word concept vector 22 using a learned recursive neural network, a clustering unit 19 that performs word clustering based on the concept vector 22, and a concept vector. Output unit 20 for outputting 22.

情報処理装置は、単語のデータをベクトルとして扱う。具体的には、１単語の最大長をＫ、文字種別をＬとしてＬ^K個の可能な単語を定義し、定義された単語を一意に識別する単語ベクトルを与える。なお、単語を定義する際に、文字種別の切り替え最大数をＭに制限してもよい。単語ベクトルは、具体的には、Ｌ^K個の成分のうち、当該単語を表す成分のみを「１」とし、残りを「０」とするベクトルである。このような単語ベクトルの表現を１ｏｆＮコーディングという。 The information processing apparatus treats word data as a vector. Specifically, the maximum length of one word K, defines the L ^K-number of possible words the character type as L, giving a word vectors that uniquely identifies the words that are defined. When defining a word, the maximum number of character type switching may be limited to M. Specifically, the word vector is a vector in which only the component representing the word among the L ^K components is “1” and the rest is “0”. Such expression of word vectors is called 1 of N coding.

情報処理装置の入力部１０は、学習データ２１として文のテキストデータの入力を受け付ける。学習データ２１は、例えば、ウェブ上にあるブログ等から取得することとしてもよい。 The input unit 10 of the information processing apparatus accepts input of sentence text data as learning data 21. The learning data 21 may be acquired from, for example, a blog on the web.

単語分割部１１は、学習データ２１の単語分割を行う機能を有する。単語分割部１１は、次の３つの方法で単語分割を行うことができる。すなわち、（１）文字コードが切り替わる点で単語分割を行う。（２）文字Ｎグラムを用いて単語分割を行う。（３）単語分割モデルを用いて単語分割を行う。 The word dividing unit 11 has a function of dividing the learning data 21 into words. The word division unit 11 can perform word division by the following three methods. That is, (1) word division is performed at the point where the character code is switched. (2) Perform word division using character N-grams. (3) Perform word division using a word division model.

（１）のように、文字コードが切り替わる点で単語分割を行うのは、学習データ２１が与えられた最初の単語分割時である。学習データ２１が最初に与えられたときには、その学習データ２１に対する文字Ｎグラム、単語分割モデルが求められていないので、文字コードの切り替わりによって初期分割を行う。なお、本実施の形態では、初期分割に、文字コードの切り替わりを利用する方法を用いているが、既存の形態素分析器（弱識別器として）によって、分割することも可能である。 As in (1), the word division is performed at the point where the character code is switched at the time of the first word division given the learning data 21. When learning data 21 is given for the first time, since character N-gram and word division model for learning data 21 are not obtained, initial division is performed by switching character codes. In the present embodiment, a method using character code switching is used for the initial division, but the division can also be performed by an existing morpheme analyzer (as a weak classifier).

（２）の文字Ｎグラムを用いた単語分割は、次のように行う。
文ｘの分割候補ｗを次のように書く。
The word division using the character N-gram of (2) is performed as follows.
The division candidate w of sentence x is written as follows.

単語分割は、次の式を最大化するように求められる。
Word segmentation is sought to maximize the following equation:

（３）の単語分割モデルを用いた単語分割は、公知の技術を用いることができる。 A known technique can be used for word division using the word division model of (3).

文字Ｎグラム学習部１２は、分割された単語の前後に単語開始、単語終了を示す仮想的な文字を挿入した上で、文字Ｎグラムの学習を行う。文字Ｎグラムの学習には、統計的なＮグラムモデルを用いることができ、例えば、Kneser-Neyスムージングと呼ばれる方法を用いてもよい。文字Ｎグラム学習部１２は、学習により得られた文字Ｎグラムのデータを、文字Ｎグラム記憶部１３に記憶する。 The character N-gram learning unit 12 learns a character N-gram after inserting virtual characters indicating the word start and word end before and after the divided word. For learning the character N-gram, a statistical N-gram model can be used. For example, a method called Kneser-Ney smoothing may be used. The character N-gram learning unit 12 stores the character N-gram data obtained by learning in the character N-gram storage unit 13.

単語Ｎグラム学習部１４は、上述したとおり、図２に示す再帰型ニューラルネットワークを利用して、単語Ｎグラムの学習を行う。再帰型ニューラルネットワークは、分割された単語の単語ベクトルｗ（ｔ）が入力される入力層３０と、中間層３２と、中間層３２の出力を受けて再度中間層３２への出力を行う入力層３１と、出力層３３とを有している。中間層３２は、ネットワークの中で、再帰的に入力データと共に更新を受ける。出力層３３には、入力された（Ｎ−１）個の単語ベクトルに後続するＮ番目の単語ベクトルが、確率分布の形で出力される。図２において、矢印の近傍に記載したアルファベットは、各層の結合荷重ベクトルである。また、各層に現れるベクトルを各層の上に記載している。 As described above, the word N-gram learning unit 14 learns the word N-gram using the recursive neural network shown in FIG. The recursive neural network includes an input layer 30 to which a word vector w (t) of divided words is input, an intermediate layer 32, and an input layer that receives the output of the intermediate layer 32 and outputs it to the intermediate layer 32 again. 31 and an output layer 33. The intermediate layer 32 is recursively updated with the input data in the network. The output layer 33 outputs the Nth word vector following the input (N−1) word vectors in the form of a probability distribution. In FIG. 2, the alphabet described in the vicinity of the arrow is a combined load vector of each layer. Moreover, the vector which appears in each layer is described on each layer.

ここで、単語ベクトルｗ（ｔ）を入力したときに、中間層３２に現れるベクトルｓ（ｔ）が、その単語の概念ベクトルに相当する。再帰型ニューラルネットワークにおいては、中間層３２の出力が入力層３１に入力され、再度中間層３２に入力される。つまり、一つ前の単語ベクトルｗ（ｔ−１）が、単語ベクトルｗ（ｔ）の概念ベクトルｓ（ｔ）に影響を与え、文脈を考慮して単語ベクトルｗ（ｔ）が求められることになる。 Here, when the word vector w (t) is input, the vector s (t) appearing in the intermediate layer 32 corresponds to the concept vector of the word. In the recursive neural network, the output of the intermediate layer 32 is input to the input layer 31 and input to the intermediate layer 32 again. That is, the previous word vector w (t−1) affects the concept vector s (t) of the word vector w (t), and the word vector w (t) is obtained in consideration of the context. Become.

再帰型ニューラルネットワークは、分割された単語のデータを教師データとし、（Ｎ−１）個の単語ベクトルが入力されたときのＮ番目の単語ベクトルに基づいて逆伝搬法によって、学習が行われる。単語Ｎグラム学習部１４は、学習によって更新された再帰型ニューラルネットワークのデータを単語Ｎグラム記憶部１５に記憶する。 In the recursive neural network, learning is performed by the back propagation method based on the Nth word vector when (N-1) word vectors are input, using the divided word data as teacher data. The word N-gram learning unit 14 stores the recursive neural network data updated by learning in the word N-gram storage unit 15.

単語境界学習部１６は、単語境界の認識を行う系列ラベリングによって単語分割モデルを学習する。単語モデルには、ＣＲＦ（Conditional Random Field）のような統計モデルを利用してもよいし、structured perceptronなどのニューラルネットワークを用いることとしてもよい。単語境界学習部１６は、学習された単語分割モデルを単語分割モデル記憶部１７に記憶する。 The word boundary learning unit 16 learns a word division model by sequence labeling that recognizes word boundaries. As the word model, a statistical model such as CRF (Conditional Random Field) may be used, or a neural network such as structured perceptron may be used. The word boundary learning unit 16 stores the learned word division model in the word division model storage unit 17.

情報処理装置は、上述した単語分割部１１にて分割された単語のデータを教師データとして、単語Ｎグラムを示す再帰型ニューラルネットワークを更新すると共に、分割された単語のデータに基づいた単語Ｎグラムや単語境界の学習を行い、学習された単語Ｎグラムや単語分割モデルを使って単語分割を行う処理を繰り返す。繰り返し処理を完了する収束判定としては、例えば、所定の回数Ｉだけ単語分割を行った時点、あるいは、パープレキシティを計算し、その値が所定の値以上変化しなくなった時点で、繰り返し処理を完了するなどの方法が考えらえる。 The information processing apparatus updates the recursive neural network indicating the word N-gram using the word data divided by the word dividing unit 11 described above as teacher data, and uses the word N-gram based on the divided word data. And word boundaries are learned, and the process of dividing words using the learned word N-gram and word division model is repeated. As the convergence determination for completing the iterative processing, for example, when the word division is performed a predetermined number of times I, or when the perplexity is calculated and the value does not change more than the predetermined value, the iterative processing is performed. You can think of how to complete it.

概念ベクトル算出部１８は、単語Ｎグラム学習部１４にて学習された再帰型ニューラルネットワークに対して単語ベクトルを入力し、その入力に対する中間層のベクトルを概念ベクトル２２として求める。概念ベクトル２２は、その単語の概念空間における位置を与え、概念ベクトル２２の距離や方向が意味的な類似度を表す。 The concept vector calculation unit 18 inputs a word vector to the recursive neural network learned by the word N-gram learning unit 14 and obtains an intermediate layer vector corresponding to the input as the concept vector 22. The concept vector 22 gives the position of the word in the concept space, and the distance and direction of the concept vector 22 represent the semantic similarity.

クラスタリング部１９は、概念ベクトル算出部１８にて算出された概念ベクトル２２に基づいて、類似の単語をクラスタリングする機能を有する。クラスタリング部１９は、階層的にクラスタリングを行ってもよい。 The clustering unit 19 has a function of clustering similar words based on the concept vector 22 calculated by the concept vector calculation unit 18. The clustering unit 19 may perform clustering hierarchically.

出力部２０は、概念ベクトル算出部１８にて算出された概念ベクトル２２のデータとクラスタリングの結果を出力する。出力部２０は、クラスタを代表する単語として、当該クラスタ内に存在する単語のうち最も頻度の高い単語を出力してもよい。なお、出力部２０は、これらに加えて、単語Ｎグラムのデータを出力してもよい。 The output unit 20 outputs the data of the concept vector 22 calculated by the concept vector calculation unit 18 and the clustering result. The output unit 20 may output a word having the highest frequency among words existing in the cluster as a word representing the cluster. In addition to the above, the output unit 20 may output data of a word N-gram.

図３は、上に説明した情報処理装置の機能を実現するハードウェア構成を示す図である。情報処理装置のハードウェアは、ＣＰＵ４０、ＲＡＭ４１、ＲＯＭ４２、通信インターフェース４４、ハードディスク４５、キーボード４６、モニタ４７を備えた通常のコンピュータである。ＲＯＭ４２に記憶されたプログラム４３を読み出して実行することにより、上に説明した情報処理装置が実現される。このようなプログラム４３も本発明の範囲に含まれる。 FIG. 3 is a diagram illustrating a hardware configuration for realizing the functions of the information processing apparatus described above. The hardware of the information processing apparatus is a normal computer including a CPU 40, a RAM 41, a ROM 42, a communication interface 44, a hard disk 45, a keyboard 46, and a monitor 47. By reading and executing the program 43 stored in the ROM 42, the information processing apparatus described above is realized. Such a program 43 is also included in the scope of the present invention.

続いて、実施の形態の情報処理装置の動作について説明する。図４は、情報処理装置の動作の概要を示す図であり、図５および図６は、単語分割と単語Ｎグラムの学習の動作を示す図である。まず、図４を参照して、情報処理装置の動作の概要について説明する。 Next, the operation of the information processing apparatus according to the embodiment will be described. FIG. 4 is a diagram showing an outline of the operation of the information processing apparatus, and FIGS. 5 and 6 are diagrams showing the operation of word division and word N-gram learning. First, an outline of the operation of the information processing apparatus will be described with reference to FIG.

まず、情報処理装置は、１単語の最大長をＫ、文字種別をＬとしてＬ^K個の可能な単語を定義し、定義された単語を一意に識別する単語ベクトルを与える（Ｓ１０）。続いて、情報処理装置は、学習データ２１を入力する（Ｓ１１）。 First, the information processing apparatus defines L ^K possible words, where the maximum length of one word is K and the character type is L, and gives a word vector that uniquely identifies the defined word (S10). Subsequently, the information processing apparatus inputs learning data 21 (S11).

情報処理装置は、入力された学習データ２１の単語分割と分割された単語のデータに基づく単語Ｎグラムの学習を行う（Ｓ１２）。ここでの処理については、図５および図６を参照して、後述する。情報処理装置は、単語分割と単語Ｎグラムの学習が収束すると、単語Ｎグラムの再帰型ニューラルネットワークを用いて、各単語の概念ベクトル２２を求め（Ｓ１３）、概念ベクトル２２に基づいて単語のクラスタリングを行う（Ｓ１４）。そして、情報処理装置は、単語の概念ベクトル２２と単語のクラスタリング結果を出力する（Ｓ１５）。 The information processing apparatus learns the word N-gram based on the word division of the input learning data 21 and the divided word data (S12). This process will be described later with reference to FIGS. 5 and 6. When the word segmentation and the learning of the word N-gram converge, the information processing device obtains a concept vector 22 of each word using a recursive neural network of the word N-gram (S13), and word clustering based on the concept vector 22 (S14). Then, the information processing apparatus outputs the word concept vector 22 and the word clustering result (S15).

図５を参照して、単語分割と単語Ｎグラムの処理について説明する。情報処理装置は、まず、初期単語分割を行う（Ｓ２０）。本実施の形態では、情報処理装置は、文字コードの切り替わりで単語を分割する。次に、情報処理装置は、分割された単語のデータを用いて、単語Ｎグラムの学習を行う（Ｓ２１）。具体的には、文に含まれる１〜Ｎ個の単語に続いて、Ｎ＋１番目にどの単語が現れるかを、再帰型ニューラルネットワークにて学習する。 With reference to FIG. 5, word division and word N-gram processing will be described. The information processing apparatus first performs initial word division (S20). In the present embodiment, the information processing apparatus divides words by switching character codes. Next, the information processing apparatus learns a word N-gram using the divided word data (S21). More specifically, the recursive neural network learns which N + 1th word appears after 1 to N words included in the sentence.

情報処理装置は、分割された単語のデータに基づいて文字Ｎグラムの学習を行い（Ｓ２２）、学習した文字Ｎグラムを用いて、学習データ２１の単語分割を行う（Ｓ２３）。 The information processing apparatus learns the character N-gram based on the divided word data (S22), and performs word division of the learning data 21 using the learned character N-gram (S23).

次に、情報処理装置は、分割された単語のデータを用いて、単語Ｎグラムの学習を行い（Ｓ２４）、系列ラベリングにより単語分割モデルの学習を行う（Ｓ２５）。そして、学習した単語分割モデルを用いて、学習データ２１を単語分割する（Ｓ２６）。 Next, the information processing apparatus learns a word N-gram using the divided word data (S24), and learns a word division model by series labeling (S25). Then, the learning data 21 is divided into words using the learned word division model (S26).

情報処理装置は、単語分割および単語Ｎグラムの学習の繰り返し処理の収束条件を満たすか否かを判定する（Ｓ２７）。収束条件を満たす場合には（Ｓ２７でＹＥＳ）、情報処理装置は、単語分割および単語Ｎグラムの学習を終了する。収束条件を満たさない場合には（Ｓ２７でＮＯ）、情報処理装置は、再度、単語Ｎグラムを学習する処理を開始する（Ｓ２１）。 The information processing apparatus determines whether or not the convergence condition of the repetition process of word division and word N-gram learning is satisfied (S27). If the convergence condition is satisfied (YES in S27), the information processing apparatus ends word division and word N-gram learning. If the convergence condition is not satisfied (NO in S27), the information processing apparatus starts the process of learning the word N-gram again (S21).

このように、本実施の形態の情報処理装置は、文字Ｎグラムによる単語分割と単語分割モデルによる単語分割を交互に行うと共に、各単語分割の処理の後に単語Ｎグラムの学習を行う。これにより、単語Ｎグラムを構成する再帰型ニューラルネットワークの学習が行われ、この再帰型ニューラルネットワークの中間層３２によって、単語の概念ベクトル２２を求めることができるようになる。 As described above, the information processing apparatus according to the present embodiment alternately performs word division using the character N-gram and word division using the word division model, and learns the word N-gram after each word division process. Thus, learning of the recurrent neural network constituting the word N-gram is performed, and the concept vector 22 of the word can be obtained by the intermediate layer 32 of the recursive neural network.

図６は、情報処理装置による単語分割と単語Ｎグラムの学習の別の例を示す図である。図６に示す処理は、図５に示す処理と基本的に同じであるが、図５に示した例では、初期分割（Ｓ２０）の後に、文字Ｎグラムの学習（Ｓ２２）、文字Ｎグラムによる単語分割（Ｓ２３）を行っているのに対し、図６に示す例では、初期分割（Ｓ３０）の後に、単語分割モデルの学習（Ｓ３２）、単語分割モデルを用いた単語分割（Ｓ３３）を行っている点が異なる。図６に示す例も、文字Ｎグラムによる単語分割と単語分割モデルによる単語分割を交互に行い、各単語分割の処理の後に単語Ｎグラムの学習を行う点では、図５に示した例と同じである。このように単語分割後の学習を文字Ｎグラムと単語分割モデルのいずれを先に行うかは任意である。 FIG. 6 is a diagram illustrating another example of word division and word N-gram learning by the information processing apparatus. The process shown in FIG. 6 is basically the same as the process shown in FIG. 5, but in the example shown in FIG. 5, after the initial division (S20), learning of character N-gram (S22), and by character N-gram While the word division (S23) is performed, in the example shown in FIG. 6, after the initial division (S30), the word division model is learned (S32) and the word division using the word division model (S33) is performed. Is different. The example shown in FIG. 6 is also the same as the example shown in FIG. 5 in that word division by the character N-gram and word division by the word division model are alternately performed and the word N-gram is learned after each word division process. It is. In this way, it is arbitrary whether learning after word division is performed first, the character N-gram or the word division model.

以上、本発明の実施の形態の情報処理装置について、実施の形態を挙げて説明したが、本発明は上記した実施の形態に限定されるものではない。例えば、単語分割を行った後に行う単語Ｎグラムの学習は、単語分割を行った毎に必ず行わなくてはならないというものではなく、例えば、文字Ｎグラムに基づく単語分割と、単語分割モデルに基づく単語分割が行われたときに、単語Ｎグラムの学習を行うこととしてもよい。 The information processing apparatus according to the embodiment of the present invention has been described with reference to the embodiment. However, the present invention is not limited to the above-described embodiment. For example, the learning of the word N-gram performed after the word division is not necessarily performed every time the word division is performed, for example, based on the word division based on the character N-gram and the word division model. When the word division is performed, the word N-gram may be learned.

本発明の情報処理装置は、単語辞書、および概念辞書を用意することなく、与えられたテキストデータから、単語の意味的な類似度を自動で獲得することができるという効果を有し、自然言語処理を行う装置として有用である。 The information processing apparatus according to the present invention has an effect that a semantic similarity of words can be automatically acquired from given text data without preparing a word dictionary and a concept dictionary. It is useful as a device for processing.

１０入力部
１１単語分割部
１２文字Ｎグラム学習部
１３文字Ｎグラム記憶部
１４単語Ｎグラム学習部
１５単語Ｎグラム記憶部
１６単語境界学習部
１７単語分割モデル記憶部
１８概念ベクトル算出部
１９出力部
２０学習データ
２１概念ベクトル
３０入力層
３１入力層
３２中間層
３３出力層
４０ＣＰＵ
４１ＲＡＭ
４２ＲＯＭ
４３プログラム
４４通信インターフェース
４５ハードディスク
４６キーボード
４７モニタ DESCRIPTION OF SYMBOLS 10 Input part 11 Word division part 12 Character N-gram learning part 13 Character N-gram storage part 14 Word N-gram learning part 15 Word N-gram storage part 16 Word boundary learning part 17 Word division model storage part 18 Concept vector calculation part 19 Output part 20 learning data 21 concept vector 30 input layer 31 input layer 32 intermediate layer 33 output layer 40 CPU
41 RAM
42 ROM
43 Program 44 Communication interface 45 Hard disk 46 Keyboard 47 Monitor

Claims

An input unit for inputting sentence data as learning data;
A word division unit for dividing the learning data into words using a character N-gram or a word division model;
A character N-gram learning unit that learns a character N-gram based on the data of the divided words and stores the learned character N-gram in a character N-gram storage unit;
A word boundary learning unit that learns a word division model by sequence labeling that recognizes word boundaries based on divided word data, and stores the word division model obtained by learning in a word division model storage unit;
A word N-gram represented by a recursive neural network that has an input layer, an intermediate layer, and an output layer, and outputs the output from the intermediate layer to the input layer as well, using the divided word data as teacher data A word N-gram learning unit stored in the word N-gram storage unit;
A concept data calculation unit for inputting word data to a recursive neural network stored in the word N-gram storage unit and obtaining data obtained in the intermediate layer as concept data;
An output unit for outputting the conceptual data;
With
Processing in which the word division unit alternately performs word division using the character N-gram learned by the character N-gram learning unit and word division using the word division model learned by the word boundary learning unit An information processing apparatus that repeatedly performs the process of learning the word N-gram using the word data divided by the word dividing unit until the word N-gram learning unit satisfies a predetermined convergence condition.

The information processing apparatus according to claim 1, wherein the recursive neural network receives data of first to Nth words constituting a sentence and outputs N + 1th word.

Based on the concept data, a clustering unit that clusters words in which the similarity between words is greater than a predetermined threshold into the same group,
The information processing apparatus according to claim 1, wherein the output unit outputs a result of clustering.

The information processing apparatus according to claim 3, wherein the clustering unit performs clustering hierarchically.

The information processing apparatus according to claim 1, wherein the word dividing unit performs initial division of the learning data based on a character code when the learning data is given.

An information processing device divides input learning data into words, and obtains the concept of the divided words,
The information processing apparatus inputs sentence data as learning data;
The information processing apparatus alternately performs word division using a character N-gram and word division using a word division model on the learning data, and inputs the divided word data as teacher data. Repeating a process of learning a word N-gram represented by a recursive neural network having a layer, an intermediate layer, and an output layer, and inputting the output from the intermediate layer to the input layer until a predetermined convergence condition is satisfied When,
The information processing apparatus inputs word data to the recursive neural network, and obtains data obtained in an intermediate layer as concept data;
The information processing apparatus outputting the conceptual data;
With
The step of learning the word N-gram includes:
The information processing apparatus performing word division of the learning data using a character N-gram;
The information processing apparatus learning the word N-gram represented by the recursive neural network using the divided word data as teacher data, and storing it in the word N-gram storage unit;
The information processing apparatus learns a word division model by sequence labeling for recognizing a word boundary based on divided word data, and stores the word division model obtained by learning in a word division model storage unit When,
The information processing apparatus performing word division of the learning data using the word division model;
The information processing apparatus learning the word N-gram represented by the recursive neural network using the divided word data as teacher data, and storing it in the word N-gram storage unit;
The information processing apparatus learns a character N-gram based on the divided word data, and stores the learned character N-gram in a character N-gram storage unit;
An information processing method comprising:

A program for dividing the input learning data into words and obtaining a concept of the divided words.
Inputting sentence data as learning data;
For the learning data, word division using a character N-gram and word division using a word division model are alternately performed, and the divided word data is used as teacher data, and an input layer, an intermediate layer, and an output Repeating a process of learning a word N-gram represented by a recursive neural network having a layer and inputting an output from an intermediate layer to an input layer until a predetermined convergence condition is satisfied;
Inputting word data to the recursive neural network and obtaining data obtained in the intermediate layer as concept data;
Outputting the conceptual data;
And execute
In the step of learning the word N-gram,
Performing word division of the learning data using a character N-gram;
Learning the word N-gram represented by the recursive neural network using the divided word data as teacher data, and storing it in the word N-gram storage unit;
Learning a word division model by sequence labeling that recognizes word boundaries based on the data of the divided words, and storing the word division model obtained by learning in a word division model storage unit;
Performing word division of the learning data using the word division model;
Learning the word N-gram represented by the recursive neural network using the divided word data as teacher data, and storing it in the word N-gram storage unit;
Learning a character N-gram based on the divided word data, and storing the learned character N-gram in a character N-gram storage unit;
A program that repeatedly executes