JP7378680B2

JP7378680B2 - Information processing device, update method, and update program

Info

Publication number: JP7378680B2
Application number: JP2023540861A
Authority: JP
Inventors: 誠竹中; 悠介小路; 進也田口
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2021-10-27
Filing date: 2021-10-27
Publication date: 2023-11-13
Anticipated expiration: 2041-10-27
Also published as: WO2023073818A1; JPWO2023073818A1

Description

本開示は、情報処理装置、更新方法、及び更新プログラムに関する。 The present disclosure relates to an information processing device, an update method, and an update program.

ｗｏｒｄ２ｖｅｃが知られている。ｗｏｒｄ２ｖｅｃは、教師なし学習で、コーパス（すなわち、ラベルの付いていない文）を用いて単語の意味的な特徴を学習することができる。学習された単語の単語ベクトルは、分散表現と呼ばれる。学習された単語の単語ベクトルは、文書検索などで用いることができる。 word2vec is known. word2vec is an unsupervised learning method that can learn the semantic features of words using a corpus (i.e., unlabeled sentences). The word vectors of learned words are called distributed representations. The word vectors of the learned words can be used in document searches and the like.

一方、単語間の関係知識、単語の属性知識などの外部情報を教師情報として用いて、分散表現の精度を向上させる手法が知られている。例えば、単語間の関係性及びカテゴリ情報に対して外部情報を用いることで、単語の意味的な情報が保たれながら、単語間の関係性を学習する手法が提案されている（非特許文献１を参照）。 On the other hand, a method is known that uses external information such as knowledge of relationships between words and knowledge of word attributes as training information to improve the accuracy of distributed representation. For example, a method has been proposed for learning the relationships between words while preserving the semantic information of the words by using external information for the relationships between words and category information (Non-Patent Document 1). ).

ＣｈａｎｇＸｕｅｔａｌ．「ＲＣ－ＮＥＴ：ＡＧｅｎｅｒａｌＦｒａｍｅｗｏｒｋｆｏｒＩｎｃｏｒｐｏｒａｔｉｎｇＫｎｏｗｌｅｄｇｅｉｎｔｏＷｏｒｄＲｅｐｒｅｓｅｎｔａｔｉｏｎｓ」、２０１４年Chang Xu et al. “RC-NET: A General Framework for Incorporating Knowledge into Word Representations”, 2014 ＴｏｍａｓＭｉｋｏｌｏｖｅｔａｌ．「ＤｉｓｔｒｉｂｕｔｅｄＲｅｐｒｅｓｅｎｔａｔｉｏｎｓｏｆＷｏｒｄｓａｎｄＰｈｒａｓｅｓａｎｄｔｈｅｉｒＣｏｍｐｏｓｉｔｉｏｎａｌｉｔｙ」Tomas Mikolov et al. "Distributed Representations of Words and Phrases and their Compositionality"

非特許文献１では、同じカテゴリの全ての単語に同じ重みを付加して、対象単語の単語ベクトルが更新される。しかし、出現頻度の少ない単語にも同じ重みを付加することは、更新される単語ベクトルの分散表現に悪影響を及ぼす。よって、非特許文献１の方法は、望ましいと言えない。 In Non-Patent Document 1, the word vector of the target word is updated by adding the same weight to all words in the same category. However, adding the same weight to words that appear less frequently has a negative effect on the distributed representation of updated word vectors. Therefore, the method of Non-Patent Document 1 cannot be said to be desirable.

本開示の目的は、分散表現を向上させることである。 The purpose of this disclosure is to improve distributed representation.

本開示の一態様に係る情報処理装置が提供される。情報処理装置は、前処理が行われたコーパスである前処理コーパスと、単語とカテゴリとの対応関係を示す情報であるカテゴリ辞書とを取得する取得部と、前記カテゴリ辞書を用いて、前記前処理コーパスに含まれる複数の単語のカテゴリを判定するカテゴリ判定部と、前記複数の単語のうちの１つの単語である対象単語に基づいて、前記対象単語の単語ベクトルを作成し、前記複数の単語のうちの１つの単語であり、かつ前記対象単語と同じカテゴリの単語である同一カテゴリ単語に基づいて、前記同一カテゴリ単語の単語ベクトルを作成する単語ベクトル作成部と、前記前処理コーパスに基づいて、前記前処理コーパス内における前記対象単語の出現頻度と、前記前処理コーパス内における前記同一カテゴリ単語の出現頻度とを算出する出現頻度算出部と、前記対象単語の単語ベクトル、前記同一カテゴリ単語の単語ベクトル、前記対象単語の出現頻度、及び前記同一カテゴリ単語の出現頻度を用いて、正則化項を算出する正則化項算出部と、前記正則化項を用いて、前記対象単語の単語ベクトルを更新する更新部と、を有する。 An information processing device according to one aspect of the present disclosure is provided. The information processing device includes an acquisition unit that acquires a preprocessed corpus that is a corpus that has undergone preprocessing, and a category dictionary that is information that indicates a correspondence relationship between words and categories; a category determination unit that determines the category of a plurality of words included in the processing corpus, and a word vector of the target word based on a target word that is one word among the plurality of words; a word vector creation unit that creates a word vector of the same category word based on the same category word that is one of the words of the target word and is a word of the same category as the target word; , an appearance frequency calculation unit that calculates the appearance frequency of the target word in the preprocessing corpus and the appearance frequency of the same category word in the preprocessing corpus; a regularization term calculation unit that calculates a regularization term using a word vector, an appearance frequency of the target word, and an appearance frequency of the same category word; and a regularization term calculation unit that calculates a regularization term using the regularization term. and an update unit that updates.

本開示によれば、分散表現を向上することができる。 According to the present disclosure, distributed representation can be improved.

実施の形態１の情報処理装置の機能を示すブロック図である。FIG. 2 is a block diagram showing functions of the information processing device according to the first embodiment. 実施の形態１の情報処理装置が有するハードウェアを示す図である。FIG. 2 is a diagram showing hardware included in the information processing apparatus according to the first embodiment. 実施の形態１の情報処理装置が実行する処理の例を示すフローチャートである。3 is a flowchart illustrating an example of processing executed by the information processing apparatus according to the first embodiment. 実施の形態１の意味空間を説明する図である。FIG. 3 is a diagram illustrating a semantic space according to the first embodiment. 実施の形態２の情報処理装置の機能を示すブロック図である。FIG. 2 is a block diagram showing functions of an information processing device according to a second embodiment. 実施の形態２の情報処理装置が実行する処理の例を示すフローチャートである。12 is a flowchart illustrating an example of processing executed by the information processing apparatus according to the second embodiment. 実施の形態２の類義語の表示と類義語の採択との具体例を示す図である。FIG. 7 is a diagram showing a specific example of displaying similar synonyms and selecting similar synonyms according to the second embodiment.

以下、図面を参照しながら実施の形態を説明する。以下の実施の形態は、例にすぎず、本開示の範囲内で種々の変更が可能である。 Embodiments will be described below with reference to the drawings. The following embodiments are merely examples, and various modifications can be made within the scope of the present disclosure.

実施の形態１．
図１は、実施の形態１の情報処理装置の機能を示すブロック図である。情報処理装置１００は、更新方法を実行する装置である。例えば、情報処理装置１００は、パーソナルコンピュータ、サーバ、スマートフォン、又はタブレット装置である。まず、情報処理装置１００が有するハードウェアを説明する。Embodiment 1.
FIG. 1 is a block diagram showing the functions of the information processing apparatus according to the first embodiment. The information processing device 100 is a device that executes an update method. For example, the information processing device 100 is a personal computer, a server, a smartphone, or a tablet device. First, the hardware included in the information processing device 100 will be explained.

図２は、実施の形態１の情報処理装置が有するハードウェアを示す図である。情報処理装置１００は、プロセッサ１０１、揮発性記憶装置１０２、不揮発性記憶装置１０３、ネットワークＩＦ（Ｉｎｔｅｒｆａｃｅ）１０４、入力ＩＦ１０５、及び表示ＩＦ１０６を有する。 FIG. 2 is a diagram showing hardware included in the information processing apparatus according to the first embodiment. The information processing device 100 includes a processor 101, a volatile storage device 102, a nonvolatile storage device 103, a network IF (Interface) 104, an input IF 105, and a display IF 106.

プロセッサ１０１は、情報処理装置１００全体を制御する。例えば、プロセッサ１０１は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、ＤＳＰ（ＤｉｇｉｔａｌＳｉｇｎａｌＰｒｏｃｅｓｓｏｒ）、ＡＳＩＣ（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）、ＦＰＧＡ（ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）などである。プロセッサ１０１は、マルチプロセッサでもよい。また、情報処理装置１００は、処理回路を有してもよい。さらに、プロセッサ１０１は、マイクロコンピュータ、又はＳｏＣ（ＳｙｓｔｅｍｏｎＣｈｉｐ）でもよい。 Processor 101 controls the entire information processing device 100 . For example, the processor 101 may be a CPU (Central Processing Unit), a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), or an FPGA (Field d Programmable Gate Array). Processor 101 may be a multiprocessor. Further, the information processing device 100 may include a processing circuit. Furthermore, the processor 101 may be a microcomputer or a SoC (System on Chip).

揮発性記憶装置１０２は、情報処理装置１００の主記憶装置である。例えば、揮発性記憶装置１０２は、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）である。不揮発性記憶装置１０３は、情報処理装置１００の補助記憶装置である。例えば、不揮発性記憶装置１０３は、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）、又はＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）である。 The volatile storage device 102 is the main storage device of the information processing device 100. For example, the volatile storage device 102 is a RAM (Random Access Memory). The nonvolatile storage device 103 is an auxiliary storage device of the information processing device 100. For example, the nonvolatile storage device 103 is a ROM (Read Only Memory), an HDD (Hard Disk Drive), or an SSD (Solid State Drive).

ネットワークＩＦ１０４は、ネットワーク１０と通信する。なお、ネットワーク１０は、有線ネットワーク又は無線ネットワークである。
入力ＩＦ１０５は、キーボード、タッチパネル、マウスなどから情報又は信号を受け付ける。なお、情報処理装置１００は、入力ＩＦ１０５を有していなくてもよい。
表示ＩＦ１０６は、ディスプレイに情報を出力する。なお、情報処理装置１００は、表示ＩＦ１０６を有していなくてもよい。Network IF 104 communicates with network 10 . Note that the network 10 is a wired network or a wireless network.
The input IF 105 receives information or signals from a keyboard, touch panel, mouse, etc. Note that the information processing apparatus 100 does not need to have the input IF 105.
Display IF 106 outputs information to the display. Note that the information processing device 100 does not need to have the display IF 106.

図１に戻って、情報処理装置１００が有する機能を説明する。
情報処理装置１００は、記憶部１１０、取得部１２０、前処理部１３０、カテゴリ判定部１４０、単語ベクトル作成部１５０、出現頻度算出部１６０、正則化項算出部１７０、及び更新部１８０を有する。Returning to FIG. 1, the functions of the information processing apparatus 100 will be explained.
The information processing device 100 includes a storage unit 110, an acquisition unit 120, a preprocessing unit 130, a category determination unit 140, a word vector creation unit 150, an appearance frequency calculation unit 160, a regularization term calculation unit 170, and an update unit 180.

記憶部１１０は、揮発性記憶装置１０２又は不揮発性記憶装置１０３に確保した記憶領域として実現してもよい。
取得部１２０、前処理部１３０、カテゴリ判定部１４０、単語ベクトル作成部１５０、出現頻度算出部１６０、正則化項算出部１７０、及び更新部１８０の一部又は全部は、処理回路によって実現してもよい。また、取得部１２０、前処理部１３０、カテゴリ判定部１４０、単語ベクトル作成部１５０、出現頻度算出部１６０、正則化項算出部１７０、及び更新部１８０の一部又は全部は、プロセッサ１０１が実行するプログラムのモジュールとして実現してもよい。例えば、プロセッサ１０１が実行するプログラムは、更新プログラムとも言う。例えば、更新プログラムは、ＣＤ、フラッシュメモリなどの記録媒体に記録される。また、更新プログラムは、記憶部１１０に格納されてもよい。更新プログラムは、ネットワーク１０を介して取得されてもよい。The storage unit 110 may be realized as a storage area secured in the volatile storage device 102 or the nonvolatile storage device 103.
Part or all of the acquisition unit 120, preprocessing unit 130, category determination unit 140, word vector creation unit 150, appearance frequency calculation unit 160, regularization term calculation unit 170, and update unit 180 are realized by a processing circuit. Good too. In addition, some or all of the acquisition unit 120, preprocessing unit 130, category determination unit 140, word vector creation unit 150, appearance frequency calculation unit 160, regularization term calculation unit 170, and update unit 180 are executed by the processor 101. It may also be realized as a program module. For example, the program executed by processor 101 is also referred to as an update program. For example, the update program is recorded on a recording medium such as a CD or a flash memory. Further, the update program may be stored in the storage unit 110. The update program may be obtained via the network 10.

記憶部１１０は、コーパス１１１とカテゴリ辞書１１２とを記憶する。コーパス１１１は、学習用データと呼んでもよい。また、コーパス１１１は、文章が登録されたデータベースと考えてもよい。カテゴリ辞書１１２は、名詞又は名詞句の単語と、カテゴリとの対応関係を示す情報である。なお、カテゴリは、名詞又は名詞句の上位概念の表現、商品のカテゴリ、又は固有表現のクラス名でもよい。また、例えば、クラス名は、人名、地名などである。 The storage unit 110 stores a corpus 111 and a category dictionary 112. The corpus 111 may also be called learning data. Furthermore, the corpus 111 may be considered as a database in which sentences are registered. The category dictionary 112 is information indicating the correspondence between words of nouns or noun phrases and categories. Note that the category may be an expression of a superordinate concept of a noun or a noun phrase, a product category, or a class name of a named entity. Further, for example, the class name is a person's name, place name, etc.

取得部１２０は、コーパス１１１とカテゴリ辞書１１２とを記憶部１１０から取得する。また、取得部１２０は、コーパス１１１とカテゴリ辞書１１２とを外部装置から取得してもよい。外部装置の図示は、省略されている。 The acquisition unit 120 acquires the corpus 111 and the category dictionary 112 from the storage unit 110. Further, the acquisition unit 120 may acquire the corpus 111 and the category dictionary 112 from an external device. Illustrations of external devices are omitted.

前処理部１３０は、コーパス１１１を前処理する。例えば、前処理部１３０は、形態素解析及び単語の正規化を行う。ここで、記憶部１１０は、前処理が行われたコーパスを記憶してもよい。前処理が行われたコーパスが記憶部１１０に格納されている場合、取得部１２０は、前処理が行われたコーパスを取得する。また、前処理が行われたコーパスが記憶部１１０に格納されている場合、情報処理装置１００は、前処理部１３０を有さない。前処理が行われたコーパスは、前処理コーパスと呼ぶ。また、取得部１２０は、前処理コーパスを外部装置から取得してもよい。 The preprocessing unit 130 preprocesses the corpus 111. For example, the preprocessing unit 130 performs morphological analysis and word normalization. Here, the storage unit 110 may store a corpus that has been subjected to preprocessing. If the preprocessed corpus is stored in the storage unit 110, the acquisition unit 120 acquires the preprocessed corpus. Further, when a preprocessed corpus is stored in the storage unit 110, the information processing device 100 does not include the preprocessing unit 130. A corpus that has been preprocessed is called a preprocessed corpus. Further, the acquisition unit 120 may acquire the preprocessed corpus from an external device.

カテゴリ判定部１４０は、カテゴリ辞書１１２を用いて、前処理コーパスに含まれる複数の単語のカテゴリを判定する。詳細には、カテゴリ判定部１４０は、カテゴリ辞書１１２を用いて、前処理コーパスに含まれる、名詞又は名詞句の単語のカテゴリを判定する。
単語ベクトル作成部１５０、出現頻度算出部１６０、正則化項算出部１７０、及び更新部１８０の詳細な機能は、後で説明する。The category determining unit 140 uses the category dictionary 112 to determine the categories of a plurality of words included in the preprocessing corpus. In detail, the category determining unit 140 uses the category dictionary 112 to determine the category of a word, a noun or a noun phrase, included in the preprocessing corpus.
Detailed functions of the word vector creation section 150, appearance frequency calculation section 160, regularization term calculation section 170, and update section 180 will be explained later.

次に、情報処理装置１００が実行する処理を、フローチャートを用いて、説明する。
図３は、実施の形態１の情報処理装置が実行する処理の例を示すフローチャートである。
（ステップＳ１１）前処理部１３０は、処理要求を受け付けたか否かを判定する。処理要求を受け付けた場合、処理は、ステップＳ１２に進む。処理要求を受け付けていない場合、前処理部１３０は、待機する。
（ステップＳ１２）前処理部１３０は、対象単語を含むコーパス１１１に対して前処理を実行する。Next, the processing executed by the information processing apparatus 100 will be explained using a flowchart.
FIG. 3 is a flowchart illustrating an example of processing executed by the information processing apparatus according to the first embodiment.
(Step S11) The preprocessing unit 130 determines whether a processing request has been accepted. If the processing request is accepted, the process proceeds to step S12. If no processing request has been received, the preprocessing unit 130 waits.
(Step S12) The preprocessing unit 130 performs preprocessing on the corpus 111 including the target word.

（ステップＳ１３）カテゴリ判定部１４０は、カテゴリ辞書１１２を用いて、前処理コーパスに含まれる、名詞又は名詞句の単語のカテゴリを判定する。
（ステップＳ１４）単語ベクトル作成部１５０は、前処理コーパスに含まれる複数の単語のうちの１つの単語である対象単語に基づいて、対象単語の単語ベクトルを作成する。例えば、単語ベクトル作成部１５０は、対象単語とｗｏｒｄ２ｖｅｃとを用いて、対象単語の単語ベクトルを作成する。また、単語ベクトル作成部１５０は、当該複数の単語のうちの１つの単語であり、かつ対象単語と同じカテゴリの単語である同一カテゴリ単語に基づいて、同一カテゴリ単語の単語ベクトルを作成する。(Step S13) The category determination unit 140 uses the category dictionary 112 to determine the category of the noun or noun phrase word included in the preprocessing corpus.
(Step S14) The word vector creation unit 150 creates a word vector of the target word based on the target word, which is one of the plural words included in the preprocessing corpus. For example, the word vector creation unit 150 creates a word vector of the target word using the target word and word2vec. Further, the word vector creation unit 150 creates a word vector of the same category word based on the same category word that is one of the plurality of words and is a word in the same category as the target word.

（ステップＳ１５）出現頻度算出部１６０は、前処理コーパスに基づいて、前処理コーパス内における対象単語の出現頻度ｆ（ｗ）を、重みとして算出する。また、出現頻度算出部１６０は、前処理コーパスに基づいて、前処理コーパス内における同一カテゴリ単語の出現頻度ｆ（ｗ_ｔ′）を、重みとして算出する。(Step S15) The appearance frequency calculation unit 160 calculates the appearance frequency f(w) of the target word in the preprocessing corpus as a weight, based on the preprocessing corpus. Furthermore, the appearance frequency calculation unit 160 calculates the appearance frequency f(w _t′ ) of the same category words in the preprocessed corpus as a weight, based on the preprocessed corpus.

（ステップＳ１６）正則化項算出部１７０は、対象単語の単語ベクトル、同一カテゴリ単語の単語ベクトル、対象単語の出現頻度ｆ（ｗ）、及び同一カテゴリ単語の出現頻度ｆ（ｗ_ｔ′）を用いて、正則化項Ｅ（ｗ_ｔ，ｗ_ｔ′）を算出する。言い換えれば、正則化項算出部１７０は、対象単語の単語ベクトル、同一カテゴリ単語の単語ベクトル、対象単語の出現頻度ｆ（ｗ）、及び同一カテゴリ単語の出現頻度ｆ（ｗ_ｔ′）を用いて、意味空間上における距離であり、出現頻度に応じた距離に基づく正則化項Ｅ（ｗ_ｔ，ｗ_ｔ′）を算出する。詳細には、正則化項算出部１７０は、式（１）を用いて、正則化項Ｅ（ｗ_ｔ，ｗ_ｔ′）を算出する。(Step S16) The regularization term calculation unit 170 uses the word vector of the target word, the word vector of the same category word, the frequency of appearance f(w) of the target word, and the frequency of appearance f(w _t′ ) of the same category word. Then, the regularization term E(w _t , w _t′ ) is calculated. In other words, the regularization term calculation unit 170 uses the word vector of the target word, the word vector of the same category word, the frequency of appearance f(w) of the target word, and the frequency of appearance f(w _t′ ) of the same category word. , which is a distance in the semantic space, and calculates a regularization term E(w _t , w _t' ) based on the distance according to the frequency of appearance. Specifically, the regularization term calculating unit 170 calculates the regularization term E(w _t , w _t′ ) using equation (1).

なお、Ｖは、コーパス１１１の語彙集合を示す。ｗ_ｔは、対象単語の単語ベクトルである。ｗ_ｔ′は、同一カテゴリ単語の単語ベクトルである。ｄ（ｗ_ｔ，ｗ_ｔ′）は、対象単語の単語ベクトルと、同一カテゴリ単語の単語ベクトルとの距離である。なお、距離では、ユークリッド距離が用いられる。距離では、コサイン類似度（ｃｏｓ類似度）の逆数、又は、“１－ｃｏｓ類似度”が用いられてもよい。Note that V indicates a vocabulary set of the corpus 111. w _t is a word vector of the target word. w _t' is a word vector of words in the same category. d(w _t , w _t' ) is the distance between the word vector of the target word and the word vector of the same category word. Note that Euclidean distance is used for the distance. For the distance, the reciprocal of cosine similarity (cos similarity) or "1-cos similarity" may be used.

ここで、意味空間上における距離を説明する。
図４は、実施の形態１の意味空間を説明する図である。図４では、意味空間が２次元で表されている。図４は、対象単語２０、出現頻度の多い単語である高頻度単語２１、及び出現頻度の少ない単語である低頻度単語２２を示している。対象単語２０、高頻度単語２１、及び低頻度単語２２は、同一のカテゴリの単語である。Here, distance in the semantic space will be explained.
FIG. 4 is a diagram illustrating the semantic space of the first embodiment. In FIG. 4, the semantic space is represented in two dimensions. FIG. 4 shows a target word 20, a high-frequency word 21 that is a word that appears frequently, and a low-frequency word 22 that is a word that appears less frequently. The target word 20, the high frequency word 21, and the low frequency word 22 are words in the same category.

図４は、信頼領域２３，２４を示している。高頻度単語２１の単語ベクトルは、分散が小さい傾向にある。そのため、信頼領域２３は、小さい。低頻度単語２２の単語ベクトルは、分散が大きい傾向にある。そのため、信頼領域２４は、大きい。また、図４は、対象単語２０と異なるカテゴリの単語３１，３２を示している。 FIG. 4 shows trust regions 23 and 24. The word vectors of the high-frequency words 21 tend to have small variance. Therefore, the trust region 23 is small. The word vectors of the low frequency words 22 tend to have a large variance. Therefore, the trust region 24 is large. Further, FIG. 4 shows words 31 and 32 in a different category from the target word 20.

同一カテゴリ単語の出現頻度ｆ（ｗ_ｔ′）が少ない場合、正則化項算出部１７０は、低頻度単語２２から対象単語２０に対する影響が小さくなるような、正則化項Ｅ（ｗ_ｔ，ｗ_ｔ′）を算出する。また、同一カテゴリ単語の出現頻度ｆ（ｗ_ｔ′）が多い場合、正則化項算出部１７０は、高頻度単語２１から対象単語２０に対する影響が大きくなるような、正則化項Ｅ（ｗ_ｔ，ｗ_ｔ′）を算出する。When the appearance frequency f(w _t′ ) of words in the same category is low, the regularization term calculation unit 170 calculates a regularization term E(w _t , w _{t )} that reduces the influence of the low frequency words 22 on the target word 20. _′ ) is calculated. Furthermore, when the appearance frequency f(w _t′ ) of words in the same category is high, the regularization term calculation unit 170 calculates a regularization term E(w _t , w _t' ) is calculated.

（ステップＳ１７）更新部１８０は、正則化項Ｅ（ｗ_ｔ，ｗ_ｔ′）を用いて、対象単語の単語ベクトルｗ_ｔを更新する。詳細には、更新部１８０は、正則化項Ｅ（ｗ_ｔ，ｗ_ｔ′）とｓｋｉｐ－ｇｒａｍｎｅｇａｔｉｖｅｓａｍｐｌｉｎｇの目的関数とを用いた目的関数に基づいて、対象単語の単語ベクトルｗ_ｔを更新する。具体的には、更新部１８０は、式（２）を用いて、対象単語の単語ベクトルｗ_ｔを更新する。なお、式（２）における総和記号の中の第１項と第２項とは、非特許文献２に記載のｓｋｉｐ－ｇｒａｍｎｅｇａｔｉｖｅｓａｍｐｌｉｎｇの目的関数とは同様である。そのため、同様の箇所の説明は、省略する。(Step S17) The updating unit 180 updates the word vector w _t of the target word using the regularization term E(w _t , w _t' ). In detail, the update unit 180 updates the word vector w _t of the target word based on an objective function using the regularization term E (w _t , w _t′ ) and the objective function of skip-gram negative sampling. . Specifically, the updating unit 180 updates the word vector w _t of the target word using equation (2). Note that the first term and the second term in the summation symbol in equation (2) are the same as the objective function of skip-gram negative sampling described in Non-Patent Document 2. Therefore, description of similar parts will be omitted.

なお、Ｊは、目的関数を示す。σ（ｘ）（＝１／（１＋ｅｘｐ（－ｘ）））は、シグモイド関数を示す。ｋは、Ｓｋｉｐ－ｇｒａｍｎｅｇａｔｉｖｅｓａｍｐｌｉｎｇの擬似負例数を示す。Ｐｎは、擬似負例のサンプリング分布を示す。なお、Ｐｎは、通常、ユニグラム分布又はユニグラムの０．７５乗した分布である。ｗ_ｎは、擬似負例の単語ベクトルを示す。ｗ_ｃは、ｗ_ｔと共起する単語の単語ベクトルを示す。なお、共起する単語とは、対象単語の前後Ｎ文字以内に存在する単語のことである。また、Ｎは、予め定められた整数である。Note that J indicates an objective function. σ(x) (=1/(1+exp(-x))) indicates a sigmoid function. k indicates the number of pseudo negative examples of Skip-gram negative sampling. Pn indicates the sampling distribution of pseudo-negative examples. Note that Pn is usually a unigram distribution or a distribution obtained by raising a unigram to the 0.75th power. w _n indicates a word vector of a pseudo-negative example. w _c indicates a word vector of words co-occurring with w _t . Note that the co-occurring words are words that exist within N characters before and after the target word. Further, N is a predetermined integer.

（ステップＳ１８）更新部１８０は、終了要件を満たすか否かを判定する。なお、例えば、終了要件は、対象単語の単語ベクトルｗ_ｔが変動しなくなることである。また、例えば、終了要件は、ステップＳ１６，１７を実行した回数が予め定められた閾値を超えることである。
終了要件が満たされていない場合、処理は、ステップＳ１６に進む。終了要件が満たされた場合、処理は、終了する。更新部１８０は、対象単語の単語ベクトルｗ_ｔを記憶部１１０に格納する。
ここで、ステップＳ１６～１８を繰り返すことは、学習と表現してもよい。(Step S18) The update unit 180 determines whether the termination requirements are satisfied. Note that, for example, the termination requirement is that the word vector w _t of the target word no longer fluctuates. Further, for example, the termination requirement is that the number of times steps S16 and 17 are executed exceeds a predetermined threshold.
If the termination requirements are not met, the process proceeds to step S16. If the termination requirements are met, the process ends. The updating unit 180 stores the word vector w _t of the target word in the storage unit 110.
Here, repeating steps S16 to S18 may be expressed as learning.

実施の形態１によれば、情報処理装置１００は、同じカテゴリの全ての単語に同じ重みを付加して、対象単語の単語ベクトルを更新しない。情報処理装置１００は、出現頻度に応じた正則化項を算出し、正則化項を用いて、対象単語の単語ベクトルを更新する。例えば、情報処理装置１００は、同一カテゴリ単語の出現頻度が少ない場合、正則化項を用いることで、出現頻度の少ない単語から対象単語に対する影響を小さくする。よって、情報処理装置１００は、対象単語の単語ベクトルの分散表現を向上させることができる。 According to the first embodiment, the information processing apparatus 100 adds the same weight to all words in the same category and does not update the word vector of the target word. The information processing device 100 calculates a regularization term according to the frequency of appearance, and uses the regularization term to update the word vector of the target word. For example, when the frequency of appearance of words in the same category is low, the information processing device 100 uses a regularization term to reduce the influence of words that appear less frequently on the target word. Therefore, the information processing device 100 can improve the distributed representation of the word vector of the target word.

実施の形態２．
次に、実施の形態２を説明する。実施の形態２では、実施の形態１と相違する事項を主に説明する。そして、実施の形態２では、実施の形態１と共通する事項の説明を省略する。Embodiment 2.
Next, a second embodiment will be described. In the second embodiment, matters that are different from the first embodiment will be mainly explained. In the second embodiment, explanations of matters common to the first embodiment will be omitted.

図５は、実施の形態２の情報処理装置の機能を示すブロック図である。図１に示される構成と同じ図５の構成は、図１に示される符号と同じ符号を付している。また、図５では、前処理部１３０、カテゴリ判定部１４０、単語ベクトル作成部１５０、出現頻度算出部１６０、正則化項算出部１７０、及び更新部１８０の図示が、省略されている。 FIG. 5 is a block diagram showing the functions of the information processing device according to the second embodiment. Components in FIG. 5 that are the same as those shown in FIG. 1 are designated by the same reference numerals as those shown in FIG. Further, in FIG. 5, illustration of the preprocessing unit 130, the category determining unit 140, the word vector creating unit 150, the appearance frequency calculating unit 160, the regularization term calculating unit 170, and the updating unit 180 is omitted.

情報処理装置１００は、さらに、表示部１９１、決定部１９２、判定部１９３、及び再学習部１９４を有する。
表示部１９１、決定部１９２、判定部１９３、及び再学習部１９４の一部又は全部は、処理回路によって実現してもよい。また、表示部１９１、決定部１９２、判定部１９３、及び再学習部１９４の一部又は全部は、プロセッサ１０１が実行するプログラムのモジュールとして実現してもよい。The information processing device 100 further includes a display section 191 , a determining section 192 , a determining section 193 , and a relearning section 194 .
A part or all of the display section 191, the determination section 192, the determination section 193, and the relearning section 194 may be realized by a processing circuit. Further, a part or all of the display section 191, the determination section 192, the determination section 193, and the relearning section 194 may be realized as a module of a program executed by the processor 101.

記憶部１１０は、データベース１１３を記憶する。データベース１１３は、学習済モデルと呼んでもよい。データベース１１３は、コーパス１１１に含まれる複数の単語と、当該複数の単語に対応する複数の単語ベクトルとの対応関係を示す。なお、当該複数の単語ベクトルのそれぞれは、実施の形態１で、更新された単語ベクトルである。
また、記憶部１１０は、再学習対象情報１１４を記憶する。再学習対象情報１１４については、後で説明する。The storage unit 110 stores a database 113. The database 113 may also be referred to as a trained model. The database 113 shows the correspondence between a plurality of words included in the corpus 111 and a plurality of word vectors corresponding to the plurality of words. Note that each of the plurality of word vectors is a word vector updated in the first embodiment.
The storage unit 110 also stores relearning target information 114. The relearning target information 114 will be explained later.

表示部１９１、決定部１９２、判定部１９３、及び再学習部１９４の機能については、後で説明する。 The functions of the display section 191, determination section 192, determination section 193, and relearning section 194 will be explained later.

次に、情報処理装置１００が実行する処理を、フローチャートを用いて説明する。
図６は、実施の形態２の情報処理装置が実行する処理の例を示すフローチャートである。
（ステップＳ２１）表示部１９１は、検索ユーザインタフェース画面を表示する。例えば、表示部１９１は、検索ユーザインタフェース画面をディスプレイに表示する。
（ステップＳ２２）取得部１２０は、検索ユーザインタフェース画面を介して、検索キーワードが入力されたか否かを判定する。ユーザによって、検索キーワードが入力された場合、取得部１２０は、ユーザが入力した検索キーワードを取得する。そして、処理は、ステップＳ２３に進む。検索キーワードが入力されていない場合、取得部１２０は、検索キーワードが入力されるまで待機する。Next, the processing executed by the information processing apparatus 100 will be explained using a flowchart.
FIG. 6 is a flowchart illustrating an example of processing executed by the information processing apparatus according to the second embodiment.
(Step S21) The display unit 191 displays a search user interface screen. For example, the display unit 191 displays a search user interface screen on the display.
(Step S22) The acquisition unit 120 determines whether a search keyword has been input via the search user interface screen. When a search keyword is input by the user, the acquisition unit 120 acquires the search keyword input by the user. The process then proceeds to step S23. If a search keyword has not been input, the acquisition unit 120 waits until a search keyword is input.

（ステップＳ２３）決定部１９２は、データベース１１３を用いて、検索キーワードの類義語を決定する。詳細には、決定部１９２は、検索キーワードの単語ベクトルと類似する単語ベクトルをデータベース１１３の中から検出し、検出された単語ベクトルに対応する単語を、検索キーワードの類義語として、決定する。なお、例えば、類似するか否かは、閾値を用いて、判定される。 (Step S23) The determining unit 192 uses the database 113 to determine synonyms of the search keyword. Specifically, the determining unit 192 detects a word vector similar to the word vector of the search keyword from the database 113, and determines a word corresponding to the detected word vector as a synonym of the search keyword. Note that, for example, whether or not they are similar is determined using a threshold value.

（ステップＳ２４）表示部１９１は、検索キーワードの類義語を表示する。なお、検索キーワードの類義語が表示されることを、サジェストとも言う。
（ステップＳ２５）判定部１９３は、ユーザが当該類義語を採択したか否かを判定する。(Step S24) The display unit 191 displays synonyms of the search keyword. Note that the display of synonyms for a search keyword is also called a suggestion.
(Step S25) The determination unit 193 determines whether the user has adopted the synonym.

ここで、類義語の表示と類義語の採択とを、具体的に例示する。
図７は、実施の形態２の類義語の表示と類義語の採択との具体例を示す図である。図７は、“ほげほげ”が検索キーワードとして入力されたことを示している。表示部１９１は、検索キーワード“ほげほげ”の類義語“ふがふが”を表示する。ユーザが類義語“ふがふが”を採択した場合、表示部１９１は、検索キーワード“ほげほげ”と類義語“ふがふが”とを表示する。ユーザが類義語“ふがふが”を採択しない場合、表示部１９１は、検索キーワード“ほげほげ”を表示する。
このように、ユーザは、表示画面を見て、類義語を採用するか否かを判断する。Here, the display of synonyms and the selection of synonyms will be specifically illustrated.
FIG. 7 is a diagram showing a specific example of displaying similar synonyms and selecting similar synonyms according to the second embodiment. FIG. 7 shows that "hogehoge" has been input as a search keyword. The display unit 191 displays a synonym of the search keyword "hogehoge", "fugafuga". When the user selects the synonym "Fugafuga", the display unit 191 displays the search keyword "Hogehoge" and the synonym "Fugafuga". If the user does not select the synonym "Fugafuga", the display unit 191 displays the search keyword "Hogehoge".
In this way, the user looks at the display screen and determines whether or not to adopt synonyms.

ユーザが類義語を採択した場合、処理は、終了する。ユーザが類義語を採択しない場合、処理は、ステップＳ２６に進む。 If the user adopts synonyms, the process ends. If the user does not select synonyms, the process proceeds to step S26.

ここで、ユーザが当該類義語を採択しない場合、当該類義語は、ユーザにとって検索キーワードの類義語ではないことを意味する。つまり、当該類義語は、意味空間上において、検索キーワードの近傍に存在するべきでないことを意味する。そこで、再学習部１９４は、検索キーワードと、表示された類義語とを、再学習の対象単語として、再学習対象情報１１４に登録する。 Here, if the user does not adopt the similar synonym, this means that the similar synonym is not a synonym of the search keyword for the user. That is, this means that the similar synonym should not exist near the search keyword in the semantic space. Therefore, the relearning unit 194 registers the search keyword and the displayed synonyms in the relearning target information 114 as relearning target words.

（ステップＳ２６）判定部１９３は、再学習の要件を満たすか否かを判定する。例えば、再学習の要件は、再学習対象情報１１４に登録されている単語の数が、閾値を超えることである。また、判定部１９３は、再学習対象情報１１４に２つの単語が登録されたタイミングで、再学習の要件を満たすと判定してもよい。
再学習の要件が満たされる場合、処理は、ステップＳ２７に進む。再学習の要件が満たされない場合、処理は、終了する。(Step S26) The determination unit 193 determines whether the requirements for relearning are satisfied. For example, a requirement for relearning is that the number of words registered in the relearning target information 114 exceeds a threshold value. Further, the determination unit 193 may determine that the requirements for relearning are satisfied at the timing when two words are registered in the relearning target information 114.
If the relearning requirements are met, the process proceeds to step S27. If the relearning requirements are not met, processing ends.

（ステップＳ２７）取得部１２０は、カテゴリ辞書１１２の更新情報を取得する。例えば、当該更新情報は、ユーザによって作成されてもよい。
（ステップＳ２８）再学習部１９４は、当該更新情報に基づいて、カテゴリ辞書１１２を更新する。なお、カテゴリ辞書１１２を更新する理由は、後述する再学習で、検索キーワードの単語ベクトルと、類義語の単語ベクトルとを適切な単語ベクトルに更新させるためである。(Step S27) The acquisition unit 120 acquires update information of the category dictionary 112. For example, the update information may be created by a user.
(Step S28) The relearning unit 194 updates the category dictionary 112 based on the updated information. The reason for updating the category dictionary 112 is to update the word vectors of the search keyword and the word vectors of synonyms to appropriate word vectors during relearning described later.

（ステップＳ２９）再学習部１９４は、更新されたカテゴリ辞書１１２と、再学習対象情報１１４とを用いて、検索キーワードの単語ベクトルと、類義語の単語ベクトルとを更新するための処理を実行する。詳細には、再学習部１９４は、再学習対象情報１１４に登録されている複数の単語を、コーパス１１１又は前処理コーパスと見立てて、ステップＳ１２～１８を実行する。これにより、再学習対象情報１１４に登録されている複数の単語のそれぞれが、対象単語となる。そして、複数の単語のそれぞれの単語ベクトルが、更新される。
再学習部１９４は、再学習対象情報１１４に登録されている複数の単語と、更新された複数の単語ベクトルとをデータベース１１３に登録する。(Step S29) The relearning unit 194 uses the updated category dictionary 112 and the relearning target information 114 to execute a process for updating the word vector of the search keyword and the word vector of the synonym. In detail, the relearning unit 194 performs steps S12 to S18 by regarding the plurality of words registered in the relearning target information 114 as the corpus 111 or the preprocessing corpus. Thereby, each of the plurality of words registered in the relearning target information 114 becomes a target word. Then, each word vector of the plurality of words is updated.
The relearning unit 194 registers a plurality of words registered in the relearning target information 114 and a plurality of updated word vectors in the database 113.

実施の形態２によれば、情報処理装置１００は、再学習対象情報１１４に登録されている複数の単語を再学習することで、ユーザが望む類義語を表示することができる。 According to the second embodiment, the information processing device 100 can display synonyms desired by the user by relearning the plurality of words registered in the relearning target information 114.

以上に説明した各実施の形態における特徴は、互いに適宜組み合わせることができる。 The features of each embodiment described above can be combined with each other as appropriate.

１０ネットワーク、２０対象単語、２１高頻度単語、２２低頻度単語、２３信頼領域、２４信頼領域、３１，３２単語、１００情報処理装置、１０１プロセッサ、１０２揮発性記憶装置、１０３不揮発性記憶装置、１１０記憶部、１１１コーパス、１１２カテゴリ辞書、１１３データベース、１１４再学習対象情報、１２０取得部、１３０前処理部、１４０カテゴリ判定部、１５０単語ベクトル作成部、１６０出現頻度算出部、１７０正則化項算出部、１８０更新部、１９１表示部、１９２決定部、１９３判定部、１９４再学習部。 10 network, 20 target word, 21 high frequency word, 22 low frequency word, 23 trust region, 24 trust region, 31, 32 word, 100 information processing device, 101 processor, 102 volatile storage device, 103 nonvolatile storage device, 110 storage unit, 111 corpus, 112 category dictionary, 113 database, 114 relearning target information, 120 acquisition unit, 130 preprocessing unit, 140 category determination unit, 150 word vector creation unit, 160 appearance frequency calculation unit, 170 regularization term calculation unit, 180 update unit, 191 display unit, 192 determination unit, 193 determination unit, 194 relearning unit.

Claims

an acquisition unit that acquires a preprocessed corpus that is a preprocessed corpus and a category dictionary that is information indicating a correspondence between words and categories;
a category determination unit that determines categories of a plurality of words included in the preprocessing corpus using the category dictionary;
A word vector of the target word is created based on the target word that is one of the plurality of words, and a word that is one of the plurality of words and is in the same category as the target word. a word vector creation unit that creates a word vector of the same category word based on the same category word;
an appearance frequency calculation unit that calculates the appearance frequency of the target word in the preprocessing corpus and the appearance frequency of the same category word in the preprocessing corpus, based on the preprocessing corpus;
a regularization term calculation unit that calculates a regularization term using the word vector of the target word, the word vector of the same category word, the frequency of appearance of the target word, and the frequency of appearance of the same category word;
an updating unit that updates a word vector of the target word using the regularization term;
An information processing device having:

further comprising a pre-processing section,
The acquisition unit acquires a corpus,
The preprocessing unit preprocesses the corpus,
the preprocessing corpus is a corpus preprocessed by the preprocessing unit;
The information processing device according to claim 1.

storage section and
A decision section,
A display section;
A determination section;
Re-study club and
It further has
The storage unit stores a database indicating correspondence between the plurality of words and a plurality of word vectors corresponding to the plurality of words,
One word vector among the plurality of word vectors is an updated word vector,
The acquisition unit acquires a search keyword input by a user,
The determining unit determines synonyms of the search keyword using the database,
The display unit displays the synonyms,
The determination unit determines whether the user has adopted the synonym,
If the user does not adopt the synonym, the relearning unit registers the search keyword and the synonym as a relearning target word in relearning target information,
The acquisition unit acquires update information of the category dictionary when the user does not adopt the synonym,
When the update information is acquired, the relearning unit updates the category dictionary based on the update information, and uses the updated category dictionary and the relearning target information to search for the search keyword. executing a process for updating the word vector of and the word vector of the synonym;
The information processing device according to claim 1 or 2.

The information processing device
Obtain a preprocessed corpus, which is a corpus that has been preprocessed, and a category dictionary, which is information indicating the correspondence between words and categories,
determining categories of a plurality of words included in the preprocessing corpus using the category dictionary;
A word vector of the target word is created based on the target word that is one of the plurality of words, and a word that is one of the plurality of words and is in the same category as the target word. Create a word vector of the same category words based on the same category words,
Based on the preprocessing corpus, calculating the frequency of appearance of the target word in the preprocessing corpus and the frequency of appearance of the same category word in the preprocessing corpus,
Calculating a regularization term using the word vector of the target word, the word vector of the same category word, the frequency of appearance of the target word, and the frequency of appearance of the same category word,
updating a word vector of the target word using the regularization term;
How to update.

In the information processing device,
Obtain a preprocessed corpus, which is a corpus that has been preprocessed, and a category dictionary, which is information indicating the correspondence between words and categories,
determining categories of a plurality of words included in the preprocessing corpus using the category dictionary;
A word vector of the target word is created based on the target word that is one of the plurality of words, and a word that is one of the plurality of words and is in the same category as the target word. Create a word vector of the same category words based on the same category words,
Based on the preprocessing corpus, calculating the frequency of appearance of the target word in the preprocessing corpus and the frequency of appearance of the same category word in the preprocessing corpus,
Calculating a regularization term using the word vector of the target word, the word vector of the same category word, the frequency of appearance of the target word, and the frequency of appearance of the same category word,
updating a word vector of the target word using the regularization term;
An update program that performs the process.