JP7716627B2

JP7716627B2 - Estimation method, estimation device, and estimation program

Info

Publication number: JP7716627B2
Application number: JP2023544821A
Authority: JP
Inventors: 佑樹北岸; 岳至森; 太一浅見; 歩相名神山
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2025-08-01
Anticipated expiration: 2041-08-30
Also published as: WO2023032016A1; JPWO2023032016A1

Description

本発明は、推定方法、推定装置および推定プログラムに関する。 The present invention relates to an estimation method, an estimation device, and an estimation program.

従来、人間の音声や顔、身振り手振り等の非言語・パラ言語情報に表れる心の状態を自動的に推定する技術の研究開発が行われてきた。例えば、エージェントやロボットとの対話において、それらの反応の生成時に対話相手の心の状態を反映させたり、メンタルヘルスケアの一環として推定結果を活用したり、ｗｅｂ会議等で参加者の状態を数値化して把握しやすくしたりすることが期待されている。 Research and development has been conducted on technology to automatically estimate mental states expressed in non-verbal and paralinguistic information such as human voice, facial expression, and gestures. For example, it is expected that in dialogue with agents or robots, the mental state of the person being spoken to can be reflected when generating responses, the results of estimation can be used as part of mental health care, and the state of participants in web conferences can be quantified to make it easier to understand.

このような非言語・パラ言語情報に表れる心の状態の推定は、一般に、音声や動画像から抽出される特徴量やデータそのもの等の入力に対し、定義された心の状態を表す各ラベルの事後確率等を出力する教師あり学習として定義される（非特許文献１参照）。 The estimation of mental states expressed in such non-verbal and paralinguistic information is generally defined as supervised learning, which outputs the posterior probability of each label representing a defined mental state in response to input such as features extracted from speech or video images or the data itself (see non-patent document 1).

ここで、感情や表情認識は、平常、喜び、悲しみ、驚き、恐怖、憎悪、怒り、軽蔑等のいくつかのクラスに分類される。また、理解度等の特定の度合いは任意の段階に分類される。教師あり学習においては、このように定義されたクラスに対応するラベルが、一人または複数の作業者によってアノテーションされる。Here, emotions and facial expressions are classified into several classes, such as neutral, joy, sadness, surprise, fear, hatred, anger, and contempt. Specific levels, such as understanding, are also classified into arbitrary stages. In supervised learning, labels corresponding to these defined classes are annotated by one or more workers.

しかしながら、このようなアノテーションすなわちラベル付与の作業は、ラベルの粒度が細かくなるほど難しくなる。例えば、理解度であれば、理解していない、普通、理解しているという３段階から、理解していない、やや理解していない、普通、やや理解している、理解しているという５段階に粒度を細かくすると、ラベル付与の難易度が上がる。However, this annotation, or labeling, task becomes more difficult the finer the label granularity. For example, if the level of understanding is increased from three levels (no understanding, normal, and understood) to five levels (no understanding, slightly no understanding, normal, slightly understanding, and understood), the difficulty of labeling increases.

このように粒度の細かい分類問題に対しては、作業者によるアノテーション結果は、大局的には一致しても局所的には一致しにくい。例えば、５段階の理解度について、複数名の作業者がアノテーションを行う場合、全作業者で理解度が低いことは一致しても、「理解していない」「やや理解していない」については僅差で評価が分かれる場合がある。この場合に、作業者にかかる疲労、経験、判断基準等のバイアスが変わると、結果も変わる可能性がある。これでは、教師あり学習ではノイズの含まれた正確ではない正解ラベルが混在することになり、学習や評価に対して悪影響を及ぼす。 For such fine-grained classification problems, annotation results by different workers may agree globally but are unlikely to agree locally. For example, when multiple workers annotate a five-level comprehension scale, all workers may agree that the level of comprehension is low, but their assessments of "not understood" and "somewhat not understood" may differ slightly. In this case, if biases such as fatigue, experience, and judgment criteria affect the workers change, the results may also change. This means that supervised learning will result in the mixing of noisy and inaccurate correct labels, which has a negative impact on learning and evaluation.

そこで、従来、ｒｅｌａｂｅｌｉｎｇ等といわれるラベルの修正技術が知られている（非特許文献２、３参照）。 Therefore, label correction techniques known as relabeling have been known in the past (see non-patent literature 2 and 3).

D. Rangulov and M. Fahim, “Emotion Recognition on large video dataset based on Convolutional Feature Extractor and Recurrent Neural Network”，2020 IEEE 4th International Conference on Image Processing, Applications and Systems（IPAS）, 2020年D. Rangulov and M. Fahim, “Emotion Recognition on large video dataset based on Convolutional Feature Extractor and Recurrent Neural Network”, 2020 IEEE 4th International Conference on Image Processing, Applications and Systems (IPAS), 2020 K. Wang, X. Peng, J. Yang, S. Lu, and Y. Qiao, “Suppressing Uncertainties for Large-Scale Facial Expression Recognition”, 2020年K. Wang, X. Peng, J. Yang, S. Lu, and Y. Qiao, “Suppressing Uncertainties for Large-Scale Facial Expression Recognition”, 2020 B. Zhang, L. Li, S. Wang, Z. Zha, and Q. Huang, “State-Relabeling Adversarial Active Learning”, 2020年B. Zhang, L. Li, S. Wang, Z. Zha, and Q. Huang, “State-Relabeling Adversarial Active Learning”, 2020

しかしながら、従来技術では、非言語・パラ言語情報に表れる心の状態を表すラベルの修正を正確に行うことは困難であった。例えば、従来技術では、１名の作業者のラベル付与結果あるいは複数名の投票結果の最大値だけをラベルとして保持しており、人間の知見を十分に活かせているとは言い難い。However, with conventional technology, it has been difficult to accurately correct labels that represent mental states expressed in non-verbal and paralinguistic information. For example, conventional technology only retains the label assignment results of a single worker or the maximum value of multiple votes as a label, which makes it difficult to fully utilize human knowledge.

本発明は、上記に鑑みてなされたものであって、非言語・パラ言語情報に表れる心の状態を表すラベルの修正を精度高く行うことを目的とする。 The present invention has been made in consideration of the above, and aims to accurately correct labels that represent mental states expressed in non-verbal and paralinguistic information.

上述した課題を解決し、目的を達成するために、本発明に係る推定方法は、推定装置が実行する推定方法であって、非言語情報またはパラ言語情報と、複数の作業者により付与された該非言語情報またはパラ言語情報に表れる心の状態を表す正解ラベルとを含む学習データを取得する取得工程と、取得された前記非言語情報またはパラ言語情報について、前記心の状態の事後確率を算出する算出工程と、前記学習データと、算出された前記心の状態の事後確率とを用いて、入力された非言語情報またはパラ言語情報に表れる心の状態を推定するモデルのモデルパラメタを学習する学習工程と、を含んだことを特徴とする。 In order to solve the above-mentioned problems and achieve the objectives, the estimation method of the present invention is an estimation method executed by an estimation device, and is characterized by including: an acquisition step of acquiring training data including non-verbal information or paralinguistic information and correct labels indicating mental states expressed in the non-verbal information or paralinguistic information assigned by multiple workers; a calculation step of calculating the posterior probability of the mental states for the acquired non-verbal information or paralinguistic information; and a learning step of learning model parameters of a model that estimates mental states expressed in input non-verbal information or paralinguistic information using the training data and the calculated posterior probability of the mental states.

本発明によれば、非言語・パラ言語情報に表れる心の状態を表すラベルの修正を精度高く行うことが可能となる。 According to the present invention, it is possible to accurately correct labels that represent mental states expressed in non-verbal and paralinguistic information.

図１は、推定装置の概略構成を例示する模式図である。FIG. 1 is a schematic diagram illustrating a schematic configuration of an estimation device. 図２は、推定装置の処理を説明するための図である。FIG. 2 is a diagram for explaining the processing of the estimation device. 図３は、学習データのデータ構成を例示する図である。FIG. 3 is a diagram illustrating an example of the data structure of the learning data. 図４は、推定処理手順を示すフローチャートである。FIG. 4 is a flowchart showing the estimation process procedure. 図５は、推定処理手順を示すフローチャートである。FIG. 5 is a flowchart showing the procedure of the estimation process. 図６は、推定プログラムを実行するコンピュータを例示する図である。FIG. 6 is a diagram illustrating a computer that executes the estimation program.

以下、図面を参照して、本発明の一実施形態を詳細に説明する。なお、この実施形態により本発明が限定されるものではない。また、図面の記載において、同一部分には同一の符号を付して示している。 One embodiment of the present invention will be described in detail below with reference to the drawings. Note that the present invention is not limited to this embodiment. In addition, in the drawings, identical parts are designated by the same reference numerals.

［推定装置の構成］
図１は、推定装置の概略構成を例示する模式図である。また、図２は、推定装置の処理を説明するための図である。本実施形態の推定装置１０は、非言語・パラ言語情報である対象者の上半身が映る動画に対して、ニューラルネットワークを用いて、非言語・パラ言語情報に表れる心の状態として、理解度を５段階で推定する。理解度は、例えば、１．理解していない、２．やや理解していない、３．平常状態、４．やや理解している、５．理解している、として、数字が大きいほど理解していることを表すように定義される。 [Configuration of the estimation device]
FIG. 1 is a schematic diagram illustrating the overall configuration of an estimation device. FIG. 2 is a diagram illustrating the processing of the estimation device. The estimation device 10 of this embodiment uses a neural network to estimate a level of understanding on a five-point scale as a mental state expressed in non-verbal and para-linguistic information, based on a video showing the upper body of a subject, which is non-verbal and para-linguistic information. The levels of understanding are defined as, for example, 1. not understanding, 2. somewhat not understanding, 3. normal state, 4. somewhat understanding, and 5. understanding, with higher numbers indicating greater understanding.

まず、図１に例示するように、本実施形態の推定装置１０は、パソコン等の汎用コンピュータで実現され、入力部１１、出力部１２、通信制御部１３、記憶部１４、および制御部１５を備える。 First, as illustrated in Figure 1, the estimation device 10 of this embodiment is realized by a general-purpose computer such as a personal computer, and includes an input unit 11, an output unit 12, a communication control unit 13, a memory unit 14, and a control unit 15.

入力部１１は、キーボードやマウス等の入力デバイスを用いて実現され、実施者による入力操作に対応して、制御部１５に対して処理開始などの各種指示情報を入力する。出力部１２は、液晶ディスプレイなどの表示装置、プリンター等の印刷装置、情報通信装置等によって実現される。通信制御部１３は、ＮＩＣ（Network Interface Card）等で実現され、サーバや、学習用データを管理する装置等の外部の装置と制御部１５とのネットワークを介した通信を制御する。 The input unit 11 is implemented using input devices such as a keyboard and mouse, and inputs various instruction information, such as starting processing, to the control unit 15 in response to input operations by the implementer. The output unit 12 is implemented by a display device such as an LCD display, a printing device such as a printer, an information communication device, etc. The communication control unit 13 is implemented by a NIC (Network Interface Card), etc., and controls communication via a network between the control unit 15 and external devices such as servers and devices that manage learning data.

記憶部１４は、ＲＡＭ（Random Access Memory）、フラッシュメモリ（Flash Memory）等の半導体メモリ素子、または、ハードディスク、光ディスク等の記憶装置によって実現される。なお、記憶部１４は、通信制御部１３を介して制御部１５と通信する構成でもよい。本実施形態において、記憶部１４には、例えば、後述する推定処理に用いられる学習データ１４ａや、推定処理で生成・更新されるモデルパラメタ１４ｂ等が記憶される。 The memory unit 14 is realized by a semiconductor memory element such as RAM (Random Access Memory) or flash memory, or a storage device such as a hard disk or optical disk. The memory unit 14 may be configured to communicate with the control unit 15 via the communication control unit 13. In this embodiment, the memory unit 14 stores, for example, learning data 14a used in the estimation process described below, and model parameters 14b generated and updated in the estimation process.

ここで、図３は、学習データのデータ構成を例示する図である。図３に示すように、学習データ１４ａには、少なくとも非言語・パラ言語情報としての対象者の上半身が映る動画データと、各動画データを識別するデータＩＤと、各動画データに表れる理解度等の心の状態を表す正解ラベルとが含まれる。図３に示す例では、正解ラベルとしての理解度ラベルが含まれている。 Here, Figure 3 is a diagram illustrating the data structure of the training data. As shown in Figure 3, the training data 14a includes at least video data showing the upper body of the subject as non-verbal/paralinguistic information, a data ID that identifies each video data, and a correct answer label that represents the mental state, such as the level of understanding, that appears in each video data. In the example shown in Figure 3, a correct answer label is included as the level of understanding label.

学習データ１４ａには、個人を識別可能な個人ＩＤ、年齢、性別等の人物の属性を表すラベルが含まれていてもよい。また、必要に応じて、学習データ１４ａの学習、開発、あるいは評価セットへの分割やデータ拡張が行われてもよい。 The training data 14a may include labels representing personal attributes such as a personal ID that can identify an individual, age, and gender. Furthermore, the training data 14a may be divided into training, development, or evaluation sets, or the data may be expanded as needed.

なお、コントラストの正規化、顔検出等の事前処理を行って、動画データのある領域のみが利用されてもよい。また、入力データ（動画データ）のコーデック等は特に限定されない。また、学習データ１４ａには、後述する更新部１５ｄの処理によって更新された理解度の正解ラベルが保持される。 In addition, pre-processing such as contrast normalization and face detection may be performed so that only certain areas of the video data are used. Furthermore, there are no particular restrictions on the codec of the input data (video data). Furthermore, the learning data 14a holds correct labels of understanding level updated by the processing of the update unit 15d, which will be described later.

具体的には、後述する推定処理で動画データから理解度を推定する場合に、例えばＷｅｂカメラで３０フレーム／秒で収録されたＨ２６４形式の動画データを、１辺が２２４ピクセルとなるようにリサイズするとよい。Ｘ個の各動画データには、Ｓ人の対象者の個人ＩＤ，Ａ名によって付与された理解度の正解ラベル、更新部１５ｄの処理によって更新された理解度の正解ラベルが付与される。図３に示す例では、更新された正解ラベルとして、更新後理解度ラベルが含まれている。Specifically, when estimating comprehension levels from video data using the estimation process described below, video data in H264 format recorded at 30 frames per second using a webcam may be resized to have 224 pixels on each side. Each of the X video data is assigned the personal IDs of S subjects, a correct comprehension level label assigned by name A, and a correct comprehension level label updated by the processing of the update unit 15d. In the example shown in Figure 3, the updated comprehension level label is included as the updated correct label.

図１の説明に戻る。制御部１５は、ＣＰＵ（Central Processing Unit）やＮＰ（Network Processor）やＦＰＧＡ（Field Programmable Gate Array）等を用いて実現され、メモリに記憶された処理プログラムを実行する。これにより、制御部１５は、図１に例示するように、取得部１５ａ、算出部１５ｂ、学習部１５ｃ、および更新部１５ｄとして機能する。なお、これらの機能部は、それぞれが異なるハードウェアに実装されてもよい。例えば取得部１５ａは他の機能部とは異なるハードウェアに実装されてもよい。また、制御部１５は、その他の機能部を備えてもよい。 Returning to the explanation of Figure 1, the control unit 15 is implemented using a CPU (Central Processing Unit), NP (Network Processor), FPGA (Field Programmable Gate Array), etc., and executes processing programs stored in memory. As a result, the control unit 15 functions as an acquisition unit 15a, a calculation unit 15b, a learning unit 15c, and an update unit 15d, as illustrated in Figure 1. Note that each of these functional units may be implemented in different hardware. For example, the acquisition unit 15a may be implemented in hardware different from the other functional units. The control unit 15 may also include other functional units.

取得部１５ａは、非言語情報またはパラ言語情報と、複数の作業者により付与された該非言語情報またはパラ言語情報に表れる心の状態を表す正解ラベルとを含む学習データ１４ａを取得する。具体的には、取得部１５ａは、入力部１１を介して、あるいは学習データを生成する装置等から通信制御部１３を介して、非言語・パラ言語情報としての対象者の上半身が映る動画データと、各動画データを識別するデータＩＤと、各動画データに表れる理解度等の心の状態を表す正解ラベルとを含む学習データ１４ａを取得する。The acquisition unit 15a acquires training data 14a including non-verbal or paralinguistic information and correct labels representing mental states, such as the level of understanding, assigned by multiple workers. Specifically, the acquisition unit 15a acquires training data 14a via the input unit 11 or from a device that generates training data, etc., via the communication control unit 13, including video data showing the upper body of the subject as non-verbal or paralinguistic information, a data ID that identifies each video data, and correct labels representing mental states, such as the level of understanding, that appear in each video data.

取得部１５ａは、以下の処理に先立って予め取得した学習データ１４ａを、記憶部１４に記憶させる。なお、取得部１５ａは、取得した学習データ１４ａを記憶部１４に記憶させずに、以下に示す算出部１５ｂに転送してもよい。 The acquisition unit 15a stores the learning data 14a acquired in advance prior to the following processing in the memory unit 14. Note that the acquisition unit 15a may transfer the acquired learning data 14a to the calculation unit 15b described below without storing it in the memory unit 14.

算出部１５ｂは、取得された非言語情報またはパラ言語情報について、心の状態の事後確率を算出する。例えば、算出部１５ｂは、学習データ１４ａの動画データについて、ニューラルネットワークを用いて、予測したい事象すなわち動画データに表れる理解度等の心の状態に関する事後確率を算出する。 The calculation unit 15b calculates the posterior probability of a mental state for the acquired non-verbal information or paralinguistic information. For example, the calculation unit 15b uses a neural network to calculate the posterior probability of a mental state, such as the level of understanding, that appears in the video data of the training data 14a.

なお、以下に説明するニューラルネットワークを用いた処理は、本実施形態に限定されず、例えば、Batch Normalization、ドロップアウト、Ｌ１／Ｌ２正則化等の周知の技術の要素が任意の箇所に付与されてもよい。 Note that the processing using neural networks described below is not limited to this embodiment, and elements of well-known technologies such as batch normalization, dropout, and L1/L2 regularization may be added at any point.

具体的には、算出部１５ｂは、フレーム長Ｔの動画データｘ_１：Ｔから、２ＤＣＮＮ（Convolutional Neural Network）で、フレーム単位の特徴量を抽出する。次に、算出部１５ｂは、Ｄ次元の出力次元を持つＲＮＮ（Recurrent Neural Network）で、次式（１）に示すように、時間方向の埋め込み表現テンソルＨ_ｘを算出する。ここで、θはＣＮＮのパラメタ集合、φはＲＮＮのパラメタ集合である。 Specifically, the calculation unit 15b extracts frame-by-frame features from video data x1 _:T having a frame length T using a 2D convolutional neural network (CNN). Next, the calculation unit 15b calculates a time-direction embedded representation tensor _Hx using a recurrent neural network (RNN) having an output dimension of D, as shown in the following equation (1). Here, θ is a parameter set of the CNN, and φ is a parameter set of the RNN.

次に、算出部１５ｂは、次式（２）に示すように、ｍｕｌｔｉ－ｈｅａｄｓｅｌｆａｔｔｅｎｔｉｏｎ機構を用いて、時間方向に対して着目すべき時刻を算出し、時間方向の重み付け和ベクトルｖを算出する。 Next, the calculation unit 15b uses a multi-head self-attention mechanism to calculate the time to be focused on in the time direction, and calculates the weighted sum vector v in the time direction, as shown in the following equation (2).

上記式（２）では、算出部１５ｂは、ｑｕｅｒｙＱ_ｉおよびｋｅｙＫ_ｉからａｔｔｅｎｔｉｏｎｗｅｉｇｈｔを算出して、ｖａｌｕｅＶ_ｉに適用し、最後に時間方向の合計を算出している。 In the above formula (2), the calculation unit 15b calculates the attention weight from the query Q _i and the key K _i , applies it to the value V _i , and finally calculates the sum in the time direction.

ここで、ｄはａｔｔｅｎｔｉｏｎｈｅａｄｓの数、ｉは各ａｔｔｅｎｔｉｏｎｈｅａｄｓ、Ｗ_ｉ ^Ｑ、Ｗ_ｉ ^Ｋ、Ｗ_ｉ ^Ｖはそれぞれ、各ａｔｔｅｎｔｉｏｎｈｅａｄｓにおけるＱｕｅｒｙ、ｋｅｙ、ｖａｌｕｅに対する重みを表す。 Here, d represents the number of attention heads, i represents each attention head, and W _i ^Q , W _i ^K , and W _i ^V represent the weights for the query, key, and value in each attention head, respectively.

最後に、算出部１５ｂは、次式（３）に示すように、２層の全結合層を用いて、５段階の理解度のそれぞれに対する事後確率ｐ（Ｃ｜ｘ_１：Ｔ）を算出する。 Finally, the calculation unit 15b calculates the posterior probability p(C|x _1:T ) for each of the five levels of understanding using two fully connected layers, as shown in the following equation (3).

ここで、Ｗ_１ ^ＦＣ、Ｗ_２ ^ＦＣは、２層の全結合層の重みを表し、Ｄ^ＦＣは１層目の全結合層の出力次元数を表し、Ｃは予測ラベルの数を表す（本実施形態ではＣ＝５）。また、１層目の全結合層の活性化関数には、ＲｅＬＵ関数が用いられている。 Here, _W1FC ^and _W2FC represent the weights of the two fully connected layers, ^DFC represents the number ^of output dimensions of the first fully connected layer, and C represents the number of predicted labels (C = 5 in this embodiment). The activation function of the first fully connected layer is the ReLU function.

学習部１５ｃは、学習データ１４ａと、算出された心の状態の事後確率とを用いて、入力された非言語情報またはパラ言語情報に表れる心の状態を推定するモデルのモデルパラメタ１４ｂを学習する。 The learning unit 15c uses the learning data 14a and the calculated posterior probability of the mental state to learn model parameters 14b of a model that estimates the mental state expressed in the input non-verbal information or paralinguistic information.

具体的には、学習部１５ｃは、モデルパラメタ集合Ωを更新し、学習済みモデルパラメタ集合Ω’を取得する。学習部１５ｃは、周知の損失関数や更新手法を適用可能である。例えば、モデルパラメタ集合Ωは、任意の他のタスクで事前学習されたものが含まれてもよいし、任意の乱数で初期値が生成されてもよいし、一部のモデルパラメタが更新されなくてもよい。 Specifically, the learning unit 15c updates the model parameter set Ω and obtains a trained model parameter set Ω'. The learning unit 15c can apply well-known loss functions and update methods. For example, the model parameter set Ω may include parameters previously trained in any other task, initial values may be generated using any random number, or some model parameters may not be updated.

例えば、学習部１５ｃは、確率的勾配法（ＳＧＤ）を用いて、次式（４）に示す交差エントロピーＬを損失関数として、モデルパラメタ集合Ωを更新する。その際には、学習率等のハイパーパラメタには任意の値が用いられる。For example, the learning unit 15c uses a stochastic gradient descent (SGD) to update the model parameter set Ω using the cross entropy L shown in the following equation (4) as a loss function. At this time, arbitrary values are used for hyperparameters such as the learning rate.

ここで、ｍ^ｘは入力される動画データｘ_１：Ｔの正解分布である。正解分布の表現手法は特に限定されず、例えば、図３に例示した理解度ラベルＬ_ｘを用いて、ｏｎｅ－ｈｏｔｖｅｃｔｏｒとして表現されてもよい。あるいは、正解分布は、正解クラスを中心とする正規分布を近似して表されてもよいし、アノテーション結果をそのままｓｏｆｔ－ｌａｂｅｌとして用いて表されてもよい。 Here, m ^x is the correct distribution of the input video data x _1:T . The method of expressing the correct distribution is not particularly limited, and for example, it may be expressed as a one-hot vector using the understanding level label L _x illustrated in Figure 3. Alternatively, the correct distribution may be expressed by approximating a normal distribution centered on the correct class, or it may be expressed by using the annotation result as a soft-label as it is.

なお、学習部１５ｃは、取得した学習済みモデルパラメタ集合Ω’をモデルパラメタ１４ｂとして、記憶部１４に記憶させる。 In addition, the learning unit 15c stores the acquired learned model parameter set Ω' in the memory unit 14 as model parameters 14b.

図１の説明に戻る。更新部１５ｄは、学習されたモデルパラメタ１４ｂを用いて、学習データ１４ａの正解ラベルを更新する。具体的には、更新部１５ｄは、学習されたモデルパラメタ１４ｂを用いて算出された心の状態の事後確率と、学習データ１４ａの正解ラベルとの類似度が所定の閾値以上である場合に、学習データ１４ａの正解ラベルを更新する。Returning to the explanation of Figure 1, the update unit 15d updates the correct label of the training data 14a using the learned model parameters 14b. Specifically, the update unit 15d updates the correct label of the training data 14a when the similarity between the posterior probability of the mental state calculated using the learned model parameters 14b and the correct label of the training data 14a is equal to or greater than a predetermined threshold.

例えば、更新部１５ｄは、複数の作業者によって付与された学習データ１４ａのラベルの分布を正規化した正解ラベルＬを更新する。まず、更新部１５ｄは、学習済みモデルパラメタ集合Ω’を用いて、学習データ１４ａに対する理解度の事後確率を予測する。その後、更新部１５ｄは、正解ラベルと事後確率との類似度を算出し、算出した類似度が所定の閾値以上であれば、正解ラベルを更新する。For example, the update unit 15d updates the correct label L, which is obtained by normalizing the distribution of labels in the training data 14a assigned by multiple workers. First, the update unit 15d predicts the posterior probability of the understanding level for the training data 14a using the trained model parameter set Ω'. Then, the update unit 15d calculates the similarity between the correct label and the posterior probability, and updates the correct label if the calculated similarity is equal to or greater than a predetermined threshold.

更新部１５ｄは、学習済みモデルパラメタΩ’を用いて、入力される動画データｘの理解度の正解ラベルＬ_ｘを更新する場合に、まず、各理解度に対する事後確率ｐ（Ｃ｜ｘ_１：Ｔ，Ω’）を算出する。次に、更新部１５ｄは、Ｌ_ｘとｐ（Ｃ｜ｘ_１：Ｔ，Ω’）との類似度を算出する。更新部１５ｄが算出する類似度は特に限定されないが、例えば、交差エントロピー、カルバック・ライブラー・ダイバージェンス、コサイン類似度、ユークリッド距離等、ベクトル間の距離や類似度を算出可能なアルゴリズムを用いて算出する。 When updating the correct label _Lx of the understanding level of input video data x using the trained model parameter Ω', the update unit 15d first calculates the posterior probability p(C|x1 _:T , Ω') for each understanding level. Next, the update unit 15d calculates the similarity between _Lx and p(C|x1 _:T , Ω'). The similarity calculated by the update unit 15d is not particularly limited, and may be calculated using an algorithm capable of calculating the distance or similarity between vectors, such as cross entropy, Kullback-Leibler divergence, cosine similarity, or Euclidean distance.

更新部１５ｄは、例えば、次式（５）に示すように、コサイン類似度ｃ_ｘ（－１≦ｃ_ｘ≦１）を算出する。 The update unit 15d calculates the cosine similarity c _x (-1≦c _x ≦1) as shown in the following equation (5), for example.

また、更新部１５ｄは、特定の条件を満たすか否かを基準として更新可否を判定することも可能である。例えば、更新部１５ｄは、ｍａｘ（ｐ（Ｃ｜ｘ_１：Ｔ，Ω’））が所定の閾値以上である場合に、Ｌ_ｘに対する更新判定をＴＲＵＥとしてもよい。 The update unit 15d may also determine whether to update Lx based on whether a specific condition is satisfied. For example, the update unit 15d may determine that _Lx should be updated as TRUE if max(p(C|x1 _:T , Ω′)) is equal to or greater than a predetermined threshold.

あるいは、更新部１５ｄは、Ｌｘの上位の２値が隣接していて、かつその比率が４：６～６：４の範囲内であれば、その２クラス内で正解が変わる場合の更新判定をＴＲＵＥとしてもよい。例えば、図３に示したデータＩＤ＝００００００２のデータの理解度ラベルにおいて、上位２値（０．４、０．６）が隣接していて、かつその比率が４：６～６：４の範囲内であるので、更新判定はＴＲＵＥとされる。一方、データＩＤ＝０００１４５９のデータの理解度ラベルにおいて、上位２値（０．８、０．２）が隣接しているものの、その比率が４：６～６：４の範囲内ではないため、更新判定はＦＡＬＳＥとされる。Alternatively, the update unit 15d may determine that the update judgment is TRUE if the two highest values of Lx are adjacent and the ratio is within the range of 4:6 to 6:4, and the correct answer changes within the two classes. For example, in the comprehension label of the data with data ID = 0000002 shown in Figure 3, the top two values (0.4, 0.6) are adjacent and the ratio is within the range of 4:6 to 6:4, so the update judgment is TRUE. On the other hand, in the comprehension label of the data with data ID = 0001459, the top two values (0.8, 0.2) are adjacent, but the ratio is not within the range of 4:6 to 6:4, so the update judgment is FALSE.

次に、更新部１５ｄは、算出した類似度が所定の閾値以上か否かを判定し、真であればＬ_ｘを更新して、更新後理解度ラベルＬ_ｘ’にｐ（Ｃ｜ｘ_１：Ｔ，Ω’）を代入する。その際に、更新部１５ｄは、単一の条件で更新判定してもよいし、複数のＡＮＤ条件やＯＲ条件を組み合わせて更新判定を行ってもよい。 Next, the update unit 15d determines whether the calculated similarity is equal to or greater than a predetermined threshold, and if true, updates _Lx and assigns p(C|x1 _:T , Ω') to the updated understanding level label _Lx '. At this time, the update unit 15d may make the update determination based on a single condition, or may make the update determination based on a combination of multiple AND conditions or OR conditions.

また、更新部１５ｄは、０．１未満等のわずかな値を０として再度正規化する等の事前処理を行った後に、Ｌ_ｘ’にｐ（Ｃ｜ｘ_１：Ｔ，Ω’）を代入してもよい。Ｌ_ｘ’は、モデルパラメタ１４ｂを引き続き学習する際の正解ラベルとして、Ｌ_ｘの代わりに損失関数の計算に用いられる。 Furthermore, the update unit 15d may substitute p(C|x1 _:T , Ω') for L _x ' after performing pre-processing such as re-normalizing a small value, such as less than 0.1, as 0. L _x ' is used in the calculation of the loss function instead of L _x as a correct label when subsequently learning the model parameters 14b.

なお、更新部１５ｄの処理は、学習部１５ｃのモデルパラメタ１４ｂの学習の任意のタイミングで起動可能である。例えば、更新部１５ｄは、学習部１５ｃにおいてモデルパラメタ１４ｂの学習による更新回数が所定の閾値以上に達した場合に、処理を起動するようにしてもよい。あるいは、更新部１５ｄは、より複雑に、例えば、１回目の処理の起動は、学習部１５ｃにおけるモデルパラメタ１４ｂの更新回数が１０００回後に行い、２回目以降の処理の起動は、学習部１５ｃにおけるモデルパラメタ１４ｂの更新回数が１００回後に行うようにしてもよい。 The processing by the update unit 15d can be initiated at any timing during the learning of the model parameters 14b by the learning unit 15c. For example, the update unit 15d may be configured to initiate the processing when the number of updates due to learning of the model parameters 14b in the learning unit 15c reaches a predetermined threshold or greater. Alternatively, the update unit 15d may be configured to be more complex, for example, to initiate the first processing after the model parameters 14b in the learning unit 15c have been updated 1000 times, and to initiate the second and subsequent processing after the model parameters 14b in the learning unit 15c have been updated 100 times.

［推定処理］
次に、推定装置１０による推定処理について説明する。図４よび図５は、推定処理手順を示すフローチャートである。本実施形態の推定処理は、学習処理と更新処理とを含む。まず、図４は、学習処理手順を示す。図４のフローチャートは、例えば、学習処理の開始を指示する入力があったタイミングで開始される。 [Estimation process]
Next, the estimation process performed by the estimation device 10 will be described. Figures 4 and 5 are flowcharts showing the estimation process procedure. The estimation process of this embodiment includes a learning process and an update process. First, Figure 4 shows the learning process procedure. The flowchart in Figure 4 starts, for example, when an input is made to instruct the start of the learning process.

まず、取得部１５ａは、非言語情報またはパラ言語情報と、複数の作業者により付与された該非言語情報またはパラ言語情報に表れる心の状態を表す正解ラベルとを含む学習データ１４ａを取得する（ステップＳ１）。取得部１５ａは、取得した学習データ１４ａを記憶部１４に記憶させる。あるいは、取得部１５ａは、取得した学習データ１４ａを記憶部１４に記憶させずに、算出部１５ｂに転送してもよい。First, the acquisition unit 15a acquires training data 14a including non-verbal or paralinguistic information and correct labels indicating mental states expressed in the non-verbal or paralinguistic information assigned by multiple workers (step S1). The acquisition unit 15a stores the acquired training data 14a in the memory unit 14. Alternatively, the acquisition unit 15a may transfer the acquired training data 14a to the calculation unit 15b without storing it in the memory unit 14.

また、算出部１５ｂが、取得された非言語情報またはパラ言語情報について、心の状態の事後確率を算出する（ステップＳ２）。 In addition, the calculation unit 15b calculates the posterior probability of the mental state for the acquired non-verbal information or paralinguistic information (step S2).

次に、学習部１５ｃが、学習データ１４ａと、算出された心の状態の事後確率とを用いて、入力された非言語情報またはパラ言語情報に表れる心の状態を推定するモデルのモデルパラメタ１４ｂを学習する（ステップＳ３）。これにより、一連の学習処理が終了する。Next, the learning unit 15c uses the learning data 14a and the calculated posterior probability of the mental state to learn model parameters 14b of a model that estimates the mental state expressed in the input non-verbal information or paralinguistic information (step S3). This completes the learning process.

次に、図５は、更新処理手順を示す。図５のフローチャートは、例えば、更新処理の開始を指示する入力があったタイミングで開始される。 Next, Figure 5 shows the update processing procedure. The flowchart in Figure 5 starts, for example, when an input is received instructing the start of the update processing.

まず、更新部１５ｄは、学習されたモデルパラメタ１４ｂを用いて、学習データ１４ａに対する理解度の事後確率を算出する（ステップＳ１１）。 First, the update unit 15d uses the learned model parameters 14b to calculate the posterior probability of understanding for the learning data 14a (step S11).

次に、更新部１５ｄは、算出された心の状態の事後確率と、学習データ１４ａの正解ラベルとの類似度が所定の閾値以上である場合に、学習データ１４ａの正解ラベルを更新する（ステップＳ１２）。これにより、一連の更新処理が終了する。Next, if the similarity between the calculated posterior probability of the mental state and the correct label of the training data 14a is equal to or greater than a predetermined threshold, the update unit 15d updates the correct label of the training data 14a (step S12). This completes the update process.

［効果］
以上、説明したように、本実施形態の推定装置１０において、取得部１５ａが、非言語情報またはパラ言語情報と、複数の作業者により付与された該非言語情報またはパラ言語情報に表れる心の状態を表す正解ラベルとを含む学習データ１４ａを取得する。算出部１５ｂが、取得された非言語情報またはパラ言語情報について、心の状態の事後確率を算出する。学習部１５ｃが、学習データ１４ａと、算出された心の状態の事後確率とを用いて、入力された非言語情報またはパラ言語情報に表れる心の状態を推定するモデルのモデルパラメタ１４ｂを学習する。 [effect]
As described above, in the estimation device 10 of this embodiment, the acquisition unit 15a acquires training data 14a including non-verbal information or paralinguistic information and correct labels assigned by multiple workers that represent mental states expressed in the non-verbal information or paralinguistic information. The calculation unit 15b calculates posterior probabilities of mental states for the acquired non-verbal information or paralinguistic information. The learning unit 15c uses the training data 14a and the calculated posterior probabilities of mental states to learn model parameters 14b of a model that estimates mental states expressed in input non-verbal information or paralinguistic information.

これにより、推定装置１０は、複数名により付与された正解ラベルを用いた大局的なラベル付与の学習により、非言語情報またはパラ言語情報に表れる心の状態を精度高く推定することが可能となる。したがって、推定装置１０は、推定した結果を用いて心の状態を表すラベルを精度高く付与することが可能となる。このように、推定装置１０によれば、非言語・パラ言語情報に表れる心の状態を表すラベルの修正を精度高く行うことが可能となる。 This enables the estimation device 10 to accurately estimate mental states expressed in non-verbal or paralinguistic information by learning global label assignment using correct labels assigned by multiple people. Therefore, the estimation device 10 can accurately assign labels representing mental states using the estimation results. In this way, the estimation device 10 makes it possible to accurately correct labels representing mental states expressed in non-verbal and paralinguistic information.

また、更新部１５ｄが、学習されたモデルパラメタ１４ｂを用いて、学習データ１４ａの正解ラベルを更新する。具体的には、更新部１５ｄは、学習されたモデルパラメタ１４ｂを用いて算出された心の状態の事後確率と、学習データ１４ａの正解ラベルとの類似度が所定の閾値以上である場合に、学習データ１４ａの正解ラベルを更新する。 The update unit 15d also updates the correct label of the training data 14a using the learned model parameters 14b. Specifically, the update unit 15d updates the correct label of the training data 14a when the similarity between the posterior probability of the mental state calculated using the learned model parameters 14b and the correct label of the training data 14a is equal to or greater than a predetermined threshold.

これにより、推定装置１０は、複数名による正解ラベルの付与結果の分布とある程度類似している場合にのみ、正解ラベルを修正することが可能となる。したがって、誤って意味の遠いクラスのラベルが付与される可能性を排除して、大局的なラベル付与を変えることなく局所的にラベルの修正を行うことが可能となる。また、人間によるラベル付与の傾向を参照することにより、ある種の制約を設けることとなり、少ないデータ量で安定してラベル修正を行うことが可能となる。このように、推定装置１０によれば、ラベル付与が難しいデータに対しても、精度高くラベルの修正を行うことが可能となる。This allows the estimation device 10 to correct the correct label only if it is somewhat similar to the distribution of correct label assignment results from multiple people. This eliminates the possibility of mistakenly assigning a label to a class with a distant meaning, making it possible to locally correct labels without changing the overall label assignment. Furthermore, by referencing the tendency of human label assignment, certain constraints are imposed, making it possible to stably correct labels with a small amount of data. In this way, the estimation device 10 makes it possible to accurately correct labels even for data that is difficult to label.

学習部１５ｃにおいてモデルパラメタ１４ｂの学習による更新回数が所定の閾値以上に達した場合に、更新部１５ｄの処理を起動する。これにより、さらに精度高くラベルの修正を行うことが可能となる。 When the number of updates due to learning of model parameters 14b in learning unit 15c reaches a predetermined threshold or more, processing by update unit 15d is initiated. This enables label correction with even higher accuracy.

［プログラム］
上記実施形態に係る推定装置１０が実行する処理をコンピュータが実行可能な言語で記述したプログラムを作成することもできる。一実施形態として、推定装置１０は、パッケージソフトウェアやオンラインソフトウェアとして上記の推定処理を実行する推定プログラムを所望のコンピュータにインストールさせることによって実装できる。例えば、上記の推定プログラムを情報処理装置に実行させることにより、情報処理装置を推定装置１０として機能させることができる。また、その他にも、情報処理装置にはスマートフォン、携帯電話機やＰＨＳ（Personal Handyphone System）等の移動体通信端末、さらには、ＰＤＡ（Personal Digital Assistant）等のスレート端末等がその範疇に含まれる。また、推定装置１０の機能を、クラウドサーバに実装してもよい。 [program]
A program describing the processing performed by the estimation device 10 according to the above embodiment in a computer-executable language can also be created. In one embodiment, the estimation device 10 can be implemented by installing an estimation program that executes the above estimation processing as package software or online software on a desired computer. For example, by executing the estimation program on an information processing device, the information processing device can function as the estimation device 10. Other examples of information processing devices include mobile communication terminals such as smartphones, mobile phones, and PHS (Personal Handyphone Systems), as well as slate terminals such as PDAs (Personal Digital Assistants). The functions of the estimation device 10 may also be implemented on a cloud server.

図６は、推定プログラムを実行するコンピュータの一例を示す図である。コンピュータ１０００は、例えば、メモリ１０１０と、ＣＰＵ１０２０と、ハードディスクドライブインタフェース１０３０と、ディスクドライブインタフェース１０４０と、シリアルポートインタフェース１０５０と、ビデオアダプタ１０６０と、ネットワークインタフェース１０７０とを有する。これらの各部は、バス１０８０によって接続される。 Figure 6 shows an example of a computer that executes an estimation program. The computer 1000 has, for example, a memory 1010, a CPU 1020, a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. These components are connected by a bus 1080.

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１およびＲＡＭ１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０３１に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１０４１に接続される。ディスクドライブ１０４１には、例えば、磁気ディスクや光ディスク等の着脱可能な記憶媒体が挿入される。シリアルポートインタフェース１０５０には、例えば、マウス１０５１およびキーボード１０５２が接続される。ビデオアダプタ１０６０には、例えば、ディスプレイ１０６１が接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012. The ROM 1011 stores a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to a hard disk drive 1031. The disk drive interface 1040 is connected to a disk drive 1041. A removable storage medium such as a magnetic disk or optical disk is inserted into the disk drive 1041. The serial port interface 1050 is connected to, for example, a mouse 1051 and a keyboard 1052. The video adapter 1060 is connected to, for example, a display 1061.

ここで、ハードディスクドライブ１０３１は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３およびプログラムデータ１０９４を記憶する。上記実施形態で説明した各情報は、例えばハードディスクドライブ１０３１やメモリ１０１０に記憶される。 Here, the hard disk drive 1031 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. Each piece of information described in the above embodiment is stored, for example, in the hard disk drive 1031 or memory 1010.

また、推定プログラムは、例えば、コンピュータ１０００によって実行される指令が記述されたプログラムモジュール１０９３として、ハードディスクドライブ１０３１に記憶される。具体的には、上記実施形態で説明した推定装置１０が実行する各処理が記述されたプログラムモジュール１０９３が、ハードディスクドライブ１０３１に記憶される。 The estimation program is stored on the hard disk drive 1031, for example, as a program module 1093 that describes instructions to be executed by the computer 1000. Specifically, the program module 1093 that describes each process executed by the estimation device 10 described in the above embodiment is stored on the hard disk drive 1031.

また、推定プログラムによる情報処理に用いられるデータは、プログラムデータ１０９４として、例えば、ハードディスクドライブ１０３１に記憶される。そして、ＣＰＵ１０２０が、ハードディスクドライブ１０３１に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して、上述した各手順を実行する。 In addition, data used for information processing by the estimation program is stored, for example, as program data 1094 on the hard disk drive 1031. The CPU 1020 then reads the program module 1093 and program data 1094 stored on the hard disk drive 1031 into the RAM 1012 as needed, and executes each of the procedures described above.

なお、推定プログラムに係るプログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０３１に記憶される場合に限られず、例えば、着脱可能な記憶媒体に記憶されて、ディスクドライブ１０４１等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、推定プログラムに係るプログラムモジュール１０９３やプログラムデータ１０９４は、ＬＡＮ（Local Area Network）やＷＡＮ（Wide Area Network）等のネットワークを介して接続された他のコンピュータに記憶され、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 The program module 1093 and program data 1094 related to the estimation program are not limited to being stored on the hard disk drive 1031, but may be stored, for example, on a removable storage medium and read by the CPU 1020 via the disk drive 1041, etc. Alternatively, the program module 1093 and program data 1094 related to the estimation program may be stored in another computer connected via a network such as a LAN (Local Area Network) or a WAN (Wide Area Network), and read by the CPU 1020 via the network interface 1070.

以上、本発明者によってなされた発明を適用した実施形態について説明したが、本実施形態による本発明の開示の一部をなす記述および図面により本発明は限定されることはない。すなわち、本実施形態に基づいて当業者等によりなされる他の実施形態、実施例および運用技術等は全て本発明の範疇に含まれる。 The above describes an embodiment of the invention made by the inventor, but the present invention is not limited to the descriptions and drawings that form part of the disclosure of the present invention according to this embodiment. In other words, all other embodiments, examples, operational techniques, etc. made by those skilled in the art based on this embodiment are included in the scope of the present invention.

１０推定装置
１１入力部
１２出力部
１３通信制御部
１４記憶部
１４ａ学習データ
１４ｂモデルパラメタ
１５制御部
１５ａ取得部
１５ｂ算出部
１５ｃ学習部
１５ｄ更新部 REFERENCE SIGNS LIST 10 Estimation device 11 Input unit 12 Output unit 13 Communication control unit 14 Storage unit 14a Learning data 14b Model parameters 15 Control unit 15a Acquisition unit 15b Calculation unit 15c Learning unit 15d Update unit

Claims

An estimation method executed by an estimation device,
an acquisition step of acquiring training data including non-verbal information or para-verbal information and correct labels indicating mental states expressed in the non-verbal information or para-verbal information assigned by a plurality of workers;
a calculation step of calculating a posterior probability of the state of mind for the acquired non-linguistic information or paralinguistic information;
a learning step of learning model parameters of a model that estimates the mental state expressed in input non-linguistic information or paralinguistic information using the learning data and the calculated posterior probability of the mental state;
an updating step of updating the correct labels of the training data using the trained model parameters;
An estimation method comprising:

2. The estimation method according to claim 1, wherein the updating step updates the correct label of the training data when a similarity between the posterior probability of the mental state calculated using the trained model parameters and the correct label of the training data is equal to or greater than a predetermined threshold.

2. The estimation method according to claim 1 , wherein the updating step is initiated when the number of updates due to learning of the model parameters in the learning step reaches a predetermined threshold value or more.

an acquisition unit that acquires learning data including non-verbal information or para-verbal information and correct labels that represent mental states expressed in the non-verbal information or para-verbal information assigned by a plurality of workers;
a calculation unit that calculates a posterior probability of the state of mind for the acquired non-linguistic information or paralinguistic information;
a learning unit that uses the learning data and the calculated posterior probability of the mental state to learn model parameters of a model that estimates the mental state expressed in input non-linguistic information or paralinguistic information;
an update unit that updates the correct labels of the training data using the trained model parameters;
An estimation device comprising:

an acquisition step of acquiring learning data including non-verbal information or para-verbal information and correct labels indicating mental states expressed in the non-verbal information or para-verbal information assigned by a plurality of workers;
a calculation step of calculating a posterior probability of the state of mind for the acquired non-linguistic information or paralinguistic information;
a learning step of learning model parameters of a model that estimates the mental state expressed in input non-linguistic information or paralinguistic information using the learning data and the calculated posterior probability of the mental state;
an updating step of updating the correct labels of the training data using the trained model parameters;
An estimation program for causing a computer to execute the above.