JP7624660B2

JP7624660B2 - Prediction device, subjective impression prediction method, and program

Info

Publication number: JP7624660B2
Application number: JP2021135890A
Authority: JP
Inventors: 陽子石井; 桃子中谷; 雄貴蔵内; 隼平大土; 和弘大塚
Original assignee: Nippon Telegraph and Telephone Corp; Yokohama National University NUC; NTT Inc USA
Current assignee: Yokohama National University NUC; NTT Inc; NTT Inc USA
Priority date: 2021-08-23
Filing date: 2021-08-23
Publication date: 2025-01-31
Anticipated expiration: 2041-08-23
Also published as: JP2023030649A

Description

本開示内容は、予測装置、主観的印象予測方法、及びプログラムに関する。 This disclosure relates to a prediction device, a subjective impression prediction method, and a program.

人と人との対話中に生じる非言語行動の中でも、頭部運動は様々な役割を担うことが知られている。例えば、話し手は発話の強調や反応確認の際に、また、聞き手は話し手に対する相槌や応答あるいは同意のサインとして頭部運動を表出する。このように頭部運動には複数の機能があり、１つの頭部運動が同時に複数の意味を持つ場合もある。このような頭部運動の機能の多重性と曖昧性に着目し、非特許文献１では、頭部運動の機能カテゴリを定義し、機能の重複を許容する非排他的なアノテーションを行ったデータを用い、頭部姿勢および発話の有無の時系列から個々の機能の有無を検出する畳み込みニューラルネットワーク(CNN)を提案している。 Head movements are known to play a variety of roles among non-verbal behaviors that occur during human dialogue. For example, speakers use head movements to emphasize what they are saying or to confirm a response, while listeners use head movements to provide interjections, responses, or signs of agreement. As such, head movements have multiple functions, and a single head movement may simultaneously have multiple meanings. Focusing on the multiplicity and ambiguity of the functions of head movements, Non-Patent Document 1 defines functional categories of head movements and proposes a convolutional neural network (CNN) that uses data that has been non-exclusively annotated to allow for overlapping functions and detects the presence or absence of individual functions from the time series of head posture and the presence or absence of speech.

K. Otsuka and M. Tsumori, "Analyzing multifunctionality of head movements in face-to-face conversations using deep convolutional neural networks," IEEE Access, vol.8, pp.217169-217195, 2020.K. Otsuka and M. Tsumori, "Analyzing multifunctionality of head movements in face-to-face conversations using deep convolutional neural networks," IEEE Access, vol.8, pp.217169-217195, 2020.

しかしながら、上述のように、頭部運動機能は対話中の行動や心情と深く関わっていることが考えられるが、未だ、対話中の対話者自身が抱く主観的印象を予測する手法は確立されていない。 However, as mentioned above, although head movement function is thought to be deeply related to behavior and emotions during a conversation, there is still no established method to predict the subjective impressions that interlocutors themselves have during a conversation.

本発明は、上記の点に鑑みてなされたものであって、対話中の対話者自身が抱く主観的印象を予測することを目的とする。 The present invention has been made in consideration of the above points, and aims to predict the subjective impressions held by the interlocutors themselves during a conversation.

上記課題を解決するため、請求項１に係る発明は、対話中の対話者自身が抱く主観的印象を予測する予測装置であって、前記対話者が撮影されることで得られた映像データから、前記対話者の頭部運動に関するデータを取得するデータ取得手段と、頭部運動機能検出モデルを用いて、前記頭部運動に関するデータに対する頭部運動機能を検出する頭部運動機能検出手段と、前記頭部運動機能検出手段による検出結果に基づいて、前記頭部運動機能の特徴量を算出する特徴量算出手段と、回帰モデルを用いて、前記頭部運動機能の特徴量に対する前記対話者の主観的印象を予測する主観的印象予測手段と、を有することを特徴とする予測装置である。 In order to solve the above problem, the invention of claim 1 is a prediction device that predicts the subjective impression held by an interlocutor himself during a conversation , comprising: a data acquisition means for acquiring data regarding the head movement of the interlocutor from video data obtained by photographing the interlocutor; a head movement function detection means for detecting head movement function in relation to the data regarding head movement using a head movement function detection model; a feature calculation means for calculating a feature of the head movement function based on the detection result by the head movement function detection means; and a subjective impression prediction means for predicting the subjective impression of the interlocutor regarding the feature of the head movement function using a regression model.

また、請求項２に係る発明は、対話中の対話者自身が抱く主観的印象を予測する予測装置であって、前記対話者が撮影されることで得られた映像データから、前記対話者の頭部運動に関するデータを取得するデータ取得手段と、頭部運動検出モデルを用いて、前記頭部運動に関するデータに対する前記頭部運動の有無を検出する頭部運動検出手段と、頭部運動機能検出モデルを用いて、前記頭部運動の有無を示す頭部運動の有無情報に対する頭部運動機能を検出する頭部運動機能検出手段と、前記頭部運動機能検出手段による検出結果に基づいて、前記頭部運動機能の特徴量を算出する特徴量算出手段と、回帰モデルを用いて、前記頭部運動機能の特徴量に対する前記対話者の主観的印象を予測する主観的印象予測手段と、を有することを特徴とする予測装置である。 The invention of claim 2 is a prediction device for predicting a subjective impression held by an interlocutor himself during a conversation , comprising: a data acquisition means for acquiring data regarding head movement of the interlocutor from video data obtained by photographing the interlocutor; a head movement detection means for detecting the presence or absence of head movement in relation to the data regarding head movement using a head movement detection model; a head movement function detection means for detecting head movement function in relation to head movement presence or absence information indicating the presence or absence of head movement using a head movement function detection model; a feature calculation means for calculating a feature of the head movement function based on the detection result by the head movement function detection means; and a subjective impression prediction means for predicting the subjective impression of the interlocutor regarding the feature of the head movement function using a regression model.

また、請求項５に係る発明は、対話中の対話者自身が抱く主観的印象を予測する予測装置であって、対話者に関する対話者データを受信し、当該対話者データから前記対話者の発話区間を示す発話区間の情報を取得するデータ取得手段と、頭部運動検出モデルを用いて、前記発話区間の情報に対する発話の有無を検出する頭部運動検出手段と、頭部運動機能検出モデルを用いて、前記発話の有無を示す発話の有無情報に対する頭部運動機能を検出する頭部運動機能検出手段と、前記頭部運動機能検出手段による検出結果に基づいて、前記頭部運動機能の特徴量を算出する特徴量算出手段と、回帰モデルを用いて、前記頭部運動機能の特徴量に対する前記対話者の主観的印象を予測する主観的印象予測手段と、を有することを特徴とする予測装置。
The invention of claim 5 is a prediction device for predicting a subjective impression held by an interlocutor himself during a conversation , comprising: data acquisition means for receiving interlocutor data related to an interlocutor and acquiring information on a speech section indicating a speech section of the interlocutor from the interlocutor data; head movement detection means for detecting the presence or absence of speech in response to the speech section information using a head movement detection model; head movement function detection means for detecting head movement function in response to speech presence or absence information indicating the presence or absence of speech using a head movement function detection model; feature calculation means for calculating a feature of the head movement function based on the detection result by the head movement function detection means; and subjective impression prediction means for predicting the subjective impression of the interlocutor for the feature of the head movement function using a regression model.

以上説明したように本発明によれば、対話中の対話者自身が抱く主観的印象を予測することができるという効果を奏する。 As described above, the present invention has the effect of being able to predict the subjective impressions held by the interlocutors during a conversation.

通信システムの概略図である。FIG. 1 is a schematic diagram of a communication system. 予測装置の電気的なハードウェア構成図である。FIG. 2 is a diagram illustrating an electrical hardware configuration of the prediction device. 予測装置の機能構成図である。FIG. 2 is a functional configuration diagram of a prediction device. 各機能における入力情報（データ）及び出力情報（データ）の関係を示した図である。FIG. 2 is a diagram showing the relationship between input information (data) and output information (data) in each function. データ学習の挙動を示すフローチャートである。11 is a flowchart showing the behavior of data learning. データ予測の挙動を示すフローチャートである。1 is a flowchart illustrating the behavior of data prediction. 頭部運動機能のカテゴリ及び機能内容が示された図である。FIG. 13 is a diagram showing categories and functional contents of head movement functions. 本実施形態で使用される頭部運動機能のカテゴリ及び機能内容が示された図である。1 is a diagram showing categories and functional contents of head movement functions used in this embodiment. FIG. 運動学的特徴の一覧を示す図である。FIG. 13 is a diagram showing a list of kinematic features. 頭部運動機能の特徴を示す図である。FIG. 1 illustrates characteristics of head movement function. 項目毎に内観報告スコアとしての主観的印象の予測値を示す図である。FIG. 13 shows predicted values of subjective impressions as introspection report scores for each item.

以下、図面に基づいて本発明の実施形態を説明する。 The following describes an embodiment of the present invention with reference to the drawings.

〔実施形態のシステム構成〕
まず、図１を用いて、本実施形態の通信システムの構成の概略について説明する。図１は、本発明の実施形態に係る通信システムの概略図である。 [System configuration of the embodiment]
First, an outline of the configuration of a communication system according to the present embodiment will be described with reference to Fig. 1. Fig. 1 is a schematic diagram of a communication system according to an embodiment of the present invention.

図１に示されているように、本実施形態の通信システム１は、入力装置２、予測装置３、及びモニタ４によって構築されている。また、入力装置２は、カメラ２ａ、スマートフォン２ｂ、マイク付きヘッドフォン２ｃ、マイク２ｄ等である。また、カメラ２ａは、輝度撮影用カメラ、LiDAR(Light Detection and Ranging)、赤外線カメラ、又はサーモカメラ等である。 As shown in FIG. 1, the communication system 1 of this embodiment is constructed by an input device 2, a prediction device 3, and a monitor 4. The input device 2 is a camera 2a, a smartphone 2b, a headphone with a microphone 2c, a microphone 2d, etc. The camera 2a is a luminance camera, a LiDAR (Light Detection and Ranging), an infrared camera, a thermal camera, etc.

また、入力装置２と予測装置３は、インターネット等の通信ネットワーク１００を介して通信することができる。通信ネットワーク１００の接続形態は、無線又は有線のいずれでも良い。 The input device 2 and the prediction device 3 can communicate with each other via a communication network 100 such as the Internet. The communication network 100 may be connected in either a wireless or wired manner.

予測装置３は、単数又は複数のコンピュータによって構成されている。予測装置３が複数のコンピュータによって構成されている場合には、「予測装置」と示しても良いし、「予測システム」と示しても良い。 The prediction device 3 is composed of one or more computers. When the prediction device 3 is composed of multiple computers, it may be referred to as a "prediction device" or a "prediction system."

予測装置３は、入力装置２によって取得された対話者に関する対話者データに基づいて、対話者の主観的印象（心情）を予測し、予測結果を示す結果データを出力する。出力方法としては、出力装置としてのモニタ４に結果データを送信することにより、モニタ４側で結果データに係るグラフ等を表示することが挙げられる。 The prediction device 3 predicts the subjective impression (feelings) of the interlocutor based on the interlocutor data about the interlocutor acquired by the input device 2, and outputs result data indicating the prediction result. As an output method, the result data may be sent to a monitor 4 as an output device, and a graph or the like related to the result data may be displayed on the monitor 4.

〔ハードウェア構成〕
＜予測装置のハードウェア構成＞
次に、図２を用いて、予測装置３の電気的なハードウェア構成を説明する。図２は、予測装置の電気的なハードウェア構成図である。 [Hardware configuration]
<Hardware configuration of prediction device>
Next, the electrical hardware configuration of the prediction device 3 will be described with reference to Fig. 2. Fig. 2 is a diagram showing the electrical hardware configuration of the prediction device.

予測装置３は、コンピュータとして、図２に示されているように、ＣＰＵ(Central Processing Unit)３０１、ＲＯＭ(Read Only Memory)３０２、ＲＡＭ(Random Access Memory)３０３、ＨＤ(Hard Disk)３０４、ＨＤＤ(Hard Disk Drive)コントローラ３０５、外部機器接続Ｉ／Ｆ(Interface)３０８、ネットワークＩ／Ｆ３０９、バスライン３１０、メディアＩ／Ｆ３１４を備えている。 As shown in FIG. 2, the prediction device 3 is a computer and includes a CPU (Central Processing Unit) 301, a ROM (Read Only Memory) 302, a RAM (Random Access Memory) 303, a HD (Hard Disk) 304, a HDD (Hard Disk Drive) controller 305, an external device connection I/F (Interface) 308, a network I/F 309, a bus line 310, and a media I/F 314.

これらのうち、ＣＰＵ３０１は、予測装置３全体の動作を制御する。ＲＯＭ３０２は、ＩＰＬ(Initial Program Loader)等のＣＰＵ３０１の駆動に用いられるプログラムを記憶する。ＲＡＭ３０３は、ＣＰＵ３０１のワークエリアとして使用される。 Of these, CPU 301 controls the operation of the entire prediction device 3. ROM 302 stores programs used to drive CPU 301, such as IPL (Initial Program Loader). RAM 303 is used as a work area for CPU 301.

ＨＤ３０４は、プログラム等の各種データを記憶する。ＨＤＤコントローラ３０５は、ＣＰＵ３０１の制御にしたがってＨＤ３０４に対する各種データの読み出し又は書き込みを制御する。なお、ＨＤ３０４及びＨＤＤコントローラ３０５の代わりに、ＳＳＤ(Solid State Drive)及びＳＳＤコントローラが搭載されるようにしてもよい。 HD304 stores various data such as programs. HDD controller 305 controls the reading and writing of various data from HD304 under the control of CPU301. Note that instead of HD304 and HDD controller 305, an SSD (Solid State Drive) and SSD controller may be installed.

外部機器接続Ｉ／Ｆ３０８は、各種の外部機器を接続するためのインターフェースである。この場合の外部機器は、ディスプレイ、スピーカ、キーボード、マウス、ＵＳＢ(Universal Serial Bus)メモリ、及びプリンタ等である。 The external device connection I/F 308 is an interface for connecting various external devices. In this case, the external devices include a display, a speaker, a keyboard, a mouse, a USB (Universal Serial Bus) memory, a printer, etc.

ネットワークＩ／Ｆ３０９は、通信ネットワーク１００を介してデータ通信をするためのインターフェースである。バスライン３１０は、図２に示されているＣＰＵ３０１等の各構成要素を電気的に接続するためのアドレスバスやデータバス等である。 The network I/F 309 is an interface for data communication via the communication network 100. The bus line 310 is an address bus, a data bus, etc., for electrically connecting each component such as the CPU 301 shown in FIG. 2.

また、メディアＩ／Ｆ３１４は、フラッシュメモリ等の記録メディア３１３に対するデータの読み出し又は書き込み（記憶）を制御する。記録メディア３１３には、ＤＶＤ(Digital Versatile Disc)やＢｌｕ-ｒａｙＤｉｓｃ（登録商標）等も含まれる。 The media I/F 314 also controls the reading and writing (storing) of data from and to a recording medium 313 such as a flash memory. The recording medium 313 also includes a DVD (Digital Versatile Disc) and a Blu-ray Disc (registered trademark).

〔予測装置の機能構成〕
次に、図３を用いて、予測装置の機能構成について説明する。図３は、予測装置の機能構成図である。 [Functional configuration of the prediction device]
Next, the functional configuration of the prediction device will be described with reference to Fig. 3. Fig. 3 is a diagram showing the functional configuration of the prediction device.

図３において、予測装置３は、データ取得部３１、検出部３２、特徴量算出部３３、及び主観的印象予測部３４、及びモデル記憶部３９を有する。データ取得部３１、検出部３２、特徴量算出部３３、及び主観的印象予測部３４は、プログラムに基づき図２のＣＰＵ３０１による命令によって実現される機能である。データ取得部３１及び検出部３２は、データ学習及びデータ予測の挙動を行う。データ取得部３１、検出部３２、特徴量算出部３３及び主観的印象予測部３４は、データ予測の挙動を行う。また、検出部３２には、頭部運動検出部３２ａ、及び頭部運動機能検出部３２ｂが含まれている。 In FIG. 3, the prediction device 3 has a data acquisition unit 31, a detection unit 32, a feature calculation unit 33, a subjective impression prediction unit 34, and a model storage unit 39. The data acquisition unit 31, the detection unit 32, the feature calculation unit 33, and the subjective impression prediction unit 34 are functions realized by instructions from the CPU 301 in FIG. 2 based on a program. The data acquisition unit 31 and the detection unit 32 perform data learning and data prediction behavior. The data acquisition unit 31, the detection unit 32, the feature calculation unit 33, and the subjective impression prediction unit 34 perform data prediction behavior. In addition, the detection unit 32 includes a head movement detection unit 32a, and a head movement function detection unit 32b.

更に、図２のＲＡＭ３０３又はＨＤ３０４には、モデル記憶部３９が構築されている。モデル記憶部３９には、頭部運動検出モデル４０ａ、頭部運動機能検出モデル４０ｂ、及び回帰モデル５０が記憶されている。 Furthermore, a model storage unit 39 is constructed in the RAM 303 or HD 304 in FIG. 2. The model storage unit 39 stores a head movement detection model 40a, a head movement function detection model 40b, and a regression model 50.

＜各モデル＞
続いて、各モデルについて説明する。 <Each model>
Next, each model will be described.

（頭部運動検出モデル）
頭部運動検出モデル４０ａは、CNN(Convolutional Neural Networks)モデルの構造として、Otsuka & Tsumori によって提案された Late-fusion CNNモデル（非特許文献１参照）をベースとし、全結合層の fc4 に Dropout 層を挿入したものが用いられている。頭部運動検出モデル４０ａは、入力情報（データ）として、対象とするフレームを中心とした映像データにおける３２のビデオフレームの頭部姿勢角(Azimuth，Elevation，Roll)の角速度を使用し、この入力情報に対し、頭部運動の有無を出力するように CNN の学習を行う。このビデオフレーム数は任意に設定可能である。以降では，頭部運動が連続して検出されたビデオフレームの区間を１つの「頭部運動区間」と呼ぶ。機械学習が行われた頭部運動検出モデル４０ａはモデル記憶部３９に記憶される。 (Head movement detection model)
The head movement detection model 40a is based on the Late-fusion CNN model (see Non-Patent Document 1) proposed by Otsuka & Tsumori as a CNN (Convolutional Neural Networks) model structure, with a Dropout layer inserted in fc4 of the fully connected layer. The head movement detection model 40a uses the angular velocity of the head pose angles (Azimuth, Elevation, Roll) of 32 video frames in the video data centered on the target frame as input information (data), and performs CNN learning to output the presence or absence of head movement for this input information. The number of video frames can be set arbitrarily. Hereinafter, a section of video frames in which head movement is detected continuously is referred to as one "head movement section." The machine-learned head movement detection model 40a is stored in the model storage unit 39.

（頭部運動機能検出モデル）
頭部運動機能検出モデル４０ｂは、頭部運動検出モデル４０ａと同様のLate-fusion CNN モデルを使用する。ここでは、頭部運動検出モデル４０ａによって検出された頭部運動区間内の各ビデオフレームを予測対象のビデオフレームとして、その各ビデオフレームを中心とした３２フレーム分の頭部姿勢角の角速度、及び発話区間データの少なくとも一方を入力情報として用いる。このビデオフレーム数は任意に設定可能である。 (Head movement function detection model)
Head movement function detection model 40b uses a late-fusion CNN model similar to head movement detection model 40a. Here, each video frame in the head movement section detected by head movement detection model 40a is used as a video frame to be predicted, and at least one of the angular velocity of the head posture angle and speech section data for 32 frames centered on each video frame is used as input information. The number of video frames can be set arbitrarily.

なお、発話データとして、発話の有無を表す２値の時系列に対して、前後５０のビデオフレームの幅を持つ移動平均フィルタにより平滑化を施した時系列を使用する。このビデオフレームの幅は任意に設定可能とする。事前の機械学習においては、頭部運動機能検出モデル４０ｂは、モデル記憶部３９に保存されている。 The speech data used is a time series that has been smoothed using a moving average filter with a width of 50 video frames before and after the binary time series indicating the presence or absence of speech. The width of this video frame can be set arbitrarily. In the preliminary machine learning, the head movement function detection model 40b is stored in the model storage unit 39.

図７は、頭部運動機能のカテゴリ及び機能内容が示された図である。事前学習においては、コーパス（対話に関する情報が時系列データとして記載されているデータ）を用いて、図７に記載の任意のカテゴリの機能の１つ以上について頭部運動機能検出モデルの学習を行う。コーパスでは、図７に記載の頭部運動機能におけるカテゴリの機能の１つ以上が、非排他的なラベルが付されたデータを含む。例えば、２０歳代から３０歳代の女性８人の２グループによる合計４つの合意形成型対話(計27分)を対象とし、図７に記載の３２種類の頭部運動機能の各々について、ビデオフレーム毎にその有無が非排他的にラベル付けされたデータなどがその一例である。但し、今回の実施形態では、図８に記載の１０種の機能のそれぞれについて頭部運動機能検出の機械学習を行った例を説明する。なお、図８は、本実施形態で使用される頭部運動機能のカテゴリ及び機能内容が示された図である。機械学習が行われた頭部運動機能検出モデル４０ｂはモデル記憶部３９に記憶される。 Figure 7 is a diagram showing the categories and function contents of head movement functions. In pre-learning, a corpus (data in which information about dialogue is written as time-series data) is used to learn a head movement function detection model for one or more of the functions of any category listed in Figure 7. The corpus includes data in which one or more of the functions of the categories in the head movement function listed in Figure 7 are non-exclusively labeled. For example, data in which the presence or absence of each of the 32 types of head movement functions listed in Figure 7 is non-exclusively labeled for each video frame for a total of four consensus-building dialogues (total of 27 minutes) between two groups of eight women in their 20s and 30s is the target. However, in this embodiment, an example of machine learning for head movement function detection for each of the 10 functions listed in Figure 8 will be described. Note that Figure 8 is a diagram showing the categories and function contents of head movement functions used in this embodiment. The head movement function detection model 40b for which machine learning has been performed is stored in the model storage unit 39.

（回帰モデル）
回帰モデル５０は、主観的印象の予測値としての内観報告スコアを予測するためのランダムフォレスト回帰モデル（参考文献１）である。 (Regression model)
Regression model 50 is a random forest regression model (Reference 1) for predicting introspection report scores as predictors of subjective impressions.

（参考文献１）L. Breiman, "Random forests," Machine Learning, vol.45, pp.5- 32, 2001.
なお、回帰モデル５０は、ランダムフォレストに限らず、Liner Regression, Support Vector Regressionなど何でも良い。 (Reference 1) L. Breiman, "Random forests," Machine Learning, vol.45, pp.5-32, 2001.
The regression model 50 is not limited to a random forest, but may be any model such as Linear Regression or Support Vector Regression.

入力情報（ここでは、特徴量）として、運動学的特徴６種に、頭部運動機能特徴２２種のうちの少なくとも１種を加えたモデル(MK＋F)が構築される。そして、回帰モデル５０は、内観報告スコア４項目である「雰囲気」、「楽しさ」、「やる気」、「集中度」を出力情報（正解データ）として、この４項目の機械学習を行う。 A model (MK+F) is constructed that adds at least one of the 22 head movement function features to six kinematic features as input information (feature amounts in this case). Then, the regression model 50 performs machine learning on the four introspection report score items "atmosphere," "fun," "motivation," and "concentration" as output information (correct answer data).

なお、内観報告スコアの項目数は任意に設定可能である。この内観報告スコアは、任意の時間で区切られ入力されたデータであり、区切られた区間を「スロット」と呼ぶ。このスロットが予測の単位である。この区間の長さは一意に決めた値でも、区間の長さが変化しても良い。区間の長さが変化する場合は、それぞれの区間の長さを別途、モデル記憶部３９で保持する。 The number of items in the introspection report score can be set arbitrarily. This introspection report score is data that is divided into arbitrary time intervals and input, and each divided interval is called a "slot." This slot is the unit of prediction. The length of this interval may be a uniquely determined value, or the length of the interval may vary. If the length of the interval varies, the length of each interval is stored separately in the model storage unit 39.

＜各機能構成＞
続いて、図３を用いて、予測装置の各機能構成について説明する。 <Functional configuration>
Next, each functional configuration of the prediction device will be described with reference to FIG.

データ取得部３１は、カメラ２ａ等によって対話者が撮影されることで得られた映像データから、対話者の頭部運動に関する座標情報を取得する。 The data acquisition unit 31 acquires coordinate information regarding the head movement of the interlocutor from video data obtained by capturing an image of the interlocutor using the camera 2a or the like.

また、データ取得部３１は、カメラ２ａ等によって得られた対話者に関する対話者データを受信し、対話者データから対話者の発話区間を示す発話区間の情報を取得する。データ取得部３１が発話区間の情報を取得する場合、以下のようなパターンが存在する。
（１）データ取得部３１は、対話者が撮影されることで得られた対話者データとしての映像データに含まれる音声データから発話区間の情報を取得する。
（２）データ取得部３１は、対話者の音声が収音されることで得られた対話者データとしての音声データから発話区間の情報を取得する。
（３）データ取得部３１は、映像データにおける対話者の口の周りに存在する特徴点の動きベクトルが所定の閾値以上の場合に発話したとみなすことで、発話区間の情報を取得する。
（４）データ取得部３１は、映像データに対して発話内容が時系列情報を持ちつつ書き起こしされたコーパスから発話区間の情報を取得する。 Furthermore, the data acquisition unit 31 receives interlocutor data on the interlocutor obtained by the camera 2a or the like, and acquires information on speech sections indicating speech sections of the interlocutor from the interlocutor data. When the data acquisition unit 31 acquires information on speech sections, there are the following patterns.
(1) The data acquisition unit 31 acquires information about speech sections from audio data included in video data serving as interlocutor data obtained by photographing an interlocutor.
(2) The data acquisition unit 31 acquires information about speech sections from voice data serving as interlocutor data obtained by collecting the voices of interlocutors.
(3) The data acquisition unit 31 acquires information about a speech section by determining that an utterance has occurred when the motion vector of a feature point existing around the mouth of a conversation partner in the video data is equal to or greater than a predetermined threshold value.
(4) The data acquisition unit 31 acquires information on speech sections from a corpus in which speech content is transcribed with time-series information for video data.

なお、データ取得部３１は、入力装置２から映像データ（対話者データ）を取得した後、リアルタイムで、頭部運動に関する座標情報（発話区間の情報）を取得するがこれに限るものではない。例えば、データ取得部３１は、入力装置２から、映像データ（対話者データ）を取得した後、一旦、ＲＡＭ３０３等に記憶しておき、予測装置３が主観的印象の予測値を出力する際に、データ取得部３１がＲＡＭ３０３等から映像データ（対話者データ）を読み出してもよい。 After acquiring the video data (partner data) from the input device 2, the data acquisition unit 31 acquires coordinate information (information on the speech section) related to head movement in real time, but this is not limited to this. For example, after acquiring the video data (partner data) from the input device 2, the data acquisition unit 31 may temporarily store the video data (partner data) in the RAM 303 or the like, and when the prediction device 3 outputs a predicted value of the subjective impression, the data acquisition unit 31 may read out the video data (partner data) from the RAM 303 or the like.

頭部運動検出部３２ａは、頭部運動検出モデルを用いて、頭部運動に関する座標情報に対する頭部運動の有無を検出する。また、頭部運動検出部３２ａは、頭部運動検出モデルを用いて、発話区間の情報に対する発話の有無を検出する。 The head movement detection unit 32a uses a head movement detection model to detect the presence or absence of head movement in relation to coordinate information relating to head movement. The head movement detection unit 32a also uses the head movement detection model to detect the presence or absence of speech in relation to information about a speech section.

なお、対話の情報が書き起こされているコーパスが存在する場合、頭部運動検出部３２ａは、そのコーパスから発話の有無を出力しても良い。この場合、頭部運動検出部３２ａの処理は省略する。 If a corpus exists in which dialogue information is transcribed, the head movement detection unit 32a may output the presence or absence of speech from the corpus. In this case, the processing of the head movement detection unit 32a is omitted.

また、データ取得部３１で取得されたデータとして映像データの入力がある場合、頭部運動検出部３２ａは、その映像データからOpenFaceなどのライブラリを用いることで、頭部に位置する座標の時系列情報を抽出してもよい。この抽出された時系列情報が、ある時刻間においての移動のユークリッド距離が閾値を超えている場合に、頭部運動が有ると判断することもでき、その場合、頭部運動検出部３２ａの処理は省略し、データ取得部３１から直接、頭部運動機能検出部３２ｂへデータを入力する。 Furthermore, when video data is input as data acquired by the data acquisition unit 31, the head movement detection unit 32a may extract time series information of the coordinates located at the head from the video data by using a library such as OpenFace. If the Euclidean distance of movement between certain times in this extracted time series information exceeds a threshold, it can be determined that there is head movement, in which case the processing by the head movement detection unit 32a is omitted and the data is input directly from the data acquisition unit 31 to the head movement function detection unit 32b.

頭部運動機能検出部３２ｂは、頭部運動機能検出モデルを用いて、頭部運動の有無を示す頭部運動の有無情報に対する頭部運動機能を検出する。また、頭部運動機能検出部３２ｂは、頭部運動機能検出モデルを用いて、発話の有無を示す発話の有無情報に対する頭部運動機能を検出する。 The head movement function detection unit 32b uses a head movement function detection model to detect head movement function in response to head movement presence/absence information indicating the presence or absence of head movement. The head movement function detection unit 32b also uses a head movement function detection model to detect head movement function in response to speech presence/absence information indicating the presence or absence of speech.

特徴量算出部３３は、頭部運動機能検出部３２ｂによる検出結果に基づいて、頭部運動機能の特徴量を算出する。また、特徴量算出部３３は、頭部運動検出部３２ａによって検出された頭部運動の有無に基づいて、頭部運動学的な特徴量を算出する。更に、特徴量算出部３３は、頭部運動検出部３２ａによって検出された発話の有無に基づいて、頭部運動学的な特徴量を算出する。 The feature amount calculation unit 33 calculates the feature amount of the head movement function based on the detection result by the head movement function detection unit 32b. The feature amount calculation unit 33 also calculates the head kinematic feature amount based on the presence or absence of head movement detected by the head movement detection unit 32a. Furthermore, the feature amount calculation unit 33 calculates the head kinematic feature amount based on the presence or absence of speech detected by the head movement detection unit 32a.

主観的印象予測部３４は、回帰モデルを用いて、頭部運動機能の特徴量及び頭部運動学的な特徴量のうち、少なくとも頭部運動機能の特徴量に対する対話者の主観的印象を予測する。主観的印象予測部３４の出力情報（データ）の例が、後述の図１１に示されている。 The subjective impression prediction unit 34 uses a regression model to predict the interlocutor's subjective impression of at least the head movement function feature quantity, out of the head movement function feature quantity and the head kinematic feature quantity. An example of the output information (data) of the subjective impression prediction unit 34 is shown in FIG. 11, which will be described later.

＜入出力情報の関係＞
ここで、図４を用いて、各機能における入力情報（データ）及び出力情報（データ）の関係を説明する。図４は、各機能における入力情報（データ）及び出力情報（データ）の関係を示した図である。 <Relationship between input and output information>
Here, the relationship between input information (data) and output information (data) in each function will be described with reference to Fig. 4. Fig. 4 is a diagram showing the relationship between input information (data) and output information (data) in each function.

図４に示されているように、データ取得部３１が映像データを取得した場合、頭部運動検出部３２ａにおいて、入力情報が頭部運動に関する座標情報のときには、出力情報は頭部運動の有無情報である。また、データ取得部３１が音声データ等を取得した場合、頭部運動検出部３２ａにおいて、出力情報は発話の有無情報である。 As shown in FIG. 4, when the data acquisition unit 31 acquires video data, if the input information in the head movement detection unit 32a is coordinate information related to head movement, the output information in the head movement detection unit 32a is information on the presence or absence of head movement. Also, when the data acquisition unit 31 acquires voice data, etc., the output information in the head movement detection unit 32a is information on the presence or absence of speech.

次に、頭部運動機能検出部３２ｂにおいて、入力情報は頭部運動の有無情報及び発話の有無情報のうち少なくとも一方であり、出力情報は頭部運動機能の検出結果である。 Next, in the head movement function detection unit 32b, the input information is at least one of information on the presence or absence of head movement and information on the presence or absence of speech, and the output information is the detection result of the head movement function.

次に、特徴量算出部３３において、入力情報が頭部運動機能の検出結果の場合には、出力情報は頭部運動機能の特徴量（後述の図１０参照）である。また、特徴量算出部３３において、入力情報が頭部運動の有無情報及び発話の有無情報のうちの少なくとも一方である場合には、出力情報は頭部運動学的な特徴量（後述の図９参照）である。なお、入力情報として、頭部運動機能の検出結果は必須であるが、頭部運動の有無情報及び発話の有無情報は必須ではない。即ち、特徴量算出部３３は、頭部運動機能の特徴量の算出は必須であるが、頭部運動学的な特徴量の算出は必須ではない。 Next, in the feature calculation unit 33, when the input information is the detection result of head movement function, the output information is the feature of head movement function (see FIG. 10 described below). In addition, in the feature calculation unit 33, when the input information is at least one of information on the presence or absence of head movement and information on the presence or absence of speech, the output information is head kinematic feature (see FIG. 9 described below). Note that while the detection result of head movement function is essential as input information, information on the presence or absence of head movement and information on the presence or absence of speech are not essential. In other words, the feature calculation unit 33 must calculate the feature of head movement function, but does not must calculate the feature of head kinematic feature.

次に、主観的印象予測部３４において、入力情報が頭部運動機能の特徴量及び頭部運動学的な特徴量のうち少なくとも頭部運動機能の特徴量であり、出力情報が対話者の主観的印象の予測値である。 Next, in the subjective impression prediction unit 34, the input information is at least the head movement function feature among the head movement function feature and the head kinematic feature, and the output information is a predicted value of the interlocutor's subjective impression.

〔実施形態の処理又は動作〕
続いて、図５乃至図１１を用いて、本実施形態の処理又は動作について説明する。 [Processing or Operation of the Embodiment]
Next, the processing or operation of this embodiment will be described with reference to FIGS.

＜データ学習の挙動＞
まず、図５を用いて、データ学習の挙動を説明する。図５は、データ学習の挙動を示したフローチャートである。 <Data learning behavior>
First, the behavior of data learning will be described with reference to Fig. 5. Fig. 5 is a flowchart showing the behavior of data learning.

データ取得部３１は、入力装置２から時系列に出力された対話者データを受信する（Ｓ１１）。そして、データ取得部３１は、対話者データから、頭部運動に関する座標情報（発話区間を示す情報）を取得する（Ｓ１２）。 The data acquisition unit 31 receives interlocutor data output in time series from the input device 2 (S11). The data acquisition unit 31 then acquires coordinate information related to head movement (information indicating speech intervals) from the interlocutor data (S12).

ここで、データ取得部３１の処理について具体的に説明する。 Here, we will explain in detail the processing of the data acquisition unit 31.

（映像データの活用）
まずは、データ取得部３１が、映像データを活用する場合について説明する。データ取得部３１は、カメラ２ａ、スマートフォン２ｂ等から映像データを取得する。データ取得部３１は、映像データを受け取った場合，OpenFaceの顔認証技術を用いて顔追跡を行い、頭部姿勢角を計測する。データ取得部３１が、頭部姿勢角として、例えば、Azimuth(方位角)、Elevation(仰角)、及びRoll(ロール角)の３成分を検出する。 (Use of video data)
First, a case where the data acquisition unit 31 utilizes video data will be described. The data acquisition unit 31 acquires video data from the camera 2a, the smartphone 2b, etc. When the data acquisition unit 31 receives the video data, it performs face tracking using OpenFace face recognition technology and measures the head attitude angle. The data acquisition unit 31 detects, for example, three components of the azimuth, elevation, and roll as the head attitude angle.

なお、Azimuth(方位角)、Elevation(仰角)、及びRoll(ロール角)ではなく、他座標系による角度でも良い。また、データ取得部３１は、角加速度を検出しても良い。 In addition, instead of Azimuth, Elevation, and Roll, angles based on other coordinate systems may be used. Furthermore, the data acquisition unit 31 may detect angular acceleration.

データ取得部３１は、頭部運動に関する座標情報として、対話映像中の人の動きをもとにコンピュータで生成された任意のプログラムから取得しても良い。その場合、データ取得部３１は、時間フレームt に対応するプログラムから出力される座標を抽出する。更に、データ取得部３１は、ウェアラブルセンサ、加速度センサ、モーションキャプチャ、磁気式センサ、又は、これらが内蔵された機器などを活用し、頭部運動に関する座標情報を取得しても良い。この場合、データ取得部３１は、時間フレームt に対応する角速度や角加速度の情報を抽出する。 The data acquisition unit 31 may acquire coordinate information related to head movement from any program generated by a computer based on the movements of a person in the dialogue video. In this case, the data acquisition unit 31 extracts coordinates output from the program corresponding to time frame t. Furthermore, the data acquisition unit 31 may acquire coordinate information related to head movement using a wearable sensor, an acceleration sensor, motion capture, a magnetic sensor, or a device with these built in. In this case, the data acquisition unit 31 extracts information on angular velocity and angular acceleration corresponding to time frame t.

以上により、データ取得部３１は、頭部運動に関する座標情報として、時間フレームt における各成分を以下に示す値とし、頭部運動検出部３２ａへ出力する。 As a result of the above, the data acquisition unit 31 sets each component in time frame t to the following values as coordinate information related to head movement, and outputs these to the head movement detection unit 32a.

（対話者データの活用）
次に、データ取得部３１が、対話者データを活用する場合について説明する。データ取得部３１は、カメラ２ａ、スマートフォン２ｂ、マイク付きヘッドフォン２ｃ、及びマイク２ｄ等から対話者データを取得する。データ取得部３１は、対話者ごとの発話区間を取得する。

(Utilizing interlocutor data)
Next, a case where the data acquisition unit 31 utilizes interlocutor data will be described. The data acquisition unit 31 acquires interlocutor data from the camera 2a, the smartphone 2b, the headphones with microphone 2c, the microphone 2d, etc. The data acquisition unit 31 acquires speech sections for each interlocutor.

データ取得部３１は、対話者データとして音声データを取得した場合、音圧の情報から発話区間データを取得する。例えば、データ取得部３１は、任意の閾値以上の値を持つ音圧を発話している区間として検出する。 When the data acquisition unit 31 acquires voice data as interlocutor data, it acquires speech section data from sound pressure information. For example, the data acquisition unit 31 detects a sound pressure having a value equal to or greater than an arbitrary threshold as a speaking section.

データ取得部３１は、対話者データとして映像データを取得した場合、映像データから音声データ抽出する。また、データ取得部３１は、映像データを用いて、OpenFaceの顔認証技術で取得できるある特徴点（例えば口の周りに存在する特徴点）を抽出し、この特徴点の動きベクトルがある閾値以上の場合に発話しているとみなしても良い。また、既に映像データに対して発話内容が時系列情報を持ちつつ書き起こしされたコーパスなどがある場合は、データ取得部３１は、このコーパスの情報を用い、発話内容が存在する区間を発話区間データとしても良い。 When the data acquisition unit 31 acquires video data as interlocutor data, it extracts voice data from the video data. The data acquisition unit 31 may also use the video data to extract certain feature points (for example, feature points around the mouth) that can be acquired using OpenFace facial recognition technology, and may determine that speech is occurring when the motion vector of this feature point is equal to or greater than a certain threshold. Furthermore, if there is already a corpus in which speech content has been transcribed with time-series information for the video data, the data acquisition unit 31 may use the information from this corpus to treat the section in which the speech content exists as speech section data.

更に、データ取得部３１は、話映像中の発話区間の情報をもとにコンピュータで生成された任意のプログラムから取得しても良い。なお、この場合、後述の頭部運動検出部３２ａは、時間フレームt に対応するプログラムから出力される発話の有無情報を抽出する。 Furthermore, the data acquisition unit 31 may acquire the information from any program generated by a computer based on information about speech sections in the spoken video. In this case, the head movement detection unit 32a described below extracts information about the presence or absence of speech output from the program corresponding to time frame t.

以上により、データ取得部３１は、発話区間の情報（時間フレームtの情報を含む）を頭部運動検出部３２ａへ出力する。なお、上述のように、データ取得部３１は、頭部運動に関する座標情報ではなく、頭部運動に関するデータとして取得してもよい。 As a result, the data acquisition unit 31 outputs information about the speech section (including information about time frame t) to the head movement detection unit 32a. Note that, as described above, the data acquisition unit 31 may acquire data about head movement instead of coordinate information about head movement.

次に、頭部運動検出部３２ａは、頭部運動検出モデルを用いて、頭部運動に関する座標情報（発話区間を示す情報）に対する頭部運動の有無（発話の有無）を検出するように機械学習を実施する（Ｓ１３）。なお、ステップＳ１２において、頭部運動に関するデータが取得された場合、頭部運動検出部３２ａは、頭部運動検出モデルを用いて、頭部運動に関するデータに対する頭部運動の有無を検出するように機械学習を実施する。また、上述のように、頭部運動検出部３２ａによるステップＳ１３は省略してもよい。この場合、ステップＳ１２で取得されたデータが、直接、頭部運動機能検出部３２ｂに入力される。 Next, the head movement detection unit 32a performs machine learning using the head movement detection model to detect the presence or absence of head movement (presence or absence of speech) for coordinate information related to head movement (information indicating a speech section) (S13). Note that, if data related to head movement is acquired in step S12, the head movement detection unit 32a performs machine learning using the head movement detection model to detect the presence or absence of head movement for the data related to head movement. Also, as described above, step S13 by the head movement detection unit 32a may be omitted. In this case, the data acquired in step S12 is directly input to the head movement function detection unit 32b.

次に、頭部運動検出部３２ａは、モデル記憶部３９に対して、機械学習後の頭部運動検出モデルを記憶する（Ｓ１４）。 Next, the head movement detection unit 32a stores the head movement detection model after machine learning in the model storage unit 39 (S14).

次に、頭部運動機能検出部３２ｂは、頭部運動機能検出モデルを用いて、頭部運動の有無情報（発話の有無情報）に対する頭部運動の機能を検出するように機械学習を実施する（Ｓ１５）。なお、頭部運動検出部３２ａによるステップＳ１３が省略された場合、頭部運動機能検出部３２ｂは、頭部運動機能検出モデルを用いて、ステップＳ１２で取得されたデータに対する頭部運動の機能を検出するように機械学習を実施する。 Next, head movement function detection unit 32b performs machine learning using the head movement function detection model to detect the head movement function for head movement presence/absence information (information on the presence/absence of speech) (S15). Note that if step S13 by head movement detection unit 32a is omitted, head movement function detection unit 32b performs machine learning using the head movement function detection model to detect the head movement function for the data acquired in step S12.

次に、頭部運動機能検出部３２ｂは、モデル記憶部３９に対して、機械学習後の頭部運動機能検出モデルを記憶する（Ｓ１６）。 Next, the head movement function detection unit 32b stores the head movement function detection model after machine learning in the model storage unit 39 (S16).

以上により、データ学習の挙動の説明について終了する。 This concludes the explanation of data learning behavior.

＜データ予測の挙動＞
続いて、図６乃至図１１を用いて、データ予測の挙動を説明する。図６は、データ予測の挙動を示したフローチャートである。なお、下記ステップＳ２１，Ｓ２２，Ｓ２３，Ｓ２４は、それぞれ上述のステップＳ１１，Ｓ１２，Ｓ１３，Ｓ１５と同様の処理であるため、重複する処理の説明を省略する。 <Data prediction behavior>
Next, the behavior of data prediction will be described with reference to Fig. 6 to Fig. 11. Fig. 6 is a flowchart showing the behavior of data prediction. Note that steps S21, S22, S23, and S24 below are the same as steps S11, S12, S13, and S15 described above, respectively, and therefore descriptions of the overlapping processes will be omitted.

データ取得部３１は、入力装置２から時系列に出力された対話者データを受信する（Ｓ１２）。そして、データ取得部３１は、対話者データから、頭部運動に関する座標情報（発話区間を示す情報）を取得する（Ｓ２２）。なお、頭部運動に関する座標情報、及び発話区間を示す情報の取得方法は、機械学習とは異なる手法であってもよい。また、上述のように、データ取得部３１は、頭部運動に関する座標情報ではなく、頭部運動に関するデータとして取得してもよい。 The data acquisition unit 31 receives interlocutor data output in time series from the input device 2 (S12). Then, the data acquisition unit 31 acquires coordinate information related to head movement (information indicating a speech section) from the interlocutor data (S22). Note that the method of acquiring the coordinate information related to head movement and the information indicating a speech section may be a method other than machine learning. Also, as described above, the data acquisition unit 31 may acquire data related to head movement instead of coordinate information related to head movement.

次に、頭部運動検出部３２ａは、学習済みの頭部運動検出モデルを用いて、頭部運動に関する座標情報（発話区間を示す情報）に対する頭部運動の有無（発話の有無）を検出する（Ｓ２３）。なお、ステップＳ２２において、頭部運動に関するデータが取得された場合、頭部運動検出部３２ａは、学習済みの頭部運動検出モデルを用いて、頭部運動に関するデータに対する頭部運動の有無を検出する。また、上述のように、頭部運動検出部３２ａによるステップＳ２３は省略してもよい。この場合、ステップＳ２２で取得されたデータが、直接、頭部運動機能検出部３２ｂに入力される。 Next, head movement detection unit 32a uses the trained head movement detection model to detect the presence or absence of head movement (presence or absence of speech) for coordinate information related to head movement (information indicating a speech section) (S23). Note that, if data related to head movement is acquired in step S22, head movement detection unit 32a uses the trained head movement detection model to detect the presence or absence of head movement for the data related to head movement. Also, as described above, step S23 by head movement detection unit 32a may be omitted. In this case, the data acquired in step S22 is directly input to head movement function detection unit 32b.

次に、頭部運動機能検出部３２ｂは、学習済みの頭部運動機能検出モデルを用いて、頭部運動の有無情報（発話の有無情報）に対する頭部運動の機能を検出する（Ｓ２４）。 Next, the head movement function detection unit 32b uses the trained head movement function detection model to detect the head movement function in response to head movement presence/absence information (information on the presence/absence of speech) (S24).

この場合、頭部運動機能検出部３２ｂは、機械学習された１０種の頭部運動機能のうち少なくとも１つ以上を含む検出を行う。１０種全ての頭部運動機能の検出を行う場合、頭部運動機能検出部３２ｂは、ある対話中のフレームt における１０種の頭部運動機能の検出結果を図８に示されている番号を用いて、以下のように表す。 In this case, head movement function detection unit 32b performs detection including at least one of the 10 types of head movement functions that have been machine-learned. When detecting all 10 types of head movement functions, head movement function detection unit 32b expresses the detection results of the 10 types of head movement functions at frame t during a conversation as follows, using the numbers shown in Figure 8.

検出結果には、上述の参考文献１が示唆するように、誤差が含まれていると考えられるが、本研究では誤差も含めて、この検出結果を頭部運動機能とみなし、頭部運動機能検出部３２ｂが検出結果を特徴量算出部３３へ出力する。

As suggested by Reference 1 mentioned above, the detection results are likely to contain errors. However, in this study, the detection results, including the errors, are regarded as head movement function, and the head movement function detection unit 32b outputs the detection results to the feature calculation unit 33.

なお、上述のように、ステップＳ２３が省略された場合、頭部運動機能検出部３２ｂは、学習済みの頭部運動機能検出モデルを用いて、ステップＳ２２で取得されたデータに対する頭部運動の機能を検出する。 As described above, if step S23 is omitted, the head movement function detection unit 32b uses the trained head movement function detection model to detect the head movement function for the data acquired in step S22.

続いて、特徴量算出部３３は、頭部運動の機能の検出結果に基づき、頭部運動機能の特徴量を算出する（Ｓ２５）。また、特徴量算出部３３は、頭部運動の有無情報（発話の有無情報）に基づき、頭部運動学的な特徴量を算出する（Ｓ２６）。なお、このステップＳ２６は必須の処理ではない。 Then, the feature calculation unit 33 calculates the feature of the head movement function based on the detection result of the head movement function (S25). The feature calculation unit 33 also calculates the head kinematic feature based on the information on the presence or absence of head movement (information on the presence or absence of speech) (S26). Note that this step S26 is not a required process.

ここで、特徴量算出部３３が実行する処理について更に詳細に説明する。 Now, we will explain in more detail the processing performed by the feature calculation unit 33.

特徴量算出部３３は，頭部運動の有無情報及び頭部運動機能の検出結果のうち、少なくとも頭部運動機能の検出結果を取得し、次段の回帰モデル５０に入力するための特徴量を計算する。 The feature calculation unit 33 acquires at least the head movement function detection results from the information on the presence or absence of head movement and the head movement function detection results, and calculates the feature amounts to be input to the next stage regression model 50.

以下、頭部運動学的な特徴量６種と頭部運動機能の特徴量２２種を定義し、少なくとも頭部運動機能の特徴量２２種を使用する場合について説明する。なお、使用する特徴量はこれらに限らず、他の特徴量を活用しても良い。 Below, we define six types of head kinematic features and 22 types of head movement function features, and explain the case where at least the 22 types of head movement function features are used. Note that the features used are not limited to these, and other features may also be used.

図９は、運動学的特徴の一覧を示す図である。ここで、スロットに含まれるフレーム数をT 、スロット内の頭部運動の検出されたフレーム数の合計をNF 、スロット内の頭部運動区間の数をNI とすると、各特徴量は図９示されるように定義される。これについて、更に詳細に説明する。 Figure 9 shows a list of kinematic features. Here, if the number of frames included in a slot is T, the total number of frames in which head movement is detected within the slot is NF, and the number of head movement intervals within the slot is NI, then each feature amount is defined as shown in Figure 9. This will be explained in more detail.

Hrate は、１つのスロットのフレーム数に対する頭部運動区間の検出されたフレームの割合を示す。Hnum は各スロット内の頭部運動区間の回数を示す。Hdurは各スロットで検出された頭部運動区間の一回あたりの平均フレーム数を示す。これらは（式１）で表される。 Hrate indicates the ratio of frames in which head movement segments are detected to the total number of frames in one slot. Hnum indicates the number of head movement segments in each slot. Hdur indicates the average number of frames per head movement segment detected in each slot. These are expressed by (Equation 1).

は、スロット内の各頭部姿勢角である

is the head attitude angle in each slot

の標準偏差を示し、スロット内の頭部姿勢角の平均を

The standard deviation of the head attitude angle in the slot is shown.

とすると，次の様に表せられる。

Then, it can be expressed as follows:

続いて、図１０は、頭部運動機能の特徴を示す図である。図１０において、各スロットにおける総フレーム数をT ，発話関連機能si,t の検出フレーム数を| si |、反応機能ri,t の検出フレーム数を| ri |、その他機能c1,t の検出フレーム数を| c1 |とし、各機能区分の検出フレームの総和を|s|, |r|, |c| とすると，各特徴量は以下のように定義される。

Next, Fig. 10 is a diagram showing the features of head movement functions. In Fig. 10, if the total number of frames in each slot is T, the number of detection frames for speech-related functions si,t is |si|, the number of detection frames for reaction functions ri,t is |ri|, the number of detection frames for other functions c1,t is |c1|, and the sums of detection frames for each function division are |s|, |r|, |c|, then each feature is defined as follows:

（ａ)機能出現率（第１の特徴量）
機能出現率 (a) Function occurrence rate (first feature amount)
Feature occurrence rate

は各スロットのフレーム数T に対して各頭部運動機能

is the head movement function for each slot frame number T

の出現したフレームの割合を示し、（式２）ように表される。

This indicates the proportion of frames in which the

この（式２）は、１スロット内の各機能の現出量を表し，対話活動の量的側面を示唆する。

This (Equation 2) represents the amount of each function occurring within one slot, and suggests the quantitative aspect of dialogue activity.

（ｂ）機能含有率（第２の特徴量）
機能含有率 (b) Functional content (second characteristic amount)
Functional content

は、スロット内の発話関連機能の総フレーム数| s |に対する各発話関連機能

is the ratio of each utterance-related feature to the total number of frames of utterance-related features in the slot |s|

の占める割合を示す。これは発話活動において、どの発話機能が相対的に多く現出していたかという発話活動の性質を表す特徴を有する。
また、

This shows the proportion of speech functions that were relatively frequently used in speech activities.
Also,

はスロット内の反応機能の総フレーム数| r |に対する各反応機能

is the total number of frames of the reaction function in the slot |r| for each reaction function

の占める割合を示す。これは、他者に対する反応の中で、どの反応機能が相対的に多く現出されていたかという反応の性質を表す特徴を有する。これらは（式３）表せられる。

This shows the proportion of the reaction function that is relatively more frequently expressed in the reaction to others. These are expressed by (Equation 3).

上述の機能出現率が、スロット内において機能の現出した量を表すものであるのに対して、機能含有率は、機能現出の絶対的な時間の長さには依存せず、これらの表出に含まれる機能の違いを機能間の相対的な量として捉える特徴を有する。例えば、発話時の頭部運動の時間長が同じであったとしても、機能含有率によって、強調の多い発話や問いかけの多い発話など話し手の振る舞いの違いを定量化することができる。同様に聞き手においても相槌に終始する場合や長く思考する場合など、機能含有率によって、傾聴時の振る舞いの性質を捉えることができる。

While the above-mentioned function occurrence rate represents the amount of a function that appears within a slot, the function content rate has the characteristic of not depending on the absolute length of time that the function appears, but rather capturing the difference in the functions contained in these expressions as the relative amount between functions. For example, even if the length of time of head movement during speech is the same, the function content rate can quantify the difference in speaker behavior, such as speech with a lot of emphasis or speech with a lot of questions. Similarly, the function content rate can capture the nature of the listener's behavior during attentive listening, such as whether the listener only responds with a nod or whether the listener thinks for a long time.

（ｃ）機能区分構成比（第３の特徴量）
機能区分構成比 (c) Functional division composition ratio (third characteristic amount)
Functional division composition ratio

はスロット内の１０種の全機能の出現フレーム数の総和

is the total number of frames in which all 10 features appear in the slot

に対する発話関連機能、反応機能、その他機能の各々の占める割合であり、（式４）によって表される。

This is the ratio of speech-related functions, reaction functions, and other functions to the total, and is expressed by (Equation 4).

この（式４）は、スロット内における各参加者の対話活動の概略を示す。例えば、話し手としての活動が多い場合には

This (Equation 4) shows an outline of the dialogue activity of each participant in a slot. For example, if there is a lot of activity as a speaker,

が大きくなる。

becomes larger.

また、聞き手としての活動が多い場合には Also, if you are mostly active as a listener

が相対的に大きくなる。

becomes relatively large.

特徴量算出部３３は、以上によって算出した特徴量を主観的印象予測部３４に出力する。 The feature calculation unit 33 outputs the feature calculated as described above to the subjective impression prediction unit 34.

次に、主観的印象予測部３４は、回帰モデルを用いて、頭部運動機能の特徴量及び頭部運動学的な特徴量のうち少なくとも頭部運動機能の特徴量に対する対話者の内観報告スコアの各項目の予測値（主観的印象の予測値）の出力する（Ｓ２７）。 Next, the subjective impression prediction unit 34 uses a regression model to output predicted values (predicted subjective impressions) of each item of the interlocutor's introspection report score for at least the head movement function feature quantities among the head movement function feature quantities and the head kinematic feature quantities (S27).

ここで、主観的印象予測部３４の処理について、更に詳細に説明する。図１１は、項目毎に内観報告スコアとしての主観的印象の予測値を示す図である。図１１では、内観報告スコア４項目である「雰囲気」、「楽しさ」、「やる気」、「集中度」について２分間の区間で区切り、１から９までの整数で値が表されている。この整数の値は、任意に入力された値（Ａ）でも、何らかの機器より出力された値（Ｂ）でも良い。また、上記整数の値は、各値（Ａ），（Ｂ）が任意の手法で計算され、その結果として求められた値でも良い。 Here, the processing of the subjective impression prediction unit 34 will be described in more detail. FIG. 11 is a diagram showing predicted values of subjective impressions as introspection report scores for each item. In FIG. 11, the four introspection report score items, "atmosphere," "fun," "motivation," and "concentration," are divided into two-minute intervals and values are expressed as integers from 1 to 9. The integer value may be an arbitrarily input value (A) or a value output from some device (B). The integer value may also be a value obtained as a result of calculating each value (A) and (B) using an arbitrary method.

また、主観的印象予測部３４は、機械学習された回帰モデル５０に対し、対話中の各スロットから得られる特徴量を入力し、内観報告スコアの各項目の予測値を出力する。なお、予測値は、９段階の離散値でも、実数値でも良い。 The subjective impression prediction unit 34 also inputs the feature quantities obtained from each slot during the conversation into the machine-learned regression model 50, and outputs a predicted value for each item of the introspection report score. Note that the predicted value may be either a nine-level discrete value or a real value.

以上により、データ予測の挙動の説明について終了する。 This concludes the explanation of data prediction behavior.

〔実施形態の効果〕
以上説明したように本実施形態によれば、対話中の対話者自身が抱く主観的印象（対話者の心情）を予測することができるという効果を奏する。これにより、対話者の間の関係性構築や心理的安全性の確保といった対話者の対話体験を向上する技術の確立へつながることができる。 [Effects of the embodiment]
As described above, the present embodiment has an effect of being able to predict the subjective impressions (feelings) of the interlocutors during a conversation. This leads to the establishment of technology that improves the conversation experience of the interlocutors, such as building relationships between the interlocutors and ensuring psychological safety.

また、対話中の頭部運動と頭部運動機能の特徴をともに用いることで対話中の状況の違いに頑強な予測を行うことが可能になる。 In addition, by using both head movement during dialogue and features of head movement function, it becomes possible to make predictions that are robust to differences in situations during dialogue.

更に、機能出現率を考慮することで対話活動の量的側面や、機能含有率を考慮することで発話活動の性質、機能区分構成比を考慮することで各参加者の対話活動の概略を特徴として用い、予測の精度向上を図ることが可能になる。 Furthermore, by considering the rate of occurrence of functions, it is possible to improve the accuracy of predictions by using the quantitative aspect of dialogue activity, by considering the rate of content of functions, it is possible to determine the nature of speech activity, and by considering the composition ratio of functional categories, it is possible to use the outline of each participant's dialogue activity as a feature.

〔補足〕
本発明は上述の実施形態に限定されるものではなく、以下に示すような構成又は処理（動作）であってもよい。
（１）本発明の予測装置３はコンピュータとプログラムによっても実現できるが、このプログラムを記録媒体に記録することも、通信ネットワークを通して提供することも可能である。
（２）上記実施形態では、入力装置２の一例としてカメラ２ａ等が示されているが、これに限るものではなく、例えば、デスクトップパソコン、ノートパソコン、タブレット端末、スマートウォッチ、カーナビゲーション装置等であってもよい。
（３）各ＣＰＵ３０１，５０１は、単一だけでなく、複数であってもよい。〔supplement〕
The present invention is not limited to the above-described embodiment, and may have the following configurations or processes (operations).
(1) The prediction device 3 of the present invention can be realized by a computer and a program. This program can be recorded on a recording medium or provided via a communication network.
(2) In the above embodiment, a camera 2a or the like is shown as an example of the input device 2, but this is not limited to this and may be, for example, a desktop personal computer, a notebook computer, a tablet terminal, a smart watch, a car navigation device, etc.
(3) Each of the CPUs 301 and 501 may be multiple, rather than just single.

１通信システム
３予測装置
４モニタ
３１データ取得部（データ取得手段の一例）
３２検出部
３２ａ頭部運動検出部（頭部運動検出手段の一例）
３２ｂ頭部運動機能検出部（頭部運動機能検出手段の一例）
３３特徴量算出部（特徴量算出手段の一例）
３４主観的印象予測部（主観的印象予測手段の一例）
３９モデル記憶部（モデル記憶手段の一例）
４０ａ頭部運動検出モデル
４０ｂ頭部運動機能検出モデル
５０回帰モデル 1 Communication system 3 Prediction device 4 Monitor 31 Data acquisition unit (an example of a data acquisition means)
32 Detector 32a Head movement detector (an example of a head movement detector)
32b Head movement function detection unit (an example of a head movement function detection means)
33 Feature amount calculation unit (an example of a feature amount calculation means)
34 Subjective impression prediction unit (an example of a subjective impression prediction means)
39 Model storage unit (an example of a model storage means)
40a Head movement detection model 40b Head movement function detection model 50 Regression model

Claims

A prediction device for predicting a subjective impression held by a person in a conversation , comprising:
A data acquisition means for acquiring data on head movement of the interlocutor from video data obtained by photographing the interlocutor;
a head movement function detection means for detecting a head movement function for the data relating to the head movement using a head movement function detection model;
a feature amount calculation means for calculating a feature amount of the head movement function based on a detection result by the head movement function detection means;
a subjective impression prediction means for predicting a subjective impression of the interlocutor with respect to the head movement function feature quantity by using a regression model;
A prediction device comprising:

A prediction device for predicting a subjective impression held by a person in a conversation , comprising:
A data acquisition means for acquiring data on head movement of the interlocutor from video data obtained by photographing the interlocutor;
head movement detection means for detecting the presence or absence of head movement with respect to the head movement related data using a head movement detection model;
a head movement function detection means for detecting a head movement function corresponding to the head movement presence/absence information indicating the presence or absence of head movement using a head movement function detection model;
a feature amount calculation means for calculating a feature amount of the head movement function based on a detection result by the head movement function detection means;
a subjective impression prediction means for predicting a subjective impression of the interlocutor with respect to the head movement function feature quantity by using a regression model;
A prediction device comprising:

the data regarding head movement is coordinate information regarding the head movement,
3. The prediction device according to claim 2, wherein the head movement detection means detects the presence or absence of the head movement with respect to the coordinate information relating to the head movement by using the head movement detection model.

the feature amount calculation means calculates a head kinematic feature amount based on the presence or absence of the head movement detected by the head movement detection means;
The prediction device according to claim 2 or 3, wherein the subjective impression prediction means predicts the interlocutor's subjective impression of the head motor function feature and the head kinematic feature by using the regression model.

A prediction device for predicting a subjective impression held by a person in a conversation , comprising:
a data acquisition means for receiving interlocutor data relating to an interlocutor and acquiring, from the interlocutor data, information on a speech section indicating an utterance section of the interlocutor;
a head movement detection means for detecting the presence or absence of speech in the speech section information by using a head movement detection model;
a head movement function detection means for detecting head movement function in response to the speech presence/absence information indicating the presence or absence of speech using a head movement function detection model;
a feature amount calculation means for calculating a feature amount of the head movement function based on a detection result by the head movement function detection means;
a subjective impression prediction means for predicting a subjective impression of the interlocutor with respect to the head movement function feature quantity by using a regression model;
A prediction device comprising:

The feature amount calculation means calculates a head kinematic feature amount based on the presence or absence of the speech detected by the head movement detection means,
The prediction device according to claim 5 , wherein the subjective impression prediction means predicts the interlocutor's subjective impression of the head motor function feature amount and the head kinematic feature amount by using the regression model.

The prediction device according to claim 5 or 6, characterized in that the data acquisition means acquires the information on the speech section from audio data included in the video data as the interlocutor data obtained by filming the interlocutor, acquires the information on the speech section from audio data as the interlocutor data obtained by collecting the interlocutor's voice, acquires the information on the speech section by determining that the interlocutor has spoken when the motion vector of a feature point existing around the mouth of the interlocutor in the video data is equal to or greater than a predetermined threshold, or acquires the information on the speech section from a corpus in which the speech content of the video data is transcribed while having time series information.

A prediction method executed by a prediction device for predicting a subjective impression held by a person in a conversation , comprising:
a data acquisition step of acquiring data on head movement of the interlocutor from video data obtained by photographing the interlocutor;
a head movement function detection step of detecting a head movement function for the data relating to the head movement using a head movement function detection model;
a feature amount calculation step of calculating a feature amount of the head movement function based on a detection result in the head movement function detection step;
a subjective impression prediction step of predicting a subjective impression of the interlocutor with respect to the head movement function feature quantity using a regression model;
A prediction method comprising the steps of:

A prediction method executed by a prediction device for predicting a subjective impression held by a person in a conversation , comprising:
a data acquisition step of acquiring data on head movement of the interlocutor from video data obtained by photographing the interlocutor;
a head movement detection step of detecting the presence or absence of head movement with respect to the head movement related data using a head movement detection model;
a head movement function detection step of detecting a head movement function corresponding to head movement presence/absence information indicating the presence/absence of head movement using a head movement function detection model;
a feature amount calculation step of calculating a feature amount of the head movement function based on a detection result by the head movement function detection step;
a subjective impression prediction step of predicting a subjective impression of the interlocutor with respect to the head movement function feature quantity using a regression model;
A prediction method comprising the steps of:

A prediction method executed by a prediction device for predicting a subjective impression held by a person in a conversation , comprising:
a data acquiring step of receiving interlocutor data related to an interlocutor and acquiring, from the interlocutor data, information on a speech section indicating an utterance section of the interlocutor;
a head movement detection step of detecting the presence or absence of speech in the speech section information by using a head movement detection model;
a head movement function detection step of detecting head movement function in response to the speech presence/absence information indicating the presence or absence of speech using a head movement function detection model;
a feature amount calculation step of calculating a feature amount of the head movement function based on a detection result by the head movement function detection step;
a subjective impression prediction step of predicting a subjective impression of the interlocutor with respect to the head movement function feature quantity using a regression model;
A prediction method comprising the steps of:

A program for causing a computer to execute the method according to any one of claims 8 to 10.