JP7640964B2

JP7640964B2 - Speech content recognition device, method, and program

Info

Publication number: JP7640964B2
Application number: JP2021024841A
Authority: JP
Inventors: 哲嗣田村; 真之介磯部; 悟速水; 拓実西脇; 悠斗後藤; 将樹能勢
Original assignee: Ricoh Co Ltd; Tokai National Higher Education and Research System NUC
Current assignee: Ricoh Co Ltd; Tokai National Higher Education and Research System NUC
Priority date: 2021-02-19
Filing date: 2021-02-19
Publication date: 2025-03-06
Anticipated expiration: 2041-02-19
Also published as: JP2022126962A

Description

本発明は、発話内容認識装置、方法及びプログラムに関するものである。 The present invention relates to an apparatus, method, and program for recognizing speech content.

従来、話者の発話内容を認識する発話内容認識装置が知られている。例えば、話者の口唇画像データを入力し、対応方向から撮像された口唇画像データに対する読唇精度の高い読唇部を用いて、話者の発話内容を認識する装置が知られている。 Conventionally, there is known a speech content recognition device that recognizes the content of a speaker's speech. For example, a device is known that inputs image data of a speaker's lips and recognizes the content of the speaker's speech using a lip reading unit that has high lip reading accuracy for the lip image data captured from a corresponding direction.

また、非特許文献１には、畳み込みニューラルネットワークを用いたエンコーダ・デコーダモデルによる「View2View」と呼ばれる手法が開示されている。この手法では、予め正面顔の画像データ（顔の正面方向から撮像された口唇画像データ）で学習した機械読唇モデルを用いて読唇結果を出力する。非正面顔の画像データが入力された場合には、正面顔の画像データに変換してから機械読唇モデルに入力し、読唇結果を出力する。 Non-Patent Document 1 also discloses a method called "View2View" that uses an encoder-decoder model using a convolutional neural network. In this method, lip-reading results are output using a machine lip-reading model that has been trained in advance on image data of a frontal face (image data of lips captured from the front of the face). When non-frontal image data is input, it is converted to image data of a frontal face and then input to the machine lip-reading model, which then outputs the lip-reading results.

また、非特許文献２には、双方向長短記憶と呼ばれる深層学習技術を用いたエンドツーエンドの読唇手法が開示されている。この非特許文献２には、正面顔と横顔など、複数の撮像角度から撮像した顔画像データを組み合わせて学習することで、読唇モデルの読唇性能が向上することが記載されている。 Non-Patent Document 2 discloses an end-to-end lip-reading method that uses a deep learning technique called bidirectional long-short-term memory. This non-patent document 2 describes how the lip-reading performance of a lip-reading model can be improved by learning by combining face image data captured from multiple imaging angles, such as front and side views.

従来の読唇部を備えた発話内容認識装置においては、特定の方向（対応方向）から撮像された口唇画像データでは正しい読唇結果が高い精度で得られるが、当該対応方向とは異なる方向から撮像された口唇画像データでは精度が落ちるという課題がある。 In conventional speech recognition devices equipped with a lip reading unit, lip reading results can be obtained with high accuracy when using lip image data captured from a specific direction (corresponding direction), but there is an issue in that accuracy drops when using lip image data captured from a direction different from the corresponding direction.

上述した課題を解決するために、本発明は、話者の発話内容を認識する発話内容認識装置であって、話者の口唇画像データを入力する入力部と、対応方向から撮像された口唇画像データに対する読唇精度の高い複数の読唇部と、前記入力部に入力された口唇画像データに対する前記複数の読唇部の各読唇処理結果を統合し、当該統合の結果に基づいて前記話者の発話内容の認識結果を生成する統合生成部とを有し、前記複数の読唇部のうちの少なくとも１つの読唇部は、当該対応方向の中に、他のいずれかの読唇部における対応方向に含まれていない方向を含むように構成され、前記複数の読唇部は、当該対応方向が２つ以上である複方向読唇部を含むことを特徴とする。 In order to solve the above-mentioned problems, the present invention provides a speech content recognition device that recognizes the speech content of a speaker, comprising an input unit that inputs lip image data of the speaker, a plurality of lip reading units that have high lip reading accuracy for lip image data captured from corresponding directions, and an integration generation unit that integrates the lip reading processing results of the plurality of lip reading units for the lip image data input to the input unit and generates a recognition result of the speech content of the speaker based on the result of the integration, wherein at least one of the plurality of lip reading units is configured to include a direction in its corresponding direction that is not included in the corresponding direction of any of the other lip reading units , and the plurality of lip reading units includes a multi-directional lip reading unit having two or more corresponding directions .

本発明によれば、読唇部の対応方向とは一致しない方向から撮像された口唇画像データでも正しい読唇結果が高い精度で得られるので、対応方向の数を超える様々な種類（様々な撮像方向）の口唇画像データについて発話内容を高精度に認識できる。 According to the present invention, accurate lip reading results can be obtained with high accuracy even for lip image data captured from a direction that does not match the corresponding direction of the lip reading unit, so the speech content can be recognized with high accuracy for various types (various imaging directions) of lip image data that exceed the number of corresponding directions.

実施形態１に係る読唇装置を示すブロック図。FIG. 1 is a block diagram showing a lip reading device according to a first embodiment. ニューラルネットワークを構成する１つのニューロンのモデルの一例を示す説明図。FIG. 1 is an explanatory diagram showing an example of a model of one neuron that constitutes a neural network. 複数層構造のニューラルネットワークの一例を示す説明図。FIG. 1 is an explanatory diagram showing an example of a neural network having a multi-layer structure. 実施形態１における機械読唇モデル（学習済みモデル）の作成方法（学習モード）の概要を示す説明図。FIG. 2 is an explanatory diagram showing an overview of a method (learning mode) for creating a machine lip reading model (trained model) in the first embodiment. 実施形態１に係る読唇装置の他の例を示すブロック図。FIG. 4 is a block diagram showing another example of the lip reading device according to the first embodiment. 実施形態１に係る読唇装置の更に他の例を示すブロック図。FIG. 4 is a block diagram showing yet another example of the lip reading device according to the first embodiment. 変形例１における読唇装置を示すブロック図。FIG. 13 is a block diagram showing a lip reading device according to a first modified example. 変形例２における読唇装置を示すブロック図。FIG. 11 is a block diagram showing a lip reading device according to a second modified example. 実施形態２に係るマルチモーダル音声認識装置を示すブロック図。FIG. 11 is a block diagram showing a multimodal speech recognition device according to a second embodiment. 実施形態３における学習データ収集システムの構成を示す説明図。FIG. 11 is an explanatory diagram showing the configuration of a learning data collection system according to a third embodiment. 同学習データ収集システムのカメラアレイを鉛直方向上方から見た説明図。FIG. 2 is an explanatory diagram showing the camera array of the learning data collection system viewed vertically from above.

〔実施形態１〕
以下、本発明を、発話内容認識装置としての読唇装置に適用した一実施形態（以下、本実施形態を「実施形態１」という。）について説明する。
本実施形態１の読唇装置は、口唇画像データとして話者の顔を撮像した顔画像データを入力し、入力された顔画像データの口唇部分を解析して当該話者が発話する発話内容の認識結果（読唇結果）を出力する。 [Embodiment 1]
Hereinafter, an embodiment in which the present invention is applied to a lip reading device as a speech content recognition device (hereinafter, this embodiment will be referred to as "embodiment 1") will be described.
The lip reading device of this embodiment 1 inputs facial image data obtained by capturing an image of a speaker's face as lip image data, analyzes the lip portion of the input facial image data, and outputs a recognition result (lip reading result) of the speech content uttered by the speaker.

図１は、本実施形態１に係る読唇装置を示すブロック図である。
本実施形態１の読唇装置１００は、主に、入力部としての画像入力部１１１と、複数の読唇部としての２つの単一角度対応読唇部１３１，１３２と、統合生成部としての読唇結果統合部１４１と、から構成されている。 FIG. 1 is a block diagram showing a lip reading device according to the first embodiment.
The lip reading device 100 of this embodiment 1 is mainly composed of an image input unit 111 as an input unit, two single-angle compatible lip reading units 131 and 132 as multiple lip reading units, and a lip reading result integration unit 141 as an integration generation unit.

画像入力部１１１は、発話内容を認識する対象である話者の顔画像データ（口唇画像データ）の入力を受け付ける。本実施形態１の画像入力部１１１は、話者の顔を撮像する撮像装置であるカメラ１や、顔画像データを記憶した記憶媒体２に対し、有線または無線で通信可能に接続されている。カメラ１からは、現に話者が発話しているリアルタイムの顔画像データが画像入力部１１１に入力される。記憶媒体２は、過去に話者が発話したときの顔画像データを記憶しており、記憶媒体２からは、過去の顔画像データが画像入力部１１１に入力される。 The image input unit 111 accepts input of face image data (lip image data) of a speaker whose speech content is to be recognized. In this embodiment 1, the image input unit 111 is connected to a camera 1, which is an imaging device that captures the face of the speaker, and a storage medium 2 that stores the face image data, so that they can communicate with each other via wired or wireless communication. Real-time face image data of the speaker currently speaking is input from the camera 1 to the image input unit 111. The storage medium 2 stores face image data from when the speaker spoke in the past, and past face image data is input from the storage medium 2 to the image input unit 111.

画像入力部１１１は、入力された顔画像データを、必要に応じ、前記２つの単一角度対応読唇部１３１，１３２の入力前に画像処理して、各単一角度対応読唇部１３１，１３２にそれぞれ受け渡す。例えば、入力された顔画像データ中の口唇画像部分を時系列に並べて抽出し、その口唇画像部分のデータを各単一角度対応読唇部１３１，１３２にそれぞれ受け渡す。 The image input unit 111 processes the input face image data as necessary before inputting the data to the two single-angle compatible lip reading units 131, 132, and passes the data to each of the single-angle compatible lip reading units 131, 132. For example, the lip image portion of the input face image data is arranged in chronological order and extracted, and the data of the lip image portion is passed to each of the single-angle compatible lip reading units 131, 132.

画像入力部１１１に入力される口唇画像データは、話者の口唇を含むように撮像された画像データであれば、その撮像方向に特に制限はない。
また、画像入力部１１１に入力される口唇画像データは、画像データ形式のものであってもよいし、口唇画像データを加工又は演算して得られる非画像データ形式のものであってもよい。
また、口唇画像データは、通常、実在の話者を撮像装置等により撮像して得られる撮像画像データであるが、仮想の話者（コンピュータグラフィックス等により作成されたもの等）を所定の視点から見たときの画像データであってもよい。 There are no particular limitations on the imaging direction of the lip image data input to image input unit 111, so long as the image data is imaged so as to include the lips of the speaker.
Furthermore, the lip image data input to the image input unit 111 may be in an image data format, or may be in a non-image data format obtained by processing or calculating the lip image data.
In addition, lip image data is typically captured image data obtained by capturing an image of a real speaker using an imaging device, etc., but it may also be image data of a virtual speaker (e.g., created using computer graphics, etc.) viewed from a specified viewpoint.

２つの単一角度対応読唇部１３１，１３２は、それぞれ、特定の方向（対応方向）から撮像された口唇画像データに対する読唇精度の高い読唇処理を行い、その読唇処理結果を生成する。２つの単一角度対応読唇部１３１，１３２は、それぞれの対応方向の中に、他方の単一角度対応読唇部における対応方向に含まれていない方向を含むように構成されている。 The two single-angle compatible lip reading units 131 and 132 each perform lip reading processing with high lip reading accuracy on lip image data captured from a specific direction (corresponding direction) and generate the lip reading processing result. The two single-angle compatible lip reading units 131 and 132 are configured so that their corresponding directions include directions that are not included in the corresponding directions of the other single-angle compatible lip reading unit.

本実施形態１では、上述した対応方向を、話者の顔の正面方向から撮像したときの撮像方向を基準（０°）にした鉛直軸回りの角度（以下「対応角度」という。）で表すものとする。このとき、第一単一角度対応読唇部１３１は、話者の顔の正面方向から撮像したときの顔画像データに対する読唇精度が高く（利用者の要求レベルを満たす精度閾値を超えている）、その対応角度（読唇精度の高い角度）は０°である。一方、第二単一角度対応読唇部１３２は、話者の顔の正面方向に対して３０°だけ横にずれた方向から撮像したときの顔画像データに対する読唇精度が高く、その対応角度（読唇精度の高い角度）は３０°である。 In this embodiment 1, the corresponding direction is expressed as an angle (hereinafter referred to as the "corresponding angle") around a vertical axis with the imaging direction when the speaker's face is imaged from the front direction as the reference (0°). In this case, the first single-angle corresponding lip reading unit 131 has high lip reading accuracy for face image data when the speaker's face is imaged from the front direction (exceeding the accuracy threshold that meets the user's required level), and its corresponding angle (angle with high lip reading accuracy) is 0°. On the other hand, the second single-angle corresponding lip reading unit 132 has high lip reading accuracy for face image data when the speaker's face is imaged from a direction shifted 30° to the side from the front direction, and its corresponding angle (angle with high lip reading accuracy) is 30°.

本実施形態１の単一角度対応読唇部１３１，１３２は、所定の読唇処理プログラム（推定プログラム）をコンピュータで実行することにより、画像入力部１１１に入力された顔画像データに対する読唇処理を実行し、読唇処理結果を生成する。本実施形態１の読唇処理プログラムは、話者の顔画像データを含む学習データを用いて学習した機械読唇モデル（学習済みモデル）を用いるが、プログラマーによってプログラミングされた読唇処理プログラムを用いてもよい。 The single-angle compatible lip reading units 131, 132 of this embodiment 1 execute a predetermined lip reading processing program (estimation program) on a computer to perform lip reading processing on the face image data input to the image input unit 111 and generate a lip reading processing result. The lip reading processing program of this embodiment 1 uses a machine lip reading model (trained model) trained using training data including the speaker's face image data, but a lip reading processing program programmed by a programmer may also be used.

本実施形態１における機械読唇モデル（学習済みモデル）は、入力されたデータ（顔画像データ）から話者の発話内容を推定するものであり、機械読唇モデルから出力される推定結果（読唇処理結果）の形式には特に制限はない。一例として、本実施形態１では、入力されたデータ（顔画像データ）に対し、１又は２以上の発話内容候補（１つの文字、１つの語又は語系列など）と、その発話内容候補ごとの信頼度情報（以下「信頼度スコア」という。）とを含むデータを読唇処理結果として出力する場合について説明する。 The machine lip reading model (trained model) in this embodiment 1 estimates the speaker's speech content from the input data (facial image data), and there are no particular limitations on the format of the estimation result (lip reading processing result) output from the machine lip reading model. As an example, this embodiment 1 describes a case where data including one or more speech content candidates (such as one character, one word or word sequence) and reliability information for each of the speech content candidates (hereinafter referred to as "reliability score") is output as the lip reading processing result for the input data (facial image data).

所定の対応角度に特化した機械読唇モデル（当該対応角度の顔画像データに対する読唇精度の高い学習済みモデル）は、例えば、当該対応角度から撮像された大量の顔画像データを学習データとして機械学習や深層学習を行うことで作成することができる。例えば、このような学習データを用い、所定のモデルに対して教師あり学習を行わせることで、未知の顔画像データの入力に受けたときに、学習データから学習した特徴に従って、１又は２以上の発話内容候補と各発話内容候補の信頼度スコアとを含むデータを推定結果として出力する機械読唇モデル（学習済みモデル）を得ることができる。なお、本実施形態１では、所定のモデルとしては、ニューラルネットワークモデルを採用するが、他の機械学習モデルを使用することも可能である。 A machine lip-reading model specialized for a specified corresponding angle (a trained model with high lip-reading accuracy for face image data at the corresponding angle) can be created, for example, by performing machine learning or deep learning using a large amount of face image data captured from the corresponding angle as training data. For example, by using such training data and performing supervised learning on a specified model, it is possible to obtain a machine lip-reading model (trained model) that, when receiving input of unknown face image data, outputs data including one or more speech content candidates and a reliability score for each speech content candidate as an estimation result according to the features learned from the training data. Note that in this embodiment 1, a neural network model is adopted as the specified model, but other machine learning models can also be used.

「教師あり学習」では、一般に、ある入力と結果（ラベル）のデータの組を大量に機械学習装置に与えることで、それらのデータセットにある特徴を学習し、入力から結果を推定するモデル、すなわち、その関係性を帰納的に獲得することができる。これは、後述のニューラルネットワークやＳＶＭ（Support Vector Machine）などのアルゴリズムを用いて実現することができる。 In "supervised learning," a machine learning device is generally given a large number of pairs of input and result (label) data, which allows it to learn the features of those data sets and inductively acquire a model that estimates results from inputs, i.e., the relationships between them. This can be achieved using algorithms such as neural networks and SVMs (Support Vector Machines), which are described below.

ニューラルネットワークは、例えば、図２に示すようなニューロンのモデルを模したニューラルネットワークを実現する演算装置及びメモリ等で構成される。図２に示すように、ニューロンは、複数の入力ｘ（ここでは一例として、入力ｘ１～入力ｘ３としているが、その入力数は、より少ない数でもよいし、より多くの数でもよい。）に対する出力ｙを出力するものである。各入力ｘ１～ｘ３には、それぞれの入力ｘに対応する重みＷ（Ｗ１～Ｗ３）が乗算される。これにより、ニューロンは、次の式（１）及び（２）により表現される出力ｙを出力する。なお、式（１）及び（２）において、θはバイアスであり、ｆｋは活性化関数である。 A neural network is composed of a computing device and a memory that realizes a neural network that mimics a neuron model, for example, as shown in FIG. 2. As shown in FIG. 2, a neuron outputs an output y for multiple inputs x (inputs x1 to x3 are shown here as an example, but the number of inputs may be less or more). Each of the inputs x1 to x3 is multiplied by a weight W (W1 to W3) corresponding to each input x. As a result, the neuron outputs an output y expressed by the following equations (1) and (2). In equations (1) and (2), θ is a bias, and fk is an activation function.

ｙ＝ｆｋ（ｖ）・・・（１）
ｖ＝Σ（Ｗ×ｘ）－θ ・・・（２）
y = fk(v)...(1)
v=Σ(W×x)-θ...(2)

ニューラルネットワークの動作には、学習モードと評価モードとがあり、学習モードでは学習データを用いて重みＷを学習し、評価モードではその重みＷを用いて評価用データの入力に対する出力（本実施形態１では発話内容候補とそれぞれの信頼度スコア）を得る。重みＷ１～Ｗ３は、誤差逆伝搬法（バックプロパゲーション）等により学習可能である。誤差逆伝搬法は、入力ｘが入力されたときの出力ｙと正解の出力ｙ（正解ラベル）との差分を小さくするように、各ニューロンについての重みを調整（学習）する手法である。 The neural network operates in a learning mode and an evaluation mode. In the learning mode, the weights W are learned using training data, and in the evaluation mode, the weights W are used to obtain an output for the input of evaluation data (in this embodiment 1, utterance content candidates and their respective reliability scores). The weights W1 to W3 can be learned using backpropagation or the like. The backpropagation method is a technique for adjusting (learning) the weights for each neuron so as to reduce the difference between the output y when an input x is input and the correct output y (correct label).

ニューラルネットワークは、図３に示すように、深層学習あるいはディープラーニングを呼ばれる複数層構造にすることが可能である。図３の例は、中間層（隠れ層）が３層構造になっている例である。各層は複数のノード（ニューロン）で構成され、各層間のノードはそれぞれ異なる重みＷで連結されている。入力層に投入された入力ｘ１～ｘ６は、重みＷの異なる中間層内のノードを通過する中で、入力ｘ１～ｘ６が重みＷによって重み付けされながら合成され、出力層を通過して出力ｙを導出する。 As shown in Figure 3, neural networks can have a multi-layer structure known as deep learning. The example in Figure 3 shows a three-layer structure of intermediate (hidden) layers. Each layer is made up of multiple nodes (neurons), and the nodes between layers are connected with different weights W. Inputs x1 to x6 input to the input layer are combined while being weighted by the weight W as they pass through the nodes in the intermediate layer, which have different weights W, and then pass through the output layer to derive the output y.

本実施形態１では、図３に示すような複数層構造のニューラルネットワークからなるニューラルネットワークモデルを採用し、既知の発話内容を発話する話者の顔画像データを含む学習データを用い、これに正しい発話内容を正解ラベルとして用いて、教師あり学習をさせることにより、機械読唇モデル（学習済みモデル）を作成する。 In this embodiment 1, a neural network model consisting of a multi-layered neural network as shown in FIG. 3 is adopted, and a machine lip reading model (trained model) is created by performing supervised learning using training data including facial image data of a speaker who speaks known speech content and using the correct speech content as a correct answer label.

図４は、本実施形態１における機械読唇モデル（学習済みモデル）の作成方法（学習モード）の概要を示す説明図である。
本実施形態１の学習モードでは、図４に示すように、指示される発話内容を話者が発話し、これを、それぞれの対応角度（本実施形態１では０°と３０°）から各収録用カメラ３１－１，３１－２によって撮像する。このように撮像された顔画像データは、対応角度ごとに学習データ記憶媒体３２に記憶される。学習データ記憶媒体３２に記憶される顔画像データは、時系列が特定できる形式で記憶される。そのため、学習データ記憶媒体３２に記憶された顔画像データは、話者が発話した時期（各発話内容が発話された時期）と照らし合わせることで、話者の発話内容と対応づけられ、学習データとして用いることができる。 FIG. 4 is an explanatory diagram showing an overview of a method (learning mode) for creating a machine lip-reading model (trained model) in the first embodiment.
In the learning mode of the first embodiment, as shown in Fig. 4, a speaker speaks a specified utterance content, and this is captured by each of the recording cameras 31-1 and 31-2 from the corresponding angles (0° and 30° in the first embodiment). The facial image data captured in this manner is stored in the learning data storage medium 32 for each corresponding angle. The facial image data stored in the learning data storage medium 32 is stored in a format that allows a time series to be specified. Therefore, the facial image data stored in the learning data storage medium 32 can be associated with the speaker's utterance content by checking it against the time when the speaker spoke (the time when each utterance content was spoken), and can be used as learning data.

このように学習データ記憶媒体３２に蓄積された学習データは、対応角度が０°の顔画像データについては第一学習部３３－１に用いられ、対応角度が３０°の顔画像データについては第二学習部３３－２に用いられる。なお、学習データには、より精度を高めるために、発話内容を特定（推定）するための特徴量として有用な他の情報を含めることができる。第一学習部３３－１では、入力される学習データにより、対応角度が０°である機械読唇モデル（学習済みモデル）が生成され、生成された機械読唇モデルは、本実施形態１の第一単一角度対応読唇部１３１にインストールされる。同様に、第二学習部３３－２では、入力される学習データにより、対応角度が３０°である機械読唇モデル（学習済みモデル）が生成され、生成された機械読唇モデルは、本実施形態１の第二単一角度対応読唇部１３２にインストールされる。 The learning data thus stored in the learning data storage medium 32 is used by the first learning unit 33-1 for face image data with a corresponding angle of 0°, and by the second learning unit 33-2 for face image data with a corresponding angle of 30°. In order to further improve accuracy, the learning data may include other information useful as a feature for identifying (estimating) the content of the speech. The first learning unit 33-1 generates a machine lip-reading model (trained model) with a corresponding angle of 0° from the input learning data, and the generated machine lip-reading model is installed in the first single-angle compatible lip-reading unit 131 of this embodiment 1. Similarly, the second learning unit 33-2 generates a machine lip-reading model (trained model) with a corresponding angle of 30° from the input learning data, and the generated machine lip-reading model is installed in the second single-angle compatible lip-reading unit 132 of this embodiment 1.

生成した機械読唇モデル（学習済みモデル）については、その学習済みモデルの作成（学習モード）を繰り返し試行して、パラメータチューニングを実行してもよい。パラメータチューニングで調整（チューニング）するパラメータとは、学習済みモデルにおける設定値や制限値（ハイパーパラメータ）などをいう。パラメータチューニングは、例えば、モデルが最適解を出せるパラメータを走査して設定する作業である。パラメータチューニングの種類としては、グリッドサーチ法やランダムサーチ法などがあり、これらを用いることができる。 For the generated machine lip reading model (trained model), parameter tuning may be performed by repeatedly trying to create the trained model (training mode). The parameters to be adjusted (tuned) in parameter tuning refer to the setting values and limit values (hyperparameters) in the trained model. Parameter tuning is, for example, the task of scanning and setting parameters that allow the model to produce an optimal solution. Types of parameter tuning that can be used include the grid search method and the random search method.

また、機械読唇モデル（学習済みモデル）に対してモデル評価を行ってもよい。このモデル評価には、例えば、クロスバリデーションやホールドアウト法などを用いることができる。ホールドアウト法とクロスバリデーションを併用してモデル評価を行うこともできる。 Model evaluation may also be performed on the machine lip reading model (trained model). For this model evaluation, for example, cross-validation or the hold-out method can be used. Model evaluation can also be performed using a combination of the hold-out method and cross-validation.

具体的には、ホールドアウト法では、元データを、事前に、学習モードで使用する学習用データと、評価モードで使用するテストデータとに分割しておき、学習用データだけを用いて学習済みモデルの作成を試行する。その後、作成した学習済みモデルにテストデータを入力し、その出力結果と当該テストデータの正解ラベルとの比較（誤差＝推定精度）を行ってモデル評価を行う。 Specifically, in the hold-out method, the original data is first split into training data to be used in training mode and test data to be used in evaluation mode, and an attempt is made to create a trained model using only the training data. After that, test data is input into the trained model that has been created, and the model is evaluated by comparing the output result with the correct label of the test data (error = estimated accuracy).

また、クロスバリデーションでは、元データを例えば５グループに分け、１回目は、そのうちの１つのグループをテストデータとし、それ以外のグループを学習用データとして、学習済みモデルの作成とモデル評価を行う。２回目は、１回目とは異なるグループをテストデータとし、３回目は１回目及び２回目とは異なるグループをテストデータとして、同様に学習済みモデルの作成とモデル評価を行う。これを５グループすべてについて行い、各回で評価したモデル評価（推定精度）の平均を取る。 In cross-validation, the original data is divided into, for example, five groups, and in the first round, one of the groups is used as test data and the other groups are used as training data to create a trained model and evaluate the model. In the second round, a different group from the first round is used as test data, and in the third round, a different group from the first and second rounds is used as test data, and a trained model is created and evaluated in the same way. This is done for all five groups, and the model evaluations (estimation accuracy) evaluated each time are averaged.

また、本実施形態１の推定プログラム（学習済みモデル）を蒸留して、新たに同様の機能を備えた推定プログラム（蒸留モデル）を作成することもできる。具体的には、本実施形態１の推定プログラム（学習済みモデル）に対し、蒸留用入力データとして、発話内容が既知である顔画像データを入力し、その信頼度スコアを出力させる。そして、出力された信頼度スコアを蒸留用入力データの正解ラベルとした蒸留用の学習データを作成し、この蒸留用の学習データを用いてモデルに学習させることにより、本実施形態１の推定プログラム（学習済みモデル）と同様の機能を備えた新たな推定プログラム（蒸留モデル）を作成する。このようにして作成される新たな推定プログラム（蒸留モデル）は、一般に、本実施形態１の推定プログラム（学習済みモデル）よりも軽量化される。また、蒸留用入力データを工夫するなどすることで、本実施形態１の推定プログラム（学習済みモデル）よりも推定精度を高めることも可能である。 In addition, the estimation program (trained model) of the present embodiment 1 can be distilled to create a new estimation program (distilled model) with similar functions. Specifically, face image data with known speech content is input as distillation input data to the estimation program (trained model) of the present embodiment 1, and its reliability score is output. Then, distillation learning data is created with the output reliability score as the correct answer label for the distillation input data, and a new estimation program (distilled model) with similar functions to the estimation program (trained model) of the present embodiment 1 is created by training a model using this distillation learning data. The new estimation program (distilled model) created in this way is generally lighter than the estimation program (trained model) of the present embodiment 1. In addition, it is also possible to improve the estimation accuracy compared to the estimation program (trained model) of the present embodiment 1 by devising the distillation input data.

なお、本実施形態１の機械読唇モデルは、発話内容候補と各発話内容候補の信頼度スコアとを含むデータを読唇処理結果として用いる例であるが、後段の読唇結果統合部１４１で用いるデータ形式に合わせた中間表現のデータを読唇処理結果として用いてもよい。具体的には、読唇処理結果として、機械読唇モデルを用いて読唇処理を行った際の当該機械読唇モデルの内部状態を記録したベクトルデータを用いてもよい。 Note that the machine lip reading model of this embodiment 1 is an example in which data including speech content candidates and the reliability scores of each speech content candidate is used as the lip reading processing result, but intermediate representation data that matches the data format used in the subsequent lip reading result integration unit 141 may also be used as the lip reading processing result. Specifically, vector data that records the internal state of the machine lip reading model when lip reading processing is performed using the machine lip reading model may also be used as the lip reading processing result.

本実施形態１においては、このように、特定の方向（対応方向）から撮像された口唇画像データに対する読唇精度の高い２つの単一角度対応読唇部１３１，１３２を用いて、口唇画像データに対する読唇処理を実行する。そして、本実施形態１で用いられる２つの単一角度対応読唇部１３１，１３２は、それぞれの対応角度（高い読唇精度が得られる角度）が、他方の単一角度対応読唇部における対応角度に含まれていない角度を含んでいる。そのため、これらの単一角度対応読唇部の対応角度（０°と３０°）のいずれかの角度と一致する角度から撮像された口唇画像データであれば、これらの単一角度対応読唇部で高い読唇精度が得られ、その読唇処理結果から発話内容を高精度に認識可能である。したがって、本実施形態１によれば、０°と３０°の角度から撮像された口唇画像データについて、発話内容を高精度に認識することができる。 In this embodiment 1, the two single-angle compatible lip reading units 131 and 132, which have high lip reading accuracy for lip image data captured from a specific direction (corresponding direction), are used to perform lip reading processing on lip image data. The two single-angle compatible lip reading units 131 and 132 used in this embodiment 1 include angles whose corresponding angles (angles at which high lip reading accuracy can be obtained) are not included in the corresponding angles of the other single-angle compatible lip reading unit. Therefore, if the lip image data is captured from an angle that matches one of the corresponding angles (0° and 30°) of these single-angle compatible lip reading units, these single-angle compatible lip reading units can obtain high lip reading accuracy, and the spoken content can be recognized with high accuracy from the lip reading processing results. Therefore, according to this embodiment 1, the spoken content can be recognized with high accuracy for lip image data captured from angles of 0° and 30°.

ここで、２つの単一角度対応読唇部１３１，１３２のいずれの対応角度とも一致しない角度（例えば１５°や４５°）から撮像された口唇画像データについては、個々の単一角度対応読唇部１３１，１３２の読唇処理では十分な読唇精度が得られない。そのため、いずれかの単一角度対応読唇部１３１，１３２の読唇処理結果だけを用いたのでは、このような口唇画像データについて発話内容を高精度に認識することはできない。 Here, for lip image data captured from an angle (e.g., 15° or 45°) that does not match the corresponding angle of either of the two single-angle compatible lip reading units 131, 132, sufficient lip reading accuracy cannot be obtained by the lip reading process of each of the single-angle compatible lip reading units 131, 132. Therefore, if only the lip reading process result of one of the single-angle compatible lip reading units 131, 132 is used, it is not possible to recognize the speech content of such lip image data with high accuracy.

そこで、本実施形態１では、読唇結果統合部１４１を設け、２つの単一角度対応読唇部１３１，１３２で得られた各読唇処理結果を統合し、その統合結果に基づいて話者の発話内容の認識結果を最終的な読唇結果として生成する。これにより、個々の単一角度対応読唇部１３１，１３２の各読唇処理結果は、正解である発話内容の確からしさ（信頼度スコア）が不正解である他の発話内容の確からしさと比較して有意に高くない又は逆に低いという結果である場合であっても、これらの読唇処理結果を統合することで、正解である発話内容の確からしさ（信頼度スコア）を際立たせ、不正解である他の発話内容の確からしさに対して有意に高くなるように処理することが可能となる。 Therefore, in this embodiment 1, a lip reading result integration unit 141 is provided to integrate the lip reading process results obtained by the two single-angle compatible lip reading units 131, 132, and generate a recognition result of the speaker's speech content as a final lip reading result based on the integration result. As a result, even if the lip reading process results of the individual single-angle compatible lip reading units 131, 132 result in the reliability (confidence score) of the correct utterance content being not significantly higher or conversely lower than the reliability of the other incorrect utterance content, by integrating these lip reading process results, it is possible to highlight the reliability (confidence score) of the correct utterance content and process it so that it is significantly higher than the reliability of the other incorrect utterance content.

読唇結果統合部１４１が行う統合処理は、精度の高い認識結果が得られるように（本実施形態１であれば、正解の発話内容の信頼度スコアが相対的に高くなるように）、２つの単一角度対応読唇部１３１，１３２で得られた各読唇処理結果を統合する処理であれば、特に制限はない。 There are no particular limitations on the integration process performed by the lip reading result integration unit 141, so long as it is a process of integrating the lip reading process results obtained by the two single-angle compatible lip reading units 131, 132 so as to obtain a highly accurate recognition result (in the case of this embodiment 1, so that the reliability score of the correct utterance content becomes relatively high).

読唇結果統合部１４１が行う統合処理の一例としては、例えば、２つの単一角度対応読唇部１３１，１３２で得られた読唇処理結果のいずれにも含まれる共通の発話内容候補（語や語系列など）の中で最も信頼度スコアの高い発話内容候補を統合結果としてもよい。あるいは、信頼度スコアの高い順に２以上の発話内容候補を統合結果としてもよい。
また、例えば、２つの単一角度対応読唇部１３１，１３２で得られた読唇処理結果に含まれる発話内容候補ごとの信頼度スコアの合計値を算出し、合計値が最も高い発話内容候補を統合結果としてもよい。あるいは、信頼度スコアの合計値が算出された２以上の発話内容候補を統合結果としてもよい。 As an example of the integration process performed by the lip reading result integration unit 141, for example, the integration result may be the utterance content candidate with the highest reliability score among the common utterance content candidates (such as words and word sequences) included in both of the lip reading process results obtained by the two single-angle compatible lip reading units 131 and 132. Alternatively, the integration result may be two or more utterance content candidates in descending order of reliability score.
Also, for example, the total value of the reliability score for each of the speech content candidates included in the lip reading processing results obtained by the two single-angle compatible lip reading units 131 and 132 may be calculated, and the speech content candidate with the highest total value may be used as the integrated result. Alternatively, two or more speech content candidates for which the total value of the reliability score has been calculated may be used as the integrated result.

読唇結果統合部１４１が行う統合処理の別例としては、２つの単一角度対応読唇部１３１，１３２における中間表現から認識結果を得る学習済みモデル（統合モデル）を、例えばニューラルネットワークモデルによって予め学習しておく。そして、２つの単一角度対応読唇部１３１，１３２で得られた読唇処理結果に含まれる各中間表現（ベクトル等）を連結して１つの中間表現を生成し、読唇結果統合部１４１のコンピュータで学習済みの統合モデルを実行して、当該１つの中間表現から１又は２以上の発話内容の認識結果を得て、これを統合結果とする。あるいは、当該１つの中間表現から１又は２以上の発話内容の認識結果に、それぞれの信頼度スコアを含めてもよい。 As another example of the integration process performed by the lip-reading result integration unit 141, a trained model (integration model) that obtains recognition results from intermediate representations in the two single-angle compatible lip-reading units 131, 132 is trained in advance, for example, by a neural network model. Then, the intermediate representations (vectors, etc.) included in the lip-reading process results obtained by the two single-angle compatible lip-reading units 131, 132 are linked to generate one intermediate representation, and the trained integration model is executed by the computer of the lip-reading result integration unit 141 to obtain recognition results of one or more speech contents from the one intermediate representation, which is used as the integration result. Alternatively, the recognition results of one or more speech contents from the one intermediate representation may include their respective confidence scores.

本実施形態１の読唇装置１００によれば、２つの単一角度対応読唇部１３１，１３２の各読唇処理結果を読唇結果統合部１４１で統合することで、各単一角度対応読唇部１３１，１３２の読唇処理により高精度な読唇結果が得られる０°と３０°の対応角度だけでなく、この対応角度から外れた角度（例えば１５°や４５°）の口唇画像データについても、発話内容を高精度に認識することが可能となる。その結果、２つの単一角度対応読唇部１３１，１３２における対応角度の数（０°と３０°）を超える様々な角度からの口唇画像データについて、発話内容を高精度に認識することが可能となる。 According to the lip reading device 100 of the first embodiment, by integrating the lip reading process results of the two single-angle compatible lip reading units 131, 132 in the lip reading result integration unit 141, it becomes possible to recognize the speech content with high accuracy not only for the corresponding angles of 0° and 30° at which highly accurate lip reading results can be obtained by the lip reading process of each single-angle compatible lip reading unit 131, 132, but also for lip image data at angles outside these corresponding angles (e.g., 15° and 45°). As a result, it becomes possible to recognize the speech content with high accuracy for lip image data from various angles that exceed the number of corresponding angles (0° and 30°) of the two single-angle compatible lip reading units 131, 132.

読唇結果統合部１４１で生成した認識結果は、話者の発話内容の認識結果を利用する後段の情報処理装置等へ出力したり、話者の発話内容の認識結果を蓄積する情報蓄積装置へ出力したりする。なお、出力態様に特に制限はなく、例えば、本実施形態１の読唇装置１００に備わった表示部に認識結果を表示させたり、読唇装置１００に備わった音声出力部から音声で出力したりしてもよい。 The recognition results generated by the lip reading result integration unit 141 are output to a downstream information processing device that uses the recognition results of the speaker's speech content, or to an information storage device that accumulates the recognition results of the speaker's speech content. There are no particular limitations on the output mode, and for example, the recognition results may be displayed on a display unit provided in the lip reading device 100 of this embodiment 1, or output as audio from an audio output unit provided in the lip reading device 100.

なお、上述した実施形態１では、読唇部が２つの例であるが、読唇部が３以上の例であってもよい。例えば、図５に示すように、話者の顔の正面方向に対して６０°だけ横にずれた方向から撮像したときの顔画像データに対する読唇精度が高い第三単一角度対応読唇部１３３を追加した構成であってもよい。 In the above-mentioned embodiment 1, an example is given in which there are two lip reading units, but an example in which there are three or more lip reading units may also be used. For example, as shown in FIG. 5, a configuration may be used in which a third single-angle compatible lip reading unit 133 is added that has high lip reading accuracy for face image data captured from a direction shifted 60° to the side from the front direction of the speaker's face.

また、単一読唇部により２以上の対応角度で高精度な読唇結果を得ることが可能な複数角度対応読唇部を作成することが可能である。具体的には、例えば、図４に示したように、２つの対応角度（０°と３０°）から撮像した顔画像データが記憶された学習データ記憶媒体３２を利用し、これらを学習データとして単一の学習部に入力して学習することで、０°と３０°という２つの対応角度で高精度な読唇処理が可能な機械読唇モデル（学習済みモデル）を生成することが可能である。 It is also possible to create a multiple-angle lip reading unit capable of obtaining highly accurate lip reading results at two or more corresponding angles using a single lip reading unit. Specifically, for example, as shown in FIG. 4, a learning data storage medium 32 is used that stores face image data captured from two corresponding angles (0° and 30°), and by inputting these as learning data into a single learning unit and learning them, it is possible to generate a machine lip reading model (trained model) capable of highly accurate lip reading processing at two corresponding angles, 0° and 30°.

したがって、例えば、図６に示すように、上述した第二単一角度対応読唇部１３２に代えて、０°と３０°という２つの対応角度で高精度な読唇処理が可能な第一複数角度対応読唇部１３４を設けてもよい。更に、図６に示すように、上述した第三単一角度対応読唇部１３３に代えて、０°と３０°と６０°という３つの対応角度で高精度な読唇処理が可能な第二複数角度対応読唇部１３５を設けてもよい。 Therefore, for example, as shown in FIG. 6, instead of the second single-angle lip reading unit 132 described above, a first multiple-angle lip reading unit 134 capable of highly accurate lip reading at two corresponding angles, 0° and 30°, may be provided. Furthermore, as shown in FIG. 6, instead of the third single-angle lip reading unit 133 described above, a second multiple-angle lip reading unit 135 capable of highly accurate lip reading at three corresponding angles, 0°, 30°, and 60°, may be provided.

ただし、複数角度対応読唇部は、通常、その複数角度対応読唇部における複数の対応角度をそれぞれ対応角度とした複数の単一角度対応読唇部を作成する場合と比較して、より広範囲の角度について読唇精度が高まることが期待できるというメリットがある一方、学習コストが増大するデメリットがある。このデメリットについては、例えば、これらの読唇処理を実行する機械読唇モデルを構築するために必要となる学習データの必要量で比較することができる。複数の対応角度のいずれについても所定の高い精度（所定の精度閾値を超える精度）を得ようとする場合には、複数角度対応読唇部の機械読唇モデルでは、通常、複数の単一角度対応読唇部の各機械読唇モデルを構築するのに必要な学習データの合計量よりも、ずっと多くの学習データを必要とする。また、学習データの増大に伴い、パラメータチューニングなどのコストも増大する。 However, while a multi-angle lip-reading unit usually has the advantage of being able to improve lip-reading accuracy for a wider range of angles compared to creating multiple single-angle lip-reading units with each corresponding angle corresponding to the multiple corresponding angles of the multi-angle lip-reading unit, it also has the disadvantage of increased learning costs. This disadvantage can be compared, for example, with the amount of training data required to build a machine lip-reading model that executes these lip-reading processes. To achieve a predetermined high level of accuracy (accuracy exceeding a predetermined accuracy threshold) for each of the multiple corresponding angles, the machine lip-reading model of the multi-angle lip-reading unit usually requires much more training data than the total amount of training data required to build each machine lip-reading model of the multiple single-angle lip-reading units. Furthermore, as the training data increases, the costs of parameter tuning and the like also increase.

一方で、複数角度対応読唇部が前記メリットを備えている点を考慮すると、図６の例のように、単一角度対応読唇部と複数角度対応読唇部とを混在させた構成であることが好適である。これにより、例えば、入力される顔画像データ（口唇画像データ）において最も頻度の高い撮像角度及びその近傍の角度（高い読唇精度が得られる対応角度）については単一角度対応読唇部で対応し、それ以外の角度については複数角度対応読唇部で対応するようにし、これらの読唇処理結果を読唇結果統合部１４１で統合することで、より様々な角度からの口唇画像データについて発話内容を高精度に認識することが可能となる。 On the other hand, considering that the multiple-angle lip reading unit has the above-mentioned advantages, it is preferable to have a configuration in which a single-angle lip reading unit and a multiple-angle lip reading unit are mixed, as in the example of Figure 6. In this way, for example, the most frequent imaging angle and nearby angles (corresponding angles that provide high lip reading accuracy) in the input face image data (lip image data) are handled by the single-angle lip reading unit, and other angles are handled by the multiple-angle lip reading unit, and the results of these lip reading processes are integrated by the lip reading result integration unit 141, making it possible to recognize the speech content of lip image data from a wider variety of angles with high accuracy.

また、この点で、理論上は、本実施形態１と同程度の様々な角度から撮像した口唇画像データに対し、発話内容を高精度に認識することが可能な単一の読唇部を構築することも可能といえる。しかしながら、このような単一の読唇部を、プログラマーによりプログラミングされた読唇プログラムによって実現することは極めて困難である。また、このような単一の読唇部を機械読唇モデル（学習済みモデル）によって実現するには、膨大な量の学習データが必要となり、そのような機械読唇モデルを構築することも実現的に困難である。 In this regard, it is also theoretically possible to construct a single lip-reading unit that is capable of recognizing speech content with high accuracy from lip image data captured from various angles similar to that of the first embodiment. However, it is extremely difficult to realize such a single lip-reading unit using a lip-reading program programmed by a programmer. Furthermore, to realize such a single lip-reading unit using a machine lip-reading model (trained model), a huge amount of training data is required, and it is also practically difficult to build such a machine lip-reading model.

これに対し、本実施形態１の読唇装置１００で用いられる読唇部は、１つの対応角度に特化した読唇部（単一角度対応読唇部）又は複数（数個程度）の対応角度に特化した読唇部（複数角度対応読唇部）であり、これらの読唇部を構築することは比較的容易である。したがって、本実施形態１によれば、様々な角度からの口唇画像データについて発話内容を高精度に認識できる読唇装置を、より簡易に作成することができるというメリットもある。 In contrast, the lip reading unit used in the lip reading device 100 of this embodiment 1 is a lip reading unit specialized for one corresponding angle (single-angle compatible lip reading unit) or a lip reading unit specialized for multiple (several) corresponding angles (multiple-angle compatible lip reading unit), and it is relatively easy to construct such lip reading units. Therefore, according to this embodiment 1, there is also the advantage that it is possible to more easily create a lip reading device that can recognize the speech content with high accuracy from lip image data from various angles.

なお、読唇結果統合部１４１によって読唇処理結果が統合される読唇部間において、それぞれの対応角度が部分的に重複していてもよい。すなわち、複数角度対応読唇部における対応角度は、全く同じ組み合わせでなければ、他の単一角度対応読唇部や他の複数角度対応読唇部における対応角度の一部または全部を含んでも良い。例えば、図６に示すように、０°については、すべての読唇部１３１，１３４，１３５の対応角度とし、３０°については、２つの複数角度対応読唇部１３４，１３５の対応角度とするようにしてもよい。 The corresponding angles of the lip reading units whose lip reading process results are integrated by the lip reading result integration unit 141 may overlap partially. In other words, the corresponding angles of the multiple-angle compatible lip reading units may include some or all of the corresponding angles of other single-angle compatible lip reading units or other multiple-angle compatible lip reading units, as long as they are not exactly the same combination. For example, as shown in FIG. 6, 0° may be the corresponding angle of all lip reading units 131, 134, and 135, and 30° may be the corresponding angle of the two multiple-angle compatible lip reading units 134 and 135.

〔変形例１〕
次に、上述した実施形態１における読唇装置１００の一変形例（以下、本変形例を「変形例１」という。）について説明する。
図７は、本変形例１における読唇装置１００を示すブロック図である。
本変形例１における読唇装置１００は、図７に示すように、０°の対応角度で高精度な読唇処理が可能な第一単一角度対応読唇部１３１と、４５°の対応角度で高精度な読唇処理が可能な第二単一角度対応読唇部１３６と、０°及び４５°の２つの対応角度で高精度な読唇処理が可能な複数角度対応読唇部１３７という、３つの読唇部を備えている。 [Modification 1]
Next, a modification of the lip reading apparatus 100 in the above-mentioned first embodiment (hereinafter, this modification will be referred to as "Modification 1") will be described.
FIG. 7 is a block diagram showing a lip reading device 100 according to the first modification.
As shown in Figure 7, the lip reading device 100 in this variant example 1 has three lip reading units: a first single-angle compatible lip reading unit 131 capable of highly accurate lip reading processing at a corresponding angle of 0°, a second single-angle compatible lip reading unit 136 capable of highly accurate lip reading processing at a corresponding angle of 45°, and a multiple-angle compatible lip reading unit 137 capable of highly accurate lip reading processing at two corresponding angles of 0° and 45°.

そして、本変形例１における読唇装置１００は、画像入力部１１１に入力された口唇画像データに基づいて、撮像方向が複数の読唇部１３１，１３６，１３７のうちの少なくとも１つの読唇部の対応角度になるように変換したデータを生成するデータ変換部としての角度変換部１２１，１２２，１２３を備えている。なお、図７の例では、３つの読唇部１３１，１３６，１３７のすべてに対し、その前段の処理部として角度変換部１２１，１２２，１２３を設け、各角度変換部により、それぞれの読唇部１３１，１３６，１３７の対応角度のいずれかに撮像方向が一致するように、画像入力部１１１に入力された口唇画像データの変換処理を行う。すなわち、各角度変換部１２１，１２２，１２３は、画像入力部１１１に入力された口唇画像データの口唇画像が、それぞれの読唇部１３１，１３６，１３７の対応角度から撮像された口唇画像と擬似的に同等になるように、変換処理を行う。 The lip reading device 100 in this modified example 1 is equipped with angle conversion units 121, 122, and 123 as data conversion units that generate data converted based on the lip image data input to the image input unit 111 so that the imaging direction becomes the corresponding angle of at least one of the lip reading units 131, 136, and 137. In the example of Figure 7, angle conversion units 121, 122, and 123 are provided for all three lip reading units 131, 136, and 137 as processing units in the preceding stage, and each angle conversion unit performs conversion processing of the lip image data input to the image input unit 111 so that the imaging direction matches one of the corresponding angles of the respective lip reading units 131, 136, and 137. That is, each angle conversion unit 121, 122, and 123 performs a conversion process so that the lip image of the lip image data input to the image input unit 111 becomes pseudo-equivalent to the lip image captured from the corresponding angle of each lip reading unit 131, 136, and 137.

例えば、３０°の角度から撮像された口唇画像データが画像入力部１１１に入力された場合、第一角度変換部１２１では、第一単一角度対応読唇部１３１の対応角度である０°に撮像方向が一致するように、画像入力部１１１に入力された口唇画像データの変換処理を行う。この場合、同様に、第二角度変換部１２２では、第二単一角度対応読唇部１３６の対応角度である４５°に撮像方向が一致するように、画像入力部１１１に入力された口唇画像データの変換処理を行う。また、第三角度変換部１２３では、複数角度対応読唇部１３７の対応角度である０°と４５°のうちのいずれか（ここでは０°）に撮像方向が一致するように、画像入力部１１１に入力された口唇画像データの変換処理を行う。 For example, when lip image data captured at an angle of 30° is input to the image input unit 111, the first angle conversion unit 121 performs conversion processing on the lip image data input to the image input unit 111 so that the imaging direction matches the corresponding angle of 0° of the first single-angle compatible lip reading unit 131. In this case, similarly, the second angle conversion unit 122 performs conversion processing on the lip image data input to the image input unit 111 so that the imaging direction matches the corresponding angle of 45° of the second single-angle compatible lip reading unit 136. Furthermore, the third angle conversion unit 123 performs conversion processing on the lip image data input to the image input unit 111 so that the imaging direction matches either the corresponding angle of 0° or 45° (here, 0°) of the multiple-angle compatible lip reading unit 137.

各角度変換部１２１，１２２，１２３は、アフィン変換のような線形写像を用いて変換してもよいし、機械学習や深層学習に基づいた変換モデルを用いて変換してもよい。また、各角度変換部１２１，１２２，１２３が変換した変換後のデータは、それぞれの読唇部１３１，１３６，１３７の入力データ形式に対応していればよく、例えば、画像データの形式でもよいし、変換モデルの中間表現の形式でもよい。 Each angle conversion unit 121, 122, 123 may perform the conversion using a linear mapping such as an affine transformation, or may perform the conversion using a conversion model based on machine learning or deep learning. Furthermore, the converted data converted by each angle conversion unit 121, 122, 123 only needs to correspond to the input data format of each lip reading unit 131, 136, 137, and may be in the format of image data or an intermediate representation of the conversion model, for example.

本変形例１によれば、各読唇部１３１，１３６，１３７に対し、それぞれの角度変換部１２１，１２２，１２３から受け取るデータ（画像データや中間表現）は、それぞれの読唇部１３１，１３６，１３７の対応角度に合致したものとなる。そのため、各読唇部１３１，１３６，１３７は、それぞれ高い精度で読唇処理を行うことができ、それぞれの読唇部１３１，１３６，１３７から高い精度の読唇処理結果を得ることができる。その結果、これらの読唇処理結果を読唇結果統合部１４１によって統合して得られる最終的な読唇結果（発話内容の認識結果）も高精度なものとなる。 According to this first modification, the data (image data or intermediate representation) received by each of the angle conversion units 121, 122, 123 for each of the lip reading units 131, 136, 137 matches the corresponding angle of each of the lip reading units 131, 136, 137. Therefore, each of the lip reading units 131, 136, 137 can perform lip reading processing with high accuracy, and highly accurate lip reading processing results can be obtained from each of the lip reading units 131, 136, 137. As a result, the final lip reading result (recognition result of the speech content) obtained by integrating these lip reading processing results by the lip reading result integration unit 141 is also highly accurate.

〔変形例２〕
次に、上述した実施形態１における読唇装置１００の他の変形例（以下、本変形例を「変形例２」という。）について説明する。
図８は、本変形例２における読唇装置１００を示すブロック図である。
本変形例２における読唇装置１００は、図８に示すように、０°の対応角度で高精度な読唇処理が可能な第一単一角度対応読唇部１３１と、４５°の対応角度で高精度な読唇処理が可能な第二単一角度対応読唇部１３６と、０°及び３０°の２つの対応角度で高精度な読唇処理が可能な複数角度対応読唇部１３４という、３つの読唇部を備えている。 [Modification 2]
Next, another modification of the lip reading apparatus 100 in the above-mentioned first embodiment (hereinafter, this modification will be referred to as "Modification 2") will be described.
FIG. 8 is a block diagram showing a lip reading device 100 according to the second modification.
As shown in Figure 8, the lip reading device 100 in this variant example 2 has three lip reading units: a first single-angle compatible lip reading unit 131 capable of highly accurate lip reading processing at a corresponding angle of 0°, a second single-angle compatible lip reading unit 136 capable of highly accurate lip reading processing at a corresponding angle of 45°, and a multiple-angle compatible lip reading unit 134 capable of highly accurate lip reading processing at two corresponding angles of 0° and 30°.

そして、本変形例２における読唇装置１００は、画像入力部１１１に入力された口唇画像データの撮像方向を推定する撮像方向推定部としての角度推定部１１２を備えている。例えば、上述した学習データ記憶媒体３２に記憶してある様々な角度から撮像された大量の顔画像データを利用して、機械学習や深層学習により、入力された顔画像データの撮像角度を推定するモデルを学習する。そして、これにより生成された角度推定モデルを角度推定部１１２のコンピュータにより実行することで、画像入力部１１１に入力された口唇画像データの撮像方向を推定する。 The lip reading device 100 in this second modification includes an angle estimation unit 112 as an imaging direction estimation unit that estimates the imaging direction of the lip image data input to the image input unit 111. For example, a large amount of face image data captured from various angles and stored in the learning data storage medium 32 described above is used to learn a model that estimates the imaging angle of the input face image data through machine learning or deep learning. The angle estimation model thus generated is then executed by the computer of the angle estimation unit 112 to estimate the imaging direction of the lip image data input to the image input unit 111.

本変形例２の角度推定部１１２は、画像入力部１１１から顔画像データを受け取ると、その顔画像データの撮像角度を推定し、予め設定された角度ごとに確からしさを示す確信度を角度推定結果として出力する。例えば、角度推定部１１２は、予め設定された角度が０°、３０°、４５°、６０°であるとき、入力された顔画像データの撮像角度の推定結果として、０°の確信度が０．３、３０°の確信度が０．４、４５°の確信度が０．２、６０°の確信度が０．１といった情報を出力する。 When the angle estimation unit 112 of this modified example 2 receives face image data from the image input unit 111, it estimates the imaging angle of the face image data and outputs a certainty factor indicating the likelihood for each preset angle as an angle estimation result. For example, when the preset angles are 0°, 30°, 45°, and 60°, the angle estimation unit 112 outputs information such as a certainty factor of 0.3 for 0°, a certainty factor of 0.4 for 30°, a certainty factor of 0.2 for 45°, and a certainty factor of 0.1 for 60° as an estimation result of the imaging angle of the input face image data.

本変形例２において、角度推定部１１２の角度推定結果は読唇結果統合部１４１に送られる。本変形例２の読唇結果統合部１４１は、角度推定部１１２から受け取った角度推定結果を用いて、３つの読唇部１３１，１３６，１３４で得られた各読唇処理結果を統合し、話者の発話内容の認識結果を最終的な読唇結果として生成する。 In this second modification, the angle estimation result of the angle estimation unit 112 is sent to the lip reading result integration unit 141. The lip reading result integration unit 141 in this second modification uses the angle estimation result received from the angle estimation unit 112 to integrate the lip reading process results obtained by the three lip reading units 131, 136, and 134, and generates a recognition result of the speaker's speech content as the final lip reading result.

本変形例２における統合処理の一例としては、例えば、３つの読唇部１３１，１３６，１３４で得られた各読唇処理結果の信頼度スコアに対し、角度推定部１１２の角度推定結果に含まれるそれぞれの読唇部の対応角度に合致した推定角度の確信度を乗じる。例えば、前記の例で説明すると、対応角度が０°である第一単一角度対応読唇部１３１については信頼度スコアを０．３倍し、対応角度が４５°である第二単一角度対応読唇部１３６については信頼度スコアを０．２倍し、対応角度が０°と３０°の２つである複数角度対応読唇部１３４については信頼度スコアを０°と３０°の確信度のうちの高い方を用いて０．４倍するといった処理を行う。 As an example of the integration process in this modified example 2, for example, the reliability score of each lip reading process result obtained by the three lip reading units 131, 136, and 134 is multiplied by the certainty of the estimated angle that matches the corresponding angle of each lip reading unit included in the angle estimation result of the angle estimation unit 112. For example, in the above example, for the first single-angle compatible lip reading unit 131 whose corresponding angle is 0°, the reliability score is multiplied by 0.3, for the second single-angle compatible lip reading unit 136 whose corresponding angle is 45°, the reliability score is multiplied by 0.2, and for the multiple-angle compatible lip reading unit 134 whose corresponding angles are 0° and 30°, the reliability score is multiplied by 0.4 using the higher of the certainties of 0° and 30°.

このように角度推定結果を用いた後、読唇結果統合部１４１は、上述した実施形態１と同様、３つの読唇部１３１，１３６，１３４で得られた読唇処理結果のいずれにも含まれる共通の発話内容候補の中で最も信頼度スコアの高い発話内容候補を統合結果としてもよい。あるいは、信頼度スコアの高い順に２以上の発話内容候補を統合結果としてもよい。また、例えば、角度推定結果を用いた後、読唇結果統合部１４１は、３つの読唇部１３１，１３６，１３４で得られた読唇処理結果に含まれる発話内容候補ごとの信頼度スコアの合計値を算出し、合計値が最も高い発話内容候補を統合結果としてもよい。あるいは、信頼度スコアの合計値が算出された２以上の発話内容候補を統合結果としてもよい。 After using the angle estimation result in this way, the lip reading result integration unit 141 may, as in the first embodiment described above, take as the integrated result the speech content candidate with the highest reliability score among the common speech content candidates included in all of the lip reading process results obtained by the three lip reading units 131, 136, and 134. Alternatively, two or more speech content candidates in descending order of reliability score may be taken as the integrated result. Also, for example, after using the angle estimation result, the lip reading result integration unit 141 may calculate the total reliability score for each speech content candidate included in the lip reading process results obtained by the three lip reading units 131, 136, and 134, and take as the integrated result the speech content candidate with the highest total value. Alternatively, two or more speech content candidates for which the total reliability score has been calculated may be taken as the integrated result.

本変形例２によれば、画像入力部１１１から顔画像データの撮像角度を推定した角度推定結果を用いて、各読唇部１３１，１３６，１３４の読唇処理結果の重み付けを行うことができる。すなわち、角度推定部１１２での角度推定結果を用い、対応角度に合致する推定角度の確信度が高い読唇部の読唇処理結果ほど重み付けを大きくして、当該読唇部の読唇処理結果が発話内容の認識結果に与える影響度を高める。これにより、読唇結果統合部１４１によって得られる最終的な読唇結果（発話内容の認識結果）を、より高精度なものとすることができる。 According to this second modification, the lip reading results of each lip reading unit 131, 136, 134 can be weighted using the angle estimation result obtained by estimating the imaging angle of the face image data from the image input unit 111. In other words, the angle estimation result from the angle estimation unit 112 is used, and the lip reading results of a lip reading unit with a higher degree of certainty of an estimated angle that matches the corresponding angle are weighted more heavily, thereby increasing the influence of the lip reading results of the lip reading unit on the recognition result of the speech content. This makes it possible to achieve a higher accuracy in the final lip reading result (recognition result of the speech content) obtained by the lip reading result integration unit 141.

〔実施形態２〕
次に、本発明を、発話内容認識装置としてのマルチモーダル音声認識装置に適用した一実施形態（以下、本実施形態を「実施形態２」という。）について説明する。
本実施形態２のマルチモーダル音声認識装置は、読唇処理と音声認識処理という２種類の発話内容認識処理を用いて、話者が発話する発話内容の認識結果を出力する。 [Embodiment 2]
Next, an embodiment in which the present invention is applied to a multimodal speech recognition device as a speech content recognition device (hereinafter, this embodiment will be referred to as "embodiment 2") will be described.
The multimodal speech recognition device of the second embodiment uses two types of speech content recognition processing, namely lip reading processing and speech recognition processing, to output a recognition result of the content of an utterance spoken by a speaker.

図９は、本実施形態２に係るマルチモーダル音声認識装置を示すブロック図である。
本実施形態２のマルチモーダル音声認識装置３００は、読唇認識処理部１０１と、音声認識処理部２０１と、認識結果統合部３０１とによって構成されている。 FIG. 9 is a block diagram showing a multimodal speech recognition device according to the second embodiment.
The multimodal speech recognition device 300 of the second embodiment is composed of a lip-reading recognition processing unit 101 , a speech recognition processing unit 201 , and a recognition result integration unit 301 .

読唇認識処理部１０１の構成は、上述した実施形態１の読唇装置１００の構成を採用することができる。図９の例は、図１に示した読唇装置１００の構成を採用したものである。 The configuration of the lip reading recognition processing unit 101 can be the same as that of the lip reading device 100 of the first embodiment described above. The example in FIG. 9 uses the configuration of the lip reading device 100 shown in FIG. 1.

音声認識処理部２０１は、主に、音声入力部２１１と音声認識部２３１とから構成されている。 The voice recognition processing unit 201 is mainly composed of a voice input unit 211 and a voice recognition unit 231.

音声入力部２１１は、発話を行っている話者の音声データの入力を受け付ける。本実施形態２の音声入力部２１１は、話者の音声を集音するマイクロフォン３や、話者の音声データを記憶した記憶媒体２に対し、有線または無線で通信可能に接続されている。マイクロフォン３からは、現に話者が発話しているリアルタイムの音声データが音声入力部２１１に入力される。記憶媒体２は、過去に話者が発話したときの音声データを記憶しており、記憶媒体２からは、過去の音声データが音声入力部２１１に入力される。 The voice input unit 211 accepts input of voice data of a speaker who is speaking. In this embodiment 2, the voice input unit 211 is connected to a microphone 3 that collects the speaker's voice and a storage medium 2 that stores the speaker's voice data, so that they can communicate with each other via wired or wireless communication. Real-time voice data of the speaker currently speaking is input from the microphone 3 to the voice input unit 211. The storage medium 2 stores voice data from when the speaker spoke in the past, and past voice data is input from the storage medium 2 to the voice input unit 211.

音声入力部２１１は、入力された音声データを、必要に応じて音声認識部２３１の入力に対応するようにデータ処理して、音声認識部２３１に受け渡す。例えば、入力された音声データからノイズを除去した音声信号を抽出し、その音声信号のデータを音声認識部２３１に受け渡す。 The voice input unit 211 processes the input voice data as necessary to make it correspond to the input of the voice recognition unit 231, and passes it to the voice recognition unit 231. For example, it extracts a voice signal from which noise has been removed from the input voice data, and passes the data of the voice signal to the voice recognition unit 231.

本実施形態２の音声認識部２３１は、所定の音声認識プログラムをコンピュータで実行することにより、音声入力部２１１に入力された音声データに対する音声認識処理を実行し、音声認識結果を生成する。本実施形態２の音声認識プログラムは、話者の音声データを含む学習データを用いて学習した音声認識モデル（学習済みモデル）を用いるが、プログラマーによってプログラミングされた音声認識プログラムを用いてもよい。 The voice recognition unit 231 of the present embodiment 2 executes a predetermined voice recognition program on a computer, thereby executing voice recognition processing on the voice data input to the voice input unit 211 and generating a voice recognition result. The voice recognition program of the present embodiment 2 uses a voice recognition model (trained model) trained using training data including the speaker's voice data, but a voice recognition program programmed by a programmer may also be used.

本実施形態２における音声認識モデル（学習済みモデル）は、入力されたデータ（音声データ）から話者の発話内容を推定するものであり、音声認識モデルから出力される推定結果（音声認識結果）の形式には、上述した機械読唇モデルの場合と同様、特に制限はない。一例として、本実施形態２では、読唇認識処理部１０１の形式に合わせて、入力されたデータ（音声データ）に対し、１又は２以上の発話内容候補（１つの語又は語系列など）と、その発話内容候補ごとの信頼度スコアとを含むデータを音声認識結果として出力する。 The speech recognition model (trained model) in this embodiment 2 estimates the speaker's speech content from the input data (speech data), and there are no particular restrictions on the format of the estimation result (speech recognition result) output from the speech recognition model, as in the case of the machine lip reading model described above. As an example, in this embodiment 2, data including one or more speech content candidates (such as a single word or word sequence) and a reliability score for each of the speech content candidates is output as a speech recognition result for the input data (speech data) in accordance with the format of the lip reading recognition processing unit 101.

認識結果統合部３０１は、読唇認識処理部１０１の読唇結果統合部１４１から出力される認識結果（読唇結果）と、音声認識処理部２０１の音声認識部２３１から出力される認識結果（音声認識結果）とを統合して、最終的な発話内容の認識結果を出力する。 The recognition result integration unit 301 integrates the recognition result (lip reading result) output from the lip reading result integration unit 141 of the lip reading recognition processing unit 101 and the recognition result (voice recognition result) output from the voice recognition unit 231 of the voice recognition processing unit 201, and outputs the final recognition result of the spoken content.

認識結果統合部３０１が行う統合処理は、精度の高い認識結果が得られるように（例えば、正解の発話内容の信頼度スコアが相対的に高くなるように）、読唇認識処理部１０１の認識結果（読唇結果）と、音声認識処理部２０１の認識結果（音声認識結果）とを統合する処理であれば、特に制限はない。 There are no particular limitations on the integration process performed by the recognition result integration unit 301, so long as it is a process of integrating the recognition result (lip reading result) of the lip reading recognition processing unit 101 and the recognition result (voice recognition result) of the voice recognition processing unit 201 so as to obtain a highly accurate recognition result (for example, so that the reliability score of the correct speech content becomes relatively high).

認識結果統合部３０１が行う統合処理の一例としては、例えば、２つの認識処理部１０１，２０１の認識結果のいずれにも含まれる共通の発話内容候補の中で最も信頼度スコアの高い発話内容候補を統合結果としてもよい。あるいは、信頼度スコアの高い順に２以上の発話内容候補を統合結果としてもよい。
また、例えば、２つの認識処理部１０１，２０１の認識結果に含まれる発話内容候補ごとの信頼度スコアの合計値を算出し、合計値が最も高い発話内容候補を統合結果としてもよい。あるいは、信頼度スコアの合計値が算出された２以上の発話内容候補を統合結果としてもよい。 As an example of the integration process performed by the recognition result integration unit 301, for example, the integration result may be an utterance content candidate with the highest reliability score among common utterance content candidates included in both of the recognition results of the two recognition processing units 101 and 201. Alternatively, the integration result may be two or more utterance content candidates in descending order of reliability score.
Also, for example, the total value of the reliability scores for each of the utterance content candidates included in the recognition results of the two recognition processing units 101 and 201 may be calculated, and the utterance content candidate with the highest total value may be used as the integrated result. Alternatively, two or more utterance content candidates for which the total value of the reliability scores has been calculated may be used as the integrated result.

認識結果統合部３０１が行う統合処理の別例としては、２つの認識処理部１０１，２０１における中間表現から認識結果を得る学習済みモデル（統合モデル）を、例えばニューラルネットワークモデルによって予め学習しておく。そして、２つの認識処理部１０１，２０１で得られた認識結果に含まれる各中間表現（ベクトル等）を連結して１つの中間表現を生成し、認識結果統合部３０１のコンピュータで学習済みの統合モデルを実行して、当該１つの中間表現から１又は２以上の発話内容の認識結果を得て、これを統合結果とする。あるいは、当該１つの中間表現から１又は２以上の発話内容の認識結果に、それぞれの信頼度スコアを含めてもよい。 As another example of the integration process performed by the recognition result integration unit 301, a trained model (integration model) that obtains recognition results from intermediate representations in the two recognition processing units 101, 201 is trained in advance, for example, by a neural network model. Then, the intermediate representations (vectors, etc.) included in the recognition results obtained by the two recognition processing units 101, 201 are concatenated to generate one intermediate representation, and the trained integration model is executed by the computer of the recognition result integration unit 301 to obtain recognition results of one or more utterance contents from the one intermediate representation, which is used as the integration result. Alternatively, the recognition results of one or more utterance contents from the one intermediate representation may include their respective confidence scores.

本実施形態２のマルチモーダル音声認識装置３００は、読唇認識処理部１０１が上述した実施形態１の読唇装置１００の構成を採用しているため、高い精度で読唇結果を得ることができる。 The multimodal speech recognition device 300 of this embodiment 2 has a lip reading processing unit 101 that adopts the configuration of the lip reading device 100 of the above-mentioned embodiment 1, so that it is possible to obtain lip reading results with high accuracy.

加えて、本実施形態２のマルチモーダル音声認識装置３００は、読唇装置と音声認識装置という互いに異なる２種類の発話内容認識方法を用いて認識結果を求め、これらの認識結果を統合して最終的な発話内容の認識結果を出力する。そのため、例えば、話者の発話内容を音声認識処理部２０１では高精度に認識困難な状況（例えば、雑音の多い環境、複数の話者が同時に発話することの多い会議環境など）であっても、読唇認識処理部１０１により当該発話内容を高精度に認識することが可能となる。また、例えば、話者の発話内容を読唇認識処理部１０１では高精度に認識困難な状況（例えば、低照明の暗い環境、話者の動き回る等により話者の口唇を撮像することが困難な環境など）であっても、音声認識処理部２０１により当該発話内容を高精度に認識することが可能となる。 In addition, the multimodal speech recognition device 300 of the second embodiment obtains recognition results using two different types of speech content recognition methods, namely, a lip reading device and a voice recognition device, and integrates these recognition results to output the final recognition result of the speech content. Therefore, even in a situation where the speech recognition processing unit 201 has difficulty recognizing the speaker's speech content with high accuracy (e.g., a noisy environment, a conference environment where multiple speakers often speak at the same time, etc.), the lip reading recognition processing unit 101 can recognize the speech content with high accuracy. Also, even in a situation where the lip reading recognition processing unit 101 has difficulty recognizing the speaker's speech content with high accuracy (e.g., a dark environment with low lighting, an environment where it is difficult to capture the speaker's lips due to the speaker's movement, etc.), the speech recognition processing unit 201 can recognize the speech content with high accuracy.

このように本実施形態２のマルチモーダル音声認識装置３００によれば、話者の発話内容の認識精度が話者の環境に左右されにくい、ロバスト性に優れた発話内容認識装置を実現できる。このようなマルチモーダル音声認識装置３００は、具体的には、会議室またはオンラインにおける会議録自動生成システム、スマートフォンにおける音声入力インタフェースとして、好適に利用することができる。 In this way, the multimodal speech recognition device 300 of the second embodiment can realize a highly robust speech content recognition device in which the recognition accuracy of the speaker's speech content is less affected by the speaker's environment. Specifically, such a multimodal speech recognition device 300 can be suitably used as an automatic meeting minutes generation system in a conference room or online, and as a speech input interface for smartphones.

〔実施形態３〕
次に、上述した実施形態２のマルチモーダル音声認識装置３００における読唇認識処理部１０１で用いられる機械読唇モデル及び音声認識処理部２０１で用いられる音声認識モデルを構築するための学習データを収集する学習データ収集システムの一実施形態（以下、本実施形態を「実施形態３」という。）について説明する。 [Embodiment 3]
Next, we will explain one embodiment of a training data collection system (hereinafter, this embodiment will be referred to as "embodiment 3") that collects training data for constructing a machine lip reading model used in the lip reading recognition processing unit 101 and a speech recognition model used in the speech recognition processing unit 201 in the multimodal speech recognition device 300 of embodiment 2 described above.

図１０は、本実施形態３における学習データ収集システムの構成を示す説明図である。
本実施形態３の学習データ収集システムは、複数の撮像装置を有するカメラアレイ３１と、音声取得装置としての収録用マイクロフォン２１と、指示装置としてのディスプレイ４２と、制御装置４３とを備えている。そのほか、本実施形態３の学習データ収集システムは、通報部４１と、記憶装置としての学習データ記憶媒体３２とを備えている。 FIG. 10 is an explanatory diagram showing the configuration of a learning data collection system in the third embodiment.
The learning data collection system of the present embodiment 3 includes a camera array 31 having a plurality of imaging devices, a recording microphone 21 as a voice acquisition device, a display 42 as an instruction device, and a control device 43. In addition, the learning data collection system of the present embodiment 3 includes a reporting unit 41 and a learning data storage medium 32 as a storage device.

カメラアレイ３１は、所定位置の話者を互いに異なる複数の撮像方向から撮像する複数のカメラ（撮像装置）３１－１～３１－１０によって構成されている。本実施形態３では、図１１に示すように、１０個の収録用カメラ３１－１～３１－１０が等間隔で配置されている。具体的には、話者の顔の正面方向から撮像したときの撮像方向を基準（０°）にした鉛直軸回りの角度を撮像角度とすると、各収録用カメラ３１－１～３１－１０は、０°～９０°までの間を１０°間隔で配置されている。このカメラアレイ３１により、発話する話者の口唇画像を各収録用カメラ３１－１～３１－１０によりそれぞれの撮像角度から同時に撮像することが可能である。 The camera array 31 is composed of multiple cameras (imaging devices) 31-1 to 31-10 that capture images of a speaker at a predetermined position from multiple different imaging directions. In this embodiment 3, as shown in FIG. 11, 10 recording cameras 31-1 to 31-10 are arranged at equal intervals. Specifically, if the imaging angle is the angle around the vertical axis based on the imaging direction when the speaker's face is imaged from the front direction (0°), the recording cameras 31-1 to 31-10 are arranged at 10° intervals between 0° and 90°. This camera array 31 makes it possible for the recording cameras 31-1 to 31-10 to simultaneously capture images of the lips of the speaker from their respective imaging angles.

カメラアレイ３１は、学習データ記憶媒体３２に接続されており、各収録用カメラ３１－１～３１－１０によって撮像された話者の顔画像データ（口唇画像データ）は、学習データ記憶媒体３２に記憶され、蓄積される。また、カメラアレイ３１は、制御装置４３に接続され、制御装置４３により撮像動作が制御される。 The camera array 31 is connected to a learning data storage medium 32, and the speaker's facial image data (lip image data) captured by each recording camera 31-1 to 31-10 is stored and accumulated in the learning data storage medium 32. The camera array 31 is also connected to a control device 43, and the image capture operation is controlled by the control device 43.

なお、カメラアレイ３１を構成するカメラの数には特に制限はない。また、カメラアレイ３１を構成するカメラの配置は、本実施形態３では鉛直軸回りの撮像角度が互いに異なるように複数の収録用カメラ３１－１～３１－１０を水平面に沿って並べているが、これに限られない。例えば、水平軸回りや鉛直軸に対して傾斜した傾斜軸回りの撮像角度が互いに異なるように複数の収録用カメラ３１－１～３１－１０を並べてもよい。 There is no particular limit to the number of cameras that make up the camera array 31. In addition, in the third embodiment, the arrangement of the cameras that make up the camera array 31 is such that the multiple recording cameras 31-1 to 31-10 are arranged along a horizontal plane so that the imaging angles around the vertical axis are different from each other, but this is not limited to this. For example, the multiple recording cameras 31-1 to 31-10 may be arranged so that the imaging angles around the horizontal axis or around an inclined axis inclined relative to the vertical axis are different from each other.

収録用マイクロフォン２１は、カメラアレイ３１で撮像する対象である話者の音声を取得する。収録用マイクロフォン２１は、学習データ記憶媒体３２に接続されており、収録用マイクロフォン２１によって集音した音声データは、学習データ記憶媒体３２に記憶され、蓄積される。また、収録用マイクロフォン２１は、制御装置４３に接続され、制御装置４３により動作が制御される。収録用マイクロフォン２１は、例えば、ピンマイクを用いて話者の襟元などに設置しても良いし、スタンドマイクを話者の近傍に設置しても良い。このとき、カメラアレイ３１の各収録用カメラ３１－１～３１－１０によって話者の口唇画像を撮像するにあたり、収録用マイクロフォン２１が邪魔にならないように設置することが望ましい。 The recording microphone 21 acquires the voice of the speaker who is the subject of imaging by the camera array 31. The recording microphone 21 is connected to a learning data storage medium 32, and the voice data collected by the recording microphone 21 is stored and accumulated in the learning data storage medium 32. The recording microphone 21 is also connected to a control device 43, and its operation is controlled by the control device 43. The recording microphone 21 may be, for example, a lapel microphone installed on the speaker's collar, or a stand microphone may be installed near the speaker. In this case, it is desirable to install the recording microphone 21 so as not to get in the way when the speaker's lip image is captured by each of the recording cameras 31-1 to 31-10 of the camera array 31.

なお、カメラアレイ３１を構成するカメラが音声取得装置としての機能を備えている場合には、収録用マイクロフォン２１としてカメラの音声取得装置を利用してもよい。 If the cameras that make up the camera array 31 have a function as an audio capture device, the camera's audio capture device may be used as the recording microphone 21.

ディスプレイ４２は、話者に指示する発話内容を表示する。ディスプレイ４２は、有線または無線で接続された制御装置４３によって表示内容が制御され、制御装置４３の制御の下、話者に対して指示する発話内容や、発話やり直しの指示などを行う。 The display 42 displays the spoken instructions to the speaker. The display contents of the display 42 are controlled by a control device 43 connected by wire or wirelessly, and under the control of the control device 43, the display 42 displays the spoken instructions to the speaker, instructs the speaker to repeat the utterance, etc.

通報部４１は、有線または無線で接続された制御装置４３によって動作が制御され、制御装置４３の制御の下、発話の開始と終了のタイミングを光や音等によって話者に通報する。 The operation of the notification unit 41 is controlled by a control device 43 connected by wire or wirelessly, and under the control of the control device 43, the notification unit 41 notifies the speaker of the start and end timing of speech using light, sound, etc.

学習データ記憶媒体３２は、上述したように、カメラアレイ３１の各収録用カメラ３１－１～３１－１０で撮像した話者の顔画像データと、収録用マイクロフォン２１で集音した話者の音声データとを、時系列が特定できる形式で記憶する。具体的には、通報部４１によって発せられる発話開始同期信号及び発話収容同期信号を、各収録用カメラ３１－１～３１－１０で撮像した顔画像データ及び収録用マイクロフォン２１で集音した音声データに埋め込む。これにより、学習データ記憶媒体３２に記憶された顔画像データ及び音声データは、話者がディスプレイ４２により指示された発話内容を発話した時期と照らし合わせることができる。よって、ディスプレイ４２を介して話者に指示された発話内容と、その発話内容を発した時の話者の顔画像データ及び音声データとが対応づけられている。 As described above, the learning data storage medium 32 stores the speaker's facial image data captured by each of the recording cameras 31-1 to 31-10 of the camera array 31 and the speaker's voice data collected by the recording microphone 21 in a format that allows a time series to be identified. Specifically, the speech start synchronization signal and speech start synchronization signal emitted by the reporting unit 41 are embedded in the facial image data captured by each of the recording cameras 31-1 to 31-10 and the voice data collected by the recording microphone 21. This allows the facial image data and voice data stored in the learning data storage medium 32 to be compared with the time when the speaker spoke the speech content instructed by the display 42. Therefore, the speech content instructed to the speaker via the display 42 is associated with the speaker's facial image data and voice data at the time when the speech content was spoken.

学習データ記憶媒体３２は、カメラアレイ３１の各収録用カメラ３１－１～３１－１０及び収録用マイクロフォン２１のそれぞれに接続される複数の記憶媒体から構成される分散型の記憶装置であってもよいし、一台の記憶装置（ファイルサーバ等）によって構成されてもよい。 The learning data storage medium 32 may be a distributed storage device consisting of multiple storage media connected to each of the recording cameras 31-1 to 31-10 and the recording microphone 21 of the camera array 31, or may be composed of a single storage device (such as a file server).

制御装置４３は、パーソナルコンピュータ等の情報処理装置によって構成され、本システム全体を制御する。具体的には、制御装置４３は、オペレータの指示操作により、カメラアレイ３１及び収録用マイクロフォン２１の収録を開始し、通報部４１を通じて発話開始同期信号を発するとともに発話開始タイミングを話者に指示し、ディスプレイ４２に発話内容を表示させて、話者に当該発話内容を発話させる。また、通報部４１を通じて発話終了同期信号を発するとともに発話終了タイミングを話者に指示し、カメラアレイ３１及び収録用マイクロフォン２１の収録を終了する。また、制御装置４３は、オペレータの指示操作により、ディスプレイ４２を通じて話者に対して発話やり直しを指示する。 The control device 43 is composed of an information processing device such as a personal computer, and controls the entire system. Specifically, the control device 43 starts recording with the camera array 31 and the recording microphone 21 in response to instructions from the operator, issues a speech start synchronization signal through the notification unit 41 and instructs the speaker when to start speaking, displays the speech content on the display 42, and causes the speaker to speak the speech content. It also issues a speech end synchronization signal through the notification unit 41 and instructs the speaker when to end speaking, and ends recording with the camera array 31 and the recording microphone 21. The control device 43 also instructs the speaker to repeat the speech through the display 42 in response to instructions from the operator.

本実施形態３の学習データ収集システムによれば、ディスプレイ４２によって指示された発話内容を発話する話者の口唇画像を複数の収録用カメラ３１－１～３１－１０によって互いに異なる撮像角度から同時に撮像するとともに、その時の話者の音声を収録用マイクロフォン２１によって取得して、これらを学習データ記憶媒体３２に記憶することができる。これにより、異なる撮像角度から撮像された顔画像データ（口唇画像データ）とこれに対応する音声データとを迅速かつ大量に収集することができる。よって、上述した実施形態２のマルチモーダル音声認識装置３００における読唇認識処理部１０１で用いられる機械読唇モデル及び音声認識処理部２０１で用いられる音声認識モデルを構築するために必要となる大量の学習データを、容易かつ迅速に収集することができる。 According to the learning data collection system of the third embodiment, the lip images of the speaker who is speaking the utterance content instructed by the display 42 are simultaneously captured from different imaging angles by the recording cameras 31-1 to 31-10, and the speaker's voice at that time is acquired by the recording microphone 21, and these can be stored in the learning data storage medium 32. This makes it possible to quickly and rapidly collect a large amount of face image data (lip image data) captured from different imaging angles and the corresponding voice data. Therefore, it is possible to easily and quickly collect a large amount of learning data required to construct the machine lip reading model used in the lip reading recognition processing unit 101 and the voice recognition model used in the voice recognition processing unit 201 in the multimodal voice recognition device 300 of the above-mentioned second embodiment.

なお、本実施形態３の学習データ収集システムは、上述した実施形態１の読唇装置１００の読唇装置１００で用いられる機械読唇モデルを構築するために必要となる大量の学習データを、容易かつ迅速に収集することにも有益である。この場合、収録用マイクロフォン２１による収録は必ずしも必要ではない。 The learning data collection system of the third embodiment is also useful for easily and quickly collecting a large amount of learning data required to construct a machine lip reading model used in the lip reading device 100 of the first embodiment described above. In this case, recording using the recording microphone 21 is not necessarily required.

以上に説明したものは一例であり、本発明は、次の態様毎に特有の効果を奏する。
［第１態様］
第１態様は、話者の発話内容を認識する発話内容認識装置（例えば、読唇装置１００、マルチモーダル音声認識装置３００）であって、話者の口唇画像データ（例えば顔画像データ）を入力する入力部（例えば画像入力部１１１）と、対応方向（例えば対応角度）から撮像された口唇画像データに対する読唇精度の高い複数の読唇部（例えば、単一角度対応読唇部１３１，１３２，１３３，１３６、複数角度対応読唇部１３４，１３５，１３７）と、前記入力部に入力された口唇画像データに対する前記複数の読唇部の各読唇処理結果を統合し、当該統合の結果に基づいて前記話者の発話内容の認識結果を生成する統合生成部（例えば読唇結果統合部１４１）とを有し、前記複数の読唇部のうちの少なくとも１つの読唇部は、当該対応方向の中に、他のいずれかの読唇部における対応方向に含まれていない方向を含むように構成されていることを特徴とするものである。
一般に、入力部に入力された口唇画像データの読唇処理を行う読唇部は、入力される口唇画像データの撮像方向が特定の方向（対応方向）であるときに、高い精度（利用者の要求レベルを満たす精度）で読唇処理を行うことができ、発話内容の認識精度が高い。具体的には、例えば、対応方向が話者の顔の正面方向である読唇部は、話者の顔を正面から撮像したときの口唇画像データが入力されたときには読唇精度が高い。一方、この読唇部に対し、話者の顔を横や斜めから撮像したときの口唇画像データを入力したときには、読唇精度が落ち、高い読唇精度が得られない場合が多い。
本態様では、互いに異なる対応方向を有する複数の読唇部を用いて、入力部に入力された口唇画像データに対する読唇処理を実行する。このとき、本態様で用いられる複数の読唇部のうちの少なくとも１つの読唇部は、対応方向（高い読唇精度が得られる方向）が他のいずれかの読唇部における対応方向に含まれていない方向を含んでいる。そのため、これらの読唇部の対応方向のいずれかの方向と一致する方向から撮像された口唇画像データであれば、当該読唇部で高い読唇精度が得られ、その読唇処理結果から発話内容を高精度に認識することが可能である。したがって、本態様によれば、これらの複数の読唇部における対応方向の数の分だけ、発話内容を高精度に認識できる口唇画像データの撮像方向を増やすことができる。
ここで、複数の読唇部におけるいずれの対応方向とも一致しない方向から撮像された口唇画像データが入力部に入力された場合、個々の読唇部の読唇処理では十分な読唇精度が得られない。そのため、いずれかの読唇部の読唇処理結果だけを用いたのでは、このような口唇画像データについて発話内容を高精度に認識することはできない。
そこで、本態様では、統合生成部において、入力部に入力された口唇画像データに対する複数の読唇部の各読唇処理結果を統合し、その統合結果に基づいて発話内容の認識結果を生成するようにしている。これにより、個々の読唇部の各読唇処理結果は、正解である発話内容の確からしさ（信頼度）が不正解である他の発話内容の確からしさと比較して有意に高くない又は逆に低いという結果であっても、これらの読唇処理結果を統合することで、正解である発話内容の確からしさを際立たせ、不正解である他の発話内容の確からしさに対して有意な違いを出すことができる。例えば、個々の読唇部の各読唇処理結果に含まれる信頼度を発話内容候補ごとに積み上げることで、正解である発話内容について、不正解である他の発話内容に対して有意な違いをもった信頼度を導き出すことができる。したがって、上述した複数の読唇部の各読唇処理結果を統合し、その統合結果に基づいて発話内容の認識結果を生成することで、個々の読唇部の読唇処理では十分な読唇精度が得られない方向から撮像された口唇画像データについて発話内容を高精度に認識することができる。
よって、本態様によれば、上述した複数の読唇部における対応方向の数を超える、様々な種類（撮像方向）の口唇画像データについて発話内容を高精度に認識することができる。 The above description is merely an example, and the present invention provides unique effects for each of the following aspects.
[First aspect]
The first aspect is a speech content recognition device (e.g., lip reading device 100, multimodal speech recognition device 300) that recognizes the speech content of a speaker, and includes an input unit (e.g., image input unit 111) that inputs lip image data (e.g., face image data) of the speaker, a plurality of lip reading units (e.g., single-angle compatible lip reading units 131, 132, 133, 136, multiple-angle compatible lip reading units 134, 135, 137) that have high lip reading accuracy for lip image data captured from corresponding directions (e.g., corresponding angles), and an integration generation unit (e.g., lip reading result integration unit 141) that integrates the lip reading processing results of the plurality of lip reading units for the lip image data input to the input unit and generates a recognition result of the speaker's speech content based on the result of the integration, and is characterized in that at least one of the plurality of lip reading units is configured to include a direction in the corresponding direction that is not included in the corresponding direction of any of the other lip reading units.
In general, a lip reading unit that performs lip reading processing of lip image data input to an input unit can perform lip reading processing with high accuracy (accuracy that meets the user's requirements) when the imaging direction of the input lip image data is a specific direction (corresponding direction), and the recognition accuracy of the spoken content is high. Specifically, for example, a lip reading unit whose corresponding direction is the front direction of the speaker's face has high lip reading accuracy when lip image data obtained when the speaker's face is imaged from the front is input. On the other hand, when lip image data obtained when the speaker's face is imaged from the side or diagonally is input to this lip reading unit, the lip reading accuracy drops, and in many cases high lip reading accuracy cannot be obtained.
In this embodiment, a lip reading process is performed on lip image data input to an input unit using a plurality of lip reading units having different corresponding directions. At this time, at least one of the plurality of lip reading units used in this embodiment includes a corresponding direction (a direction in which high lip reading accuracy can be obtained) that is not included in the corresponding direction of any of the other lip reading units. Therefore, if the lip image data is captured from a direction that matches one of the corresponding directions of these lip reading units, the lip reading unit can obtain high lip reading accuracy, and the speech content can be recognized with high accuracy from the lip reading process result. Therefore, according to this embodiment, the imaging directions of the lip image data that can recognize the speech content with high accuracy can be increased by the number of corresponding directions of these plurality of lip reading units.
Here, when lip image data captured from a direction that does not match any of the corresponding directions of the multiple lip readers is input to the input unit, the lip reading process of each lip reader cannot obtain sufficient lip reading accuracy. Therefore, if only the lip reading process result of one of the lip readers is used, the speech content of such lip image data cannot be recognized with high accuracy.
Therefore, in this embodiment, the integration generating unit integrates the results of the lip reading processes of the multiple lip reading units for the lip image data input to the input unit, and generates a recognition result of the speech content based on the integration result. As a result, even if the results of the lip reading processes of the individual lip reading units show that the reliability (confidence) of the correct speech content is not significantly higher than the reliability of the other incorrect speech content, or conversely, is lower, the integration of these lip reading results can highlight the reliability of the correct speech content and create a significant difference from the reliability of the other incorrect speech content. For example, by accumulating the reliability included in the results of the lip reading processes of the individual lip reading units for each speech content candidate, it is possible to derive a reliability of the correct speech content that is significantly different from the reliability of the other incorrect speech content. Therefore, by integrating the results of the lip reading processes of the multiple lip reading units described above and generating a recognition result of the speech content based on the integration result, it is possible to recognize the speech content with high accuracy for lip image data captured from a direction in which sufficient lip reading accuracy cannot be obtained by the lip reading processes of the individual lip reading units.
Therefore, according to this aspect, it is possible to recognize the speech content with high accuracy for various types (imaging directions) of lip image data, which exceeds the number of corresponding directions of the above-mentioned multiple lip reading units.

［第２態様］
第２態様は、第１態様において、前記複数の読唇部は、当該対応方向が１つである単方向読唇部（例えば、単一角度対応読唇部１３１，１３２，１３３，１３６）を含むことを特徴とするものである。
対応方向が１つである単方向読唇部は、対応方向が２つ以上である複方向読唇部よりも簡易に構築することが可能である。よって、発話内容認識装置を簡易に実現しやすい。 [Second aspect]
The second aspect is characterized in that in the first aspect, the multiple lip reading units include a unidirectional lip reading unit (e.g., single-angle compatible lip reading units 131, 132, 133, 136) that correspond to one direction.
A unidirectional lip reader that supports one direction can be constructed more easily than a multi-directional lip reader that supports two or more directions, and therefore a speech content recognition device can be easily realized.

［第３態様］
第３態様は、第１又は第２態様において、前記複数の読唇部は、当該対応方向が２つ以上である複方向読唇部（例えば、複数角度対応読唇部１３４，１３５，１３７）を含むことを特徴とするものである。
これによれば、読唇部の数を少なくでき、簡素な発話内容認識装置を実現しやすい。 [Third aspect]
The third aspect is characterized in that, in the first or second aspect, the multiple lip reading units include a multi-directional lip reading unit (e.g., multiple angle compatible lip reading units 134, 135, 137) having two or more corresponding directions.
This allows the number of lip readers to be reduced, making it easier to realize a simple speech content recognition device.

［第４態様］
第４態様は、第１乃至第３態様のいずれかにおいて、前記入力部に入力された口唇画像データに基づいて、撮像方向が前記複数の読唇部のうちの少なくとも１つの読唇部の対応方向になるように変換したデータを生成するデータ変換部（例えば角度変換部１２１～１２３）を有し、前記少なくとも１つの読唇部は、前記データ変換部で変換されたデータを用いて読唇処理を行うことを特徴とするものである。
これによれば、複数の読唇部には、それぞれの対応方向に合致した撮像方向の口唇画像データがそれぞれ入力されるので、各読唇部から高い精度の読唇処理結果を得ることができる。その結果、これらの読唇処理結果を統合生成部によって統合して得られる発話内容の認識結果も高精度なものとすることができる。 [Fourth aspect]
A fourth aspect is characterized in that, in any of the first to third aspects, the device has a data conversion unit (e.g., angle conversion units 121 to 123) that generates data converted based on lip image data input to the input unit so that the imaging direction corresponds to the corresponding direction of at least one of the multiple lip reading units, and the at least one lip reading unit performs lip reading processing using the data converted by the data conversion unit.
According to this, since the lip reading units are each input with lip image data of an imaging direction that matches the corresponding direction, each lip reading unit can obtain a lip reading process result with high accuracy. As a result, the recognition result of the speech content obtained by integrating these lip reading process results by the integration generation unit can also be highly accurate.

［第５態様］
第５態様は、第１乃至第４態様のいずれかにおいて、前記入力部に入力された口唇画像データの撮像方向を推定する撮像方向推定部（例えば角度推定部１１２）を有し、前記統合生成部は、前記撮像方向推定部の推定結果を用いて、前記話者の発話内容の認識結果を生成することを特徴とするものである。
これによれば、入力部から顔画像データの撮像角度を撮像方向推定部により推定した推定結果を用いて、各読唇部の読唇処理結果の重み付けを行うことができる。すなわち、撮像方向推定部での推定結果を用い、対応方向に合致する推定角度の確信度が高い読唇部の読唇処理結果ほど重み付けを大きくして、当該読唇部の読唇処理結果が発話内容の認識結果に与える影響度を高めることができる。これにより、統合生成部によって得られる発話内容の認識結果を、より高精度なものとすることができる。 [Fifth aspect]
A fifth aspect is characterized in that, in any of the first to fourth aspects, it has an imaging direction estimation unit (e.g., an angle estimation unit 112) that estimates the imaging direction of the lip image data input to the input unit, and the integration generation unit generates a recognition result of the speech content of the speaker using the estimation result of the imaging direction estimation unit.
According to this, the lip reading process results of each lip reading unit can be weighted using the estimation result of the imaging direction estimation unit estimating the imaging angle of the face image data from the input unit. In other words, using the estimation result of the imaging direction estimation unit, the lip reading process results of the lip reading unit with a higher degree of certainty of the estimated angle matching the corresponding direction can be weighted more heavily, thereby increasing the influence of the lip reading process results of the lip reading unit on the recognition result of the utterance content. This makes it possible to increase the accuracy of the recognition result of the utterance content obtained by the integration generation unit.

［第６態様］
第６態様は、第１乃至第５態様のいずれかにおいて、前記複数の読唇部は、読唇処理により推定された１又は第２以上の発話内容候補と発話内容候補ごとの信頼度情報（例えば信頼度スコア）とを含む読唇処理結果を生成し、前記統合生成部は、前記複数の読唇部の各読唇処理結果に含まれる信頼度情報を発話内容候補ごとに統合することを特徴とするものである。
これによれば、簡易な方法で、統合生成部において発話内容の認識結果を高精度に得ることができる。 [Sixth aspect]
A sixth aspect is characterized in that, in any of the first to fifth aspects, the multiple lip reading units generate lip reading processing results including one or more speech content candidates estimated by the lip reading process and reliability information (e.g., a reliability score) for each speech content candidate, and the integration generation unit integrates the reliability information included in each lip reading processing result of the multiple lip reading units for each speech content candidate.
This makes it possible to obtain a highly accurate recognition result of the speech content in the integration generation unit using a simple method.

［第７態様］
第７態様は、第１乃至第５態様のいずれかにおいて、前記複数の読唇部は、発話内容候補を推定するための中間情報（例えば中間表現）を読唇処理により読唇処理結果として生成し、前記統合生成部は、前記複数の読唇部の各読唇処理結果に含まれる中間情報を統合することを特徴とするものである。
これによれば、中間情報の学習データによって学習した学習済みモデル（統合モデル）を用いて、複数の読唇部の各読唇処理結果を統合することができ、統合生成部において発話内容の認識結果を高精度に得ることができる。 [Seventh aspect]
A seventh aspect is characterized in that, in any of the first to fifth aspects, the multiple lip reading units generate intermediate information (e.g., intermediate expressions) for estimating candidate speech content as a lip reading processing result by lip reading processing, and the integration generation unit integrates the intermediate information contained in each lip reading processing result of the multiple lip reading units.
According to this, the lip reading processing results of multiple lip reading units can be integrated using a trained model (integrated model) trained using intermediate information learning data, and the integrated generation unit can obtain highly accurate recognition results of the speech content.

［第８態様］
第８態様は、第１乃至第７態様のいずれかにおいて、前記話者の音声データから該話者の発話内容を認識する音声認識部（例えば音声認識処理部２０１）を有し、前記統合生成部（例えば、読唇結果統合部１４１及び認識結果統合部３０１）は、前記音声認識部の認識結果を用いて、前記話者の発話内容の認識結果を生成することを特徴とするものである。
これによれば、読唇処理と音声認識処理という２種類の発話内容認識処理を用いて話者が発話する発話内容の認識結果を出力するマルチモーダルの発話内容認識装置（例えばマルチモーダル音声認識装置３００）を実現できる。これにより、話者の発話内容の認識精度が話者の環境に左右されにくい、ロバスト性に優れた発話内容認識装置を実現できる。 [Eighth aspect]
The eighth aspect is characterized in that, in any of the first to seventh aspects, it has a voice recognition unit (e.g., the voice recognition processing unit 201) that recognizes the content of the speaker's speech from the speaker's voice data, and the integration generation unit (e.g., the lip reading result integration unit 141 and the recognition result integration unit 301) generates a recognition result of the content of the speaker's speech using the recognition result of the voice recognition unit.
This makes it possible to realize a multimodal speech content recognition device (e.g., the multimodal speech recognition device 300) that outputs a recognition result of the speech content uttered by a speaker using two types of speech content recognition processing, namely lip reading processing and speech recognition processing. This makes it possible to realize a speech content recognition device with excellent robustness in which the recognition accuracy of the speaker's speech content is less affected by the speaker's environment.

［第９態様］
第９態様は、第１乃至第８態様のいずれかにおいて、前記複数の読唇部は、話者の口唇画像データを含む学習データを用いて学習した機械読唇モデルをコンピュータに実行させることにより、前記入力部に入力された口唇画像データの読唇処理を行う読唇部を含むことを特徴とするものである。
これによれば、機械読唇モデルにより読唇処理を行うため、より高精度な読唇処理結果を得ることができる。 [Ninth aspect]
A ninth aspect is characterized in that in any of the first to eighth aspects, the plurality of lip reading units include a lip reading unit that performs lip reading processing of lip image data input to the input unit by having a computer execute a machine lip reading model trained using training data including lip image data of a speaker.
According to this, lip reading processing is performed using a machine lip reading model, so that lip reading processing results with higher accuracy can be obtained.

［第１０態様］
第１０態様は、第９態様の発話内容認識装置で用いられる前記機械読唇モデルを構築するための学習データを収集する学習データ収集システムであって、所定位置の話者を互いに異なる複数の撮像方向から撮像する複数の撮像装置（例えば収録用カメラ３１－１～３１－１０）と、前記話者の音声を取得する音声取得装置（例えば収録用マイクロフォン２１）と、前記話者に発話内容を指示する指示装置（例えばディスプレイ４２）と、前記指示装置に発話内容を指示させ、指示された発話内容を発話する前記話者の口唇画像を前記複数の撮像装置により同時に撮像するとともに、該話者の音声を前記音声取得装置により取得し、得られた口唇画像データ及び音声データを記憶装置（例えば学習データ記憶媒体３２）に記憶する制御を実行する制御装置４３とを有することを特徴とするものである。
これによれば、指示装置によって指示された発話内容を発話する話者の口唇画像を複数の撮像装置によって互いに異なる撮像角度から同時に撮像するとともに、その時の話者の音声を音声取得装置によって取得して、これらを記憶装置に記憶することができる。これにより、異なる撮像角度から撮像された口唇画像データとこれに対応する音声データとを迅速かつ大量に収集することができる。よって、上述した第９態様のマルチモーダル発話内容認識装置における読唇処理で用いられる機械読唇モデル及び音声認識処理で用いられる音声認識モデルを構築するために必要となる大量の学習データを、容易かつ迅速に収集することができる。 [Tenth aspect]
The tenth aspect is a learning data collection system that collects learning data for constructing the machine lip reading model used in the speech content recognition device of the ninth aspect, characterized in that it has a plurality of imaging devices (e.g., recording cameras 31-1 to 31-10) that image a speaker at a predetermined position from a plurality of different imaging directions, a voice acquisition device (e.g., recording microphone 21) that acquires the voice of the speaker, an instruction device (e.g., display 42) that instructs the speaker on the content of the utterance, and a control device 43 that executes control to have the instruction device instruct the content of the utterance, simultaneously image images of the lips of the speaker uttering the instructed content by the plurality of imaging devices, acquire the voice of the speaker by the voice acquisition device, and store the obtained lip image data and voice data in a storage device (e.g., learning data storage medium 32).
According to this, the lip images of the speaker who speaks the utterance contents instructed by the instruction device can be simultaneously captured from different imaging angles by the multiple imaging devices, and the voice of the speaker at that time can be acquired by the voice acquisition device and stored in the storage device. This makes it possible to quickly and massively collect lip image data captured from different imaging angles and corresponding voice data. Therefore, it is possible to easily and quickly collect a large amount of learning data required to build a machine lip reading model used in the lip reading process and a voice recognition model used in the voice recognition process in the multimodal utterance content recognition device of the ninth aspect described above.

［第１１態様］
第１１態様は、発話内容認識装置により話者の発話内容を認識する方法であって、話者の口唇画像データを前記発話内容認識装置に入力する入力工程と、前記発話内容認識装置が、対応方向から撮像された口唇画像データに対する読唇精度の高い複数の読唇部を用いて、前記入力工程で入力された口唇画像データの読唇処理を行う読唇工程と、前記発話内容認識装置が、前記読唇工程によって得られた前記複数の読唇部の各読唇処理結果を統合し、当該統合の結果に基づいて前記話者の発話内容の認識結果を生成する統合生成工程とを有し、前記複数の読唇部のうちの少なくとも１つの読唇部は、当該対応方向の中に、他のいずれかの読唇部における対応方向に含まれていない方向を含むように構成されていることを特徴とするものである。
本態様によれば、上述した複数の読唇部における対応方向の数を超える、様々な種類（撮像方向）の口唇画像データについて発話内容を高精度に認識することができる。 [Eleventh aspect]
An eleventh aspect is a method for recognizing the content of a speaker's utterance by a speech content recognition device, comprising: an input step of inputting lip image data of the speaker into the speech content recognition device; a lip reading step in which the speech content recognition device performs lip reading processing of the lip image data input in the input step using a plurality of lip reading units having high lip reading accuracy for lip image data captured from corresponding directions; and an integration generation step in which the speech content recognition device integrates the lip reading processing results of the plurality of lip reading units obtained by the lip reading step and generates a recognition result of the speaker's speech content based on the result of the integration, wherein at least one lip reading unit of the plurality of lip reading units is configured to include a direction in its corresponding direction that is not included in the corresponding direction of any of the other lip reading units.
According to this aspect, it is possible to recognize the speech content with high accuracy for various types (imaging directions) of lip image data, which exceeds the number of corresponding directions of the above-mentioned multiple lip reading units.

［第１２態様］
第１２態様は、話者の発話内容を認識する発話内容認識装置のコンピュータに実行されるプログラムであって、前記発話内容認識装置に入力された口唇画像データに対し、対応方向から撮像された口唇画像データに対する読唇精度の高い複数の読唇手段によりそれぞれ読唇処理した各読唇処理結果を統合し、当該統合の結果に基づいて前記話者の発話内容の認識結果を生成する統合生成手段として、前記コンピュータを機能させるものであり、前記複数の読唇部のうちの少なくとも１つの読唇部は、当該対応方向の中に、他のいずれかの読唇部における対応方向に含まれていない方向を含むように構成されていることを特徴とするものである。
本態様によれば、上述した複数の読唇部における対応方向の数を超える、様々な種類（撮像方向）の口唇画像データについて発話内容を高精度に認識することができる。 [Twelfth aspect]
A twelfth aspect is a program executed on a computer of a speech content recognition device that recognizes the content of a speaker's speech, which causes the computer to function as an integration generation means that integrates the results of lip reading processing performed by multiple lip reading means with high lip reading accuracy on lip image data captured from corresponding directions input to the speech content recognition device, and generates a recognition result for the content of the speaker's speech based on the result of the integration, and is characterized in that at least one lip reading unit among the multiple lip reading units is configured to include a direction in its corresponding direction that is not included in the corresponding direction of any of the other lip reading units.
According to this aspect, it is possible to recognize the speech content with high accuracy for various types (imaging directions) of lip image data, which exceeds the number of corresponding directions of the above-mentioned multiple lip reading units.

［第１３態様］
第１３態様は、第１０態様の学習データ収集システムにより、前記発話内容認識装置で用いられる前記機械読唇モデルを構築するための学習データを収集する方法であって、前記指示装置に発話内容を指示させ、指示された発話内容を発話する話者の口唇画像を前記複数の撮像装置により同時に撮像するとともに、該話者の音声を前記音声取得装置により取得し、得られた口唇画像データ及び音声データを記憶装置に記憶することを特徴とするものである。
本態様によれば、上述した第９態様のマルチモーダル発話内容認識装置における読唇処理で用いられる機械読唇モデル及び音声認識処理で用いられる音声認識モデルを構築するために必要となる大量の学習データを、容易かつ迅速に収集することができる。 [Thirteenth aspect]
A thirteenth aspect is a method for collecting learning data for constructing the machine lip reading model used in the speech content recognition device by the learning data collection system of the tenth aspect, characterized in that the instruction device is caused to indicate the speech content, lip images of a speaker speaking the indicated speech content are simultaneously captured by the multiple imaging devices, and the speaker's voice is acquired by the voice acquisition device, and the obtained lip image data and voice data are stored in a storage device.
According to this aspect, it is possible to easily and quickly collect a large amount of training data required to construct a machine lip reading model used in the lip reading process and a speech recognition model used in the speech recognition process in the multimodal speech content recognition device of the ninth aspect described above.

［第１４態様］
第１４態様は、第１０態様の学習データ収集システムにおける前記制御装置のコンピュータに実行されるプログラムであって、前記指示装置に発話内容を指示させ、指示された発話内容を発話する話者の口唇画像を前記複数の撮像装置により同時に撮像するとともに、該話者の音声を前記音声取得装置により取得し、得られた口唇画像データ及び音声データを記憶装置に記憶する制御を実行する制御手段として、前記コンピュータを機能させることを特徴とするものである。
本態様によれば、上述した第９態様のマルチモーダル発話内容認識装置における読唇処理で用いられる機械読唇モデル及び音声認識処理で用いられる音声認識モデルを構築するために必要となる大量の学習データを、容易かつ迅速に収集することができる。 [14th aspect]
A fourteenth aspect is a program executed by a computer of the control device in the learning data collection system of the tenth aspect, which causes the instructing device to instruct the speech content, simultaneously captures lip images of a speaker speaking the instructed speech content using the multiple imaging devices, acquires the speaker's voice using the voice acquisition device, and causes the obtained lip image data and voice data to function as a control means that executes control to store in a storage device.
According to this aspect, it is possible to easily and quickly collect a large amount of training data required to construct a machine lip reading model used in the lip reading process and a speech recognition model used in the speech recognition process in the multimodal speech content recognition device of the ninth aspect described above.

１：カメラ
２：記憶媒体
３：マイクロフォン
２１：収録用マイクロフォン
３１：カメラアレイ
３１－１～３１－１０：収録用カメラ
３２：学習データ記憶媒体
３３－１：第一学習部
３３－２：第二学習部
４１：通報部
４２：ディスプレイ
４３：制御装置
１００：読唇装置
１０１：読唇認識処理部
１１１：画像入力部
１１２：角度推定部
１２１～１２３：角度変換部
１３１，１３２，１３３，１３６：単一角度対応読唇部
１３４，１３５，１３７：複数角度対応読唇部
１４１：読唇結果統合部
２０１：音声認識処理部
２１１：音声入力部
２３１：音声認識部
３００：マルチモーダル音声認識装置
３０１：認識結果統合部 1: Camera 2: Storage medium 3: Microphone 21: Recording microphone 31: Camera array 31-1 to 31-10: Recording camera 32: Learning data storage medium 33-1: First learning unit 33-2: Second learning unit 41: Reporting unit 42: Display 43: Control device 100: Lip reading device 101: Lip reading recognition processing unit 111: Image input unit 112: Angle estimation unit 121 to 123: Angle conversion unit 131, 132, 133, 136: Single angle compatible lip reading unit 134, 135, 137: Multiple angle compatible lip reading unit 141: Lip reading result integration unit 201: Speech recognition processing unit 211: Speech input unit 231: Speech recognition unit 300: Multimodal speech recognition device 301: Recognition result integration unit

A. Koumparoulis et al., "Deep view2view mapping for view-invariant lipreading", IEEE SLT, 2018, p.588-594A. Koumparoulis et al., "Deep view2view mapping for view-invariant lipreading", IEEE SLT, 2018, p.588-594 S. Petridis et al., "End-to-end Multiview Lip Reading", IEEE ICASSP, 2018, p.6548-6552S. Petridis et al., "End-to-end Multiview Lip Reading", IEEE ICASSP, 2018, p.6548-6552

Claims

A speech content recognition device that recognizes the content of a speaker's speech,
an input unit for inputting lip image data of a speaker;
A plurality of lip reading units having high lip reading accuracy for lip image data captured from corresponding directions;
an integration unit that integrates the lip reading process results of the plurality of lip reading units for the lip image data input to the input unit and generates a recognition result of the speech content of the speaker based on the integration result;
At least one of the plurality of lip reading units is configured to include a direction in its corresponding direction that is not included in the corresponding direction of any of the other lip reading units;
The speech content recognition device, wherein the plurality of lip reading units includes a multi-directional lip reading unit having two or more corresponding directions.

A speech content recognition device that recognizes the content of a speaker's speech,
an input unit for inputting lip image data of a speaker;
A plurality of lip reading units having high lip reading accuracy for lip image data captured from corresponding directions;
an integration unit that integrates the lip reading process results of the plurality of lip reading units for the lip image data input to the input unit and generates a recognition result of the speech content of the speaker based on the integration result;
At least one of the plurality of lip reading units is configured to include a direction in its corresponding direction that is not included in the corresponding direction of any of the other lip reading units;
an imaging direction estimation unit that estimates a plurality of imaging directions of the lip image data input to the input unit and certainty factor information for each imaging direction;
the integration generation unit integrates the lip reading process results using a plurality of estimation results from the imaging direction estimation unit, and generates a recognition result of the speaker's speech content based on the integration result .

A speech content recognition device that recognizes the content of a speaker's speech,
an input unit for inputting lip image data of a speaker;
A plurality of lip reading units having high lip reading accuracy for lip image data captured from corresponding directions;
an integration unit that integrates the lip reading process results of the plurality of lip reading units for the lip image data input to the input unit and generates a recognition result of the speech content of the speaker based on the integration result;
At least one of the plurality of lip reading units is configured to include a direction in its corresponding direction that is not included in the corresponding direction of any of the other lip reading units;
the plurality of lip reading units generate a lip reading process result including one or more speech content candidates estimated by the lip reading process and reliability information for each of the speech content candidates;
The speech content recognition device, wherein the integration/generation unit integrates reliability information included in each of the lip reading processing results of the plurality of lip reading units for each speech content candidate.

A speech content recognition device that recognizes the content of a speaker's speech,
an input unit for inputting lip image data of a speaker;
A plurality of lip reading units having high lip reading accuracy for lip image data captured from corresponding directions;
an integration unit that integrates the lip reading process results of the plurality of lip reading units for the lip image data input to the input unit and generates a recognition result of the speech content of the speaker based on the integration result;
At least one of the plurality of lip reading units is configured to include a direction in its corresponding direction that is not included in the corresponding direction of any of the other lip reading units;
the plurality of lip reading units generate a lip reading process result including one or more speech content candidates estimated by the lip reading process and reliability information for each of the speech content candidates;
a speech recognition unit that outputs, as a recognition result, speech content candidates of the speaker and reliability information for each of the speech content candidates from the speech data of the speaker;
The speech content recognition device is characterized in that the integration generation unit integrates the lip reading processing results of the multiple lip reading units and the recognition results of the voice recognition unit, and generates a recognition result of the speaker's speech content based on the result of the integration .

5. The speech recognition apparatus according to claim 1,
The speech content recognition device, wherein the plurality of lip reading units include a unidirectional lip reading unit corresponding to one direction.

6. The speech recognition apparatus according to claim 1,
a data conversion unit that generates data converted based on the lip image data input to the input unit so that the imaging direction corresponds to a corresponding direction of at least one of the plurality of lip reading units;
The speech content recognition device according to claim 1, wherein the at least one lip reading unit performs lip reading processing using data converted by the data conversion unit.

5. The speech recognition apparatus according to claim 1, 2 or 4,
the plurality of lip reading units generate intermediate information for estimating speech content candidates as a lip reading process result by lip reading processing;
The speech content recognition device, wherein the integration/generation unit integrates intermediate information included in each of the lip reading processing results of the plurality of lip reading units.

8. The speech recognition apparatus according to claim 1,
The speech content recognition device is characterized in that the multiple lip reading units include a lip reading unit that performs lip reading processing of the lip image data input to the input unit by having a computer execute a machine lip reading model trained using training data including the speaker's lip image data.

A method for recognizing a speaker's utterance content by a speech recognition device, comprising:
an input step of inputting lip image data of a speaker into the speech recognition device;
a lip reading process in which the speech content recognition device performs lip reading processing on the lip image data input in the input process using a plurality of lip reading units having high lip reading accuracy for the lip image data captured from corresponding directions;
the speech content recognition device has an integration generating step of integrating the lip reading processing results of the plurality of lip reading units obtained by the lip reading step and generating a recognition result of the speech content of the speaker based on the integration result,
At least one of the plurality of lip reading units is configured to include a direction in its corresponding direction that is not included in the corresponding direction of any of the other lip reading units;
The method according to claim 1, wherein the plurality of lip readers includes a multi-directional lip reader having two or more corresponding directions.

A method for recognizing a speaker's utterance content by a speech recognition device, comprising:
an input step of inputting lip image data of a speaker into the speech recognition device;
a lip reading process in which the speech content recognition device performs lip reading processing on the lip image data input in the input process using a plurality of lip reading units having high lip reading accuracy for the lip image data captured from corresponding directions;
the speech content recognition device has an integration generating step of integrating the lip reading processing results of the plurality of lip reading units obtained by the lip reading step and generating a recognition result of the speech content of the speaker based on the integration result,
At least one of the plurality of lip reading units is configured to include a direction in its corresponding direction that is not included in the corresponding direction of any of the other lip reading units;
an imaging direction estimating step of estimating a plurality of imaging directions of the lip image data inputted in the input step and pieces of certainty information for each imaging direction;
The method is characterized in that in the integration and generation step, the lip reading process results are integrated using multiple estimation results from the imaging direction estimation step, and a recognition result of the speaker's speech content is generated based on the integration result .

A method for recognizing a speaker's utterance content by a speech recognition device, comprising:
an input step of inputting lip image data of a speaker into the speech recognition device;
a lip reading process in which the speech content recognition device performs lip reading processing on the lip image data input in the input process using a plurality of lip reading units having high lip reading accuracy for the lip image data captured from corresponding directions;
the speech content recognition device has an integration generating step of integrating the lip reading processing results of the plurality of lip reading units obtained by the lip reading step and generating a recognition result of the speech content of the speaker based on the integration result,
At least one of the plurality of lip reading units is configured to include a direction in its corresponding direction that is not included in the corresponding direction of any of the other lip reading units;
the plurality of lip reading units generate a lip reading process result including one or more speech content candidates estimated by the lip reading process and reliability information for each of the speech content candidates;
The method according to the present invention, characterized in that in the integrating and generating step, reliability information included in each of the lip reading processing results of the plurality of lip reading units is integrated for each speech content candidate.

A method for recognizing a speaker's utterance content by a speech recognition device, comprising:
an input step of inputting lip image data of a speaker into the speech recognition device;
a lip reading process in which the speech content recognition device performs lip reading processing on the lip image data input in the input process using a plurality of lip reading units having high lip reading accuracy for the lip image data captured from corresponding directions;
the speech content recognition device has an integration generating step of integrating the lip reading processing results of the plurality of lip reading units obtained by the lip reading step and generating a recognition result of the speech content of the speaker based on the integration result,
At least one of the plurality of lip reading units is configured to include a direction in its corresponding direction that is not included in the corresponding direction of any of the other lip reading units;
the plurality of lip reading units generate a lip reading process result including one or more speech content candidates estimated by the lip reading process and reliability information for each of the speech content candidates;
a speech recognition step of outputting, as a recognition result, speech content candidates of the speaker and reliability information for each of the speech content candidates from the speech data of the speaker;
The method is characterized in that in the integration and generation process, the lip reading processing results of the multiple lip reading units and the recognition results of the voice recognition process are integrated, and a recognition result of the speaker's speech content is generated based on the result of the integration .

A program executed on a computer of an utterance content recognition device for recognizing the content of a speaker's utterance,
The computer is caused to function as an integration generating means for integrating lip reading results of lip reading processes performed by a plurality of lip reading means having high lip reading accuracy on lip image data captured from corresponding directions, inputted into the speech content recognition device, and generating a recognition result of the speech content of the speaker based on the integration result,
At least one of the plurality of lip reading units is configured to include a direction in its corresponding direction that is not included in the corresponding direction of any of the other lip reading units,
The program, wherein the plurality of lip reading means includes a multi-directional lip reading means having two or more corresponding directions.

A program executed on a computer of an utterance content recognition device for recognizing the content of a speaker's utterance,
the computer is caused to function as an integration generating means for integrating lip reading results obtained by lip reading lip image data input to the speech content recognition device using a plurality of lip reading means each having a high lip reading accuracy for lip image data captured from corresponding directions, and generating a recognition result for the speech content of the speaker based on the integration result; and an imaging direction estimating means for estimating a plurality of imaging directions of the input lip image data and certainty information for each imaging direction,
At least one of the plurality of lip reading units is configured to include a direction in its corresponding direction that is not included in the corresponding direction of any of the other lip reading units,
The program, wherein the integration generation means integrates the lip reading process results using multiple estimation results from the imaging direction estimation means, and generates a recognition result of the speaker's speech content based on the integration result .

A program executed on a computer of an utterance content recognition device for recognizing the content of a speaker's utterance,
The computer is caused to function as an integration generating means for integrating lip reading results of lip reading processes performed by a plurality of lip reading means having high lip reading accuracy on lip image data captured from corresponding directions, inputted into the speech content recognition device, and generating a recognition result of the speech content of the speaker based on the integration result,
At least one of the plurality of lip reading units is configured to include a direction in its corresponding direction that is not included in the corresponding direction of any of the other lip reading units,
the plurality of lip reading means generate a lip reading process result including one or more speech content candidates estimated by the lip reading process and reliability information for each of the speech content candidates;
The program, wherein the integrating and generating means integrates reliability information included in each lip reading process result of the plurality of lip reading means for each speech content candidate.

A program executed on a computer of an utterance content recognition device for recognizing the content of a speaker's utterance,
The computer is caused to function as an integration generating means for integrating lip reading results obtained by lip reading lip image data input to the speech content recognition device using a plurality of lip reading means each having high lip reading accuracy for lip image data captured from corresponding directions, and generating a recognition result of the speaker's speech content based on the integration result, and a voice recognition means for outputting, as recognition results, speech content candidates of the speaker and reliability information for each of the speech content candidates from the voice data of the speaker,
At least one of the plurality of lip reading units is configured to include a direction in its corresponding direction that is not included in the corresponding direction of any of the other lip reading units,
the plurality of lip reading means generate a lip reading process result including one or more speech content candidates estimated by the lip reading process and reliability information for each of the speech content candidates;
The program, wherein the integration and generation means integrates the lip reading processing results of the multiple lip reading means and the recognition results of the voice recognition means, and generates a recognition result of the speaker's speech content based on the result of the integration .