JP7624707B2

JP7624707B2 - Facial synthesis lip reading device and facial synthesis lip reading method

Info

Publication number: JP7624707B2
Application number: JP2021045840A
Authority: JP
Inventors: 剛史齊藤
Original assignee: Kyushu Institute of Technology NUC
Current assignee: Kyushu Institute of Technology NUC
Priority date: 2021-03-19
Filing date: 2021-03-19
Publication date: 2025-01-31
Anticipated expiration: 2041-03-19
Also published as: JP2022144707A

Description

本発明は、機械学習を用いて不特定の認識対象発話者の発話内容を高精度で推測することができる顔合成読唇装置及び顔合成読唇方法に関する。 The present invention relates to a facial synthesis lip reading device and a facial synthesis lip reading method that can predict the speech content of an unspecified speaker to be recognized with high accuracy using machine learning.

従来、音声情報をテキストに変換する音声認識技術は、実験室等の低騒音の環境下では、十分な認識率が得られており、少しずつ普及しつつあるが、周囲の騒音の影響を受け易いオフィスや屋外等の騒音環境下、或いは声を出し難い電車や病院等の公共の場所では利用し難く、実用性に欠けるという問題があった。また、発話が困難な発話障害者は音声認識技術を利用することができず、汎用性に欠けるという問題もあった。
これに対して、読唇技術は、発話者の唇の動き等から発話内容を推測することができ、音声を発する必要がなく（音声情報を必要とせず）、映像のみでも発話内容を推測できるため、騒音環境下や公共の場所等でも利用が期待できるだけでなく、発話障害者も利用することができる。特に、コンピュータを用いた読唇技術であれば、特別な訓練を必要とせず、誰でも手軽に利用できるため、その普及が期待されている。
例えば、特許文献１には、口唇領域を含む顔画像を取得する撮像手段と、取得画像から口唇領域を抽出する領域抽出手段と、抽出された口唇領域より形状特徴量を計測する特徴量計測手段と、登録モードにおいて計測されたキーワード発話シーンの特徴量を登録するキーワードＤＢと、認識モードにおいて、登録されているキーワードの特徴量と、文章の発話シーンを対象として計測された特徴量とを比較することにより口唇の発話内容を認識（推測）する認識処理を行って、文章の中からキーワードを認識するワードスポッティング読唇を行う判断手段と、判断手段が行った認識結果を表示する表示手段とを備えたワードスポッティング読唇装置が開示されている。 Conventionally, speech recognition technology that converts speech information into text has a sufficient recognition rate in low-noise environments such as laboratories, and is gradually becoming more widespread, but it is difficult to use in noisy environments such as offices and outdoors where it is easily affected by surrounding noise, or in public places such as trains and hospitals where it is difficult to speak, and there is a problem that it lacks practicality. In addition, speech recognition technology cannot be used by people with speech disorders who have difficulty speaking, and there is also a problem that it lacks versatility.
In contrast, lip-reading technology can infer the content of speech from the movement of the speaker's lips, etc., and does not require the speaker to emit sound (no audio information is required), and can infer the content of speech from video alone, making it possible to use it not only in noisy environments and public places, but also by people with speech disorders. In particular, lip-reading technology using a computer does not require special training and can be easily used by anyone, so its widespread use is expected.
For example, Patent Document 1 discloses a word spotting lip reading device that includes an imaging means for acquiring a facial image including a lip area, an area extraction means for extracting the lip area from the acquired image, a feature measurement means for measuring shape features from the extracted lip area, a keyword DB for registering features of keyword speech scenes measured in a registration mode, a determination means for performing a recognition process in a recognition mode to recognize (infer) the speech content of the lips by comparing the features of the registered keywords with features measured for speech scenes in a sentence, thereby performing word spotting lip reading to recognize keywords from within a sentence, and a display means for displaying the recognition results performed by the determination means.

特開２０１２－５９０１７号公報JP 2012-59017 A

特許文献１をはじめとするコンピュータを用いた従来の読唇技術では、登録モード（学習モード、学習時）において、登録（学習）の対象となる発話者（学習対象発話者）の発話シーンを用いて機械学習によりモデルを学習し、認識モードにおいて、認識の対象となる発話者（認識対象発話者）の発話内容を推測している。この認識のタスクには、一般的に、認識対象発話者が学習対象発話者の中に含まれる特定話者認識と、認識対象発話者が学習対象発話者の中に含まれていない不特定話者認識の二つがある。特定話者認識では、学習対象発話者及び認識対象発話者が特定（限定）されるため、通常、不特定話者認識よりも認識精度が高くなるが、利用者（認識対象発話者）を学習対象発話者とする発話シーンの事前登録（学習）が必要であり、学習モデルの構築に手間がかかるという課題がある。これに対し、不特定話者認識では、特定話者認識のような事前登録は不要であり（学習対象発話者は誰でもよく）、不特定の学習対象発話者の発話シーンで（既存のデータベース等を利用して）学習を行うことができるため、学習モデルの構築が容易である反面、特定話者認識よりも認識の難易度が高くなるという課題がある。
本発明は、かかる事情に鑑みてなされたもので、発話者の顔画像を画像処理することにより、読唇の認識精度を向上させることができる実用性に優れた顔合成読唇装置及び顔合成読唇方法を提供することを目的とする。 In the conventional lip reading technology using a computer, such as that described in Patent Document 1, in a registration mode (learning mode, during learning), a model is learned by machine learning using the speech scenes of a speaker (learning target speaker) to be registered (learned), and in a recognition mode, the speech contents of a speaker (recognition target speaker) to be recognized are inferred. This recognition task generally includes two types: specific speaker recognition, in which the recognition target speaker is included in the learning target speakers, and unspecified speaker recognition, in which the recognition target speaker is not included in the learning target speakers. In specific speaker recognition, the learning target speaker and the recognition target speaker are specified (limited), so the recognition accuracy is usually higher than that of unspecified speaker recognition, but there is a problem that it is necessary to pre-register (learn) speech scenes in which the user (recognition target speaker) is the learning target speaker, and it is time-consuming to build a learning model. In contrast, speaker-independent recognition does not require prior registration as in speaker-specific recognition (any speaker can be the target speaker), and learning can be performed on speech scenes of unspecified speakers (using existing databases, etc.). This makes it easy to build a learning model, but it has the problem that recognition is more difficult than speaker-specific recognition.
The present invention has been made in consideration of the above circumstances, and aims to provide a highly practical face synthesis lip reading device and face synthesis lip reading method that can improve the accuracy of lip reading recognition by processing the speaker's facial image.

前記目的に沿う第１の発明に係る顔合成読唇装置は、学習時に、学習対象発話者の発話シーンが記録された学習対象画像を読み込み、認識時に、認識対象発話者の発話シーンが記録された認識対象画像を読み込む画像取得部と、該画像取得部に読み込まれた前記学習対象画像及び前記認識対象画像をそれぞれ画像処理して学習対象データ及び認識対象データを抽出する画像処理部と、学習時に、前記学習対象データに基づいて読唇の機械学習を行い、学習モデルを構築する学習処理部と、前記学習モデルを保存する読唇データベースと、認識時に、前記認識対象データと、前記読唇データベースに保存された前記学習モデルから、機械学習により、前記認識対象発話者の発話内容を推測する認識処理部とを備え、
前記画像処理部は、前記学習対象画像から前記学習対象発話者の学習時顔画像を検出し、前記認識対象画像から前記認識対象発話者の認識時顔画像を検出する顔検出手段と、該顔検出手段で検出された前記学習時顔画像及び前記認識時顔画像をそれぞれ特定発話者の顔画像を用いて学習時合成顔画像及び認識時合成顔画像に変換する顔合成手段と、該顔合成手段で作成された前記学習時合成顔画像及び前記認識時合成顔画像からそれぞれ学習時口唇領域及び認識時口唇領域を抽出する口唇領域抽出手段と、前記学習対象データとして、前記学習時口唇領域から学習時口唇特徴を抽出し、前記認識対象データとして、前記認識時口唇領域から認識時口唇特徴を抽出する特徴抽出手段とを有する。 A face synthesis lip reading device according to a first invention for achieving the above object comprises an image acquisition unit which, during learning, reads a learning target image in which a speech scene of a learning target speaker is recorded, and, during recognition, reads a recognition target image in which a speech scene of the recognition target speaker is recorded; an image processing unit which performs image processing on the learning target image and the recognition target image read by the image acquisition unit to extract learning target data and recognition target data, respectively; a learning processing unit which, during learning, performs machine learning of lip reading based on the learning target data and constructs a learning model; a lip reading database which stores the learning model; and a recognition processing unit which, during recognition, infers the speech content of the recognition target speaker by machine learning from the recognition target data and the learning model stored in the lip reading database,
The image processing unit includes a face detection means for detecting a training-time facial image of the training target speaker from the training target image and a recognition-time facial image of the recognition target speaker from the recognition target image, a face synthesis means for converting the training-time facial image and the recognition-time facial image detected by the face detection means into a training-time synthetic facial image and a recognition-time synthetic facial image, respectively, using a facial image of a specific speaker, a lip area extraction means for extracting a training-time lip area and a recognition-time lip area from the training-time synthetic facial image and the recognition-time synthetic facial image created by the face synthesis means, respectively, and a feature extraction means for extracting training-time lip features from the training-time lip area as the training target data and extracting recognition-time lip features from the recognition-time lip area as the recognition target data.

第１の発明に係る顔合成読唇装置において、前記画像処理部は、前記学習時合成顔画像及び前記認識時合成顔画像からそれぞれ学習時顔特徴点及び認識時顔特徴点を検出する顔特徴点検出手段を有し、前記口唇領域抽出手段は、前記学習時顔特徴点及び前記認識時顔特徴点からそれぞれ前記学習時口唇領域及び前記認識時口唇領域を抽出してもよい。 In the face synthesis lip reading device of the first invention, the image processing unit may have a facial feature point detection means for detecting training-time facial feature points and recognition-time facial feature points from the training-time synthetic face image and the recognition-time synthetic face image, respectively, and the lip area extraction means may extract the training-time lip area and the recognition-time lip area from the training-time facial feature points and the recognition-time facial feature points, respectively.

第１の発明に係る顔合成読唇装置において、前記特徴抽出手段は、前記学習対象データとして、前記学習時口唇特徴に加えて、前記学習時顔特徴点から学習時表情特徴を抽出し、前記認識対象データとして、前記認識時口唇特徴に加えて、前記認識時顔特徴点から認識時表情特徴を抽出することもできる。 In the face synthesis lip reading device of the first invention, the feature extraction means extracts, as the learning target data, in addition to the learning lip features, learning facial expression features from the learning facial feature points, and, as the recognition target data, in addition to the recognition lip features, recognition facial expression features from the recognition facial feature points.

第１の発明に係る顔合成読唇装置において、前記学習対象発話者及び前記認識対象発話者それぞれの発話シーンを撮影する撮影手段及び前記認識処理部で推測された前記認識対象発話者の発話内容を出力する認識結果出力部を備えることができる。 The face synthesis lip reading device according to the first invention may include an imaging means for imaging the speech scenes of the learning target speaker and the recognition target speaker, and a recognition result output unit for outputting the speech content of the recognition target speaker estimated by the recognition processing unit.

第１の発明に係る顔合成読唇装置において、前記認識結果出力部は、前記認識処理部で推測された前記認識対象発話者の発話内容を文字で表示するディスプレイ及び／又は音声で出力するスピーカを備えることが好ましい。 In the face synthesis lip reading device of the first invention, it is preferable that the recognition result output unit includes a display that displays the speech content of the recognition target speaker estimated by the recognition processing unit in text and/or a speaker that outputs the speech content in audio.

前記目的に沿う第２の発明に係る顔合成読唇方法は、学習対象発話者の発話シーンが記録された学習対象画像を読み込む学習時第１工程と、前記学習対象画像から前記学習対象発話者の学習時顔画像を検出する学習時第２工程と、前記学習時顔画像を特定発話者の顔画像を用いて学習時合成顔画像に変換する学習時第３工程と、前記学習時合成顔画像から学習時口唇領域を抽出する学習時第４工程と、学習対象データとして、前記学習時口唇領域から学習時口唇特徴を抽出する学習時第５工程と、前記学習時第１工程～前記学習時第５工程を繰り返し、前記学習対象データに基づいて読唇の機械学習を行い、学習モデルを構築する学習時第６工程と、前記学習モデルを保存する学習時第７工程と、保存された前記学習モデルを読み込む認識時第１工程と、認識対象発話者の発話シーンが記録された認識対象画像を読み込む認識時第２工程と、前記認識対象画像から前記認識対象発話者の認識時顔画像を検出する認識時第３工程と、前記認識時顔画像を特定発話者の顔画像を用いて認識時合成顔画像に変換する認識時第４工程と、前記認識時合成顔画像から認識時口唇領域を抽出する認識時第５工程と、認識対象データとして、前記認識時口唇領域から、認識時口唇特徴を抽出する認識時第６工程と、前記認識対象データと前記学習モデルから、機械学習により、前記認識対象発話者の発話内容を推測する認識時第７工程とを備える。 A face synthetic lip reading method according to a second invention in accordance with the above object includes a first learning step of reading a learning target image in which a speech scene of a learning target speaker is recorded, a second learning step of detecting a learning face image of the learning target speaker from the learning target image, a third learning step of converting the learning face image into a learning synthetic face image using a face image of a specific speaker, a fourth learning step of extracting a learning lip area from the learning synthetic face image, a fifth learning step of extracting learning lip features from the learning lip area as learning target data, a sixth learning step of repeating the first learning step to the fifth learning step, performing machine learning of lip reading based on the learning target data, and constructing a learning model, and a seventh learning step of saving the learning model; a first recognition step of loading the saved learning model; a second recognition step of loading a recognition target image in which a speech scene of a recognition target speaker is recorded; a third recognition step of detecting a recognition target face image of the recognition target speaker from the recognition target image; a fourth recognition step of converting the recognition target face image into a recognition target synthetic face image using a face image of a specific speaker; a fifth recognition step of extracting a recognition target lip area from the recognition target synthetic face image; a sixth recognition step of extracting recognition target lip features from the recognition target lip area as recognition target data; and a seventh recognition step of inferring the speech content of the recognition target speaker by machine learning from the recognition target data and the learning model.

第２の発明に係る顔合成読唇方法において、前記学習時第４工程では、前記学習時合成顔画像から学習時顔特徴点を検出して、該学習時顔特徴点から前記学習時口唇領域を抽出し、前記認識時第５工程では、前記認識時合成顔画像から認識時顔特徴点を検出して、該認識時顔特徴点から前記認識時口唇領域を抽出してもよい。 In the face synthesis lip reading method according to the second invention, in the fourth learning step, learning facial feature points are detected from the learning synthetic face image and the learning lip area is extracted from the learning facial feature points, and in the fifth recognition step, recognition facial feature points are detected from the recognition synthetic face image and the recognition lip area is extracted from the recognition facial feature points.

第２の発明に係る顔合成読唇方法において、前記学習時第５工程では、前記学習対象データとして、前記学習時口唇特徴に加えて、前記学習時顔特徴点から学習時表情特徴を抽出し、前記認識時第６工程では、前記認識対象データとして、前記認識時口唇特徴に加えて、前記認識時顔特徴点から認識時表情特徴を抽出することもできる。 In the face synthesis lip reading method according to the second invention, in the fifth learning step, in addition to the learning lip features, learning facial expression features are extracted from the learning facial feature points as the learning target data, and in the sixth recognition step, in addition to the recognition lip features, recognition facial expression features can also be extracted from the recognition facial feature points as the recognition target data.

第１の発明に係る顔合成読唇装置及び第２の発明に係る顔合成読唇方法は、不特定の発話者（学習対象発話者及び認識対象発話者をまとめて発話者という）の顔画像を特定発話者の顔画像に変換して学習処理及び認識処理を行うことにより、認識時に、発話内容を高精度で推測することができ、認識精度の向上を図ることができる。 The facial synthesis lip reading device according to the first invention and the facial synthesis lip reading method according to the second invention convert the facial image of an unspecified speaker (the learning target speaker and the recognition target speaker are collectively referred to as speakers) into the facial image of a specific speaker and perform learning and recognition processes, thereby making it possible to estimate the spoken content with high accuracy during recognition, thereby improving recognition accuracy.

本発明の一実施の形態に係る顔合成読唇装置の構成を示すブロック図である。1 is a block diagram showing a configuration of a face synthesis lip reading device according to an embodiment of the present invention; 同顔合成読唇装置の画像処理部の機能を示すブロック図である。FIG. 2 is a block diagram showing functions of an image processing unit of the face synthesis lip reading device. 本発明の一実施の形態に係る顔合成読唇方法の学習時の動作を示すフローチャートである。5 is a flowchart showing an operation during learning of the face synthesis and lip reading method according to the embodiment of the present invention. 同顔合成読唇方法の認識時の動作を示すフローチャートである。11 is a flowchart showing an operation during recognition of the same face synthesis lip reading method.

続いて、本発明を具体化した実施の形態について説明し、本発明の理解に供する。
図１に示す本発明の一実施の形態に係る顔合成読唇装置１０及び顔合成読唇方法は、コンピュータ（機械学習）を用いた読唇技術において、不特定の学習対象発話者及び不特定の認識対象発話者の顔画像を特定発話者の顔画像に変換して学習処理及び認識処理を行うことにより、不特定の認識対象発話者の発話内容を高精度で推測し、認識精度（読唇精度）の向上を図るものである。
図１に示すように、顔合成読唇装置１０は、学習対象発話者及び認識対象発話者それぞれの発話シーンを撮影（記録）する撮影手段１１を備えている。そして、顔合成読唇装置１０は、学習時に、学習対象発話者の発話シーンが記録された学習対象画像を撮影手段１１から読み込み、認識時に、認識対象発話者の発話シーンが記録された認識対象画像を撮影手段１１から読み込む画像取得部１３を備えている。また、顔合成読唇装置１０は、画像取得部１３に読み込まれた学習対象画像及び認識対象画像をそれぞれ画像処理して、機械学習に必要な学習対象データ及び認識対象データを抽出する画像処理部１４を備えている。さらに、顔合成読唇装置１０は、学習時に、学習対象データに基づいて読唇の機械学習を行い、学習モデルを構築する学習処理部１５と、学習モデルを保存する読唇データベース１６を備えている。そして、顔合成読唇装置１０は、認識時に、認識対象データと、読唇データベース１６に保存された学習モデルから、機械学習により、認識対象発話者の発話内容を推測する認識処理部１７を備えている。 Next, specific embodiments of the present invention will be described for better understanding of the present invention.
The facial synthesis lip reading device 10 and facial synthesis lip reading method according to one embodiment of the present invention shown in FIG. 1 are lip reading technologies using a computer (machine learning), which convert the facial images of an unspecified learning target speaker and an unspecified recognition target speaker into facial images of a specific speaker and then perform learning and recognition processes to predict the spoken content of the unspecified recognition target speaker with high accuracy, thereby improving recognition accuracy (lip reading accuracy).
As shown in Fig. 1, the face synthesis lip reading device 10 includes an image capture unit 11 for capturing (recording) speech scenes of a learning target speaker and a recognition target speaker. The face synthesis lip reading device 10 also includes an image capture unit 13 for reading a learning target image in which a speech scene of the learning target speaker is recorded from the image capture unit 11 during learning, and reading a recognition target image in which a speech scene of the recognition target speaker is recorded from the image capture unit 11 during recognition. The face synthesis lip reading device 10 also includes an image processing unit 14 for processing the learning target image and the recognition target image loaded into the image capture unit 13, respectively, to extract learning target data and recognition target data required for machine learning. The face synthesis lip reading device 10 also includes a learning processing unit 15 for performing machine learning of lip reading based on the learning target data during learning, and constructing a learning model, and a lip reading database 16 for storing the learning model. The face synthesis lip reading device 10 is equipped with a recognition processing unit 17 that, during recognition, infers the spoken content of the speaker to be recognized through machine learning from the recognition target data and the learning model stored in the lip reading database 16.

ここで、顔合成読唇装置１０は、図１に示すように、画像取得部１３、画像処理部１４、学習処理部１５、読唇データベース１６及び認識処理部１７を含んで構成されるが、顔合成読唇装置１０に用いられる顔合成読唇方法を実行するプログラムがコンピュータ１８にインストールされ、コンピュータ１８のＣＰＵがそのプログラムを実行することにより、コンピュータ１８を上記の画像取得部１３、画像処理部１４、学習処理部１５、読唇データベース１６及び認識処理部１７として機能させることができる。コンピュータの形態としては、デスクトップ型又はノート型が好適に用いられるが、これらに限定されるものではなく、適宜、選択することができる。なお、画像取得部１３、画像処理部１４、学習処理部１５、読唇データベース１６及び認識処理部１７の一部又は全ては、クラウドコンピューティングにより、ネットワークを通じて利用することもできる。また、撮影手段としてはビデオカメラが好適に用いられるが、顔合成読唇装置が専用の撮影手段を備えている必要はなく、発話シーンを撮影した各種の撮影手段若しくは発話シーンが記録された各種の記憶手段（記憶媒体）をコンピュータ（画像取得部）に接続して学習対象画像又は認識対象画像を読み込むことができる。よって、撮影手段として、動画撮影機能が搭載されたスマートフォン等を用いてもよいし、撮影手段をコンピュータ（画像取得部）に接続して画像を読み込む代わりに、撮影手段に内蔵されたメモリーカード等の記憶媒体を撮影手段からコンピュータ（画像取得部）に挿し代えて画像を読み込むこともできる。 Here, as shown in FIG. 1, the face synthesis lip reading device 10 is configured to include an image acquisition unit 13, an image processing unit 14, a learning processing unit 15, a lip reading database 16, and a recognition processing unit 17, but a program for executing the face synthesis lip reading method used in the face synthesis lip reading device 10 is installed in a computer 18, and the CPU of the computer 18 executes the program, thereby allowing the computer 18 to function as the image acquisition unit 13, the image processing unit 14, the learning processing unit 15, the lip reading database 16, and the recognition processing unit 17. As for the form of the computer, a desktop type or a notebook type is preferably used, but is not limited to these and can be selected as appropriate. Note that some or all of the image acquisition unit 13, the image processing unit 14, the learning processing unit 15, the lip reading database 16, and the recognition processing unit 17 can also be used through a network by cloud computing. In addition, while a video camera is preferably used as the imaging means, the face synthesis lip reading device does not need to be equipped with a dedicated imaging means, and various imaging means that capture speech scenes or various storage means (storage media) in which speech scenes are recorded can be connected to a computer (image acquisition unit) to read the learning target image or the recognition target image. Therefore, a smartphone or the like equipped with a video recording function can be used as the imaging means, and instead of connecting the imaging means to a computer (image acquisition unit) to read the image, a storage medium such as a memory card built into the imaging means can be inserted from the imaging means to the computer (image acquisition unit) to read the image.

また、顔合成読唇装置１０は、認識処理部１７で推測された認識対象発話者の発話内容を出力する認識結果出力部１９を備えている。本実施の形態では、認識結果出力部１９は、認識処理部１７で推測された認識対象発話者の発話内容を文字で表示するディスプレイ２０及び音声で出力するスピーカ２１を備える構成としたが、顔合成読唇装置１０の使用場所及び使用環境等に応じて、ディスプレイ２０及びスピーカ２１のいずれか一方又は双方を適宜、選択して使用することができる。なお、ディスプレイ及びスピーカは、コンピュータの付属品若しくは内蔵品でもよいし、別途、コンピュータに後付け（外付け）したものでもよい。また、認識結果出力部は、ディスプレイ又はスピーカの一方のみを備える構成としてもよい。 The facial synthesis lip reading device 10 also includes a recognition result output unit 19 that outputs the speech content of the recognition target speaker estimated by the recognition processing unit 17. In this embodiment, the recognition result output unit 19 includes a display 20 that displays the speech content of the recognition target speaker estimated by the recognition processing unit 17 in text and a speaker 21 that outputs it in audio, but either or both of the display 20 and the speaker 21 can be appropriately selected and used depending on the location and environment of use of the facial synthesis lip reading device 10. The display and the speaker may be accessories or built-in components of the computer, or may be separately attached (externally) to the computer. The recognition result output unit may also be configured to include only one of the display and the speaker.

次に、図２により、画像処理部１４の詳細について説明する。
画像処理部１４は、学習時に、学習対象画像から学習対象発話者の学習時顔画像を検出し、認識時に、認識対象画像から認識対象発話者の認識時顔画像を検出する顔検出手段２２を備えている。また、画像処理部１４は、顔検出手段２２で検出された学習時顔画像及び認識時顔画像をそれぞれ特定発話者の顔画像を用いて学習時合成顔画像及び認識時合成顔画像に変換する顔合成手段２３と、顔合成手段２３で作成された学習時合成顔画像及び認識時合成顔画像からそれぞれ学習時口唇領域及び認識時口唇領域を抽出する口唇領域抽出手段２４を備えている。そして、画像処理部１４は、学習対象データとして、学習時口唇領域から学習時口唇特徴を抽出し、認識対象データとして、認識時口唇領域から認識時口唇特徴を抽出する特徴抽出手段２５を備えている。 Next, the image processing unit 14 will be described in detail with reference to FIG.
The image processing unit 14 includes face detection means 22 for detecting a learning face image of the learning target speaker from the learning target image during learning, and detecting a recognition face image of the recognition target speaker from the recognition target image during recognition. The image processing unit 14 also includes face synthesis means 23 for converting the learning face image and the recognition face image detected by the face detection means 22 into a learning composite face image and a recognition composite face image using the face image of a specific speaker, respectively, and lip area extraction means 24 for extracting a learning lip area and a recognition lip area from the learning composite face image and the recognition composite face image created by the face synthesis means 23, respectively. The image processing unit 14 also includes feature extraction means 25 for extracting learning lip features from the learning lip area as learning target data, and for extracting recognition lip features from the recognition lip area as recognition target data.

顔合成手段２３では、不特定の人物の学習時顔画像及び認識時顔画像の発話時の動きに合わせて、ある特定の人物の顔画像（静止画）を動かす（動画化）ことができる。つまり、不特定発話者の学習時顔画像及び認識時顔画像を、特定発話者の顔画像を用いて学習時合成顔画像及び認識時合成顔画像に変換することができるので、読唇の認識精度の向上を図ることができる。特定発話者の顔画像としては、例えば既存のデータベース等に登録された人物の顔画像を用いてもよいし、その他の人物の顔画像を用いてもよい。また、実在する人物だけでなく、コンピュータグラフィックスやアニメーションで描かれた人物の顔画像でもよいし、例えば人物以外のロボット、車、動物、昆虫等を擬人化した顔画像でもよい。また、特定発話者は一人に限定されるものではなく、特定された少人数でもよい。学習対象データ及び認識対象データの対象が、多数の不特定発話者から、一人又は少人数の特定発話者に集約されることにより、認識精度が向上する。特定発話者を学習対象発話者又は認識対象発話者から選択した場合、特定発話者に選択された学習対象発話者又は認識対象発話者の顔画像は、他の学習対象発話者又は認識対象発話者と同様に特定発話者の顔画像に変換してもよいし、変換しなくてもよい。なお、顔合成手段２３では、ＦｉｒｓｔＯｒｄｅｒＭｏｔｉｏｎＭｏｄｅｌ（ＦＯＭＭ）が好適に用いられるがこれに限定されない。 In the face synthesis means 23, the face image (still image) of a specific person can be moved (animated) in accordance with the movement of the learning face image and the recognition face image of an unspecified person when speaking. In other words, the learning face image and the recognition face image of an unspecified speaker can be converted into a learning synthetic face image and a recognition synthetic face image using the face image of a specific speaker, so that the recognition accuracy of lip reading can be improved. As the face image of the specific speaker, for example, a face image of a person registered in an existing database or the like can be used, or a face image of another person can be used. In addition to a real person, the face image can be a face image of a person drawn by computer graphics or animation, or for example, a face image of a person other than a person, such as a robot, a car, an animal, or an insect. In addition, the specific speaker is not limited to one person, but can be a small number of specified people. The recognition accuracy is improved by consolidating the learning target data and the recognition target data from a large number of unspecified speakers to one or a small number of specific speakers. When a specific speaker is selected from the learning target speakers or the recognition target speakers, the facial image of the learning target speaker or the recognition target speaker selected as the specific speaker may or may not be converted to the facial image of the specific speaker, as with other learning target speakers or recognition target speakers. Note that the face synthesis means 23 preferably uses the First Order Motion Model (FOMM), but is not limited to this.

画像処理部１４は、口唇領域抽出手段２４で、学習時合成顔画像及び認識時合成顔画像を画像処理することにより学習時口唇領域及び認識時口唇領域を抽出するが、学習時口唇領域及び認識時口唇領域を抽出する手段及び方法は、適宜、選択される。例えば、画像処理部１４は、顔特徴点検出手段で、学習時合成顔画像及び認識時合成顔画像からそれぞれ学習時顔特徴点及び認識時顔特徴点を検出した上で、口唇領域抽出手段により、学習時顔特徴点及び認識時顔特徴点に基づいて、学習時口唇領域及び認識時口唇領域を抽出することができる。学習時顔特徴点及び認識時顔特徴点は、例えば、特定発話者の顔の輪郭並びに眉、目、鼻及び口の位置と形状を表すものであり、その特徴点数は、適宜、選択される。なお、本実施の形態のように、先に、顔合成手段２３で、学習対象発話者の学習時顔画像及び認識対象発話者の認識時顔画像が、特定発話者の顔画像を用いて学習時合成顔画像及び認識時合成顔画像に変換される場合、学習時顔特徴点及び認識時顔特徴点の特徴点数を削減して顔特徴点検出手段での処理を高速化若しくは簡素化することもできるし、顔特徴点検出手段を省略して画像処理部での処理を簡素化することもできる。 The image processing unit 14 extracts the learning-time lip area and the recognition-time lip area by image processing the learning-time synthetic face image and the recognition-time synthetic face image using the lip area extraction means 24, and the means and method for extracting the learning-time lip area and the recognition-time lip area are selected as appropriate. For example, the image processing unit 14 can detect the learning-time facial feature points and the recognition-time facial feature points from the learning-time synthetic face image and the recognition-time synthetic face image, respectively, using the facial feature point detection means, and then extract the learning-time lip area and the recognition-time lip area based on the learning-time facial feature points and the recognition-time facial feature points using the lip area extraction means. The learning-time facial feature points and the recognition-time facial feature points represent, for example, the facial contour of a specific speaker and the positions and shapes of the eyebrows, eyes, nose, and mouth, and the number of feature points is selected as appropriate. In addition, as in this embodiment, when the face synthesis means 23 first converts the learning-time face image of the learning target speaker and the recognition-time face image of the recognition target speaker into a learning-time composite face image and a recognition-time composite face image using the face image of a specific speaker, the number of learning-time facial feature points and recognition-time facial feature points can be reduced to speed up or simplify processing in the facial feature point detection means, or the facial feature point detection means can be omitted to simplify processing in the image processing unit.

また、特徴抽出手段で、学習対象データとして、学習時口唇領域から学習時口唇特徴を抽出し、認識対象データとして、認識時口唇領域から認識時口唇特徴を抽出することにより、学習処理部及び認識処理部では、それぞれの口唇特徴（口唇周辺領域の動きの特徴）に基づいて機械学習を行い、学習モデルの構築及び発話内容の推測を行うことができる。なお、前述のように、画像処理部が、顔特徴点検出手段を有しており、学習時合成顔画像及び認識時合成顔画像からそれぞれ学習時顔特徴点及び認識時顔特徴点が検出されている場合、特徴抽出手段は、学習対象データとして、学習時口唇特徴に加えて、学習時顔特徴点から学習時表情特徴を抽出し、認識対象データとして、認識時口唇特徴に加えて、認識時顔特徴点から認識時表情特徴を抽出することができ、口唇特徴に加え、顔全体の表情特徴（例えば、眉、目及び口等の位置、形状及び角度等の変化）を考慮して機械学習を行うことにより、発話内容の認識精度（認識率）をさらに高めることも可能であるが、学習時口唇特徴及び認識時口唇特徴だけでも十分な認識精度が得られる。 In addition, the feature extraction means extracts learning-time lip features from the learning-time lip area as learning target data, and extracts recognition-time lip features from the recognition-time lip area as recognition target data, so that the learning processing unit and the recognition processing unit can perform machine learning based on the respective lip features (movement features of the area around the lips), construct a learning model, and infer the content of the speech. As described above, if the image processing unit has a facial feature point detection means and the learning facial feature points and the recognition facial feature points are detected from the learning synthetic face image and the recognition synthetic face image, respectively, the feature extraction means can extract learning facial expression features from the learning facial feature points in addition to the learning lip features as the learning target data, and can extract recognition facial expression features from the recognition facial feature points in addition to the recognition lip features as the recognition target data. It is possible to further improve the recognition accuracy (recognition rate) of the speech content by performing machine learning taking into account the facial expression features of the entire face (for example, changes in the position, shape, and angle of the eyebrows, eyes, mouth, etc.) in addition to the lip features, but sufficient recognition accuracy can be obtained with just the learning lip features and the recognition lip features.

次に、図３により、本発明の一実施の形態に係る顔合成読唇方法の学習時の動作について説明する。
まず、学習時第１工程で、学習対象発話者の発話シーンが記録された学習対象画像を画像取得部１３に読み込む（Ｓ１）。次に、学習時第２工程で、画像処理部１４の顔検出手段２２により、学習対象画像から学習対象発話者の学習時顔画像を検出する（Ｓ２）。続いて、学習時第３工程で、画像処理部１４の顔合成手段２３により、学習時顔画像を特定発話者の顔画像を用いて学習時合成顔画像に変換し（Ｓ３）、学習時第４工程で、画像処理部１４の口唇領域抽出手段２４により、学習時合成顔画像から学習時口唇領域を抽出する（Ｓ４）。さらに、学習時第５工程で、画像処理部１４の特徴抽出手段２５により、学習対象データとして、学習時口唇領域から学習時口唇特徴を抽出する（Ｓ５）。以上の学習時第１工程～学習時第５工程は、学習する発話シーンの数だけ繰り返し行われる。そして、学習時第６工程で、学習処理部１５により、それぞれの発話シーンから抽出した学習対象データに基づいて読唇の機械学習を行い、学習モデルを構築する（Ｓ６）。こうして構築された学習モデルは、学習時第７工程において、読唇データベース１６に保存される（Ｓ７）。
なお、先に説明したように、学習時口唇領域を抽出する手段及び方法は、適宜、選択される。例えば、学習時第４工程において、画像処理部の顔特徴点検出手段で、学習時合成顔画像から学習時顔特徴点を検出した上で、口唇領域抽出手段により、学習時顔特徴点に基づいて、学習時口唇領域を抽出することができる。また、学習時顔特徴点が検出されていれば、学習時第５工程では、画像処理部の特徴抽出手段により、学習対象データとして、学習時口唇領域に加えて、学習時顔特徴点から学習時表情特徴を抽出し、口唇特徴に加え、顔全体の表情特徴を考慮して機械学習を行うことができる。 Next, the operation during learning of the face synthesis and lip reading method according to the embodiment of the present invention will be described with reference to FIG.
First, in a first learning step, a learning target image in which a speech scene of a learning target speaker is recorded is read into the image acquisition unit 13 (S1). Next, in a second learning step, the face detection means 22 of the image processing unit 14 detects a learning face image of the learning target speaker from the learning target image (S2). Next, in a third learning step, the face synthesis means 23 of the image processing unit 14 converts the learning face image into a learning composite face image using the face image of a specific speaker (S3), and in a fourth learning step, the lip region extraction means 24 of the image processing unit 14 extracts a learning lip region from the learning composite face image (S4). Furthermore, in a fifth learning step, the feature extraction means 25 of the image processing unit 14 extracts learning lip features from the learning lip region as learning target data (S5). The above first to fifth learning steps are repeated as many times as the number of speech scenes to be learned. Then, in a sixth learning step, the learning processing unit 15 performs machine learning of lip reading based on the learning target data extracted from each speech scene, and constructs a learning model (S6). The learning model thus constructed is stored in the lip reading database 16 in a seventh learning step (S7).
As described above, the means and method for extracting the learning lip area are appropriately selected. For example, in the fourth learning step, the face feature point detection means of the image processing unit detects the learning face feature points from the learning composite face image, and the lip area extraction means extracts the learning lip area based on the learning face feature points. If the learning face feature points are detected, in the fifth learning step, the feature extraction means of the image processing unit extracts learning facial expression features from the learning face feature points in addition to the learning lip area as learning target data, and machine learning can be performed taking into account the facial expression features of the entire face in addition to the lip features.

続いて、図４により、顔合成読唇方法の認識時の動作について説明する。
まず、認識時第１工程で、読唇データベース１６に保存された学習モデル（学習済みモデル）を読み込む（Ｓ１）。そして、認識時第２工程で、認識対象発話者の発話シーンが記録された認識対象画像を画像取得部１３に読み込む（Ｓ２）。次に、認識時第３工程で、画像処理部１４の顔検出手段２２により、認識対象画像から認識対象発話者の認識時顔画像を検出する（Ｓ３）。続いて、認識時第４工程で、画像処理部１４の顔合成手段２３により、認識時顔画像を特定発話者の顔画像を用いて認識時合成顔画像に変換し（Ｓ４）、認識時第５工程で、画像処理部１４の口唇領域抽出手段２４により、認識時合成顔画像から認識時口唇領域を抽出する（Ｓ５）。さらに、認識時第６工程で、画像処理部１４の特徴抽出手段２５により、認識対象データとして、認識時口唇領域から、認識時口唇特徴を抽出する（Ｓ６）。そして、認識時第７工程で、認識対象データと学習モデルから、機械学習（読唇処理）により、認識対象発話者の発話内容を推測する（Ｓ７）。推測された発話内容（評価結果）は、文字及び／又は音声に変換され、評価結果出力部１９のディスプレイ２０及び／又はスピーカ２１から出力される（Ｓ８）。
なお、前述のように、画像処理部が、顔特徴点検出手段を有している場合、認識時第５工程では、認識時合成顔画像から認識時顔特徴点を検出した上で、認識時顔特徴点に基づいて認識時口唇領域を抽出することができる。また、認識時第６工程では、特徴抽出手段は、認識対象データとして、認識時口唇特徴に加えて、認識時顔特徴点から認識時表情特徴を抽出することができ、口唇特徴に加え、顔全体の表情特徴を考慮して機械学習を行うことができる。 Next, the operation during recognition in the face synthesis and lip reading method will be described with reference to FIG.
First, in a first step during recognition, a learning model (trained model) stored in the lip reading database 16 is read (S1). Then, in a second step during recognition, a recognition target image in which a speech scene of a recognition target speaker is recorded is read into the image acquisition unit 13 (S2). Next, in a third step during recognition, the face detection means 22 of the image processing unit 14 detects a recognition target face image of the recognition target speaker from the recognition target image (S3). Then, in a fourth step during recognition, the face synthesis means 23 of the image processing unit 14 converts the recognition target face image into a recognition target composite face image using the face image of the specific speaker (S4), and in a fifth step during recognition, the lip region extraction means 24 of the image processing unit 14 extracts a recognition target lip region from the recognition target composite face image (S5). Furthermore, in a sixth step during recognition, the feature extraction means 25 of the image processing unit 14 extracts recognition target lip features from the recognition target lip region as recognition target data (S6). Then, in the seventh step of the recognition process, the speech content of the speaker to be recognized is inferred from the recognition target data and the learning model by machine learning (lip reading processing) (S7). The inferred speech content (evaluation result) is converted into characters and/or voice, and is output from the display 20 and/or speaker 21 of the evaluation result output unit 19 (S8).
As described above, if the image processing unit has a facial feature point detection means, in the fifth recognition step, the recognition facial feature points are detected from the recognition composite face image, and the recognition lip area can be extracted based on the recognition facial feature points. In the sixth recognition step, the feature extraction means can extract recognition facial expression features from the recognition facial feature points in addition to the recognition lip features as recognition target data, and can perform machine learning taking into account the facial expression features of the entire face in addition to the lip features.

次に、本発明の作用効果を確認するために行った評価結果について説明する。
（比較例１）
読唇用に公開されたデータベースＣＵＡＶＥを用いて従来の読唇方法の評価を行った。ＣＵＡＶＥには、３６名（男性１９名、女性１７名）の登録発話者が、それぞれ０～９の１０種の数字を英語で発話したシーンが収録されている。従来の読唇方法として、本発明の顔合成読唇方法の学習時第３工程と認識時第４工程を省略し、その後の工程で、学習時合成顔画像及び認識時合成顔画像の代わりに、学習時顔画像及び認識時顔画像をそのまま使用した。評価方法としては、既存手法の一人抜き法を用いた。つまり、３６名の登録発話者の中から１名の認識対象発話者を選択し、残りの３５名の登録発話者を学習対象発話者とする学習と評価を、認識対象発話者を変えて３６回（１通り）の認識実験を行ったところ、平均認識精度は７３％であった。 Next, the results of evaluations carried out to confirm the effects of the present invention will be described.
(Comparative Example 1)
A conventional lip-reading method was evaluated using the database CUAVE, which is open to the public for lip-reading. CUAVE contains scenes in which 36 registered speakers (19 males and 17 females) each speak 10 numbers from 0 to 9 in English. As the conventional lip-reading method, the third learning step and the fourth recognition step of the face synthesis lip-reading method of the present invention were omitted, and in the subsequent steps, the learning face image and the recognition face image were used as they were, instead of the learning synthetic face image and the recognition synthetic face image. As the evaluation method, the existing method of the leave-one-person method was used. In other words, one recognition target speaker was selected from the 36 registered speakers, and the remaining 35 registered speakers were used as the learning target speaker for learning and evaluation. The recognition target speaker was changed, and 36 recognition experiments (one set) were performed, and the average recognition accuracy was 73%.

（実施例１）
比較例１と同一のデータベースＣＵＡＶＥを用いて、読唇方法として、本発明の顔合成読唇方法を使用した。まず、学習時第３工程及び認識時第４工程で使用する特定発話者の顔画像を選定するため、各登録発話者の発話シーンから１枚ずつフレーム画像を取り出して、３６名の登録発話者の顔画像（静止画）を準備し、各登録発話者の顔画像の中から１名ずつ特定発話者の顔画像を選択して、特定発話者毎に、比較例１と同様の一人抜き法で３６回（１通り）の認識実験を行った。３６名の特定発話者に対する合計３６通りの認識実験のうち、３５通り（９７％）の認識実験において、比較例１の認識実験よりも認識精度が向上し、平均認識精度は８３％であった。また、３６通り（３６名の特定発話者）の認識実験のうち、最も認識精度が高かった１名の特定発話者に対する３６回（１通り）の認識実験の結果を認識対象発話者毎に比較例１の認識実験の結果と比較したところ、３６回の認識実験のうち、２７回の認識実験、つまり、３６名中２７名の認識対象発話者（７５％）の認識実験において、比較例１の認識実験よりも認識精度が向上した。 Example 1
The same database CUAVE as in Comparative Example 1 was used, and the face synthesis lip reading method of the present invention was used as the lip reading method. First, in order to select the face images of the specific speakers to be used in the third learning step and the fourth recognition step, one frame image was taken out from each registered speaker's speech scene to prepare face images (still images) of 36 registered speakers, and one face image of the specific speaker was selected from the face images of each registered speaker, and 36 recognition experiments (one pattern) were performed for each specific speaker by the leave-one-out method similar to Comparative Example 1. Of the total 36 recognition experiments for the 36 specific speakers, 35 (97%) showed improved recognition accuracy compared to the recognition experiments in Comparative Example 1, with an average recognition accuracy of 83%. In addition, when the results of 36 (one) recognition experiments for one specific speaker with the highest recognition accuracy out of the 36 recognition experiments (36 specific speakers) were compared with the results of the recognition experiments in Comparative Example 1 for each target speaker, it was found that the recognition accuracy was improved compared to the recognition experiments in Comparative Example 1 in 27 of the 36 recognition experiments, that is, in the recognition experiments for 27 out of 36 target speakers (75%).

（比較例２）
データベースとしてＣＵＡＶＥの代わりにＯｕｌｕＶＳを使用した。ＯｕｌｕＶＳには、２０名（男性１７名、女性３名）の登録発話者が、それぞれ１０種の挨拶文を英語で発話したシーンが収録されている。登録発話者の人数が異なる以外は、比較例１と同様にして評価を行った。つまり、２０名の登録発話者の中から１名の認識対象発話者を選択し、残りの１９名の登録発話者を学習対象発話者とする学習と評価を、認識対象発話者を変えて２０回（１通り）の認識実験を行ったところ、平均認識精度は８１％であった。 (Comparative Example 2)
OuluVS was used as the database instead of CUAVE. OuluVS contains scenes in which 20 registered speakers (17 men and 3 women) each utter 10 types of greetings in English. Evaluation was performed in the same manner as in Comparative Example 1, except that the number of registered speakers was different. In other words, one recognition target speaker was selected from the 20 registered speakers, and the remaining 19 registered speakers were used as learning targets for learning and evaluation. The recognition target speaker was changed and 20 (one pattern) recognition experiments were performed, and the average recognition accuracy was 81%.

（実施例２）
比較例２と同一のデータベースＯｕｌｕＶＳを用いた以外は、実施例１と同様にして評価を行った。つまり、準備した２０名の登録発話者の顔画像（静止画）の中から１名ずつ特定発話者の顔画像を選択して、特定発話者毎に、一人抜き法で２０回（１通り）の認識実験を行った。２０名の特定発話者に対する合計２０通りの認識実験のうち、１７通り（８５％）の認識実験において、比較例２の認識実験よりも認識精度が向上し、平均認識精度は８７％であった。また、２０通り（２０名の特定発話者）の認識実験のうち、最も認識精度が高かった１名の特定発話者に対する２０回（１通り）の認識実験の結果を認識対象発話者毎に比較例２の認識実験の結果と比較したところ、２０回の認識実験のうち、１２回の認識実験、つまり、２０名中１２名の認識対象発話者（６０％）の認識実験において、比較例２の認識実験よりも認識精度が向上した。 Example 2
Except for using the same database OuluVS as in Comparative Example 2, the evaluation was performed in the same manner as in Example 1. That is, the facial image of a specific speaker was selected one by one from the facial images (still images) of the prepared 20 registered speakers, and 20 (one-way) recognition experiments were performed for each specific speaker by the leave-one-out method. Of the total 20 recognition experiments for the 20 specific speakers, the recognition accuracy was improved in 17 (85%) recognition experiments compared to the recognition experiment in Comparative Example 2, and the average recognition accuracy was 87%. In addition, when the results of the 20 (one-way) recognition experiments for the one specific speaker with the highest recognition accuracy among the 20 recognition experiments (20 specific speakers) were compared with the results of the recognition experiment in Comparative Example 2 for each recognition target speaker, the recognition accuracy was improved in 12 of the 20 recognition experiments, that is, in the recognition experiments for 12 of the 20 recognition experiments (60%) of the recognition target speakers, compared to the recognition experiment in Comparative Example 2.

以上のことから、本発明の顔合成読唇方法を用いることにより、従来の読唇方法よりも読唇の認識精度が平均で６～７％程度向上することが判明し、本発明の顔合成読唇方法の有効性が確認された。なお、実施例１、２では、学習対象発話者及び認識対象発話者の全ての顔画像を１名の特定発話者の顔画像に変換して認識実験を行ったが、特定発話者の人数が複数名であっても、少なくとも学習対象発話者の人数よりも少なければ、認識精度を向上させることができると考えられる。 From the above, it was found that by using the face synthesis lip reading method of the present invention, the lip reading recognition accuracy is improved by about 6 to 7% on average compared to conventional lip reading methods, and the effectiveness of the face synthesis lip reading method of the present invention was confirmed. Note that in Examples 1 and 2, the recognition experiment was conducted by converting all facial images of the learning target speakers and the recognition target speakers into the facial image of one specific speaker, but even if there are multiple specific speakers, it is believed that the recognition accuracy can be improved as long as the number of specific speakers is at least less than the number of learning target speakers.

以上、本発明を、実施の形態を参照して説明してきたが、本発明は何ら上記した実施の形態に記載した構成に限定されるものではなく、特許請求の範囲に記載されている事項の範囲内で考えられるその他の実施の形態や変形例も含むものである。
例えば、本実施の形態では、機械学習に、深層学習の一種であるゲート付き回帰型ユニット（ＧａｔｅｄＲｅｃｕｒｒｅｎｔＵｎｉｔ、ＧＲＵ）を利用したが、機械学習のアルゴリズムは、適宜、選択される。 The present invention has been described above with reference to an embodiment, but the present invention is not limited to the configurations described in the above embodiment, and also includes other embodiments and modifications that are possible within the scope of the matters described in the claims.
For example, in the present embodiment, a gated recurrent unit (GRU), which is a type of deep learning, is used for machine learning, but the machine learning algorithm may be selected as appropriate.

１０：顔合成読唇装置、１１：撮影手段、１３：画像取得部、１４：画像処理部、１５：学習処理部、１６：読唇データベース、１７：認識処理部、１８：コンピュータ、１９：認識結果出力部、２０：ディスプレイ、２１：スピーカ、２２：顔検出手段、２３：顔合成手段、２４：口唇領域抽出手段、２５：特徴抽出手段 10: Face synthesis lip reading device, 11: Photographing means, 13: Image acquisition unit, 14: Image processing unit, 15: Learning processing unit, 16: Lip reading database, 17: Recognition processing unit, 18: Computer, 19: Recognition result output unit, 20: Display, 21: Speaker, 22: Face detection means, 23: Face synthesis means, 24: Lip area extraction means, 25: Feature extraction means

Claims

an image acquisition unit that loads a learning target image in which a speech scene of a learning target speaker is recorded during learning, and loads a recognition target image in which a speech scene of the recognition target speaker is recorded during recognition, an image processing unit that performs image processing on the learning target image and the recognition target image loaded into the image acquisition unit to extract learning target data and recognition target data, respectively, a learning processing unit that performs machine learning of lip reading based on the learning target data during learning and constructs a learning model, a lip reading database that stores the learning model, and a recognition processing unit that infers the speech content of the recognition target speaker by machine learning from the recognition target data and the learning model stored in the lip reading database during recognition,
said image processing unit comprises face detection means for detecting a training-time facial image of the training target speaker from said training image and for detecting a recognition-time facial image of the recognition target speaker from said recognition target image; face synthesis means for converting the training-time facial image and the recognition-time facial image detected by said face detection means into a training-time synthetic facial image and a recognition-time synthetic facial image, respectively, using a facial image of a specific speaker; lip area extraction means for extracting a training-time lip area and a recognition-time lip area from the training-time synthetic facial image and the recognition-time synthetic facial image created by said face synthesis means, respectively; and feature extraction means for extracting training-time lip features from the training-time lip area as said training target data and extracting recognition-time lip features from the recognition-time lip area as said recognition target data.

A facial synthesis lip reading device according to claim 1, characterized in that the image processing unit has a facial feature point detection means for detecting the learning-time facial feature points and the recognition-time facial feature points from the learning-time synthetic face image and the recognition-time synthetic face image, respectively, and the lip area extraction means extracts the learning-time lip area and the recognition-time lip area from the learning-time facial feature points and the recognition-time facial feature points, respectively.

A facial synthesis lip reading device according to claim 2, wherein the feature extraction means extracts, as the learning target data, in addition to the learning lip features, learning facial expression features from the learning facial feature points, and, as the recognition target data, in addition to the recognition lip features, recognition facial expression features from the recognition facial feature points.

A facial synthesis lip-reading device according to any one of claims 1 to 3, characterized in that it is provided with an imaging means for imaging the speech scenes of the learning target speaker and the recognition target speaker, and a recognition result output unit for outputting the speech content of the recognition target speaker estimated by the recognition processing unit.

A facial synthesis lip-reading device according to claim 4, characterized in that the recognition result output unit is provided with a display that displays the speech content of the speaker to be recognized that is estimated by the recognition processing unit in text and/or a speaker that outputs the speech content in audio.

A face synthesis and lip reading method using machine learning by a computer, comprising:
a learning step of reading a learning target image in which a speech scene of a learning target speaker is recorded; a learning step of detecting a learning face image of the learning target speaker from the learning target image; a learning step of converting the learning face image into a learning synthetic face image using a face image of a specific speaker; a learning step of extracting a learning lip area from the learning synthetic face image; a learning step of extracting learning lip features from the learning lip area as learning target data; a learning step of repeating the learning steps from the learning step to the learning step, and constructing a learning model based on the learning target data; a learning step of storing the learning model; a third step at recognition time of detecting a time-of-recognition face image of the recognition target speaker from the recognition target image; a fourth step at recognition time of converting the time-of-recognition face image into a time-of-recognition synthetic face image using a face image of a specific speaker; a fifth step at recognition time of extracting a time-of-recognition lip area from the time-of-recognition synthetic face image; a sixth step at recognition time of extracting a time-of-recognition lip feature from the time-of-recognition lip area as recognition target data; and a seventh step at recognition time of inferring the speech content of the recognition target speaker from the recognition target data and the learning model by machine learning.

A face synthesis lip reading method according to claim 6, characterized in that in the fourth learning step, learning facial feature points are detected from the learning synthetic face image and the learning lip area is extracted from the learning facial feature points, and in the fifth recognition step, recognition facial feature points are detected from the recognition synthetic face image and the recognition lip area is extracted from the recognition facial feature points.

A face synthesis lip reading method according to claim 7, characterized in that in the fifth learning step, in addition to the learning lip features, learning facial expression features are extracted from the learning facial feature points as the learning target data, and in the sixth recognition step, in addition to the recognition lip features, recognition facial expression features are extracted from the recognition facial feature points as the recognition target data.