JP7790111B2

JP7790111B2 - Information processing method, program, information processing device, and information processing system

Info

Publication number: JP7790111B2
Application number: JP2021193077A
Authority: JP
Inventors: 紘之長野
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2021-11-29
Filing date: 2021-11-29
Publication date: 2025-12-23
Anticipated expiration: 2041-11-29
Also published as: JP2023079562A

Description

本発明は、情報処理方法、プログラム、情報処理装置、情報処理システムに関する。 The present invention relates to an information processing method, a program, an information processing device, and an information processing system.

従来から、映像とともに記録されている音声を認識する音声認識において、映像を画像認識した結果に基づいて言語モデルを選択し、選択された言語モデルを用いて音声認識を行う技術が知られている。具体的には、例えば、映像の中で発話している人物の性別や年齢等の属性を画像認識結果として言語モデルを選択し、選択された言語モデルを用いて音声認識を行うことが知られている。 Conventionally, in speech recognition for recognizing audio recorded along with video, a technology has been known in which a language model is selected based on the results of image recognition of the video, and the selected language model is used to perform speech recognition. Specifically, for example, a language model is selected using attributes such as the gender and age of a person speaking in the video as the image recognition results, and speech recognition is performed using the selected language model.

しかしながら、人物の発話の様態は、同一人物であっても、発話の相手や会話を行った場面等、発話が行われたときの状況に応じて様々に変化する。このため、上述した従来の技術のように、人物の属性等に基づき言語モデルを選択する場合、状況に応じた発話の様態が言語モデルに反映されず、音声認識の精度を低下させる可能性がある。 However, even for the same person, a person's speech style can vary depending on the situation when the speech is made, such as the person they are speaking with and the setting in which the conversation takes place. For this reason, when selecting a language model based on a person's attributes, as in the conventional technology described above, the speech style according to the situation may not be reflected in the language model, which could reduce the accuracy of speech recognition.

開示の技術は、上記事情に鑑みたものであり、音声認識の精度を向上させることを目的とする。 The disclosed technology was developed in consideration of the above circumstances and aims to improve the accuracy of voice recognition.

開示の技術は、コンピュータによる情報処理方法であって、前記コンピュータが、話者の映像を含む映像データの入力を受け付けて、前記映像データが入力されると、記憶部に格納された複数の言語モデルと対応する複数の発話の様態のそれぞれについて、各発話の様態が、前記話者の発話の様態と合致する確率を示す推定情報を出力し、前記話者の発話の様態に応じた言語モデルを選択する設定がされている場合に、記憶部に格納された複数の言語モデルのうち、前記確率が最も高い発話の様態と対応する言語モデルを、前記話者の発話の様態に応じた言語モデルとして選択し、前記話者の発話の様態に応じた新たな言語モデルを生成する設定がされている場合に、前記推定情報に基づき前記話者の発話の様態に応じた新たな言語モデルを生成し、選択された前記話者の発話の様態に応じた言語モデル、又は、前記新たな言語モデルを、音声認識を行う音声認識部に提供し、選択された前記話者の発話の様態に応じた言語モデル、又は、前記新たな言語モデルを用いて前記映像データに含まれる音声データの音声認識を行った結果のテキストデータを、表示装置に表示させる、情報処理方法である。
The disclosed technology is an information processing method by a computer, in which the computer receives input of video data including a video of a speaker, and when the video data is input, outputs estimated information indicating the probability that each speech style corresponds to the speech style of the speaker for each of a plurality of speech styles corresponding to a plurality of language models stored in a storage unit, and when a setting is made to select a language model according to the speech style of the speaker, the computer selects a language model corresponding to the speech style with the highest probability from the plurality of language models stored in the storage unit as the language model according to the speech style of the speaker. and when a setting is made to generate a new language model according to the speaker's speech style based on the estimated information, generate a new language model according to the speaker's speech style based on the estimated information, provide the selected language model according to the speaker's speech style or the new language model to a speech recognition unit that performs speech recognition, and display text data resulting from speech recognition of the speech data included in the video data using the selected language model according to the speaker's speech style or the new language model on a display device.

音声認識の精度を向上させることができる。 This can improve the accuracy of voice recognition.

第一の実施形態の情報処理システムの一例を示す図である。FIG. 1 illustrates an example of an information processing system according to a first embodiment. 情報処理装置のハードウェア構成の一例を示す図である。FIG. 2 illustrates an example of a hardware configuration of an information processing device. 端末装置のハードウェア構成の一例を示す図である。FIG. 2 illustrates an example of a hardware configuration of a terminal device. 第一の実施形態の情報処理装置の機能を説明する図である。FIG. 2 is a diagram illustrating functions of the information processing apparatus according to the first embodiment. 第一の実施形態の端末装置の機能構成を説明する図である。FIG. 2 is a diagram illustrating a functional configuration of a terminal device according to the first embodiment. 第一の実施形態の情報処理装置の処理を説明するフローチャートである。10 is a flowchart illustrating processing of the information processing apparatus of the first embodiment. 第一の実施形態の情報処理装置の処理を説明する第一の図である。FIG. 2 is a first diagram illustrating processing by the information processing apparatus according to the first embodiment. 第一の実施形態の情報処理装置の処理を説明する第二の図である。FIG. 10 is a second diagram illustrating the processing of the information processing apparatus of the first embodiment. 第二の実施形態の情報処理装置の処理を説明するフローチャートである。10 is a flowchart illustrating processing by an information processing apparatus according to a second embodiment. 第三の実施形態の情報処理装置の処理を説明するフローチャートである。10 is a flowchart illustrating processing of an information processing apparatus according to a third embodiment. 第四の実施形態のシステム構成の一例を示す図である。FIG. 10 illustrates an example of a system configuration according to a fourth embodiment. 第五の実施形態のシステム構成の一例を示す図である。FIG. 13 illustrates an example of a system configuration according to a fifth embodiment.

（第一の実施形態）
以下に図面を参照して、第一の実施形態について説明する。図１は、第一の実施形態の情報処理システムの一例を示す図である。 (First embodiment)
The first embodiment will be described below with reference to the drawings. Fig. 1 is a diagram illustrating an example of an information processing system according to the first embodiment.

本実施形態の情報処理システム１００は、情報処理装置２００と、端末装置３００とを含み、情報処理装置２００と端末装置３００とは、ネットワーク等を介して接続されている。 The information processing system 100 of this embodiment includes an information processing device 200 and a terminal device 300, and the information processing device 200 and the terminal device 300 are connected via a network or the like.

本実施形態の情報処理システム１００において、情報処理装置２００は、音声認識処理部２２０を有する。つまり、本実施形態の情報処理システム１００は、音声認識システムの一例である。 In the information processing system 100 of this embodiment, the information processing device 200 has a voice recognition processing unit 220. In other words, the information processing system 100 of this embodiment is an example of a voice recognition system.

情報処理装置２００は、端末装置３００から、音声データを含む映像データの入力を受け付けると、音声認識処理部２２０により、映像データから、発話が行われた状況に対応した話者の発話の仕方を推定する。 When the information processing device 200 receives input of video data including audio data from the terminal device 300, the speech recognition processing unit 220 estimates, from the video data, the speaker's speaking style corresponding to the situation in which the speech was made.

そして、情報処理装置２００は、推定された発話の仕方と対応した言語モデルを用いて、映像データに含まれる音声データに対する音声認識を行い、音声認識の結果であるテキストデータを出力する。テキストデータは、例えば、端末装置３００に対して出力されて、端末装置３００において表示されてもよい。 Then, the information processing device 200 performs speech recognition on the audio data included in the video data using a language model corresponding to the estimated speaking style, and outputs text data that is the result of the speech recognition. The text data may be output to the terminal device 300 and displayed on the terminal device 300, for example.

話者の発話の仕方とは、言い換えれば、話者の発話における言葉遣いや話し方を含む発話の様態である。以下の説明では、話者の発話の様態を、話者の発話スタイルと表現する場合がある。 A speaker's manner of speaking, in other words, is the manner of speech, including the words and manner of speaking used by the speaker. In the following explanation, the manner of speech may be referred to as the speaker's speech style.

本実施形態における、発話が行われた状況とは、具体的には、例えば、話者が紙面に印刷された文章やディスプレイに表示された文章を読み上げている状況、仲の良い複数の話者が会話を楽しんでいる状況、複数の話者の関係が上司と部下であり部下が上司に対して報告を行っている状況や、互いに初対面である状況等を含む。 In this embodiment, the situations in which speech occurs specifically include, for example, a situation in which a speaker is reading aloud a sentence printed on paper or a sentence displayed on a display, a situation in which multiple speakers who are close friends are enjoying a conversation, a situation in which multiple speakers are in a superior-subordinate relationship and the subordinate is reporting to the superior, and a situation in which each speaker is meeting the other for the first time.

また、話者の発話スタイルには、例えば、書き言葉による発話、話し言葉による発話、くだけた話し言葉による発話などがある。 Furthermore, a speaker's speaking style may include, for example, written speech, spoken speech, and informal speech.

書き言葉とは、文章を書く際に使われる言葉である。話し言葉とは、会話で使う言葉であり、くだけた話し言葉よりも丁寧な表現を用いた言葉である。くだけた話し言葉とは、親しみやすい言葉や日常会話的な言葉である。 Written language is the language used in writing. Spoken language is the language used in conversation, and is more polite than informal speech. Informal speech is familiar language or everyday conversational language.

例えば、発話の状況が、文章を読み上げている状況である場合、話者の発話スタイルは、文章を書く際に使われる書き言葉になる可能性が高い。また、発話の状況が、話者同士が親しい間柄という状況である場合、話者の発話スタイルは、くだけた話し言葉になる可能性が高い。また、発話の状況が、話者同士が初対面という状況である場合、話者の発話スタイルは、話し言葉になる可能性が高い。 For example, if the speech situation is one in which a sentence is being read aloud, the speaker's speech style is likely to be the written language used when writing a sentence. Also, if the speech situation is one in which the speakers are close friends, the speaker's speech style is likely to be informal colloquial language. Also, if the speech situation is one in which the speakers are meeting for the first time, the speaker's speech style is likely to be colloquial language.

本実施形態の情報処理装置２００は、例えば、映像データが示す発話の状況から、話者の発話スタイルが書き言葉を用いた発話である可能性が高いと推定された場合、書き言葉と対応する言語モデルを用いた音声認識を行う。 In this embodiment, the information processing device 200 performs speech recognition using a language model corresponding to written language when, for example, it is estimated from the speech situation shown in the video data that the speaker's speech style is likely to be written language.

本実施形態の端末装置３００は、例えば、音声データを含む映像データを情報処理装置２００に送信するものであり、スマートフォンやタブレット端末等であってよい。また、端末装置３００は、例えば、映像データを取得する撮像装置であってもよいし、撮像装置そのものであってもよい。 The terminal device 300 of this embodiment transmits, for example, video data including audio data to the information processing device 200, and may be a smartphone, tablet terminal, or the like. Furthermore, the terminal device 300 may be, for example, an imaging device that acquires video data, or may be the imaging device itself.

また、本実施形態の端末装置３００は、全天球型の撮像装置であってもよく、例えば、会話の場の中心に設置されてもよい。このように、全天球型の撮像装置を端末装置３００とすることで、会話に参加している話者全員の映像データを撮像することができる。 Furthermore, the terminal device 300 of this embodiment may be a spherical imaging device and may be installed, for example, at the center of a conversation. In this way, by using a spherical imaging device as the terminal device 300, it is possible to capture video data of all speakers participating in a conversation.

このように、本実施形態の情報処理装置２００では、映像データが示す発話の状況から、話者の発話スタイルを推定し、発話スタイルに応じた言語モデルを用いて音声認識を行う。また、本実施形態の情報処理装置２００は、音声認識の対象となる発話の前後の状況等を考慮した時系列の情報である映像データを用いて、音声認識を行う。 In this way, the information processing device 200 of this embodiment estimates the speaker's speaking style from the speech situation indicated by the video data, and performs speech recognition using a language model corresponding to the speaking style. Furthermore, the information processing device 200 of this embodiment performs speech recognition using video data, which is time-series information that takes into account the situation before and after the utterance that is the target of speech recognition.

このため、例えば、本実施形態では、同一の人物が、異なる状況で発話を行っていた場合等であっても、発話が行われた状況に適した言語モデルで音声認識を行うことができ、音声認識の精度を向上させることができる。 For this reason, in this embodiment, for example, even if the same person speaks in different situations, speech recognition can be performed using a language model appropriate to the situation in which the utterance was made, thereby improving the accuracy of speech recognition.

なお、図１の例では、情報処理システム１００に含まれる端末装置３００を１台としているが、これに限定されない。情報処理システム１００は、端末装置３００が複数有し、音声データを情報処理装置２００に送信する端末装置３００と、情報処理装置２００から出力されたテキストデータを受信する端末装置３００と、をそれぞれ別々の端末装置としてもよい。 Note that in the example of Figure 1, the information processing system 100 includes one terminal device 300, but this is not limited to this. The information processing system 100 may include multiple terminal devices 300, and the terminal device 300 that transmits voice data to the information processing device 200 and the terminal device 300 that receives text data output from the information processing device 200 may each be separate terminal devices.

また、本実施形態では、端末装置３００から音声データを受信し、情報処理装置２００の有する表示装置にテキストデータを出力してもよい。また、本実施形態では、情報処理装置２００に対して直接音声データし、テキストデータを端末装置３００に出力してもよい。また、本実施形態では、情報処理装置２００に対して直接音声データを入力し、テキストデータを情報処理装置２００の有する表示装置に出力してもよい。 In addition, in this embodiment, voice data may be received from the terminal device 300, and text data may be output to a display device of the information processing device 200. In addition, in this embodiment, voice data may be input directly to the information processing device 200, and text data may be output to the terminal device 300. In this embodiment, voice data may be input directly to the information processing device 200, and text data may be output to a display device of the information processing device 200.

また、図１の例では、情報処理装置２００が音声認識処理部２２０を有するものとしたが、これに限定されない。音声認識処理部２２０は、複数の情報処理装置２００で実現されてもよい。 Furthermore, in the example of FIG. 1, the information processing device 200 has the voice recognition processing unit 220, but this is not limited to this. The voice recognition processing unit 220 may be realized by multiple information processing devices 200.

次に、図２、図３を参照して、情報処理装置２００と端末装置３００のハードウェア構成について説明する。図２は、情報処理装置のハードウェア構成の一例を示す図である。 Next, the hardware configuration of the information processing device 200 and the terminal device 300 will be described with reference to Figures 2 and 3. Figure 2 is a diagram showing an example of the hardware configuration of an information processing device.

情報処理装置２００は、コンピュータによって構築されており、図２に示されているように、ＣＰＵ２０１、ＲＯＭ２０２、ＲＡＭ２０３、ＨＤ２０４、ＨＤＤ(Hard Disk Drive)コントローラ２０５、ディスプレイ２０６、外部機器接続Ｉ／Ｆ(Interface)２０８、ネットワークＩ／Ｆ２０９、バスラインＢ１、キーボード２１１、ポインティングデバイス２１２、ＤＶＤ－ＲＷ(Digital Versatile Disk Rewritable)ドライブ２１４、メディアＩ／Ｆ２１６を備えている。 The information processing device 200 is constructed by a computer, and as shown in FIG. 2, includes a CPU 201, ROM 202, RAM 203, HD 204, HDD (Hard Disk Drive) controller 205, display 206, external device connection I/F (Interface) 208, network I/F 209, bus line B1, keyboard 211, pointing device 212, DVD-RW (Digital Versatile Disk Rewritable) drive 214, and media I/F 216.

これらのうち、ＣＰＵ２０１は、情報処理装置２００全体の動作を制御する。ＲＯＭ２０２は、ＩＰＬ等のＣＰＵ２０１の駆動に用いられるプログラムを記憶する。ＲＡＭ２０３は、ＣＰＵ２０１のワークエリアとして使用される。ＨＤ２０４は、プログラム等の各種データを記憶する。ＨＤＤコントローラ２０５は、ＣＰＵ２０１の制御にしたがってＨＤ２０４に対する各種データの読み出し又は書き込みを制御する。 Of these, the CPU 201 controls the overall operation of the information processing device 200. The ROM 202 stores programs used to drive the CPU 201, such as IPL. The RAM 203 is used as a work area for the CPU 201. The HD 204 stores various data such as programs. The HDD controller 205 controls the reading and writing of various data from and to the HD 204 under the control of the CPU 201.

ディスプレイ（表示装置）２０６は、カーソル、メニュー、ウィンドウ、文字、又は画像などの各種情報を表示する。外部機器接続Ｉ／Ｆ２０８は、各種の外部機器を接続するためのインターフェースである。この場合の外部機器は、例えば、ＵＳＢ(Universal Serial Bus)メモリやプリンタ等である。ネットワークＩ／Ｆ２０９は、通信ネットワークを利用してデータ通信をするためのインターフェースである。バスラインＢ１は、図２に示されているＣＰＵ２０１等の各構成要素を電気的に接続するためのアドレスバスやデータバス等である。 The display (display device) 206 displays various types of information such as a cursor, menu, window, text, or image. The external device connection I/F 208 is an interface for connecting various external devices. In this case, external devices include, for example, USB (Universal Serial Bus) memory and printers. The network I/F 209 is an interface for data communication using a communication network. The bus line B1 is an address bus, data bus, etc. for electrically connecting each component such as the CPU 201 shown in Figure 2.

また、キーボード２１１は、文字、数値、各種指示などの入力のための複数のキーを備えた入力手段の一種である。ポインティングデバイス２１２は、各種指示の選択や実行、処理対象の選択、カーソルの移動などを行う入力手段の一種である。ＤＶＤ－ＲＷドライブ２１４は、着脱可能な記録媒体の一例としてのＤＶＤ－ＲＷ２１３に対する各種データの読み出し又は書き込みを制御する。なお、ＤＶＤ－ＲＷに限らず、ＤＶＤ－Ｒ等であってもよい。メディアＩ／Ｆ２１６は、フラッシュメモリ等の記録メディア２１５に対するデータの読み出し又は書き込み（記憶）を制御する。 The keyboard 211 is a type of input device equipped with multiple keys for inputting characters, numbers, various instructions, etc. The pointing device 212 is a type of input device for selecting and executing various instructions, selecting processing targets, moving the cursor, etc. The DVD-RW drive 214 controls the reading and writing of various data from a DVD-RW 213, which is an example of a removable recording medium. Note that this is not limited to a DVD-RW, and may be a DVD-R or the like. The media I/F 216 controls the reading and writing (storing) of data from a recording medium 215, such as a flash memory.

図３は、端末装置のハードウェア構成の一例を示す図である。本実施形態の端末装置３００は、ＣＰＵ３０１、ＲＯＭ３０２、ＲＡＭ３０３、ＥＥＰＲＯＭ３０４、ＣＭＯＳセンサ３０５、撮像素子Ｉ／Ｆ３０６、加速度・方位センサ３０７、メディアＩ／Ｆ３０９、ＧＰＳ受信部３１１を備えている。 Figure 3 shows an example of the hardware configuration of a terminal device. The terminal device 300 of this embodiment includes a CPU 301, ROM 302, RAM 303, EEPROM 304, a CMOS sensor 305, an image sensor I/F 306, an acceleration and orientation sensor 307, a media I/F 309, and a GPS receiver 311.

これらのうち、ＣＰＵ３０１は、端末装置３００全体の動作を制御する演算処理装置である。ＲＯＭ３０２は、ＣＰＵ３０１やＩＰＬ等のＣＰＵ３０１の駆動に用いられるプログラムを記憶する。ＲＡＭ３０３は、ＣＰＵ３０１のワークエリアとして使用される。ＥＥＰＲＯＭ３０４は、ＣＰＵ３０１の制御にしたがって、スマートフォン用プログラム等の各種データの読み出し又は書き込みを行う。ＲＯＭ３０２、ＲＡＭ３０３、ＥＥＰＲＯＭ３０４は、端末装置３００の記憶装置の一例である。 Of these, CPU 301 is an arithmetic processing unit that controls the overall operation of terminal device 300. ROM 302 stores CPU 301 and programs used to drive CPU 301, such as IPL. RAM 303 is used as a work area for CPU 301. EEPROM 304 reads and writes various data, such as smartphone programs, under the control of CPU 301. ROM 302, RAM 303, and EEPROM 304 are examples of storage devices for terminal device 300.

ＣＭＯＳ(Complementary Metal Oxide Semiconductor)センサ３０５は、ＣＰＵ３０１の制御に従って被写体（主に自画像）を撮像して動画データを得る内蔵型の撮像手段の一種である。なお、ＣＭＯＳセンサではなく、ＣＣＤ(Charge Coupled Device)センサ等の撮像手段であってもよい。 The CMOS (Complementary Metal Oxide Semiconductor) sensor 305 is a type of built-in imaging device that captures images of a subject (mainly a self-portrait) under the control of the CPU 301 to obtain video data. Note that instead of a CMOS sensor, an imaging device such as a CCD (Charge Coupled Device) sensor may also be used.

撮像素子Ｉ／Ｆ３０６は、ＣＭＯＳセンサ３０５の駆動を制御する回路である。加速度・方位センサ３０７は、地磁気を検知する電子磁気コンパスやジャイロコンパス、加速度センサ等の各種センサである。メディアＩ／Ｆ３０９は、フラッシュメモリ等の記録メディア３０８に対するデータの読み出し又は書き込み（記憶）を制御する。ＧＰＳ受信部３１１は、ＧＰＳ衛星からＧＰＳ信号を受信する。 The imaging element I/F 306 is a circuit that controls the operation of the CMOS sensor 305. The acceleration/azimuth sensor 307 is a variety of sensors, such as an electronic magnetic compass that detects geomagnetism, a gyrocompass, and an acceleration sensor. The media I/F 309 controls the reading and writing (storage) of data from the recording media 308, such as flash memory. The GPS receiver 311 receives GPS signals from GPS satellites.

また、端末装置３００は、遠距離通信回路３１２、遠距離通信回路３１２のアンテナ３１２ａ、ＣＭＯＳセンサ３１３、撮像素子Ｉ／Ｆ３１４、マイク（集音装置）３１５、スピーカ３１６、音入出力Ｉ／Ｆ３１７、ディスプレイ（表示装置）３１８、外部機器接続Ｉ／Ｆ(Interface)３１９、近距離通信回路３２０、近距離通信回路３２０のアンテナ３２０ａ、及びタッチパネル３２１を備えている。 The terminal device 300 also includes a long-distance communication circuit 312, an antenna 312a for the long-distance communication circuit 312, a CMOS sensor 313, an image sensor I/F 314, a microphone (sound collection device) 315, a speaker 316, a sound input/output I/F 317, a display (display device) 318, an external device connection I/F (Interface) 319, a short-distance communication circuit 320, an antenna 320a for the short-distance communication circuit 320, and a touch panel 321.

これらのうち、遠距離通信回路３１２は、通信ネットワークを介して、他の機器と通信する回路である。ＣＭＯＳセンサ３１３は、ＣＰＵ３０１の制御に従って被写体を撮像して動画データを得る内蔵型の撮像手段の一種である。撮像素子Ｉ／Ｆ３１４は、ＣＭＯＳセンサ３１３の駆動を制御する回路である。マイク３１５は、音を電気信号に変える内蔵型の回路である。スピーカ３１６は、電気信号を物理振動に変えて音楽や音声などの音を生み出す内蔵型の回路である。音入出力Ｉ／Ｆ３１７は、ＣＰＵ３０１の制御に従ってマイク３１５及びスピーカ３１６との間で音信号の入出力を処理する回路である。
なお、ＣＭＯＳセンサ３０５、ＣＭＯＳセンサ３１３は、何れか一方がディスプレイ３１８の近傍に配置され、他方が端末装置３００の背面に配置されてもよい。 Of these, the long-distance communication circuit 312 is a circuit that communicates with other devices via a communication network. The CMOS sensor 313 is a type of built-in imaging means that captures an image of a subject and obtains video data under the control of the CPU 301. The image sensor I/F 314 is a circuit that controls the driving of the CMOS sensor 313. The microphone 315 is a built-in circuit that converts sound into an electrical signal. The speaker 316 is a built-in circuit that converts an electrical signal into physical vibrations to generate sound such as music or voice. The sound input/output I/F 317 is a circuit that processes the input and output of sound signals between the microphone 315 and the speaker 316 under the control of the CPU 301.
Note that either the CMOS sensor 305 or the CMOS sensor 313 may be disposed near the display 318 and the other may be disposed on the back surface of the terminal device 300 .

ディスプレイ３１８は、被写体の画像や各種アイコン等を表示する液晶や有機ＥＬ(Electro Luminescence)などの表示手段の一種である。外部機器接続Ｉ／Ｆ３１９は、各種の外部機器を接続するためのインターフェースである。近距離通信回路３２０は、ＮＦＣ(Near Field Communication)やＢｌｕｅｔｏｏｔｈ（登録商標）等の通信回路である。タッチパネル３２１は、利用者がディスプレイ３１８を押下することで、端末装置３００を操作する入力手段の一種である。ディスプレイ３１８は、端末装置３００の有する表示部の一例である。 The display 318 is a type of display means, such as a liquid crystal display or organic electroluminescence (EL) display, that displays images of subjects and various icons. The external device connection I/F 319 is an interface for connecting various external devices. The short-range communication circuit 320 is a communication circuit such as NFC (Near Field Communication) or Bluetooth (registered trademark). The touch panel 321 is a type of input means that allows the user to operate the terminal device 300 by pressing the display 318. The display 318 is an example of a display unit that the terminal device 300 has.

なお、本実施形態では、端末装置３００は、スマートフォンやタブレット端末としたが、これに限定されない。端末装置３００は、図２に示す情報処理装置２００と同様のハードウェア構成を有する一般的なコンピュータであってもよい。 In this embodiment, the terminal device 300 is a smartphone or tablet terminal, but is not limited to this. The terminal device 300 may also be a general computer having a hardware configuration similar to that of the information processing device 200 shown in FIG. 2.

次に、図４を参照して、本実施形態の情報処理装置２００の機能について説明する。図４は、第一の実施形態の情報処理装置の機能を説明する図である。 Next, the functions of the information processing device 200 of this embodiment will be described with reference to Figure 4. Figure 4 is a diagram explaining the functions of the information processing device of the first embodiment.

本実施形態の情報処理装置２００は、音声認識処理部２２０、入力受付部２３０、発話区間検出部２３１、音声取得部２３２、発話スタイル推定部２３３を有する。 The information processing device 200 of this embodiment has a speech recognition processing unit 220, an input receiving unit 230, a speech period detection unit 231, a speech acquisition unit 232, and a speech style estimation unit 233.

また、音声認識処理部２２０は、音声認識部２２１、言語モデル提供部２２２、言語モデル記憶部２２３、出力テキスト生成部２２４を含む。 The speech recognition processing unit 220 also includes a speech recognition unit 221, a language model providing unit 222, a language model storage unit 223, and an output text generation unit 224.

言語モデル記憶部２２３は、複数の言語モデルが格納されている。具体的には、言語モデル記憶部２２３には、第一の言語モデル２４１、第二の言語モデル２４２、第三の言語モデル２４３が格納されている。 The language model storage unit 223 stores multiple language models. Specifically, the language model storage unit 223 stores a first language model 241, a second language model 242, and a third language model 243.

第一の言語モデル２４１は、書き言葉に対応した言語モデルであり、第二の言語モデル２４２は、話し言葉に対応した言語モデルであり、第三の言語モデル２４３は、くだけた話し言葉に対応した言語モデルである。 The first language model 241 is a language model corresponding to written language, the second language model 242 is a language model corresponding to spoken language, and the third language model 243 is a language model corresponding to casual spoken language.

本実施形態の第一の言語モデル２４１、第二の言語モデル２４２、第三の言語モデル２４３のそれぞれは、音声認識部２２１が出力する音声認識の結果を入力とし、次に現れる音素、もしくは文字、もしくは音声認識部２２１の処理単位毎の単語の出現確率を言語モデル提供部２２２に対して出力する。 In this embodiment, the first language model 241, second language model 242, and third language model 243 each receive the speech recognition results output by the speech recognition unit 221 as input, and output to the language model providing unit 222 the probability of the next phoneme, character, or word appearing for each processing unit of the speech recognition unit 221.

入力受付部２３０は、外部から送信される映像データの入力を受け付ける。なお、本実施形態の映像データは、音声データと動画データとを含む。 The input receiving unit 230 receives input of video data transmitted from an external device. Note that in this embodiment, video data includes audio data and video data.

発話区間検出部２３１は、入力された映像データに含まれる音声データに基づき、映像データにおいて、発話が行われている区間を検出する。また、発話区間検出部２３１は、検出した発話区間毎の音声データを音声取得部２３２に出力し、動画データを発話スタイル推定部２３３に出力する。 The speech section detection unit 231 detects sections in the video data where speech is taking place based on the audio data included in the input video data. The speech section detection unit 231 also outputs the audio data for each detected speech section to the audio acquisition unit 232, and outputs the video data to the speech style estimation unit 233.

音声取得部２３２は、発話区間検出部２３１から出力された音声データを取得し、音声データから音声特徴量を抽出して、音声認識処理部２２０の音声認識部２２１に対して出力する。つまり、音声取得部２３２は、発話区間とされた所定期間において取得された音声データを音声認識部２２１に対して出力する。 The speech acquisition unit 232 acquires the speech data output from the speech period detection unit 231, extracts speech features from the speech data, and outputs them to the speech recognition unit 221 of the speech recognition processing unit 220. In other words, the speech acquisition unit 232 outputs the speech data acquired during a predetermined period designated as a speech period to the speech recognition unit 221.

音声特徴量としてはＭＦＣＣが知られているが、ＬＰＣ（Linear Predictive Coding）、ＦＢＡＮＫ（Log Mel-Filterbank Coefficients）等を使用してよい。 MFCC is a well-known speech feature, but LPC (Linear Predictive Coding), FBANK (Log Mel-Filterbank Coefficients), etc. may also be used.

発話スタイル推定部２３３は、発話区間検出部２３１から出力された動画データから、話者による発話が行われた状況を判別し、発話の状況に基づき話者の発話スタイルを推定する。そして、発話スタイル推定部２３３は、推定結果を示す発話スタイル推定情報を生成し、音声認識処理部２２０の言語モデル提供部２２２に対し、発話スタイル推定情報を出力する。つまり、発話スタイル推定部２３３は、発話区間とされた所定期間毎に、発話スタイル推定情報を生成し、言語モデル提供部２２２に対して出力する。発話スタイル推定情報の詳細は後述する。 The speech style estimation unit 233 determines the circumstances under which the speaker spoke from the video data output from the speech section detection unit 231, and estimates the speaker's speech style based on the circumstances of the speech. The speech style estimation unit 233 then generates speech style estimation information indicating the estimation results, and outputs the speech style estimation information to the language model provision unit 222 of the speech recognition processing unit 220. In other words, the speech style estimation unit 233 generates speech style estimation information for each predetermined period of time that is considered to be a speech section, and outputs it to the language model provision unit 222. Details of the speech style estimation information will be described later.

音声認識部２２１は、音声取得部２３２から入力された音声特徴量と、言語モデル提供部２２２によって選択された言語モデルとに基づき、音声認識を行い、認識結果を出力する。具体的には、音声認識部２２１は、発話区間毎に、音声認識を行った結果であるテキストデータを、出力テキスト生成部２２４と、言語モデル記憶部２２３に格納された各言語モデルとに対し、順次出力する。 The speech recognition unit 221 performs speech recognition based on the speech features input from the speech acquisition unit 232 and the language model selected by the language model provision unit 222, and outputs the recognition results. Specifically, the speech recognition unit 221 sequentially outputs text data, which is the result of speech recognition for each utterance section, to the output text generation unit 224 and each language model stored in the language model storage unit 223.

本実施形態の音声認識部２２１における音声認識の処理単位は、音素、文字、単語等であってよい。また、音声認識部２２１における音声認識の処理単位と、言語モデル記憶部２２３に格納された第一の言語モデル２４１、第二の言語モデル２４２、第三の言語モデル２４３の処理単位は同じものである。
言語モデル提供部２２２は、発話スタイル推定部２３３から入力された発話スタイル推定情報に基づき、言語モデル記憶部２２３に格納された複数の言語モデルから、話者の発話スタイルと対応する言語モデルを選択する。そして、言語モデル提供部２２２は、音声認識部２２１に選択した言語モデルを提供する。 The processing unit of the speech recognition in the speech recognition unit 221 of this embodiment may be a phoneme, a character, a word, etc. The processing unit of the speech recognition in the speech recognition unit 221 is the same as the processing units of the first language model 241, the second language model 242, and the third language model 243 stored in the language model storage unit 223.
The language model providing unit 222 selects a language model corresponding to the speaker's speech style from among the multiple language models stored in the language model storage unit 223, based on the speech style estimation information input from the speech style estimation unit 233. Then, the language model providing unit 222 provides the selected language model to the speech recognition unit 221.

なお、本実施形態では、言語モデル提供部２２２に対して、言語モデル選択信号を入力することで、予め提供する言語モデルを特定しておくことができる。具体的には、例えば、映像データに登場する話者の発話スタイルが予め特定されている場合等には、事前に言語モデル選択信号によって、音声認識部２２１に提供する言語モデルを特定しておいてもよい。言語モデル選択信号は、例えば、情報処理システム１００の利用者等によって予め入力されていてもよい。 In this embodiment, the language model to be provided can be specified in advance by inputting a language model selection signal to the language model providing unit 222. Specifically, for example, if the speaking style of a speaker appearing in the video data has been specified in advance, the language model to be provided to the speech recognition unit 221 may be specified in advance by the language model selection signal. The language model selection signal may be input in advance by, for example, a user of the information processing system 100.

このようにすることで、発話スタイル推定部２３３の推定結果に影響されることなく、特定された言語モデルを用いた音声認識を行うことができる。 By doing this, speech recognition can be performed using the identified language model without being affected by the estimation results of the speech style estimation unit 233.

出力テキスト生成部２２４は、音声認識部２２１から出力される音素、文字、単語といった、処理単位の音声認識の結果であるテキストデータを時系列につなげて、出力用テキストデータを生成する。そして、出力テキスト生成部２２４は、生成した出力用テキストデータを端末装置３００に対して出力する。 The output text generation unit 224 generates output text data by chronologically concatenating text data that is the result of speech recognition for processing units, such as phonemes, characters, and words, output from the speech recognition unit 221. The output text generation unit 224 then outputs the generated output text data to the terminal device 300.

ここで、発話スタイル推定部２３３について説明する。本実施形態の発話スタイル推定部２３３は、動画データを入力データとし、動画中の話者の発話スタイル（話者の発話における言葉遣い）を正解データとする学習データのペアにより事前に機械学習されたものである。 Here, we will explain the speaking style estimation unit 233. The speaking style estimation unit 233 of this embodiment is machine-learned in advance using pairs of training data in which video data is input data and the speaking style of the speaker in the video (the language used in the speaker's speech) is used as correct answer data.

言い換えれば、本実施形態の発話スタイル推定部２３３は、動画データを入力データとし、複数の発話スタイルについて、各発話スタイルが動画中の話者の発話スタイルと合致する確率を含む発話スタイル推定情報を出力データとする学習済みのモデルである。 In other words, the speech style estimation unit 233 of this embodiment is a trained model that takes video data as input data and outputs speech style estimation information for multiple speech styles, including the probability that each speech style matches the speech style of the speaker in the video.

本実施形態において、例えば、学習データに含まれる入力データを、発表者が発表資料が投影されたスクリーンを用いたプレゼンテーションを多数の聴講者に対して行っている様子を撮像した映像データとする。この場合、映像データのうち、発表者が聴講者に向かって話している時間帯の映像データ（入力データ）に対する正解データは、話し言葉の発話スタイルとなる。また、映像データのうち、発表者がスクリーンや手元の資料を読み上げている時間帯の映像データに対する正解データは、書き言葉の発話スタイルとなる。また、映像データのうち、聴講者の中の一人が立ち上がり発話している時間帯の映像データに対する正解データは、話し言葉の発話スタイルとなる。 In this embodiment, for example, the input data included in the learning data is video data captured of a presenter giving a presentation to a large audience using a screen on which presentation materials are projected. In this case, the correct answer data for the video data (input data) during the period in which the presenter is speaking to the audience is a spoken speaking style. The correct answer data for the video data during the period in which the presenter is reading from the screen or materials in front of him/her is a written speaking style. The correct answer data for the video data during the period in which one audience member stands up and speaks is a spoken speaking style.

また、本実施形態において、例えば、学習データに含まれる入力データを、会議室等において少人数が参加している会議の様子を撮像した映像データとする。この場合、映像データのうち、発話者が資料を読み上げている時間帯の映像データに対する正解データは、書き言葉の発話スタイルとなり、畏まった雰囲気の時間帯の映像データに対する正解データは、話し言葉の発話スタイルとなる。また、映像データのうち、同僚と気軽な雰囲気で会話をしている時間帯の映像データと対応する正解データは、くだけた話し言葉の発話スタイルとなる。 In addition, in this embodiment, for example, the input data included in the learning data is video data captured of a meeting with a small number of participants in a conference room or the like. In this case, the correct answer data for video data taken during a time period in which the speaker is reading a document will be a written speaking style, and the correct answer data for video data taken during a time period in which the atmosphere is formal will be a spoken speaking style. Furthermore, the correct answer data for video data taken during a time period in which the speaker is having a casual conversation with a colleague will be a casual spoken speaking style.

また、会議の参加者に上長が含まれる場合には、会議を撮像した映像データのうち、上長が発言している時間帯の映像データと対応する正解データは、くだけた話し言葉の発話スタイルなる。また、上長以外の参加者が上長に対して発話している時間帯の映像データと対応する正解データは、話し言葉の発話スタイルとなる。 Furthermore, if a superior is included among the participants in a meeting, the correct answer data corresponding to the video data of the meeting taken during the time period in which the superior is speaking will be an informal speaking style. Furthermore, the correct answer data corresponding to the video data taken during the time period in which participants other than the superior are speaking to the superior will be an informal speaking style.

また、本実施形態では、入力データとなる映像データに登場する話者が限定的である場合、例えば、ある会社のある部署メンバーのみ）、メンバー個々を認識し、各人の発話のくせや人間関係も含めた学習がなされてもよい。このように発話スタイル推定部２３３を学習すれば、話者の発話スタイルの推定精度を向上させることができる。 Furthermore, in this embodiment, if the speakers appearing in the video data that serves as input data are limited (for example, only members of a certain department of a certain company), individual members may be recognized and learning may be performed that also includes each person's speaking habits and interpersonal relationships. By training the speaking style estimation unit 233 in this way, the accuracy of estimating the speaker's speaking style can be improved.

次に、本実施形態の発話スタイル推定情報について説明する。 Next, we will explain the speech style estimation information of this embodiment.

本実施形態の発話スタイル推定情報は、複数の発話スタイル毎の確率を示す情報を含む。具体的には、例えば、発話スタイル推定情報は、入力された映像データに対して、書き言葉を用いた発話スタイル（第一の発話スタイル）となる確率と、話し言葉を用いた発話スタイル（第二の発話スタイル）となる確率と、くだけた話し言葉を用いた発話スタイル（第三の発話スタイル）となる確率とを含む。 The speech style estimation information of this embodiment includes information indicating the probability of each of multiple speech styles. Specifically, for example, the speech style estimation information includes the probability that the input video data will be a speech style using written language (first speech style), a speech style using spoken language (second speech style), and a speech style using casual spoken language (third speech style).

なお、発話スタイル毎の確率とは、複数の発話スタイルと、動画における話者の発話スタイルと合致する確率である。複数の発話スタイルは、予め決められていてもよい。 Note that the probability for each speaking style is the probability that multiple speaking styles match the speaking style of the speaker in the video. The multiple speaking styles may be determined in advance.

発話スタイル推定情報の一例としては、例えば、「第一の発話スタイルの確率が５％、第二の発話スタイルの確率が７０％、第三の発話スタイルの確率が２５％」等がある。この場合、動画における話者の発話スタイルが、話し言葉を用いた発話スタイル（第二の発話スタイル）である可能性が最も高いことがわかる。 An example of speech style estimation information is, for example, "the probability of the first speech style is 5%, the probability of the second speech style is 70%, and the probability of the third speech style is 25%." In this case, it can be seen that the speech style of the speaker in the video is most likely to be a spoken language style (the second speech style).

なお、図４の例では、情報処理装置２００に、入力受付部２３０、発話区間検出部２３１、音声取得部２３２、音声認識処理部２２０が設けられるものとしたが、これに限定されない。こられの各部の一部又は全部は、情報処理装置２００と通信が可能な情報処理装置２００以外の装置に設けられていてもよい。言い換えれば、情報処理装置２００は、複数の情報処理装置によって実現されてもよい。 In the example of FIG. 4, the information processing device 200 is provided with the input acceptance unit 230, speech period detection unit 231, speech acquisition unit 232, and speech recognition processing unit 220, but this is not limited to this. Some or all of these units may be provided in a device other than the information processing device 200 that is capable of communicating with the information processing device 200. In other words, the information processing device 200 may be realized by multiple information processing devices.

次に、図５を参照して、端末装置３００の機能構成について説明する。図５は、第一の実施形態の端末装置の機能構成を説明する図である。 Next, the functional configuration of the terminal device 300 will be described with reference to Figure 5. Figure 5 is a diagram illustrating the functional configuration of the terminal device of the first embodiment.

本実施形態の端末装置３００は、映像データ取得部３３０、出力部３４０、通信部３５０を含む。 The terminal device 300 of this embodiment includes a video data acquisition unit 330, an output unit 340, and a communication unit 350.

映像データ取得部３３０は、映像データを取得する。具体的には、映像データ取得部３３０は、端末装置３００が有する撮像装置等によって撮像された映像データを取得する。 The video data acquisition unit 330 acquires video data. Specifically, the video data acquisition unit 330 acquires video data captured by an imaging device or the like included in the terminal device 300.

出力部３４０は、端末装置３００からの各種の情報の出力を行う。具体的には、出力部３４０は、ディスプレイ３１８に、情報処理装置２００から受信したテキストデータを表示させる。 The output unit 340 outputs various types of information from the terminal device 300. Specifically, the output unit 340 displays text data received from the information processing device 200 on the display 318.

通信部３５０は、端末装置３００と情報処理装置２００との通信を制御する。具体的には、通信部３５０は、端末装置３００から情報処理装置２００へ、映像データを送信し、情報処理装置２００からテキストデータを受信する。 The communication unit 350 controls communication between the terminal device 300 and the information processing device 200. Specifically, the communication unit 350 transmits video data from the terminal device 300 to the information processing device 200 and receives text data from the information processing device 200.

次に、図６を参照して、本実施形態の情報処理装置２００の処理について説明する。図６は、第一の実施形態の情報処理装置の処理を説明するフローチャートである。 Next, the processing of the information processing device 200 of this embodiment will be described with reference to Figure 6. Figure 6 is a flowchart illustrating the processing of the information processing device of the first embodiment.

本実施形態の情報処理装置２００は、入力受付部２３０により、端末装置３００から映像データの入力を受け付ける（ステップＳ６０１）。続いて、情報処理装置２００は、発話区間検出部２３１により、映像データから発話区間を検出する（ステップＳ６０２）。 In this embodiment, the information processing device 200 receives input of video data from the terminal device 300 via the input receiving unit 230 (step S601). Next, the information processing device 200 detects speech segments from the video data via the speech segment detection unit 231 (step S602).

具体的には、例えば、発話区間検出部２３１は、あるタイミングからあるタイミングまでの第一の期間を第一の話者の発話区間として検出し、第一の期間に続く第二の期間を第二の話者の発話区間として検出する。 Specifically, for example, the speech period detection unit 231 detects a first period from a certain timing to another timing as a speech period of a first speaker, and detects a second period following the first period as a speech period of a second speaker.

続いて、情報処理装置２００は、発話区間検出部２３１により、映像データから音声データと動画データとを抽出し、音声データを音声認識部２２１へ出力し、動画データを発話スタイル推定部２３３に出力する（ステップＳ６０３）。 Next, the information processing device 200 extracts audio data and video data from the video data using the speech section detection unit 231, outputs the audio data to the speech recognition unit 221, and outputs the video data to the speech style estimation unit 233 (step S603).

情報処理装置２００は、ステップＳ６０３に続いて、音声認識部２２１により、音声データから音声特徴量を抽出し（ステップＳ６０４）、後述するステップＳ６０７へ進む。 Following step S603, the information processing device 200 extracts speech features from the speech data using the speech recognition unit 221 (step S604), and proceeds to step S607, which will be described later.

また、情報処理装置２００は、ステップＳ６０３に続いて、動画データを発話スタイル推定部２３３へ入力し、発話スタイル推定情報を取得する（ステップＳ６０５）。 Following step S603, the information processing device 200 inputs the video data to the speech style estimation unit 233 and acquires speech style estimation information (step S605).

続いて、情報処理装置２００は、発話スタイル推定情報を言語モデル提供部２２２へ入力し、言語モデル記憶部２２３に格納された言語モデルから、音声認識部２２１に提供する言語モデルを選択し、音声認識部２２１へ提供する（ステップＳ６０６）。 Next, the information processing device 200 inputs the speech style estimation information to the language model providing unit 222, selects a language model to be provided to the speech recognition unit 221 from the language models stored in the language model storage unit 223, and provides it to the speech recognition unit 221 (step S606).

具体的には、言語モデル提供部２２２は、発話スタイル推定情報に含まれる言語モデル毎の確率を参照し、言語モデル記憶部２２３に格納された言語モデルのうち、確率が最も高い言語モデルを選択し、音声認識部２２１に提供する。 Specifically, the language model providing unit 222 refers to the probability of each language model included in the speech style estimation information, selects the language model with the highest probability from the language models stored in the language model storage unit 223, and provides it to the speech recognition unit 221.

続いて、情報処理装置２００は、通知された言語モデルと、音声特徴量とに基づき、音声認識部２２１で処理単位で音声認識を行い、処理単位のテキストデータを出力テキスト生成部２２４に対して出力する（ステップＳ６０７）。 Next, the information processing device 200 performs speech recognition for each processing unit in the speech recognition unit 221 based on the notified language model and speech features, and outputs the text data for each processing unit to the output text generation unit 224 (step S607).

続いて、情報処理装置２００は、出力テキスト生成部２２４により、処理単位のテキストデータをまとめた出力用テキストデータを生成し、端末装置３００に出力する（ステップＳ６０８）。 Next, the information processing device 200 generates output text data that summarizes the text data for each processing unit using the output text generation unit 224, and outputs the output text data to the terminal device 300 (step S608).

本実施形態では、このように、話者を撮像した動画データに基づき、話者の発話の状況に応じた発話スタイルを推定し、推定された発話スタイルと対応する言語モデルを選択して音声認識に用いる。 In this embodiment, the speaking style according to the speaker's speech situation is estimated based on video data of the speaker, and a language model corresponding to the estimated speaking style is selected and used for speech recognition.

したがって、本実施形態によれば、音声認識の対象となる発話の前後の状況等を考慮した時系列の情報（映像データ）に基づき、話者の発話スタイルを推定することができる。したがって、話者が発話を行った状況に応じた発話スタイルを音声認識に用いることができ、音声認識の精度を向上させることができる。 Accordingly, this embodiment makes it possible to estimate a speaker's speaking style based on time-series information (video data) that takes into account the circumstances before and after the utterance that is the subject of speech recognition. Therefore, a speaking style that corresponds to the circumstances in which the speaker made the utterance can be used for speech recognition, thereby improving the accuracy of speech recognition.

以下に、図７及び図８を参照して、本実施形態の情報処理装置２００の処理について、具体的に説明する。図７は、第一の実施形態の情報処理装置の処理を説明する第一の図である。 The processing of the information processing device 200 of this embodiment will be described in detail below with reference to Figures 7 and 8. Figure 7 is the first diagram illustrating the processing of the information processing device of the first embodiment.

図７において、入力受付部２３０が受け付けた映像データＧ１は、比較的親しい関係にある２人の人物Ｐ１と人物Ｐ２とが飲食をしながら会話を楽しんでいる状態を撮像したものである。 In Figure 7, the video data G1 received by the input receiving unit 230 is a captured image of two people, P1 and P2, who are relatively close, enjoying a conversation while eating and drinking.

発話区間検出部２３１は、この映像データＧ１から、人物Ｐ１の発話区間と、人物Ｐ２の発話区間とを検出し、検出された区間の音声データを音声取得部２３２に出力し、動画データを発話スタイル推定部２３３に出力する。 The speech section detection unit 231 detects speech sections of person P1 and person P2 from this video data G1, outputs the audio data of the detected sections to the audio acquisition unit 232, and outputs the video data to the speech style estimation unit 233.

発話スタイル推定部２３３は、話者が人物Ｐ１である発話区間の動画データが入力されると、発話区間における発話スタイル推定情報を出力する。 When video data of an utterance section in which person P1 is the speaker is input, the speech style estimation unit 233 outputs speech style estimation information for the utterance section.

図７の例では、人物Ｐ１と人物Ｐ２とは、比較的親しい関係である。このため、発話スタイル推定部２３３は、第三の言語モデル２４３となる確率が最も高く、次に第二の言語モデル２４２となる確率が高く、第一の言語モデル２４１となる確率が最も低くなる発話スタイル推定情報を生成し、言語モデル提供部２２２に出力する。 In the example of Figure 7, person P1 and person P2 have a relatively close relationship. Therefore, the speech style estimation unit 233 generates speech style estimation information that is most likely to be the third language model 243, next most likely to be the second language model 242, and least likely to be the first language model 241, and outputs this information to the language model providing unit 222.

言語モデル提供部２２２は、この発話スタイル推定情報が入力されると、最も確率の高い第三の言語モデル２４３を音声認識部２２１に提供する。 When this speech style estimation information is input, the language model providing unit 222 provides the third language model 243 with the highest probability to the speech recognition unit 221.

音声認識部２２１は、人物Ｐ１の音声データの音声特徴量と、第三の言語モデル２４３とに基づき、音声認識を行った結果のテキストデータを出力する。 The speech recognition unit 221 outputs text data resulting from speech recognition based on the speech features of person P1's speech data and the third language model 243.

次に、発話スタイル推定部２３３は、話者が人物Ｐ２である発話区間の動画データが入力されると、発話区間における発話スタイル推定情報を出力する。このとき、発話スタイル推定部２３３は、人物Ｐ１のときと同様に、第三の言語モデル２４３となる確率が最も高く、次に第二の言語モデル２４２となる確率が高く、第一の言語モデル２４１となる確率が最も低くなる発話スタイル推定情報を生成し、言語モデル提供部２２２に出力する。 Next, when video data of a speech section in which the speaker is person P2 is input, the speech style estimation unit 233 outputs speech style estimation information for the speech section. At this time, just as with person P1, the speech style estimation unit 233 generates speech style estimation information in which the third language model 243 is most likely to be the third language model, the second language model 242 is next most likely to be the second language model, and the first language model 241 is least likely to be the first language model, and outputs this to the language model providing unit 222.

音声認識部２２１は、人物Ｐ２の音声データの音声特徴量と、第三の言語モデル２４３とに基づき、音声認識を行った結果のテキストデータを出力する。 The speech recognition unit 221 outputs text data resulting from speech recognition based on the speech features of person P2's speech data and the third language model 243.

続いて、情報処理装置２００は、２つのテキストデータをまとめた出力用テキストデータを生成し、端末装置３００に表示させる。 The information processing device 200 then generates output text data that combines the two pieces of text data and displays it on the terminal device 300.

図７に示す画面７１は、端末装置３００に出力用テキストデータが表示された画面の例である。 Screen 71 shown in Figure 7 is an example of a screen on which output text data is displayed on the terminal device 300.

画面７１は、表示領域７２、７３を含む。表示領域７２には、人物Ｐ１の音声データに第三の言語モデル２４３を用いて音声認識を行った結果であるテキストデータＴ１が表示されている。また、表示領域７３には、人物Ｐ２の音声データに対して第三の言語モデル２４３を用いて音声認識を行った結果であるテキストデータＴ２が表示されている。 Screen 71 includes display areas 72 and 73. Display area 72 displays text data T1, which is the result of speech recognition performed on person P1's speech data using the third language model 243. Display area 73 displays text data T2, which is the result of speech recognition performed on person P2's speech data using the third language model 243.

つまり、本実施形態では、画面７１に、音声認識を行った結果のテキストデータが、話者毎に表示されている。 In other words, in this embodiment, the text data resulting from speech recognition is displayed on screen 71 for each speaker.

画面７１において、テキストデータＴ１は、「このお菓子、おいしいね」であり、くだけた話し言葉となっている。また、画面７１において、テキストデータＴ２は、「やっぱりおいしいよね」であり、くだけた話し言葉となっている。 On screen 71, text data T1 is "This candy is delicious," which is casual speech. Also on screen 71, text data T2 is "It's delicious, after all," which is also casual speech.

図８は、第一の実施形態の情報処理装置の処理を説明する第二の図である。図８において、入力受付部２３０が受け付けた映像データＧ２は、人物Ｐ３が、ディスプレイ８５に表示された資料を表示させて、プレゼンテーションを行っている状態を撮像したものである。 Figure 8 is a second diagram illustrating the processing of the information processing device of the first embodiment. In Figure 8, the video data G2 received by the input receiving unit 230 is an image of person P3 giving a presentation while displaying materials on the display 85.

発話区間検出部２３１は、この映像データＧ２から、人物Ｐ３の発話区間を検出し、検出された区間の音声データを音声取得部２３２に出力し、動画データを発話スタイル推定部２３３に出力する。 The speech section detection unit 231 detects speech sections of person P3 from this video data G2, outputs the audio data of the detected sections to the audio acquisition unit 232, and outputs the video data to the speech style estimation unit 233.

発話スタイル推定部２３３は、話者が人物Ｐ３である発話区間の動画データが入力されると、発話区間における発話スタイル推定情報を出力する。 When video data of an utterance section in which person P3 is the speaker is input, the speech style estimation unit 233 outputs speech style estimation information for the utterance section.

図８の例では、人物Ｐ３は、ディスプレイ８５に表示された文章を読んでいる状態である。このため、発話スタイル推定部２３３は、第一の言語モデル２４１となる確率が最も高く、次に第二の言語モデル２４２となる確率が高く、第三の言語モデル２４３となる確率が最も低くなる発話スタイル推定情報を生成し、言語モデル提供部２２２に出力する。 In the example of Figure 8, person P3 is reading a sentence displayed on display 85. Therefore, the speech style estimation unit 233 generates speech style estimation information that is most likely to be the first language model 241, next most likely to be the second language model 242, and least likely to be the third language model 243, and outputs this information to the language model providing unit 222.

言語モデル提供部２２２は、この発話スタイル推定情報が入力されると、最も確率の高い第一の言語モデル２４１を音声認識部２２１に提供する。 When this speech style estimation information is input, the language model providing unit 222 provides the most likely first language model 241 to the speech recognition unit 221.

音声認識部２２１は、人物Ｐ３の音声データの音声特徴量と、第一の言語モデル２４１とに基づき、音声認識を行った結果のテキストデータを出力する。 The speech recognition unit 221 outputs text data resulting from speech recognition based on the speech features of person P3's speech data and the first language model 241.

図８に示す画面８１は、端末装置３００に出力用テキストデータが表示された画面の例である。 Screen 81 shown in Figure 8 is an example of a screen on which output text data is displayed on the terminal device 300.

画面８１には、人物Ｐ３の音声データに対して、第一の言語モデルを用いて音声認識を行った結果であるテキストデータＴ３が表示される。 Screen 81 displays text data T3, which is the result of speech recognition performed on person P3's voice data using the first language model.

画面８１において、テキストデータＴ３は、「この場合、このような結果になります。」であり、書き言葉となっている。 On screen 81, the text data T3 is "In this case, the result will be like this," and is written in written form.

このように、本実施形態では、話者が発話をしている状況を撮像した映像データ（動画データ）に基づき、話者の発話における言葉遣いや話し方を示す発話スタイルを推定し、発話スタイルに応じた言語モデルを用いて音声認識を行う。 In this way, in this embodiment, the speaker's speech style, which indicates the language and manner of speaking used in the speaker's speech, is estimated based on video data (video data) capturing the situation in which the speaker is speaking, and speech recognition is performed using a language model corresponding to the speech style.

したがって、例えば、図７の人物Ｐ１と図８の人物Ｐ３とが同一人物であったとしても、発話が行われた状況に応じた発話スタイルに応じた言語モデルを用いて、音声認識を行うことができ、音声認識の精度を向上させることができる。 Therefore, for example, even if person P1 in Figure 7 and person P3 in Figure 8 are the same person, speech recognition can be performed using a language model that corresponds to the speaking style that corresponds to the situation in which the utterance was made, thereby improving the accuracy of speech recognition.

また、本実施形態では、例えば、音声認識を行った結果であるテキストデータを表示させる際に、音声認識に用いた言語モデルと対応する発話スタイルを端末装置３００に表示させてもよい。 Furthermore, in this embodiment, for example, when displaying text data resulting from speech recognition, the terminal device 300 may display the speech style corresponding to the language model used for speech recognition.

具体的には、例えば、図７に示す画面７１では、テキストデータＴ１、Ｔ２と対応付けて、「くだけた話し言葉と対応した言語モデルを用いました」等という情報を表示させてもよい。 Specifically, for example, on screen 71 shown in FIG. 7, information such as "A language model corresponding to casual speech was used" may be displayed in association with text data T1 and T2.

このような情報を表示させることで、画面７１を閲覧しているユーザに対して、発話スタイルの推定結果を提示することができる。 By displaying this information, the estimated speaking style can be presented to the user viewing screen 71.

また、本実施形態では、発話スタイル推定部２３３は、複数の発話スタイル毎の確率を含む情報を発話スタイル推定情報として出力するものとしたが、これに限定されない。発話スタイル推定部２３３は、例えば、最も確率の高い発話スタイルのみを発話スタイル推定情報として出力してもよい。 In addition, in this embodiment, the speech style estimation unit 233 outputs information including the probability of each of multiple speech styles as speech style estimation information, but this is not limited to this. The speech style estimation unit 233 may, for example, output only the speech style with the highest probability as speech style estimation information.

（第二の実施形態）
以下に、図面を参照して第二の実施形態について説明する。第二の実施形態は、発話スタイル推定情報と、言語モデル記憶部２２３に格納された複数の言語モデルとを用いて、音声認識部２２１に提供する言語モデルを生成する点が、第一の実施形態と相違する。以下の第二の実施形態の説明では、第一の実施形態との相違点について説明し、第一の実施形態の同様の機能構成を有するものには、第一の実施形態の説明で用いた符号と同様の部号を付与し、その説明を省略する。 Second Embodiment
The second embodiment will be described below with reference to the drawings. The second embodiment differs from the first embodiment in that a language model to be provided to the speech recognition unit 221 is generated using speech style estimation information and a plurality of language models stored in the language model storage unit 223. In the following description of the second embodiment, differences from the first embodiment will be described, and components having similar functional configurations to those of the first embodiment will be assigned the same reference numerals as those used in the description of the first embodiment, and their description will be omitted.

図９は、第二の実施形態の情報処理装置の処理を説明するフローチャートである。図９のステップＳ９０１からステップＳ９０５の処理は、図６のステップＳ６０１からステップＳ６０５の処理と同様であるから、説明を省略する。 Figure 9 is a flowchart illustrating the processing of an information processing device according to the second embodiment. The processing from steps S901 to S905 in Figure 9 is similar to the processing from steps S601 to S605 in Figure 6, and therefore will not be described further.

図９のステップＳ９０４に続いて、情報処理装置２００は、言語モデル提供部２２２により、発話スタイル推定部２３３から出力された発話スタイル推定情報に基づき、音声認識部２２１に提供する言語モデルを生成する（ステップＳ９０６）。 Following step S904 in FIG. 9, the information processing device 200 causes the language model providing unit 222 to generate a language model to be provided to the speech recognition unit 221 based on the speech style estimation information output from the speech style estimation unit 233 (step S906).

以下に、本実施形態の言語モデル提供部２２２の処理について説明する。 The processing performed by the language model providing unit 222 in this embodiment is described below.

本実施形態の言語モデル提供部２２２は、発話スタイル推定情報の入力を受け付けて、発話スタイル推定情報に含まれる各発話スタイルの確率と、各発話スタイルと対応する言語モデルとを用いて新たな言語モデルを生成し、音声認識部２２１へ提供する。 The language model providing unit 222 of this embodiment accepts input of speech style estimation information, generates a new language model using the probability of each speech style included in the speech style estimation information and the language model corresponding to each speech style, and provides it to the speech recognition unit 221.

具体的には、例えば、言語モデル提供部２２２に対し、発話区間の発話スタイルが、書き言葉を用いた発話スタイル（第一の発話スタイル）である確率が５％、話し言葉を用いた発話スタイル（第二の発話スタイル）である確率が７０％、くだけた話し言葉を用いた発話スタイル（第三の発話スタイル）である確率が２５％であることを示す発話スタイル推定情報が入力されたとする。 Specifically, for example, suppose that the language model providing unit 222 receives input of speech style estimation information indicating that the speech style of the speech section is 5% to be a speech style using written language (first speech style), 70% to be a speech style using spoken language (second speech style), and 25% to be a speech style using casual spoken language (third speech style).

この場合において、例えば、第一の言語モデル２４１に対して、音声認識部２２１の処理単位の単語の音声認識の結果であるテキストデータが入力された場合に、次の処理単位に単語Ａが出現する出現確率が５％であったとする。また、同様に、このテキストデータが第二の言語モデル２４２に対して入力された場合に、次の処理単位に単語Ａが出現する出現確率が１０％であり、このテキストデータが第三の言語モデル２４３に対して入力された場合に、次の処理単位に単語Ａが出現する出現確率が２０％であったとする。 In this case, for example, when text data resulting from speech recognition of words in a processing unit by the speech recognition unit 221 is input to the first language model 241, the probability that word A will appear in the next processing unit is 5%. Similarly, when this text data is input to the second language model 242, the probability that word A will appear in the next processing unit is 10%, and when this text data is input to the third language model 243, the probability that word A will appear in the next processing unit is 20%.

この場合、本実施形態の言語モデル提供部２２２は、発話スタイル推定情報に含まれる各発話スタイルの確率と、各発話スタイルと対応する言語モデルにおける単語Ａの出現確率とを乗算した結果を加算する。そして、言語モデル提供部２２２は、次の処理単位に単語Ａが出現する出現確率が、演算した結果の値となるような言語モデルを生成し、音声認識部２２１に提供する。 In this case, the language model providing unit 222 of this embodiment adds the results of multiplying the probability of each speech style included in the speech style estimation information by the probability of appearance of word A in the language model corresponding to each speech style. The language model providing unit 222 then generates a language model in which the probability of appearance of word A in the next processing unit is equal to the calculated value, and provides this to the speech recognition unit 221.

この場合、言語モデル提供部２２２は、次の処理単位に単語Ａが出現する出現確率が、
第一の発話スタイルである確率５％×第一の言語モデル２４１での出現確率５％＋
第二の発話スタイルである確率７０％×第二の言語モデル２４２での出現確率１０％＋
第三の発話スタイルである確率２５％×第三の言語モデル２４３での出現確率２０％
＝１２．２５％
となるような言語モデルを生成し、音声認識部２２１に提供する。 In this case, the language model providing unit 222 determines that the occurrence probability of word A in the next processing unit is
Probability of first speaking style: 5% × Probability of appearance in first language model 241: 5% +
Probability of the second speaking style being 70% × probability of appearance in the second language model 242 being 10% +
25% probability of being the third speaking style × 20% probability of appearance in the third language model 243
= 12.25%
The language model is generated so as to satisfy the following condition and is provided to the speech recognition unit 221.

本実施形態の情報処理装置２００は、ステップＳ９０６に続いて、ステップＳ９０７へ進む。ステップＳ９０７とステップＳ９０８は、図６のステップＳ６０７とステップＳ６０８と同様であるから、説明を省略する。 In this embodiment, the information processing device 200 proceeds to step S907 after step S906. Steps S907 and S908 are similar to steps S607 and S608 in Figure 6, so their description will be omitted.

本実施形態では、このように、発話スタイル推定情報に含まれる各発話スタイルの確率と、各言語モデルにおける処理単位の単語の出現確率とに基づいて生成した言語モデルを音声認識に用いる。このため、本実施形態では、発話スタイル推定部２３３による推定結果に誤りがあった場合であっても、誤りによる影響を抑えることができる。 In this embodiment, a language model generated based on the probability of each speech style included in the speech style estimation information and the occurrence probability of words in each processing unit in each language model is used for speech recognition. Therefore, in this embodiment, even if there is an error in the estimation result by the speech style estimation unit 233, the impact of the error can be reduced.

また、本実施形態では、例えば、各発話スタイルと対応した言語モデル毎の単語の出現確率が反映された言語モデルを用いて音声認識を行う。このため、本実施形態では、発話スタイル推定情報に含まれる複数の発話スタイルの確率のそれぞれが近い値であり、発話スタイルの推定が困難な場合等であっても、音声認識の精度を向上させることができる。 Furthermore, in this embodiment, speech recognition is performed using, for example, a language model that reflects the probability of occurrence of words for each language model corresponding to each speaking style. Therefore, in this embodiment, the accuracy of speech recognition can be improved even in cases where the probabilities of multiple speaking styles included in the speaking style estimation information are close to each other and it is difficult to estimate the speaking style.

（第三の実施形態）
以下に、図面を参照して第三の実施形態について説明する。第三の実施形態は、情報処理装置に対する設定に応じて、言語モデルを選択するか、又は、言語モデルを生成する点が、第一の実施形態と相違する。以下の第三の実施形態の説明では、第一の実施形態との相違点について説明し、第一の実施形態の同様の機能構成を有するものには、第一の実施形態の説明で用いた符号と同様の部号を付与し、その説明を省略する。 (Third embodiment)
A third embodiment will be described below with reference to the drawings. The third embodiment differs from the first embodiment in that a language model is selected or generated according to settings for the information processing device. In the following description of the third embodiment, differences from the first embodiment will be described, and components having similar functional configurations to those of the first embodiment will be assigned the same reference numerals as those used in the description of the first embodiment, and descriptions thereof will be omitted.

本実施形態の情報処理装置２００は、言語モデル提供部２２２に対し、言語モデル記憶部２２３に格納された言語モデルを発話スタイル推定情報の推定結果に応じて選択するか、又は、新たに言語モデルを生成するか、設定することができる。 The information processing device 200 of this embodiment can set the language model providing unit 222 to select a language model stored in the language model storage unit 223 based on the estimation result of the speech style estimation information, or to generate a new language model.

本実施形態の言語モデル提供部２２２は、発話スタイル推定部２３３から発話スタイル推定情報が入力されると、設定に応じて言語モデルの選択、又は、言語モデルの生成を行う。また、本実施形態の言語モデル提供部２２２は、言語モデルの選択も言語モデル生成も設定されていない場合には、言語モデル選択信号によって指定された言語モデルを音声認識部２２１に提供する。 When the language model providing unit 222 of this embodiment receives speech style estimation information from the speech style estimation unit 233, it selects or generates a language model depending on the settings. Furthermore, if neither language model selection nor language model generation is set, the language model providing unit 222 of this embodiment provides the speech recognition unit 221 with the language model specified by the language model selection signal.

図１０は、第三の実施形態の情報処理装置の処理を説明するフローチャートである。図１０のステップＳ１００１からステップＳ１００３の処理は、図６のステップＳ６０１からステップＳ６０３の処理と同様であるから、説明を省略する。 Figure 10 is a flowchart illustrating the processing of an information processing device according to the third embodiment. The processing from steps S1001 to S1003 in Figure 10 is similar to the processing from steps S601 to S603 in Figure 6, and therefore will not be described further.

本実施形態の情報処理装置２００において、言語モデル提供部２２２は、発話スタイル推定情報が入力されると、言語モデルを選択する設定が行われているか否かを判定する（ステップＳ１００４）。 In the information processing device 200 of this embodiment, when speech style estimation information is input, the language model providing unit 222 determines whether a setting for selecting a language model has been made (step S1004).

ステップＳ１００４において、言語モデルを選択する設定が行われている場合、情報処理装置２００は、図６のステップＳ６０４、６０５へ進む。 If a setting to select a language model is made in step S1004, the information processing device 200 proceeds to steps S604 and S605 in Figure 6.

ステップＳ１００４において、言語モデルを選択する設定が行われておらず、言語モデルを生成する設定が行われていた場合、情報処理装置２００は、図９のステップＳ９０４、９０５へ進む。 If, in step S1004, the setting to select a language model has not been made and the setting to generate a language model has been made, the information processing device 200 proceeds to steps S904 and S905 in Figure 9.

ステップＳ１００５において、言語モデルを生成する設定が行われていない場合、音声認識処理部２２０は、言語モデル選択信号に応じて、言語モデル記憶部２２３に格納されている複数の言語モデルから、言語モデルを選択する（ステップＳ１００６）。 If the setting to generate a language model has not been made in step S1005, the speech recognition processing unit 220 selects a language model from multiple language models stored in the language model storage unit 223 in accordance with the language model selection signal (step S1006).

続いて、情報処理装置２００は、音声取得部２３２により、発話区間検出部２３１から出力された音声データの音声特徴量を抽出し（ステップＳ１００７）、ステップＳ１００８へ進む。 Next, the information processing device 200 causes the speech acquisition unit 232 to extract speech features from the speech data output from the speech period detection unit 231 (step S1007), and proceeds to step S1008.

図１０のステップＳ１００８とステップＳ１００９の処理は、図６のステップＳ６０７とステップＳ６０８の処理と同様であるから、説明を省略する。 The processing of steps S1008 and S1009 in Figure 10 is similar to the processing of steps S607 and S608 in Figure 6, so description thereof will be omitted.

このように、本実施形態では、言語モデルを発話スタイル推定情報に含まれる各発話スタイルの確率に応じて選択するか、又は、言語モデルを発話スタイル推定情報を用いて新たに生成するか、設定することができる。 In this way, in this embodiment, it is possible to set whether a language model is selected based on the probability of each speech style included in the speech style estimation information, or whether a language model is newly generated using the speech style estimation information.

また、本実施形態では、例えば、発話スタイル推定情報に含まれる各発話スタイルの確率に対して所定の閾値を設定してもよい。そして、本実施形態では、発話スタイル推定情報に含まれる各発話スタイルの確率と、所定の閾値との関係に応じて、言語モデルを選択する設定が行われているか否かを判定してもよい。 Furthermore, in this embodiment, for example, a predetermined threshold may be set for the probability of each speech style included in the speech style estimation information. Then, in this embodiment, it may be determined whether or not a setting for selecting a language model has been made based on the relationship between the probability of each speech style included in the speech style estimation information and the predetermined threshold.

具体的には、本実施形態では、発話スタイル推定情報に含まれる各発話スタイルの確率のうち、最も高い値が所定の閾値以上である場合に、言語モデルを選択する設定が行われているものと判定してもよい。 Specifically, in this embodiment, if the highest probability value of each speech style included in the speech style estimation information is equal to or greater than a predetermined threshold, it may be determined that a setting to select a language model has been made.

言い換えれば、本実施形態では、発話スタイル推定情報に含まれる各発話スタイルの確率のうち、最も高い値が所定の閾値未満である場合は、新たな言語モデルを生成する。所定の閾値は、例えば、５０％としてもよい。 In other words, in this embodiment, if the highest probability of each speech style included in the speech style estimation information is less than a predetermined threshold, a new language model is generated. The predetermined threshold may be, for example, 50%.

本実施形態では、このようにすることで、発話スタイル推定情報に含まれる各発話スタイルの確率が、それぞれ近い値である場合には、各発話スタイルに応じた言語モデルにおける単語の出現確率を反映させた言語モデルを音声認識に用いることができる。したがって、本実施形態によれば、音声認識の精度を向上させることができる。 In this embodiment, by doing this, if the probabilities of each speech style included in the speech style estimation information are close to each other, a language model that reflects the word appearance probability in the language model corresponding to each speech style can be used for speech recognition. Therefore, according to this embodiment, the accuracy of speech recognition can be improved.

（第四の実施形態）
以下に、図面を参照して、第四の実施形態について説明する。第四の実施形態では、第一乃至第三の実施形態の情報処理システムの具体的な利用シーンの一例を示している。 (Fourth embodiment)
The fourth embodiment will be described below with reference to the drawings. In the fourth embodiment, an example of a specific usage scene of the information processing system according to the first to third embodiments is shown.

図１１は、第四の実施形態のシステム構成の一例を示す図である。図１１では、第一乃至第三の実施形態のいずれか一つを、遠隔会議システムに利用した場合を示している。 Figure 11 is a diagram showing an example of the system configuration of the fourth embodiment. Figure 11 shows a case where any one of the first to third embodiments is used in a remote conference system.

本実施形態の遠隔会議システム１００Ａは、情報処理装置２００、半天球型撮像装置４００、電子黒板５００を含み、それぞれがネットワークを介して接続されている。 The remote conference system 100A of this embodiment includes an information processing device 200, a hemispherical imaging device 400, and an electronic whiteboard 500, all of which are connected via a network.

本実施形態では、半天球型撮像装置４００と、電子黒板５００とは、それぞれが、地理的に離れた場所に設置されていてもよい。具体的には、例えば、半天球型撮像装置４００は、Ａ県Ａ市に所在する事業所の会議室に設置されており、電子黒板５００は、Ｂ県Ｂ市に所在する事業所の会議室に設置されていてもよい。 In this embodiment, the hemispherical imaging device 400 and the electronic whiteboard 500 may be installed in geographically separate locations. Specifically, for example, the hemispherical imaging device 400 may be installed in a conference room of a business office located in City A, Prefecture A, and the electronic whiteboard 500 may be installed in a conference room of a business office located in City B, Prefecture B.

半天球型撮像装置４００は、会議室内の半天球画像データを撮像する。また、半天球型撮像装置４００は、集音装置を有しており、会議室内で行われた発話の音声データを取得する。また、半天球型撮像装置４００は、半天球画像データと音声データとを含む映像データを情報処理装置２００へ送信する通信装置とを含んでもよい。 The hemispherical imaging device 400 captures hemispherical image data within the conference room. The hemispherical imaging device 400 also has a sound collection device and acquires audio data of speech made within the conference room. The hemispherical imaging device 400 may also include a communication device that transmits video data including the hemispherical image data and audio data to the information processing device 200.

電子黒板５００は、例えば、タッチパネル付大型ディスプレイを有し、ユーザが指示した盤面の座標を検出し座標を接続してストロークを表示するものであり、表示装置の一例である。なお、電子黒板５００は、電子情報ボード、電子ホワイトボードと呼ばれる場合もある。 The electronic whiteboard 500, which has, for example, a large display with a touch panel, detects coordinates on the board indicated by the user and displays strokes by connecting the coordinates, is an example of a display device. Note that the electronic whiteboard 500 is also sometimes called an electronic information board or electronic whiteboard.

本実施形態の情報処理装置２００は、音声認識処理部２２０を有し、例えば、半天球型撮像装置４００が設置された会議室で取得された映像データに基づき、会議に参加していた話者毎の発話をテキストデータに変換して電子黒板５００に表示させる。なお、電子黒板５００には、テキストデータと共に、半天球型撮像装置４００が取得した映像データが表示されてもよい。 The information processing device 200 of this embodiment has a voice recognition processing unit 220, and converts the speech of each speaker participating in a conference into text data based on, for example, video data acquired in a conference room where a hemispherical imaging device 400 is installed, and displays the text data on the electronic whiteboard 500. Note that the electronic whiteboard 500 may also display the video data acquired by the hemispherical imaging device 400 together with the text data.

本実施形態では、このように情報処理装置２００を用いることで、例えば、立場の異なる複数の人物が発話している会議等において、発話した人物の立場に応じた発話スタイルと対応する言語モデルを用いて音声認識を行うことができる。 In this embodiment, by using the information processing device 200 in this manner, for example, in a conference where multiple people with different positions are speaking, speech recognition can be performed using a language model that corresponds to the speaking style that corresponds to the position of the person speaking.

したがって、本実施形態によれば、例えば、電子黒板５００において、音声認識の結果であるテキストデータのみが表示された場合であっても、テキストデータに話者毎の発話スタイルが反映される。このため、本実施形態では、例えば、電子黒板５００の閲覧者に対して、話者の立場や、人間関係を把握させることができる。 According to this embodiment, even if only text data resulting from speech recognition is displayed on the electronic whiteboard 500, the speaking style of each speaker is reflected in the text data. Therefore, in this embodiment, for example, viewers of the electronic whiteboard 500 can understand the speaker's position and interpersonal relationships.

なお、図１１では、映像データを半天球型撮像装置４００により撮像するものとしたが、これに限定されない。映像データは、一般的な撮像装置や、全天球型撮像装置等によって取得されてもよい。 Note that while FIG. 11 illustrates the video data being captured by a hemispherical imaging device 400, this is not limiting. The video data may also be acquired by a general imaging device, a spherical imaging device, or the like.

（第五の実施形態）
以下に、図面を参照して、第五の実施形態について説明する。第五の実施形態では、第一乃至第三の実施形態の情報処理システムの具体的な利用シーンの一例を示している。 (Fifth embodiment)
The fifth embodiment will be described below with reference to the drawings. The fifth embodiment shows an example of a specific usage scene of the information processing systems of the first to third embodiments.

図１２は、第五の実施形態のシステム構成の一例を示す図である。図１２では、第一乃至第三の実施形態の何れか一つを、カウンセリング支援システムに利用した場合を示している。 Figure 12 is a diagram showing an example of the system configuration of the fifth embodiment. Figure 12 shows a case where any one of the first to third embodiments is used in a counseling support system.

本実施形態のカウンセリング支援システム１００Ｂは、例えば、臨床心理士等のカウンセラによるカウンセリングが行われる場所に導入されてもよい。本実施形態のカウンセリング支援システム１００Ｂは、情報処理装置２００、撮像装置７００、端末装置８００を含み、それぞれがネットワークを介して接続されている。 The counseling support system 100B of this embodiment may be installed in a location where counseling is provided by a counselor such as a clinical psychologist. The counseling support system 100B of this embodiment includes an information processing device 200, an imaging device 700, and a terminal device 800, all of which are connected via a network.

撮像装置７００は、例えば、カウンセリングが行われるカウンセリングルーム等に設定されてもよく、カウンセリングを受ける人物を含む映像データを取得する。言い換えれば、撮像装置７００は、カウンセリングを受けている人物の動画と音声を含む映像データを取得する。そして、撮像装置７００は、取得した映像データを情報処理装置２００へ送信する。以下の説明では、カウンセリングを受けている人物を相談者と表現する場合がある。 The imaging device 700 may be set up, for example, in a counseling room where counseling is conducted, and acquires video data including the person receiving counseling. In other words, the imaging device 700 acquires video data including video and audio of the person receiving counseling. The imaging device 700 then transmits the acquired video data to the information processing device 200. In the following description, the person receiving counseling may be referred to as the client.

なお、撮像装置７００により取得される映像データには、カウンセラの画像やカウンセラの発話を示す音声データが含まれてもよい。 The video data acquired by the imaging device 700 may also include images of the counselor and audio data showing the counselor's speech.

本実施形態の端末装置８００は、例えば、カウンセラが所持している端末装置であってよい。端末装置８００は、例えば、タブレット型の端末装置であってもよくもディスプレイを有するものである。 The terminal device 800 of this embodiment may be, for example, a terminal device carried by a counselor. The terminal device 800 may be, for example, a tablet-type terminal device having a display.

情報処理装置２００は、撮像装置７００から取得した映像データを取得すると、カウンセリング中の相談者の画像から、相談者の発話スタイルを推定し、相談者の音声データをテキストデータに変換する。そして、情報処理装置２００は、端末装置８００からの映像データの再生要求等を受け付けた場合に、映像データに、テキストデータを重ねて、端末装置８００に表示させる。 When the information processing device 200 acquires video data from the imaging device 700, it estimates the client's speaking style from the image of the client during counseling and converts the client's voice data into text data. Then, when the information processing device 200 receives a request to play the video data from the terminal device 800, it overlays the text data on the video data and displays it on the terminal device 800.

カウンセリング中における相談者の発話スタイルには、相談者の心理状態が反映される可能性が高い。例えば、相談者の音声データから変換されたテキストデータが、書き言葉である場合には、相談者は緊張した状態であることがわかる。また、例えば、相談者の音声データから変換されたテキストデータが、話し言葉である場合には、相談者は比較的リラックスした状態であることがわかる。 A client's speaking style during counseling is likely to reflect their psychological state. For example, if the text data converted from the client's voice data is written, it can be seen that the client is in a tense state. Also, if the text data converted from the client's voice data is spoken, it can be seen that the client is in a relatively relaxed state.

また、例えば、相談者の音声データから変換されたテキストデータが、話し言葉から書き言葉に変化した場合には、カウンセリングでの話題が、相談者を緊張させるような話題に変化したことがわかる。 Also, for example, if the text data converted from the client's voice data changes from spoken to written language, it can be determined that the topic of the counseling session has changed to one that is making the client nervous.

本実施形態では、カウンセリング中の相談者の音声データを、相談者の発話スタイルに応じたテキストデータに変換して、カウンセラに提示することで、カウンセラによる、相談者の心理状態の把握を支援することができる。また、本実施形態では、カウンセリング中の相談者の心理状態をカウンセラに把握させることで、カウンセラに対して、適切な話題の選択等が行われていたかを学習させることができる。 In this embodiment, the voice data of the client during counseling is converted into text data that matches the client's speaking style and presented to the counselor, thereby assisting the counselor in understanding the client's psychological state. Furthermore, in this embodiment, by allowing the counselor to understand the client's psychological state during counseling, the counselor can learn whether appropriate topics were selected, etc.

なお、本実施形態では、カウンセリング支援システム１００Ｂを、臨床心理士等によるカウンセリングに適用したものとしたが、これに限定されない。カウンセリング支援システム１００Ｂは、例えば、学生や求職者に対する就職相談や、企業等の組織内での面談等に用いられてもよい。 In this embodiment, the counseling support system 100B is applied to counseling by a clinical psychologist or the like, but is not limited to this. The counseling support system 100B may also be used, for example, for career counseling for students and job seekers, or for interviews within organizations such as companies.

上記で説明した実施形態の各機能は、一又は複数の処理回路によって実現することが可能である。ここで、本明細書における「処理回路」とは、電子回路により実装されるプロセッサのようにソフトウェアによって各機能を実行するようプログラミングされたプロセッサや、上記で説明した各機能を実行するよう設計されたASIC（Application Specific Integrated Circuit）、DSP（digital signal processor）、FPGA（field programmable gate array）や従来の回路モジュール等のデバイスを含むものとする。 The functions of the above-described embodiments can be realized by one or more processing circuits. In this specification, the term "processing circuit" includes processors programmed to perform functions by software, such as processors implemented by electronic circuits, as well as devices such as ASICs (Application Specific Integrated Circuits), DSPs (Digital Signal Processors), FPGAs (Field Programmable Gate Arrays), and conventional circuit modules designed to perform the above-described functions.

また、実施形態に記載された装置群は、本明細書に開示された実施形態を実施するための複数のコンピューティング環境のうちの１つを示すものにすぎない。 Furthermore, the devices described in the embodiments represent only one of several computing environments for implementing the embodiments disclosed herein.

ある実施形態では、情報処理装置２００は、サーバクラスタといった複数のコンピューティングデバイスを含む。複数のコンピューティングデバイスは、ネットワークや共有メモリなどを含む任意のタイプの通信リンクを介して互いに通信するように構成されており、本明細書に開示された処理を実施する。同様に、情報処理装置２００は、互いに通信するように構成された複数のコンピューティングデバイスを含むことができる。 In one embodiment, information processing apparatus 200 includes multiple computing devices, such as a server cluster. The multiple computing devices are configured to communicate with each other over any type of communication link, including a network, shared memory, etc., and perform the processing disclosed herein. Similarly, information processing apparatus 200 may include multiple computing devices configured to communicate with each other.

さらに、情報処理装置２００は、開示された処理ステップを様々な組み合わせで共有するように構成できる。例えば、情報処理装置２００によって実行されるプロセスは、他の情報処理装置によって実行され得る。同様に、情報処理装置２００の機能は、他の情報処理装置によって実行することができる。また、情報処理装置と他の情報処理装置の各要素は、１つの情報処理装置にまとめられていても良いし、複数の装置に分けられていても良い。 Furthermore, the information processing device 200 can be configured to share the disclosed processing steps in various combinations. For example, a process performed by the information processing device 200 can be performed by another information processing device. Similarly, the functions of the information processing device 200 can be performed by another information processing device. Furthermore, the elements of the information processing device and the other information processing device may be integrated into a single information processing device, or may be separated into multiple devices.

以上、各実施形態に基づき本発明の説明を行ってきたが、上記実施形態に示した要件に本発明が限定されるものではない。これらの点に関しては、本発明の主旨をそこなわない範囲で変更することができ、その応用形態に応じて適切に定めることができる。 The present invention has been described above based on various embodiments, but the present invention is not limited to the requirements set forth in the above embodiments. These aspects can be modified without departing from the spirit of the present invention, and can be determined appropriately depending on the application form.

１００音声認識システム
２００情報処理装置
２２０音声認識処理部
２２１音声認識部
２２２言語モデル提供部
２２３言語モデル記憶部
２２４出力テキスト生成部
２３０入力受付部
２３１発話区間検出部
２３２音声取得部
２３３発話スタイル推定部
３００端末装置 REFERENCE SIGNS LIST 100 Speech recognition system 200 Information processing device 220 Speech recognition processing unit 221 Speech recognition unit 222 Language model providing unit 223 Language model storage unit 224 Output text generating unit 230 Input receiving unit 231 Speech period detecting unit 232 Speech acquisition unit 233 Speech style estimating unit 300 Terminal device

特開２００４－３３３７３８号公報Japanese Patent Application Laid-Open No. 2004-333738

Claims

An information processing method by a computer, comprising:
Accepts input of video data including video of the speaker,
When the video data is input, for each of a plurality of speech styles corresponding to a plurality of language models stored in a storage unit, output estimation information indicating the probability that each speech style matches the speech style of the speaker;
When a setting is made to select a language model according to the speech style of the speaker, a language model corresponding to the speech style with the highest probability is selected as the language model according to the speech style of the speaker;
generating a new language model according to the speech style of the speaker based on the estimated information when a setting is made to generate a new language model according to the speech style of the speaker;
providing a language model corresponding to the speech style of the selected speaker or the new language model to a speech recognition unit that performs speech recognition;
An information processing method, comprising: displaying on a display device text data resulting from speech recognition of audio data contained in the video data using a language model corresponding to the speech style of the selected speaker or the new language model .

An information processing method by a computer, comprising:
Accepts input of video data including video of the speaker,
When the video data is input, for each of a plurality of speech styles corresponding to a plurality of language models stored in a storage unit, output estimation information indicating the probability that each speech style matches the speech style of the speaker;
selecting a language model according to the manner of speech of the speaker based on the estimated information when a setting is made to select a language model according to the manner of speech of the speaker;
generating a new language model using the plurality of language models and a probability for each speech mode included in the estimated information when a setting is made to generate a new language model according to the speech mode of the speaker;
providing a language model corresponding to the speech style of the selected speaker or the new language model to a speech recognition unit that performs speech recognition;
An information processing method, comprising: displaying on a display device text data resulting from speech recognition of audio data contained in the video data using a language model corresponding to the speech style of the selected speaker or the new language model .

The computer
When the highest probability value among the probabilities that each of the plurality of speech modes matches the speech mode of the speaker is less than a predetermined threshold value,
The information processing method according to claim 1 or 2 , further comprising determining that a setting is set to generate a new language model according to the speaker's speech style.

The computer
The information processing method according to claim 1 , wherein the text data is displayed on the display device for each utterance of the speaker.

The computer
5. The information processing method according to claim 4 , further comprising the step of displaying information indicating a speech style corresponding to the language model provided to the speech recognition unit on the display device together with the text data.

Accepts input of video data including video of the speaker,
When the video data is input, for each of a plurality of speech styles corresponding to a plurality of language models stored in a storage unit, output estimation information indicating the probability that each speech style matches the speech style of the speaker;
When a setting is made to select a language model according to the speech style of the speaker, a language model corresponding to the speech style with the highest probability is selected from among a plurality of language models stored in a storage unit as the language model according to the speech style of the speaker;
generating a new language model according to the speech style of the speaker based on the estimated information when a setting is made to generate a new language model according to the speech style of the speaker;
providing a language model corresponding to the speech style of the selected speaker or the new language model to a speech recognition unit that performs speech recognition;
A program that causes a computer to execute a process of displaying on a display device text data resulting from speech recognition of the audio data contained in the video data using a language model that corresponds to the speech style of the selected speaker, or the new language model.

Accepts input of video data including video of the speaker,
When the video data is input, for each of a plurality of speech styles corresponding to a plurality of language models stored in a storage unit, output estimation information indicating the probability that each speech style matches the speech style of the speaker;
selecting a language model according to the manner of speech of the speaker based on the estimated information when a setting is made to select a language model according to the manner of speech of the speaker;
When a setting is made to generate a new language model according to the manner of speech of the speaker,
generating a new language model using the probability for each mode of speech included in the estimation information and the plurality of language models;
providing a language model corresponding to the speech style of the selected speaker or the new language model to a speech recognition unit that performs speech recognition;
A program that causes a computer to execute a process of displaying on a display device text data resulting from speech recognition of the audio data contained in the video data using a language model that corresponds to the speech style of the selected speaker, or the new language model.

an input receiving unit that receives input of video data including a video of a speaker;
an estimation unit that, when the video data is input, outputs estimation information indicating the probability that each speech style corresponds to a speech style of the speaker for each of a plurality of language models stored in a storage unit;
a model providing unit that, when a setting is made to select a language model according to the speaker's speech style, selects a language model corresponding to the speech style with the highest probability as a language model according to the speaker's speech style, and, when a setting is made to generate a new language model according to the speaker's speech style, generates a new language model according to the speaker's speech style based on the estimated information;
an output text generation unit that provides a language model corresponding to the speech style of the selected speaker or the new language model to a speech recognition unit that performs speech recognition, and displays text data resulting from speech recognition of the audio data included in the video data using the language model corresponding to the speech style of the selected speaker or the new language model on a display device.

an input receiving unit that receives input of video data including a video of a speaker;
an estimation unit that, when the video data is input, outputs estimation information indicating the probability that each speech style corresponds to a speech style of the speaker for each of a plurality of language models stored in a storage unit;
a model providing unit that selects a language model according to the speaker's speech style based on the estimated information when a setting is made to select a language model according to the speaker's speech style, and that generates a new language model according to the speaker's speech style using the probability for each speech style included in the estimated information and the plurality of language models when a setting is made to generate a new language model according to the speaker's speech style;
an output text generation unit that provides a language model corresponding to the speech style of the selected speaker or the new language model to a speech recognition unit that performs speech recognition, and displays text data resulting from speech recognition of the audio data included in the video data using the language model corresponding to the speech style of the selected speaker or the new language model on a display device.

10. An information processing system comprising: the information processing device according to claim 8 ; and an imaging device that captures the video data.

The information processing system according to claim 10 , wherein the imaging device is a spherical imaging device or a semi-spherical imaging device.