JP7575804B2

JP7575804B2 - Voice recognition program, voice recognition method, voice recognition device, and voice recognition system

Info

Publication number: JP7575804B2
Application number: JP2022106669A
Authority: JP
Inventors: 康弘眞井; 幸一朗森重
Original assignee: 株式会社ミルプラトー
Priority date: 2020-12-18
Filing date: 2022-06-30
Publication date: 2024-10-30
Anticipated expiration: 2040-12-18
Also published as: JP7103681B2; JP2022096852A; JP2022121643A

Description

この発明は、録音等の音声認識を行う音声認識プログラム、音声認識方法、音声認識装置および音声認識システムに関する。 This invention relates to a voice recognition program, a voice recognition method, a voice recognition device, and a voice recognition system for performing voice recognition on recorded audio, etc.

ＩＣレコーダや録音アプリケーション（アプリ）により録音した音声は、ＩＣレコーダ等に多数保持可能である。録音後の音声ファイルについて、多数のうちから必要なものを効率的に見つけ出し再生できることが望まれている。また、音声データに含まれる話者を具体的にユーザに提示できることが望まれている。 A large number of audio files recorded using IC recorders or recording applications (apps) can be stored on IC recorders, etc. It is desirable to be able to efficiently find and play back the required audio files from among the many recorded audio files. It is also desirable to be able to specifically present to the user the speakers included in the audio data.

音声ファイルに含まれる話者人数は、ｋ－ｍｅａｎｓ法等のクラスタリング技術により推定することができる。クラスタリングでは、話者人数を事前に設定することで話者人数に基づき音声ファイルに含まれる音声を話者ごとに分割する。話者人数の推定に関する技術としては、例えば、会議等の打合せの録音前に話者人数を事前に設定し、話者分のマイクを用意し話者別の方向を検出する処理等により、話者人数を推定する技術がある（例えば、下記特許文献１参照。）。 The number of speakers included in an audio file can be estimated using clustering techniques such as the k-means method. In clustering, the number of speakers is set in advance and the audio included in the audio file is divided into speakers based on the number of speakers. One technique for estimating the number of speakers is to set the number of speakers in advance before recording a meeting or other discussion, prepare microphones for each speaker, and estimate the number of speakers by processing to detect the direction of each speaker (see, for example, Patent Document 1 below).

特開２００９－３０１１２５号公報JP 2009-301125 A

しかしながら、従来技術では、録音データに含まれる話者人数の推定には、打合せの録音前に話者人数を事前に設定し、話者分のマイクを用意し話者別の方向を検出する処理等の事前準備が必要となり煩雑であった。また、推定した話者人数は、所定の精度を有しているが、実際に録音した話者人数と異なる場合があり、このような場合において推定した話者人数の修正を簡単に行えなかった。 However, with conventional technology, estimating the number of speakers included in recorded data required cumbersome advance preparations, such as setting the number of speakers before recording the meeting, preparing microphones for each speaker, and processing to detect the direction of each speaker. In addition, although the estimated number of speakers had a certain degree of accuracy, it could differ from the number of speakers actually recorded, and in such cases the estimated number of speakers could not be easily corrected.

加えて、録音後の音声ファイルに含まれる話者が具体的に誰であるかの話者推定についても、簡単に推定できることが望まれる。 In addition, it would be desirable to be able to easily estimate who the specific speaker is in a recorded audio file.

本発明は、上記課題に鑑み、事前設定せずとも音声ファイルに含まれる音声の話者人数および話者を簡単に推定できることを目的とする。 In view of the above problems, the present invention aims to easily estimate the number and speakers of audio contained in an audio file without any prior setting.

上記目的を達成するため、本発明の音声認識プログラムは、コンピュータに、音声ファイルに含まれる話者別の話者人数を推定し、予め用意された話者別の学習済モデルを参照し、推定した前記話者人数のそれぞれの話者を認識し、前記音声ファイルに含まれる話者をタグ付けする、処理を実行させることを特徴とする。 To achieve the above object, the speech recognition program of the present invention is characterized in that it causes a computer to execute a process of estimating the number of speakers for each speaker included in a speech file, referencing a trained model for each speaker that has been prepared in advance, recognizing each speaker of the estimated number of speakers, and tagging the speakers included in the speech file.

また、前記認識の処理は、推定した前記話者人数の情報、および前記話者人数に対応する話者候補をユーザに提示し、前記ユーザによる前記話者候補から前記話者を特定する操作に基づき、前記話者人数のそれぞれの話者を認識し、前記タグ付けの処理は、前記ユーザの操作に基づき話者をタグ付けする、ことを特徴とする。 The recognition process is characterized in that it presents to a user information on the estimated number of speakers and speaker candidates corresponding to the number of speakers, and recognizes each speaker of the number of speakers based on an operation by the user to identify the speaker from the speaker candidates, and the tagging process tags the speakers based on the operation by the user.

また、前記推定の処理は、推定した前記話者人数をユーザに提示し、前記ユーザによる前記話者人数の変更操作に基づき、前記音声ファイルに含まれる話者人数の推定を再度実行する、ことを特徴とする。 The estimation process is also characterized in that it presents the estimated number of speakers to the user, and re-estimates the number of speakers included in the audio file based on the user's operation to change the number of speakers.

さらに、前記タグ付け後の話者の情報の学習および蓄積を行い、前記認識の処理は、前記学習済モデルに基づき、推定した前記話者人数のそれぞれの話者を認識する、ことを特徴とする。 Furthermore, the tagged speaker information is learned and accumulated, and the recognition process is characterized in that each speaker of the estimated number of speakers is recognized based on the learned model.

また、前記音声の録音時あるいは再生時に、前記音声ファイルに含まれる文字をリアルタイムに生成することを特徴とする。 Another feature of the system is that the characters contained in the audio file are generated in real time when the audio is recorded or played back.

また、本発明の音声認識方法は、コンピュータが、音声ファイルに含まれる話者別の話者人数を推定し、予め用意された話者別の学習済モデルを参照し、推定した前記話者人数のそれぞれの話者を認識し、前記音声ファイルに含まれる話者をタグ付けする、処理を実行することを特徴とする。 The speech recognition method of the present invention is characterized in that the computer executes a process of estimating the number of speakers for each speaker included in the speech file, referring to a trained model for each speaker that has been prepared in advance, recognizing each speaker of the estimated number of speakers, and tagging the speakers included in the speech file.

また、本発明の音声認識装置は、音声ファイルに含まれる話者人数と話者を認識する制御部、を備え、前記制御部は、音声ファイルに含まれる話者別の話者人数を推定し、予め用意された話者別の学習済モデルを参照し、推定した前記話者人数のそれぞれの話者を認識し、前記音声ファイルに含まれる話者をタグ付けする、ことを特徴とする。 The voice recognition device of the present invention also includes a control unit that recognizes the number of speakers and the speakers included in the voice file, and the control unit estimates the number of speakers for each speaker included in the voice file, refers to a trained model for each speaker that has been prepared in advance, recognizes each speaker of the estimated number of speakers, and tags the speakers included in the voice file.

また、本発明の音声認識システムは、端末と、クラウドが通信接続された音声認識システムにおいて、前記端末は、音声の録音部と、録音あるいは再生した音声ファイルを前記クラウドにアップロードする通信部と、を有し、前記クラウドは、前記音声ファイルに含まれる話者別の話者人数を推定し、予め用意された話者別の学習済モデルを参照し、推定した前記話者人数のそれぞれの話者を認識し、前記音声ファイルに含まれる話者をタグ付けした情報を前記端末に通知する、ことを特徴とする。 The speech recognition system of the present invention is a speech recognition system in which a terminal and a cloud are communicatively connected, the terminal has a voice recording unit and a communication unit that uploads recorded or played back voice files to the cloud, and the cloud estimates the number of speakers for each speaker included in the voice file, refers to a trained model for each speaker that has been prepared in advance, recognizes each speaker of the estimated number of speakers, and notifies the terminal of information tagged with the speakers included in the voice file.

また、前記端末は、前記クラウドが推定した前記話者人数の情報、および前記話者人数に対応する話者候補をユーザに提示する表示部を備え、前記ユーザによる前記話者候補から前記話者を特定する操作の情報を前記クラウドに送信し、前記クラウドは、前記端末から受信した前記話者候補から前記話者を特定する操作の情報に基づき、前記話者人数のそれぞれの話者を認識し、前記ユーザの操作に基づき話者をタグ付けした情報を前記端末に送信する、ことを特徴とする。 The terminal is also characterized in that it has a display unit that presents to the user information on the number of speakers estimated by the cloud and speaker candidates corresponding to the number of speakers, transmits information on an operation by the user to identify the speaker from the speaker candidates to the cloud, and the cloud recognizes each speaker of the number of speakers based on the information on the operation to identify the speaker from the speaker candidates received from the terminal, and transmits information tagged with the speaker based on the user's operation to the terminal.

また、前記端末の前記制御部は、前記クラウドが推定した前記話者人数を前記表示部によりユーザに提示し、前記クラウドは、前記端末から受信した前記ユーザによる前記話者人数の変更操作に基づき、前記音声ファイルに含まれる話者人数の推定を再度実行した結果を前記端末に送信する、ことを特徴とする。 The control unit of the terminal also presents the number of speakers estimated by the cloud to the user via the display unit, and the cloud transmits to the terminal the result of re-estimating the number of speakers included in the audio file based on the user's operation to change the number of speakers received from the terminal.

また、前記クラウドは、前記端末からアップロードされた前記音声ファイルを保存する保存部を有することを特徴とする。 The cloud also has a storage unit that stores the audio files uploaded from the terminal.

上記構成によれば、音声ファイルに含まれる話者人数を推定後、各話者を具体的に認識でき、音声ファイルに含まれる話者人数と各話者を簡単に知ることができるようになる。 With the above configuration, after estimating the number of speakers included in an audio file, each speaker can be specifically recognized, making it easy to know the number of speakers included in the audio file and each speaker.

本発明によれば、事前設定せずとも音声ファイルに含まれる音声の話者人数および話者を簡単に推定できるという効果を奏する。 The present invention has the effect of easily estimating the number and speakers of the audio contained in an audio file without any prior setting.

図１は、実施の形態にかかる音声認識システムの機能構成図である。FIG. 1 is a functional configuration diagram of a voice recognition system according to an embodiment. 図２は、音声認識装置のハードウェア構成例を示す図である。FIG. 2 is a diagram illustrating an example of a hardware configuration of the voice recognition device. 図３は、音声認識にかかる処理例を示すフローチャートである。FIG. 3 is a flowchart showing an example of a process related to voice recognition. 図４は、音声学習にかかる処理例を示すフローチャートである。FIG. 4 is a flowchart showing an example of a process for voice learning. 図５は、音声認識に用いるテーブル構造例を示す図表である。FIG. 5 is a diagram showing an example of a table structure used for voice recognition. 図６は、話者人数推定と話者認識の処理の遷移図である。FIG. 6 is a transition diagram of processes for estimating the number of speakers and for speaker recognition. 図７は、初回録音時の端末上の表示画面を示す図である。FIG. 7 shows a display screen on the terminal when recording for the first time. 図８は、録音時の端末上の録音画面を示す図である。FIG. 8 shows the recording screen on the terminal during recording. 図９は、話者人数の推定後の端末上の表示画面を示す図である。FIG. 9 is a diagram showing a display screen on the terminal after the number of speakers has been estimated. 図１０は、話者候補の端末上の表示画面を示す図である。FIG. 10 shows a display screen on the terminal of a speaker candidate. 図１１は、端末上の話者選択の一覧を示す表示画面を示す図である。FIG. 11 is a diagram showing a display screen showing a list of speaker selections on a terminal. 図１２は、端末上の文字起こしの表示画面を示す図である。FIG. 12 is a diagram showing a display screen of the transcription on the terminal. 図１３は、音声ファイル再生時の端末上の表示画面を示す図である。FIG. 13 shows a display screen on the terminal when an audio file is being played back. 図１４は、音声ファイルに含まれる音声の波形例を示す図である。FIG. 14 is a diagram showing an example of a waveform of an audio contained in an audio file. 図１５は、音声ファイルに含まれる話者のグループ分けを示す図である。FIG. 15 is a diagram showing groupings of speakers included in an audio file. 図１６は、推定した話者人数の変更を示す図である。FIG. 16 is a diagram showing changes in the estimated number of speakers. 図１７は、推定した話者人数の変更を示す図である。FIG. 17 is a diagram showing changes in the estimated number of speakers. 図１８は、端末上の話者候補の追加表示画面を示す図である。FIG. 18 shows a screen for adding speaker candidates on a terminal.

（実施の形態）
以下に添付図面を参照して、この発明にかかる音声認識プログラム、音声認識方法、音声認識装置および音声認識システムの好適な実施の形態を詳細に説明する。 (Embodiment)
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Preferred embodiments of a voice recognition program, a voice recognition method, a voice recognition device, and a voice recognition system according to the present invention will be described in detail below with reference to the accompanying drawings.

（システムの概要構成）
図１は、実施の形態にかかる音声認識システムの機能構成図である。音声認識システムは、音声を録音する端末１００と、クラウド１１０とを含む。端末１００は、ＩＣレコーダや、録音アプリを有するスマートフォン、タブレット、ＰＣ等である。以下の説明では、ＩＣレコーダやスマートフォン等の端末１００がマイクから音声を録音する構成を例に説明するが、これに限らず、端末１００は、スマートフォン等による相手との通話を録音する構成とすることもできる。 (System Overview)
1 is a functional configuration diagram of a voice recognition system according to an embodiment. The voice recognition system includes a terminal 100 that records voice, and a cloud 110. The terminal 100 is an IC recorder, a smartphone having a recording application, a tablet, a PC, or the like. In the following description, a configuration in which the terminal 100 such as an IC recorder or a smartphone records voice from a microphone will be described as an example, but the present invention is not limited to this, and the terminal 100 can also be configured to record a call with a counterpart party via a smartphone or the like.

端末１００は、マイク１０１と、制御部１０５と、キーボード１０６と、ディスプレイ１０７と、を含む。制御部１０５は、録音部１０２と、文字起こし部１０３と、話者タグ付け部１０４と、を含む。録音部１０２は、マイク１０１を介して話者（会議等の複数の参加者等）が発した音声を音声ファイルＤとして保持する。制御部１０５は、話者人数および話者推定の際、音声ファイルＤをクラウド１１０上に送信する。 The terminal 100 includes a microphone 101, a control unit 105, a keyboard 106, and a display 107. The control unit 105 includes a recording unit 102, a transcription unit 103, and a speaker tagging unit 104. The recording unit 102 stores the voices uttered by speakers (such as multiple participants in a conference) via the microphone 101 as audio files D. The control unit 105 transmits the audio files D to the cloud 110 when estimating the number of speakers and the speakers.

文字起こし部１０３は、音声ファイルＤに含まれる音声を音声認識してテキスト等の文字データを生成する。話者タグ付け部１０４は、音声ファイルＤに含まれる音声の話者人数および話者の情報をタグとして音声ファイルＤにタグ付けする。図１のシステム構成例では、話者タグ付け部１０４は、クラウド１１０が話者人数と話者を推定したタグ付けの情報をクラウド１１０から取得し、端末１００上において音声ファイルＤに含まれる話者を特定可能にタグ付けする。 The transcription unit 103 performs speech recognition on the voice included in the audio file D to generate character data such as text. The speaker tagging unit 104 tags the audio file D with the number of speakers and speaker information of the voice included in the audio file D as tags. In the example system configuration of FIG. 1, the speaker tagging unit 104 obtains tagging information from the cloud 110 that estimates the number of speakers and the speakers, and tags the speakers included in the audio file D on the terminal 100 so that they can be identified.

端末１００の制御部１０５は、搭載された各機能を、例えば、ＡＰＩ（ＡｐｐｌｉｃａｔｉｏｎＰｒｏｇｒａｍｍｉｎｇＩｎｔｅｒｆａｃｅ）により呼び出し実行する構成としてもよい。 The control unit 105 of the terminal 100 may be configured to call and execute each of the installed functions, for example, via an API (Application Programming Interface).

クラウド１１０は、複数のＰＣ群、サーバー群、ストレージ群を有し、端末１００とインターネット等のネットワークを介して通信接続される。図１の構成例に示すクラウド１１０は、例えば、ストレージサーバー１２０と、機械学習サーバー１３０と、学習済モデルを格納する学習済モデルデータベース（ＤＢ）１４０と、を含む。 The cloud 110 has a plurality of PC groups, servers, and storage groups, and is communicatively connected to the terminal 100 via a network such as the Internet. The cloud 110 shown in the configuration example of FIG. 1 includes, for example, a storage server 120, a machine learning server 130, and a trained model database (DB) 140 that stores trained models.

ストレージサーバー１２０は、端末１００との間で音声ファイルＤを送受信する。ストレージサーバー１２０は、端末１００から送信された音声ファイルＤを保存部１２１に一時保存する。また、ストレージサーバー１２０は、機械学習サーバー１３０が話者人数と話者を推定した情報を含む音声ファイルＤを保存部１２１に一時保存し、この音声ファイルＤを端末１００に送信する。 The storage server 120 transmits and receives audio files D to and from the terminal 100. The storage server 120 temporarily stores the audio files D transmitted from the terminal 100 in the storage unit 121. The storage server 120 also temporarily stores the audio files D, which include information on the number of speakers and the speakers estimated by the machine learning server 130, in the storage unit 121, and transmits the audio files D to the terminal 100.

機械学習サーバー１３０は、話者人数推定部１３１と、話者認識部１３２の機能を有する。学習済モデルＤＢ１４０には、音声ファイルＤの話者人数と話者を推定するための学習済モデルが保持される。学習済モデルは、音声別の話者の認識情報の学習結果であり、端末１００からの音声認識の要求ごとにアップロードされる音声ファイルＤの学習結果として学習済モデルＤＢ１４０に更新可能に蓄積される。 The machine learning server 130 has the functions of a speaker number estimation unit 131 and a speaker recognition unit 132. The trained model DB 140 holds trained models for estimating the number and speakers of audio file D. The trained models are the results of learning speaker recognition information for each voice, and are updatable and stored in the trained model DB 140 as the learning results of audio file D uploaded for each voice recognition request from the terminal 100.

機械学習サーバー１３０の話者人数推定部１３１は、音声ファイルＤを音声認識し、音声ファイルＤに含まれる話者人数を推定する。話者認識部１３２は、話者人数を推定した後の音声ファイルＤに含まれる話者を推定する。話者人数推定部１３１と話者認識部１３２は、学習済モデルＤＢ１４０の学習済モデルにアクセスし、話者人数および話者を推定する。 The number of speakers estimation unit 131 of the machine learning server 130 performs speech recognition on the audio file D and estimates the number of speakers included in the audio file D. The speaker recognition unit 132 estimates the speakers included in the audio file D after estimating the number of speakers. The number of speakers estimation unit 131 and the speaker recognition unit 132 access the trained model in the trained model DB 140 and estimate the number of speakers and the speakers.

実施の形態では、クラウド１１０（機械学習サーバー１３０）が音声ファイルＤに含まれる話者人数および話者の推定を行い、推定結果を一旦端末１００に送信する。端末１００では、クラウド１１０側で推定した音声ファイルＤの話者人数と話者の情報を画面上に表示する。そして、端末１００でのユーザによる操作により、修正および確定を行う。 In the embodiment, the cloud 110 (machine learning server 130) estimates the number of speakers and the speakers contained in the audio file D, and transmits the estimation result to the terminal 100 once. The terminal 100 displays the number of speakers and speaker information of the audio file D estimated by the cloud 110 on the screen. Then, the user on the terminal 100 performs corrections and confirmations.

このように、実施の形態では、クラウド１１０側で推定した音声ファイルＤの話者人数と話者を、端末１００のユーザが補助的に行う操作により修正あるいは確定する。この修正および確定の操作情報は、端末１００からクラウド１１０（機械学習サーバー１３０）に送信する。これら修正および確定の処理時においては、音声ファイルＤそのものを端末１００とクラウド１１０との間で送受信する必要はなく、話者人数と話者に関する情報に対する修正および確定の情報のみを送信することで、伝送データ量を削減できる。 In this manner, in the embodiment, the number of speakers and speakers of audio file D estimated on the cloud 110 side are corrected or confirmed by an auxiliary operation performed by the user of terminal 100. This correction and confirmation operation information is transmitted from terminal 100 to cloud 110 (machine learning server 130). During these correction and confirmation processes, there is no need to transmit and receive audio file D itself between terminal 100 and cloud 110, and by transmitting only the correction and confirmation information regarding the number of speakers and the information regarding the speakers, the amount of transmitted data can be reduced.

機械学習サーバー１３０は、話者人数の修正時には、音声ファイルＤに対し修正後の話者人数で話者を再度分割する。また、話者の修正時には、音声ファイルＤに対し修正後の話者をタグ付けする。 When the number of speakers is corrected, the machine learning server 130 divides the speakers again for the audio file D according to the corrected number of speakers. When the speaker is corrected, the machine learning server 130 tags the corrected speaker for the audio file D.

このように、実施の形態では、事前準備せずとも、録音後の音声ファイルＤに基づき、話者人数と話者を推定する。そして、推定した話者人数と話者をユーザ操作により修正可能とすることで、音声ファイルＤに含まれる話者人数と話者の推定精度を向上でき、簡単に推定処理できるようになる。 In this way, in the embodiment, the number of speakers and the speakers are estimated based on the recorded audio file D without prior preparation. By allowing the estimated number of speakers and speakers to be modified by user operation, the estimation accuracy of the number of speakers and the speakers included in the audio file D can be improved, and the estimation process can be performed easily.

図１に示した例では、端末１００により録音した音声ファイルＤをクラウド１１０により音声認識する音声認識システムを構成している。これに限らず、図１でクラウド１１０側に配置した話者人数推定と話者認識の機能を端末１００に配置することで、端末１００単独で音声認識装置を構成することもできる。 In the example shown in FIG. 1, a voice recognition system is configured in which the cloud 110 performs voice recognition on the voice file D recorded by the terminal 100. However, it is not limited to this. By arranging the speaker number estimation and speaker recognition functions arranged on the cloud 110 side in FIG. 1 in the terminal 100, it is also possible to configure a voice recognition device on the terminal 100 alone.

図２は、音声認識装置のハードウェア構成例を示す図である。例えば、図１に示す端末
１００は、図２に示す構成を有する。端末１００は、ＣＰＵ２０１、ＲＯＭ２０２、ＲＡＭ２０３、外部メモリ２０４、マイク１０１、キーボード１０６、入力インターフェース（Ｉ／Ｆ）２０８、映像Ｉ／Ｆ２０９、ディスプレイ１０７、通信Ｉ／Ｆ２１１、等を含む。各構成部２０１～２１１は、バス２２０によってそれぞれ接続されている。 Fig. 2 is a diagram showing an example of the hardware configuration of a voice recognition device. For example, the terminal 100 shown in Fig. 1 has the configuration shown in Fig. 2. The terminal 100 includes a CPU 201, a ROM 202, a RAM 203, an external memory 204, a microphone 101, a keyboard 106, an input interface (I/F) 208, a video I/F 209, a display 107, a communication I/F 211, etc. The components 201 to 211 are connected to each other via a bus 220.

ＣＰＵ２０１は、端末１００全体の制御を司る制御部の機能を有する。ＲＯＭ２０２は、制御用のブートプログラムを記録している。ＲＡＭ２０３は、ＣＰＵ２０１のワークエリアとして使用される。すなわち、ＣＰＵ２０１は、ＲＡＭ２０３をワークエリアとして使用しながら、ＲＯＭ２０２に記録された各種プログラムを実行することによって、音声認識装置１００の全体の制御を司る。 The CPU 201 functions as a control unit that controls the entire terminal 100. The ROM 202 stores a boot program for control. The RAM 203 is used as a work area for the CPU 201. That is, the CPU 201 controls the entire voice recognition device 100 by executing various programs stored in the ROM 202 while using the RAM 203 as a work area.

外部メモリ２０４は、ＨＤＤやＳＳＤ、ディスク装置、フラッシュメモリ等からなり、ＣＰＵ２０１の制御にしたがってデータを書き込み／読み取り可能に保持する。 The external memory 204 may be a HDD, SSD, disk device, flash memory, etc., and stores data in a writable/readable manner under the control of the CPU 201.

入力Ｉ／Ｆ２０８には、話者の音声を取得するマイク１０１と、文字、数値、各種指示などの入力のための複数のキーを備えたキーボード１０６とが接続され、これらから入力されたデータをＣＰＵ２０１に出力する。 The input I/F 208 is connected to a microphone 101 that captures the speaker's voice and a keyboard 106 that has multiple keys for inputting characters, numbers, various instructions, etc., and outputs data input from these to the CPU 201.

映像Ｉ／Ｆ２０９は、ディスプレイ１０７に接続される。映像Ｉ／Ｆ２０９は、具体的には、例えば、ディスプレイ１０７全体を制御するグラフィックコントローラと、即時表示可能な画像情報を一時的に記録するＶＲＡＭ（ＶｉｄｅｏＲＡＭ）などのバッファメモリと、グラフィックコントローラから出力される画像データに基づいてディスプレイ１０７を制御する制御ＩＣなどによって構成される。 The video I/F 209 is connected to the display 107. Specifically, the video I/F 209 is composed of, for example, a graphics controller that controls the entire display 107, a buffer memory such as a VRAM (Video RAM) that temporarily records image information that can be displayed immediately, and a control IC that controls the display 107 based on image data output from the graphics controller.

ディスプレイ１０７には、アイコン、カーソル、メニュー、ウインドウ、あるいは文字や画像などの各種データが表示される。ディスプレイ１０７としては、例えば、ＴＦＴ液晶ディスプレイ、有機ＥＬディスプレイなどを用いることができる。 Display 107 displays various data such as icons, cursors, menus, windows, or characters and images. For example, a TFT liquid crystal display or an organic EL display can be used as display 107.

通信Ｉ／Ｆ２１１は、ネットワークに接続され、クラウド１１０と通信接続するインターフェースとして機能する。ネットワークとしては、有線あるいは無線接続されるインターネット、公衆回線網や携帯電話網、ＬＡＮ、ＷＡＮなどがある。 The communication I/F 211 is connected to a network and functions as an interface for communication connection with the cloud 110. The network may be the Internet, a public line network, a mobile phone network, a LAN, a WAN, etc., which may be connected via a wired or wireless connection.

図１に示した端末１００は、図２に記載のＲＯＭ２０２、ＲＡＭ２０３、外部メモリ２０４などに記録されたプログラムやデータを用いて、ＣＰＵ２０１が所定のプログラムを実行することによって、端末１００の機能を実現する。また、端末１００がスマートフォンやタブレット等の携帯機器の場合、キーボード１０６と、ディスプレイ１０７はタッチパネルで構成してもよい。 The terminal 100 shown in FIG. 1 realizes the functions of the terminal 100 by the CPU 201 executing a predetermined program using programs and data recorded in the ROM 202, RAM 203, external memory 204, etc. shown in FIG. 2. In addition, if the terminal 100 is a mobile device such as a smartphone or tablet, the keyboard 106 and the display 107 may be configured as a touch panel.

また、図１に記載のクラウド１１０を構成する各サーバー１２０，１３０についても、図２同様の構成を有し、ＣＰＵ２０１が制御部として機能し、全体の処理を司る。 Furthermore, each of the servers 120 and 130 constituting the cloud 110 shown in FIG. 1 has a configuration similar to that shown in FIG. 2, with the CPU 201 functioning as a control unit and managing the overall processing.

図３は、音声認識にかかる処理例を示すフローチャートである。上述した話者人数推定および話者認識にかかる音声認識の処理は、主にクラウド１１０（機械学習サーバー１３０）が行う。 Figure 3 is a flowchart showing an example of processing related to speech recognition. The speech recognition processing related to the above-mentioned speaker number estimation and speaker recognition is mainly performed by the cloud 110 (machine learning server 130).

はじめに、端末１００は、録音した音声ファイルＤをクラウド１１０にアップロードする（ステップＳ３０１）。端末１００は、既に録音されている音声ファイルＤをアップロードしてもよい。この音声ファイルＤは、不特定の話者が録音した音声であり、話者人数も不明な状態である。端末１００は、話者人数と話者を特定するために音声ファイルＤをアップロードする。クラウド１１０は、アップロードされた音声ファイルＤをストレージ
サーバー１２０の保存部１２１に保存する（ステップＳ３０２）。 First, the terminal 100 uploads a recorded audio file D to the cloud 110 (step S301). The terminal 100 may upload an already recorded audio file D. This audio file D is audio recorded by unspecified speakers, and the number of speakers is unknown. The terminal 100 uploads the audio file D to identify the number of speakers and the speakers. The cloud 110 stores the uploaded audio file D in the storage unit 121 of the storage server 120 (step S302).

次に、クラウド１１０の機械学習サーバー１３０の制御部（話者人数推定部１３１、ＣＰＵ２０１）は、音声ファイルＤに含まれる話者人数を推定する（ステップＳ３０３）。制御部は、音声ファイルＤの音声に対し、推定した話者人数別のユニークなＩＤを付与する。ＩＤ付与により、音声ファイルＤ上において推定した話者別の音声が識別可能となる。 Next, the control unit (number of speakers estimation unit 131, CPU 201) of the machine learning server 130 in the cloud 110 estimates the number of speakers included in the audio file D (step S303). The control unit assigns a unique ID to the audio in the audio file D according to the estimated number of speakers. By assigning the ID, the audio for each estimated speaker in the audio file D can be identified.

次に、制御部は、学習済モデルＤＢ１４０の学習済モデルにアクセスし、音声ファイルＤに学習済みの話者モデルが存在するか否かを判断する（ステップＳ３０４）。 Next, the control unit accesses the trained model in trained model DB140 and determines whether a trained speaker model exists for audio file D (step S304).

判断結果、音声ファイルＤに学習済みの話者モデルが存在する場合（ステップＳ３０４：Ｙｅｓ）、制御部は、音声ファイルＤに含まれる、推定した話者人数それぞれの話者を認識する処理を行い（ステップＳ３０５）、ステップＳ３０６の処理に移行する。一方、判断結果、音声ファイルＤに学習済みの話者モデルが存在しない場合（ステップＳ３０４：Ｎｏ）、制御部は、ステップＳ３０６の処理に移行する。 If the determination result shows that a trained speaker model exists in audio file D (step S304: Yes), the control unit performs processing to recognize speakers for each of the estimated number of speakers contained in audio file D (step S305), and proceeds to processing in step S306. On the other hand, if the determination result shows that a trained speaker model does not exist in audio file D (step S304: No), the control unit proceeds to processing in step S306.

ステップＳ３０６では、制御部は、ユーザインターフェース（ＵＩ）に話者認識の結果を反映した話者認識画面を生成する（ステップＳ３０６）。制御部は、この話者認識画面を端末１００に送信する。 In step S306, the control unit generates a speaker recognition screen that reflects the results of the speaker recognition on the user interface (UI) (step S306). The control unit transmits this speaker recognition screen to the terminal 100.

これにより、端末１００のディスプレイ１０７上には、話者認識画面が表示される。話者認識画面は、上記処理により音声ファイルＤに含まれる推定した話者人数と、認識した話者の情報（話者候補）と、を有する。端末１００の制御部は、話者認識画面を見たユーザ操作により、話者人数に対するフィードバック（話者レコメンド）をクラウド１１０（機械学習サーバー１３０）に送信する（ステップＳ３０７）。 As a result, a speaker recognition screen is displayed on the display 107 of the terminal 100. The speaker recognition screen has the number of speakers estimated by the above process and information on the recognized speakers (speaker candidates). In response to a user operation viewing the speaker recognition screen, the control unit of the terminal 100 sends feedback on the number of speakers (speaker recommendations) to the cloud 110 (machine learning server 130) (step S307).

このフィードバックにおいて、端末１００を操作するユーザは、話者認識画面上に表示されている推定した話者人数に対する修正および確認と、認識した話者（話者候補）に対する修正および確認の操作を行う。このように、実施の形態では、クラウド１１０側で推定した話者人数と話者候補について、端末１００のユーザによる修正および確認を行う。 In this feedback, the user operating the terminal 100 modifies and confirms the estimated number of speakers displayed on the speaker recognition screen, and modifies and confirms the recognized speakers (speaker candidates). Thus, in the embodiment, the number of speakers and speaker candidates estimated by the cloud 110 are modified and confirmed by the user of the terminal 100.

これにより、クラウド１１０（機械学習サーバー１３０の制御部）は、端末１００のユーザによる修正および確認の操作によって、音声ファイルＤに対する話者を特定し、特定した話者を識別するタグ付けを行う（ステップＳ３０８）。タグの情報は、クラウド１１０から端末１００に送信され、端末１００は、受信したタグの情報を音声ファイルＤに関連付けて保持する。以上の処理により、端末１００は、音声ファイルＤに含まれる話者人数と話者をディスプレイ１０７上に表示することができる。 As a result, the cloud 110 (control unit of the machine learning server 130) identifies the speaker for the audio file D through the correction and confirmation operations by the user of the terminal 100, and performs tagging to identify the identified speaker (step S308). The tag information is transmitted from the cloud 110 to the terminal 100, and the terminal 100 stores the received tag information in association with the audio file D. Through the above processing, the terminal 100 can display the number of speakers and the speakers included in the audio file D on the display 107.

図４は、音声学習にかかる処理例を示すフローチャートである。クラウド１１０機械学習サーバー１３０の制御部は、図３に示した一つの音声ファイルＤに対する処理ごとに、図４に示す処理を実施し、学習済モデルＤＢ１４０を構築する。 Figure 4 is a flowchart showing an example of processing for audio learning. The control unit of the cloud 110 machine learning server 130 performs the processing shown in Figure 4 for each processing for one audio file D shown in Figure 3, and constructs the trained model DB 140.

はじめに、機械学習サーバー１３０の制御部は、ステップＳ３０８（図３参照）の処理後、該当する音声ファイルＤから話者音源を抽出する（ステップＳ４０１）。図３の処理により、音声ファイルＤに含まれる話者別の音源を特定できる。 First, after processing step S308 (see FIG. 3), the control unit of the machine learning server 130 extracts speaker sound sources from the corresponding audio file D (step S401). The processing in FIG. 3 makes it possible to identify the sound sources for each speaker contained in the audio file D.

これにより、機械学習サーバー１３０は、特定した話者に対する学習を行い（ステップＳ４０２）、学習結果である話者モデルを学習済モデルＤＢ１４０に保存する（ステップＳ４０３）。これにより、音声ファイルＤに含まれる話者ごとの音声を学習でき、学習を
繰り返すことで、話者認識の精度を向上できるようになる。 As a result, the machine learning server 130 performs learning on the identified speaker (step S402), and stores the speaker model, which is the learning result, in the trained model DB 140 (step S403). This allows learning of the voice of each speaker contained in the audio file D, and by repeating the learning, it becomes possible to improve the accuracy of speaker recognition.

図５は、音声認識に用いるテーブル構造例を示す図表である。これらのテーブル５０１～５０４は、クラウド１１０の制御部が保持し、上記の話者人数の推定および話者認識の処理に用いる。 Figure 5 is a diagram showing an example of a table structure used for speech recognition. These tables 501 to 504 are held by the control unit of the cloud 110 and are used for the above-mentioned speaker number estimation and speaker recognition processing.

図５（ａ）は、端末１００からアップロードされた音声ファイルＤを識別するオーディオテーブル（ａｕｄｉｏｓ）５０１であり、クラウド１１０のストレージサーバー１２０の制御部が保持する。ストレージサーバー１２０の制御部は、オーディオテーブル（ａｕｄｉｏｓ）として、アップロードされる各音声ファイルＤ別の識別子（ｉｄ）を付与して保存する。例えば、ｉｄ「０００１」の音声ファイルＤは「ｆｉｌｅ０００１．ｍｐ３」である。 Figure 5 (a) shows an audio table (audios) 501 that identifies audio files D uploaded from the terminal 100, and is held by the control unit of the storage server 120 of the cloud 110. The control unit of the storage server 120 assigns an identifier (id) to each uploaded audio file D and stores it as an audio table (audios). For example, the audio file D with id "0001" is "file0001.mp3".

図５（ｂ）～（ｄ）は、クラウド１１０の機械学習サーバー１３０の制御部が保持するテーブル５０２～５０４である。機械学習サーバー１３０の制御部は、ストレージサーバー１２０の保存部１２１に保存された音声ファイルＤを読み出し、上述した話者人数の推定および話者認識の処理を行う際にこれらのテーブルを生成および参照する。 Figures 5(b) to (d) show tables 502 to 504 held by the control unit of the machine learning server 130 in the cloud 110. The control unit of the machine learning server 130 reads the audio file D stored in the storage unit 121 of the storage server 120, and generates and references these tables when performing the above-mentioned speaker number estimation and speaker recognition processing.

図５（ｂ）は、話者認識用のテーブル（ａｕｄｉｏ＿ｐｒｅｄｉｃｔｉｏｎｓ）５０２である。このテーブル５０２は、音声ファイルＤ（ａｕｄｉｏ＿ｉｄ）別のｉｄと、ＩＤ別に推定した話者のｉｄ（ｓｐｅａｋｅｒ＿ｉｄ）と、認識精度（ｃｏｎｆｉｄｅｎｃｅ）、の情報を含む。例えば、ｉｄ「０００１」では、音声ファイルＤ「ｆｉｌｅ０００１．ｍｐ３」に含まれる認識した話者（ｓｐｅａｋｅｒ＿ｉｄ）が「０００１」、この話者「０００１」の認識精度（ｃｏｎｆｉｄｅｎｃｅ）が「０．８（８０％の信頼度）」であることを示す。 Figure 5 (b) is a table (audio_predictions) 502 for speaker recognition. This table 502 includes information on the IDs for each audio file D (audio_id), the speaker IDs (speaker_id) estimated for each ID, and the recognition accuracy (confidence). For example, the ID "0001" indicates that the recognized speaker (speaker_id) included in the audio file D "file0001.mp3" is "0001", and the recognition accuracy (confidence) of this speaker "0001" is "0.8 (80% confidence)."

図５（ｃ）は、話者認識用のテーブル（ｓｐｅａｋｅｒｓ）５０３である。このテーブル５０３は、ｉｄ別に認識した話者の名前（ｎａｍｅ）の情報を含む。例えば、ｉｄ「０００１」の話者（名前）は「Ａｌｉｃｅ」である。 Figure 5 (c) is a table (speakers) 503 for speaker recognition. This table 503 contains information on the names of speakers recognized by ID. For example, the speaker (name) of ID "0001" is "Alice."

図５（ｄ）は、音声ファイルＤの推定した話者人数／認識後の話者用のテーブル（ａｕｄｉｏ＿ｓｐｅａｋｅｒｓ）５０４である。このテーブル５０４は、ｉｄ別の音声ファイルＤ（ａｕｄｉｏ＿ｉｄ）と、推定した話者（ｓｐｅａｋｅｒ＿ｉｄ）の情報を含む。例えば、ある一つの音声ファイルＤ（ａｕｄｉｏ＿ｉｄ）「０００１」については、推定した話者（ｓｐｅａｋｅｒ＿ｉｄ）として「０００１」と「０００２」の２名が「ＮＵＬＬ」となっている。この場合、この２名はいずれも「ＮＵＬＬ」であるため話者が具体的に認識されておらず、話者人数が２名として推定のみされた状態が示されている。 Figure 5 (d) is a table (audio_speakers) 504 for the estimated number of speakers/speakers after recognition for audio file D. This table 504 includes information on audio files D (audio_id) by ID and the estimated speakers (speaker_id). For example, for one audio file D (audio_id) "0001", the estimated speakers (speaker_id) for two people, "0001" and "0002", are "NULL". In this case, since both of these people are "NULL", the speakers have not been specifically recognized, and the number of speakers is only estimated to be two.

また、ｉｄ「０００３」には、話者が認識された状態が示され、この場合、ある一つの音声ファイルＤ（ａｕｄｉｏ＿ｉｄ）「０００２」について、１名の話者（ｓｐｅａｋｅｒ＿ｉｄ）「０００１」、すなわち図５（ｃ）の「Ａｌｉｃｅ」が、８０％の信頼度（図５（ｂ）参照）で認識された状態が示されている。また、ａｕｄｉｏ＿ｉｄ「０００３」には話者「Ｃｈａｒｌｉｅ」が９０％の信頼度で存在しているとされ、実際に「Ｃｈａｒｌｉｅ」が認識された状態が示されている。 In addition, id "0003" indicates the state in which a speaker has been recognized. In this case, for one audio file D (audio_id) "0002", one speaker (speaker_id) "0001", i.e. "Alice" in Figure 5(c) is recognized with 80% confidence (see Figure 5(b)). In addition, in audio_id "0003", speaker "Charlie" is considered to exist with 90% confidence, and the state in which "Charlie" has actually been recognized is indicated.

機械学習サーバー１３０の制御部は、話者人数の推定および話者の認識の処理時にこれら図５（ａ）～（ｄ）のテーブルを更新処理する。 The control unit of the machine learning server 130 updates the tables in Figures 5(a) to (d) when estimating the number of speakers and recognizing the speakers.

（話者人数推定と話者認識の処理）
次に、図６～図１１を用いて、実施の形態にかかる音声認識処理を順に説明する。図６
は、話者人数推定と話者認識の処理の遷移図である。図６に示す例では、端末１００がスマートフォン等のモバイル機器であり、録音および音声認識機能を有するモバイルアプリ６０１を搭載している。モバイルアプリ６０１は、端末１００の制御部１０５に相当する。クラウド１１０側の機械学習サーバー１３０は、端末１００での音声の初回録音時の処理（ステップＳ６００）と、初回録音後、一人でも録音タグ付けしている場合の処理（ステップＳ６１０）とで異なる処理を行う。 (Speaker number estimation and speaker recognition processing)
Next, the speech recognition process according to the embodiment will be described in order with reference to FIGS.
is a transition diagram of the processing of estimating the number of speakers and speaker recognition. In the example shown in FIG. 6, the terminal 100 is a mobile device such as a smartphone, and is equipped with a mobile application 601 having recording and voice recognition functions. The mobile application 601 corresponds to the control unit 105 of the terminal 100. The machine learning server 130 on the cloud 110 side performs different processing for the processing at the time of the first recording of voice on the terminal 100 (step S600) and the processing when at least one person has tagged the recording after the first recording (step S610).

初回録音時の処理（ステップＳ６００）では、クラウド１１０（機械学習サーバー１３０）は、端末１００（モバイルアプリ６０１）から送信された音声ファイルＤに対し、教師なし学習アルゴリズムによる学習を行った後（ステップＳ６０１）、音声ファイルＤに含まれる合計話者数を推定する（ステップＳ６０２）。このステップＳ６０１での話者人数の推定にあたり、教師あり学習アルゴリズムによる学習をおこなうことで、話者人数推定の精度を向上することができる。 In the process for the initial recording (step S600), the cloud 110 (machine learning server 130) performs learning using an unsupervised learning algorithm on the audio file D sent from the terminal 100 (mobile app 601) (step S601), and then estimates the total number of speakers contained in the audio file D (step S602). When estimating the number of speakers in this step S601, learning using a supervised learning algorithm can improve the accuracy of the speaker number estimation.

図７は、初回録音時の端末上の表示画面を示す図である。初回録音時、モバイルアプリ６０１は、端末１００のディスプレイ１０７上に表示する表示画面７００を示す。この初回録音時、モバイルアプリ６０１は、ディスプレイ１０７上に録音開始日時７０１、録音時間７０２、録音の場所７０３、タイトル７０４を表示する。 Figure 7 shows the display screen on the terminal when recording for the first time. When recording for the first time, the mobile app 601 shows a display screen 700 that is displayed on the display 107 of the terminal 100. When recording for the first time, the mobile app 601 displays the recording start date and time 701, recording duration 702, recording location 703, and title 704 on the display 107.

モバイルアプリ６０１は、例えば、録音開始日時７０１は、端末１００が有するタイマから取得し、録音時間７０２は録音開始～録音終了までの時間をタイマ計測により取得し、録音の場所７０３は、端末１００が有するＧＰＳ等の測位部から取得し、タイトル７０４は、端末１００のユーザ操作等により設定する。この初回録音時、音声ファイルＤの話者は認識されておらず、話者７０５の部分は未表示である。 For example, the mobile app 601 obtains the recording start date and time 701 from a timer in the terminal 100, obtains the recording time 702 from timer measurement, which is the time from the start to the end of recording, obtains the recording location 703 from a positioning unit such as a GPS in the terminal 100, and sets the title 704 by user operation of the terminal 100, etc. During this initial recording, the speaker of the audio file D is not recognized, and the speaker 705 portion is not displayed.

図８は、録音時の端末上の録音画面を示す図である。モバイルアプリ６０１は、音声の録音時、図８に示す録音画面８００を端末１００のディスプレイ１０７に表示する。録音画面８００は、録音／停止ボタン８０１、録音時間８０２、録音音声（波形）８０３、録音文字起こし表示部８０４、をそれぞれ表示する。 Figure 8 is a diagram showing the recording screen on the terminal when recording. When recording audio, the mobile app 601 displays the recording screen 800 shown in Figure 8 on the display 107 of the terminal 100. The recording screen 800 displays a record/stop button 801, recording time 802, recorded audio (waveform) 803, and a recording transcription display section 804.

モバイルアプリ６０１は、ユーザによる録音／停止ボタン８０１の操作ごとに録音開始あるいは停止を行う。また、録音開始後の時間を録音時間８０２として表示し、録音時の音声に対応した波形８０３を表示する。録音文字起こし表示部８０４には、上述した文字起こし部１０３が録音した音声からリアルタイムにテキスト文字を生成したものが表示される。 The mobile app 601 starts or stops recording each time the user operates the record/stop button 801. It also displays the time since recording started as recording time 802, and displays a waveform 803 corresponding to the audio being recorded. The recording transcription display section 804 displays text characters that have been generated in real time from the audio recorded by the transcription section 103 described above.

図９は、話者人数の推定後の端末上の表示画面を示す図である。クラウド１１０（機械学習サーバー１３０の話者人数推定部１３１）により話者人数を推定した情報を端末１００のモバイルアプリ６０１が受信した状態での表示画面９００を示す。 Figure 9 shows the display screen on the terminal after the number of speakers has been estimated. It shows the display screen 900 in a state where the mobile app 601 of the terminal 100 has received information on the number of speakers estimated by the cloud 110 (the number-of-speakers estimation unit 131 of the machine learning server 130).

ここで、クラウド１１０側での話者推定により音声ファイルＤに含まれる話者人数が２名であるとする。この場合、モバイルアプリ６０１は、図９に示すように、話者７０５の部分に、推定した話者人数を示す話者数推定表示領域７１１に「二人の話者を推定しました。」と表示する。また、推定した２名分に対応して２つの話者表示領域７１２を表示する。 Here, let us assume that the number of speakers included in the audio file D is two, based on speaker estimation on the cloud 110 side. In this case, the mobile app 601 displays "Two speakers estimated" in the speaker number estimation display area 711 showing the estimated number of speakers in the speaker 705 portion, as shown in FIG. 9. In addition, two speaker display areas 712 are displayed corresponding to the two estimated speakers.

この後、クラウド１１０（機械学習サーバー１３０の話者認識部１３２）は、学習済みモデルを参照して推定した２名分の話者のそれぞれの話者が誰であるか具体的な話者（名前）を関連付ける。話者認識部１３２は、推定した２名の話者のうち１名について話者の関連付けを行った後、残りの１名についても同様に関連付けを行う。 The cloud 110 (speaker recognition unit 132 of machine learning server 130) then refers to the trained model and associates specific speakers (names) to identify who each of the two estimated speakers is. After associating the speakers for one of the two estimated speakers, the speaker recognition unit 132 similarly associates the remaining speaker.

ここで、クラウド１１０（機械学習サーバー１３０の話者認識部１３２）は、この話者認識の際、音声ファイルＤに特定の話者が存在している可能性が高い（例えば、信頼度７０％以上）と判定した場合、判定した話者候補を端末１００の話者表示領域７１２に表示させる。 Here, if the cloud 110 (speaker recognition unit 132 of the machine learning server 130) determines during speaker recognition that there is a high probability that a specific speaker is present in the audio file D (e.g., a confidence level of 70% or more), it displays the determined speaker candidate in the speaker display area 712 of the terminal 100.

図１０は、話者候補の端末上の表示画面を示す図である。クラウド１１０（機械学習サーバー１３０の話者認識部１３２）からの一人の話者候補（２名の話者人数推定）の通知があった場合の状態を示す。モバイルアプリ６０１は、話者数推定表示領域７１１に「二人の話者を推定しました。名前を設定してください。」と表示する。また、表示画面１０００の話者表示領域７１２のうち、一人目の話者表示領域７１２ａに一人目のＩＤ「０００１」との表示を、具体的な話者候補の内容「もしかしてＡｌｉｃｅさんですか？」に切り替えて表示する。また、確認表示「はい／いいえ」を表示する。また、符号７２０は、ユーザ操作により話者人数を追加するための話者人数追加ボタンである。 Figure 10 is a diagram showing a display screen on the terminal of a speaker candidate. It shows a state when a notification of one speaker candidate (estimated number of speakers: two) is received from the cloud 110 (speaker recognition unit 132 of machine learning server 130). The mobile app 601 displays "Two speakers have been estimated. Please set a name" in the speaker number estimation display area 711. In addition, in the speaker display area 712 of the display screen 1000, the display of the first speaker's ID "0001" in the first speaker display area 712a is switched to the specific content of the speaker candidate "Are you Alice?". In addition, a confirmation display "Yes/No" is displayed. Also, the symbol 720 is a speaker number addition button for adding the number of speakers by a user operation.

このように、クラウド１１０（機械学習サーバー１３０の話者認識部１３２）が話者候補を具体的にタグ付けし、話者表示領域７１２ａに具体的に話者候補（名前）「Ａｌｉｃｅ」を表示する。これにより、端末１００を操作するユーザは、話者候補が正しいか否かを確認することができる。そして、ユーザが表示されている話者候補が正しいと判断し、確認表示「はい」を操作すると、モバイルアプリ６０１は、クラウド１１０（機械学習サーバー１３０の話者認識部１３２）に確認操作の情報を送信し、クラウド１１０（機械学習サーバー１３０の話者認識部１３２）は、ＩＤ「０００１」の話者が「Ａｌｉｃｅ」であると認識し、音声ファイルＤに話者「Ａｌｉｃｅ」が存在することを認識する。 In this way, the cloud 110 (the speaker recognition unit 132 of the machine learning server 130) specifically tags the speaker candidate and specifically displays the speaker candidate (name) "Alice" in the speaker display area 712a. This allows the user operating the terminal 100 to confirm whether the speaker candidate is correct or not. Then, when the user determines that the displayed speaker candidate is correct and operates the confirmation display "Yes," the mobile app 601 transmits information of the confirmation operation to the cloud 110 (the speaker recognition unit 132 of the machine learning server 130), and the cloud 110 (the speaker recognition unit 132 of the machine learning server 130) recognizes that the speaker of ID "0001" is "Alice" and that the speaker "Alice" exists in the audio file D.

このタグ付けの処理は、図６のステップＳ６１０に相当する。ステップＳ６１０では、クラウド１１０（機械学習サーバー１３０の話者認識部１３２）は、初回録音時の処理（ステップＳ６００）で話者推定した後、音声ファイルＤに話者が所定の信頼度以上で存在する場合、この話者に対して教師あり学習アルゴリズムによる学習を行った後（ステップＳ６１１）、話者候補を端末１００のユーザにレコメンド（確認操作）する（ステップＳ６１２）処理を行う。 This tagging process corresponds to step S610 in FIG. 6. In step S610, after estimating the speaker in the process at the time of initial recording (step S600), if a speaker is present in the audio file D with a certain degree of reliability or higher, the cloud 110 (speaker recognition unit 132 of the machine learning server 130) performs learning on this speaker using a supervised learning algorithm (step S611), and then performs processing to recommend (confirm) the speaker candidates to the user of the terminal 100 (step S612).

なお、図１０において、表示画面１１００の話者表示領域７１２のうち、二人目の話者表示領域７１２ｂは、話者がＩＤ「０００２」と表示され、二人目がいることを推定したのみの状態が示されている。この二人目の話者表示領域７１２ｂについても、所定以上の信頼度で異なる話者が存在する場合、上記一人目と同様に、クラウド１１０（機械学習サーバー１３０の話者認識部１３２）は、ユーザに対する話者候補をレコメンドする。 In FIG. 10, the second speaker display area 712b of the speaker display area 712 on the display screen 1100 displays the speaker ID "0002", indicating that only a second speaker is present. If a different speaker exists in this second speaker display area 712b with a certain degree of reliability or higher, the cloud 110 (the speaker recognition unit 132 of the machine learning server 130) will recommend speaker candidates to the user, just as in the case of the first speaker.

図１１は、端末上の話者選択の一覧を示す表示画面を示す図である。図１０の説明において、クラウド１１０（機械学習サーバー１３０の話者認識部１３２）は、話者候補を端末１００に通知するが、音声ファイルＤに含まれ所定の信頼度を有する話者候補の一覧の情報を端末１００に送信してもよい。この場合、モバイルアプリ６０１は、話者選択の一覧の表示画面１１００を表示する。この一覧の表示画面１１００をユーザが確認して複数の話者候補のなかから話者を選択することができる。このほか、図１０の話者表示領域７１２ａに表示された話者候補が異なる場合、ユーザが「いいえ」を選択することで、モバイルアプリ６０１が表示画面１１００を表示し、ユーザが他の話者候補「Ｂｏｂ」、「Ｃｈａｒｌｉｅ」を選択することができる。 FIG. 11 is a diagram showing a display screen showing a list of speaker selections on a terminal. In the description of FIG. 10, the cloud 110 (the speaker recognition unit 132 of the machine learning server 130) notifies the terminal 100 of the speaker candidates, but may also transmit information of a list of speaker candidates included in the audio file D and having a predetermined reliability to the terminal 100. In this case, the mobile app 601 displays a display screen 1100 of the speaker selection list. The user can check this list display screen 1100 and select a speaker from among multiple speaker candidates. In addition, if the speaker candidates displayed in the speaker display area 712a of FIG. 10 are different, the user can select "No" to cause the mobile app 601 to display the display screen 1100, and the user can select other speaker candidates "Bob" and "Charlie".

上述したように、実施の形態では、クラウド１１０（機械学習サーバー１３０の話者認識部１３２）側のみの判断で話者認識することなく、話者候補を一旦端末１００に送信し、ユーザにより確認操作する処理を行うことで、話者認識を簡単な操作で精度向上できる
ようになる。 As described above, in the embodiment, speaker recognition is not performed solely based on the judgment of the cloud 110 (the speaker recognition unit 132 of the machine learning server 130), but speaker candidates are first sent to the terminal 100 and then a process is performed in which the user confirms them, thereby making it possible to improve the accuracy of speaker recognition with a simple operation.

（文字起こし機能）
ここで、端末１００の文字起こし部１０３の機能について説明する。文字起こし部１０３は、音声を録音あるいは再生しながら音声に対応するテキスト文字を生成する。 (Transcription function)
Here, a description will be given of the function of the transcription unit 103 of the terminal 100. The transcription unit 103 generates text characters corresponding to the speech while recording or playing back the speech.

図１２は、端末上の文字起こしの表示画面を示す図である。この図には、音声ファイルＤの再生時の状態を示す。端末１００の制御部（モバイルアプリ６０１）は、文字起こし部１０３の機能時、ディスプレイ１０７上に文字起こしの表示画面１２００を表示する。モバイルアプリ６０１は、表示画面１２００上に、録音情報（タイトル、録音日時、録音時間、録音場所）１２０１と、文字起こし内容１２０２、再生位置表示バー１２０３、再生操作（再生／停止、戻し、メモ操作）ボタン１２０４を表示する。 Figure 12 is a diagram showing the display screen of the transcription on the terminal. This figure shows the state when audio file D is being played back. The control unit (mobile app 601) of the terminal 100 displays a transcription display screen 1200 on the display 107 when the transcription unit 103 is functioning. On the display screen 1200, the mobile app 601 displays recording information (title, recording date and time, recording duration, recording location) 1201, transcription content 1202, a playback position display bar 1203, and playback operation (play/stop, rewind, memo operation) buttons 1204.

文字起こし部１０３は、音声の発話のタイミングに連動する形で生成したテキスト文字を記録していき、その文字起こし精度の「自信」を識別可能に表示する。例えば、文字起こし内容１２０２に表示するテキスト文字には、標準の濃度に対し、自信が低い文字を薄く表示し（領域１２１０）、標準よりも自信が高い文字を濃く（領域１２１１）表示する。これにより、ユーザが文字起こし内容１２０２に表示されるテキストの濃度により文字ごとの変換精度を容易に把握できるようになる。また、文字上の再生位置をハイライト（領域１２２１）で表示し、再生位置がわかるよう表示する。 The transcription unit 103 records the generated text characters in conjunction with the timing of speech, and displays the "confidence" of the transcription accuracy in an identifiable manner. For example, in the text characters displayed in the transcription content 1202, characters with less confidence are displayed lighter than the standard darkness (area 1210), and characters with more confidence than the standard are displayed darker (area 1211). This allows the user to easily grasp the conversion accuracy of each character from the darkness of the text displayed in the transcription content 1202. In addition, the playback position on the characters is highlighted (area 1221) so that the playback position can be identified.

このように、文字起こしされた文字は、文字起こしの推定の自信（文字起こしの精度に相当）に合わせて異なる表示形態とすることで、録音後における再生やテキスト検索時の利便性を高めることができる。 In this way, the transcribed text can be displayed in different formats according to the estimated confidence of the transcription (corresponding to the accuracy of the transcription), which can improve the convenience of playing back the recording or searching the text after it has been recorded.

また、上述した話者認識の情報は、この文字起こし機能にも有効に利用することができる。例えば、文字起こしした元の音声ファイルＤの話者の認識結果を表示画面１２００に表示することができる。図示の例では、録音情報１２０１の一部に、音声ファイルＤに対し話者認識後の話者１２３０の情報「Ａｌｉｃｅ」が表示されている。これにより、音声ファイルＤそのものの話者をユーザに通知できるようになる。 The speaker recognition information described above can also be effectively used for this transcription function. For example, the speaker recognition results for the original transcribed audio file D can be displayed on the display screen 1200. In the illustrated example, the speaker 1230 information "Alice" after speaker recognition for audio file D is displayed as part of the recording information 1201. This makes it possible to notify the user of the speaker of audio file D itself.

（音声ファイル再生時の表示）
図１３は、音声ファイル再生時の端末上の表示画面を示す図である。端末１００には多数の音声ファイルＤが記憶保持されており、制御部（モバイルアプリ６０１）は、音声ファイルの再生等の際、表示画面１３００上に所望する音声ファイルＤを見つけやすくするための画面表示を行う。例えば、図１３に示すように、端末１００の再生時には、カレンダー１３０１を表示する。カレンダー１３０１上には、録音済みの音声ファイルＤに付与された録音日の部分が識別可能（図示の例では録音日が〇）に表示される。これにより、ユーザは、カレンダー１３０１上から録音日に基づき所望する音声ファイルＤを容易に再生できるようになる。 (Display when playing an audio file)
13 is a diagram showing a display screen on the terminal when playing an audio file. A large number of audio files D are stored in the terminal 100, and the control unit (mobile application 601) displays a screen on the display screen 1300 to make it easier to find a desired audio file D when playing an audio file. For example, as shown in FIG. 13, a calendar 1301 is displayed during playback on the terminal 100. On the calendar 1301, the recording date portion assigned to the recorded audio file D is displayed in an identifiable manner (in the illustrated example, the recording date is displayed as a circle). This allows the user to easily play a desired audio file D on the calendar 1301 based on the recording date.

また、不図示であるが、カレンダー１３０１上の録音日の選択により、録音された音声ファイルＤの情報として、上述した話者認識の情報、すなわち、録音された話者「Ａｌｉｃｅ」等をポップアップ等で表示させてもよい。これにより、必要な音声ファイルＤをより簡単に検索できるようになる。 Although not shown, by selecting a recording date on the calendar 1301, the above-mentioned speaker recognition information, i.e., the recorded speaker "Alice" etc., may be displayed as information on the recorded audio file D in a pop-up or the like. This makes it easier to search for the required audio file D.

（話者人数の推定と話者認識の修正例）
次に、図１４～図１８を用いて話者人数の推定と話者認識の修正例について説明する。実施の形態では、音声ファイルＤに含まれる認識したい人の声となる音声区間と、雑音である非音声区間と、を識別する仕組みとして音声区間検出（ＶＡＤ：Ｖｏｉｃｅ．Ａｃ
ｔｉｖｉｔｙＤｅｔｅｃｔｉｏｎ）技術を用いる。 (Example of speaker count estimation and speaker recognition correction)
Next, an example of estimating the number of speakers and correcting speaker recognition will be described with reference to Figs. 14 to 18. In the embodiment, a voice activity detection (VAD: Voice.Act) is used as a mechanism for distinguishing between a voice activity segment that is the voice of a person to be recognized and a non-voice activity segment that is noise, which are included in the voice file D.
The device uses Integrity Detection technology.

図１４は、音声ファイルに含まれる音声の波形例を示す図である。図１４に示す音声ファイルＤについて、クラウド１１０（機械学習サーバー１３０）が一人の話者として登録された音声と推定すれば、ＶＡＤ抽出した４つの音源Ｓ１～Ｓ４がいずれも同一の人の音声と判断し、次回以降の学習に利用する。 Figure 14 is a diagram showing an example of the waveform of the sound contained in the sound file. If the cloud 110 (machine learning server 130) estimates that the sound file D shown in Figure 14 is the sound registered as a single speaker, it will determine that the four VAD-extracted sound sources S1 to S4 are all the voices of the same person, and will use them for future learning.

図１５は、音声ファイルに含まれる話者のグループ分けを示す図である。便宜上、図１５に示す音声の波形Ｓ１～Ｓ４は、図１４と同様としている。ここで、音声ファイルＤについて、クラウド１１０（機械学習サーバー１３０）が話者が二人と推定し、一人に名前「Ａｌｉｃｅ」と話者候補のラベル付けをしたとする。この場合、クラウド１１０（機械学習サーバー１３０）は、ＶＡＤ音源抽出で得られた４つの音源Ｓ１～Ｓ４に対して、クラスタリング処理をすることで音源Ｓ１～Ｓ４を２グループに分ける。 Figure 15 is a diagram showing grouping of speakers included in an audio file. For convenience, the audio waveforms S1 to S4 shown in Figure 15 are the same as those in Figure 14. Here, assume that the cloud 110 (machine learning server 130) estimates that there are two speakers for audio file D and labels one of them as a speaker candidate with the name "Alice". In this case, the cloud 110 (machine learning server 130) performs clustering processing on the four audio sources S1 to S4 obtained by VAD audio source extraction, thereby dividing the audio sources S1 to S4 into two groups.

図１５の例では、グループ１（Ｇ１）が音源Ｓ１，Ｓ４であり、グループ２（Ｇ２）が音源Ｓ２，Ｓ３であったとする。これにより、一方のグループＧ１の話者が「Ａｌｉｃｅ」の音声である確率は５０％となる。なお、４人いると認識すれば、４つのグループにおける「Ａｌｉｃｅ」の音声である確率は２５％となる。 In the example of Figure 15, group 1 (G1) has sound sources S1 and S4, and group 2 (G2) has sound sources S2 and S3. This means that there is a 50% probability that the speaker in group G1 is the voice of "Alice." If it is recognized that there are four people, then the probability that the voice of "Alice" is in all four groups is 25%.

クラウド１１０（機械学習サーバー１３０）は、１００％「Ａｌｉｃｅ」である音声、５０％「Ｂｏｂ」である音声など、信頼度によって学習時に重みづけをした学習データの音源を学習し、この学習済モデルを用いて新たな音声ファイルＤの音源について、近似する音声があるかを判定する。クラウド１１０（機械学習サーバー１３０）は、学習時に、例えば、５０％の信頼度を持つ音源に対しては、その他の学習済モデルが存在する場合は、その学習データの選定の段階から近似判定を行って取得することで、５０％以上の信頼度を得ることができる。 Cloud 110 (machine learning server 130) learns the sound source of the training data that is weighted by reliability during training, such as a voice that is 100% "Alice" and a voice that is 50% "Bob", and uses this trained model to determine whether there is an approximate voice for the sound source of a new audio file D. During training, for example, for a sound source with a reliability of 50%, if there are other trained models, cloud 110 (machine learning server 130) can obtain a reliability of 50% or more by performing an approximation determination from the stage of selecting the training data and acquiring it.

図１６は、推定した話者人数の変更を示す図である。便宜上、図１６（ａ）に示す音声の波形Ｓ１～Ｓ４は、図１５と同様としている。図１５に示す処理により、クラウド１１０（機械学習サーバー１３０）が一人の話者候補「Ａｌｉｃｅ」のラベル付けを行い、グループ１（Ｇ１）が話者候補「Ａｌｉｃｅ」の音源Ｓ１，Ｓ４であると認識したとする。 Figure 16 is a diagram showing changes in the estimated number of speakers. For convenience, the audio waveforms S1 to S4 shown in Figure 16(a) are the same as those in Figure 15. Assume that the cloud 110 (machine learning server 130) labels one speaker candidate "Alice" through the process shown in Figure 15, and recognizes group 1 (G1) as the sound sources S1 and S4 of speaker candidate "Alice."

ここで、端末１００のユーザに対し、音声ファイルＤに含まれる話者が二人と提示した後、ユーザ操作により３人であると修正された場合、クラウド１１０（機械学習サーバー１３０）は、話者人数の推定について、ＶＡＤ抽出音源のクラスタリングを改めて３人に適応して行う。図示の例では、クラウド１１０（機械学習サーバー１３０）は、音声ファイルＤに対し、ＶＡＤ抽出音源のクラスタリングを３人に適応して行うことで、図１６（ａ）に示すグループ１（Ｇ１）の音源Ｓ４が、図１６（ｂ）に示すように３人目のグループ３（Ｓ３）に変更される。 Here, if the number of speakers included in the audio file D is presented to the user of the terminal 100 as two, and then the user operates to correct this to three, the cloud 110 (machine learning server 130) performs clustering of the VAD-extracted audio sources again to estimate the number of speakers, adapting them to three people. In the illustrated example, the cloud 110 (machine learning server 130) performs clustering of the VAD-extracted audio sources for the audio file D, adapting them to three people, and thereby changing the audio source S4 of group 1 (G1) shown in FIG. 16(a) to group 3 (S3) of the third person, as shown in FIG. 16(b).

図１７は、推定した話者人数の変更を示す図である。便宜上、図１７（ａ）に示す音声の波形Ｓ１～Ｓ４は、図１５と同様としている。図１５に示す処理により、クラウド１１０（機械学習サーバー１３０）が一人の話者候補「Ａｌｉｃｅ」のラベル付けを行い、グループ１（Ｇ１）が話者候補「Ａｌｉｃｅ」の音源Ｓ１，Ｓ４であると認識したとする。 Figure 17 shows changes to the estimated number of speakers. For convenience, the audio waveforms S1 to S4 shown in Figure 17(a) are the same as those in Figure 15. Assume that the cloud 110 (machine learning server 130) labels one speaker candidate "Alice" through the process shown in Figure 15, and recognizes group 1 (G1) as the sound sources S1 and S4 of speaker candidate "Alice."

ここで、端末１００のユーザに対し、音声ファイルＤに含まれる話者が二人と提示した後、ユーザ操作により一人であると修正された場合、クラウド１１０（機械学習サーバー１３０）は、話者人数の推定について、ＶＡＤ抽出音源のクラスタリングを改めて一人に適応して行う。図示の例では、クラウド１１０（機械学習サーバー１３０）は、音声ファイルＤがすべて一人分の音源として学習することで、図１７（ａ）に示すグループ２（Ｇ
２）の音源Ｓ２，Ｓ３、図１７（ｂ）に示すように音声ファイルＤの音源Ｓ１～Ｓ４がすべて同じグループ１（Ｇ１）に変更される。 Here, if the number of speakers included in the audio file D is presented to the user of the terminal 100 as two, and then the user operates to correct the number to one, the cloud 110 (machine learning server 130) performs clustering of the VAD-extracted audio source again to estimate the number of speakers, adapting it to one person. In the illustrated example, the cloud 110 (machine learning server 130) learns that all of the audio files D are audio sources for one person, and thereby creates a group 2 (G
2) The sound sources S2 and S3, and the sound sources S1 to S4 of the audio file D as shown in FIG. 17(b) are all changed to the same group 1 (G1).

図１８は、端末上の話者候補の追加表示画面を示す図である。上記図１０を用いて説明したように、クラウド１１０（機械学習サーバー１３０の話者認識部１３２）からの一人の話者候補（２名の話者人数推定）の通知があった場合の状態の後、ユーザによる話者の追加時の表示画面１８００を示す。 Figure 18 is a diagram showing a display screen for adding speaker candidates on a terminal. As described above with reference to Figure 10, this shows a display screen 1800 when a speaker is added by the user after a notification of one speaker candidate (estimated number of speakers: two) has been received from the cloud 110 (speaker recognition unit 132 of the machine learning server 130).

図１０において、端末１００は、ユーザに対し、２名の話者人数の推定に対応して表示画面１８００に２つの話者表示領域７１２ａ，７１２ｂを表示している。この後、ユーザが音声ファイルＤに含まれる話者が３名であると修正する場合、図１０に示した話者人数追加ボタン７２０を操作することで、クラウド１１０（機械学習サーバー１３０）は、３人目の話者表示領域７１２ｃを追加表示する。 In FIG. 10, the terminal 100 displays two speaker display areas 712a and 712b on the display screen 1800 to the user in response to an estimated number of speakers of two. If the user subsequently corrects the number of speakers included in the audio file D to three, the cloud 110 (machine learning server 130) will additionally display a third speaker display area 712c by operating the add speaker number button 720 shown in FIG. 10.

図１８の表示例では、３人目の話者表示領域７１２ｃ部分には、３人目のＩＤ「０００３」とのみ表示した状態である。なお、二人目の話者表示領域７１２ｂについては、ＩＤ「０００２」と表示した状態であるが、具体的な話者候補「Ｂｏｂ」が提示可能な場合、確認表示「はい／いいえ」とともに表示する。 In the display example of FIG. 18, the third speaker display area 712c displays only the third speaker's ID "0003". The second speaker display area 712b displays the ID "0002", but if a specific speaker candidate "Bob" can be presented, it is displayed together with a confirmation message "Yes/No".

このようにして、クラウド１１０（機械学習サーバー１３０）は、ユーザに対し話者候補を推定した話者人数分だけ提示し、ユーザ操作による話者人数の修正に基づき、音声ファイルＤに含まれる話者人数の推定、およびこの後の話者の認識を精度よく効率的に行えるようになる。 In this way, the cloud 110 (machine learning server 130) presents speaker candidates to the user in the number of estimated speakers, and based on the user's operation to modify the number of speakers, it becomes possible to accurately and efficiently estimate the number of speakers contained in the audio file D and subsequently recognize the speakers.

また、実施の形態による音声認識処理により、個人の権利の保護に有効活用できるようになる。例えば、契約上で弱い立場に立たされる個人に有用であり、俳優などのアーティストや、フリーランスで働く個人など、契約書がまだまだ商習慣として根付いていない現状において、口約束が先行する問題、「（ある事項を互いに）言った／言わない」問題、口約束を忘れられたことにされた約束の反故の問題、等に対応できるようになる。本実施の形態で説明した音声認識処理を用いて会話を録音することで、簡単な覚書や契約書の自動生成が可能となる。契約書の形式としては、基本的には複数個（例えば２０個）の質問回答で生成できるようなパターン化されているものが多く、上述した文字起こし等の簡単な自然言語処理技術で対応できる。 Furthermore, the voice recognition processing according to the embodiment can be effectively used to protect the rights of individuals. For example, it is useful for individuals who are in a weak position in contracts, and in the current situation where contracts have not yet taken root as a business practice, such as for artists such as actors and individuals working freelance, it can deal with problems such as verbal promises, "said/not said (something to each other)" problems, and broken promises when verbal promises are pretended to be forgotten. By recording conversations using the voice recognition processing described in this embodiment, it becomes possible to automatically generate simple memoranda and contracts. As for the format of contracts, many of them are basically patterned so that they can be generated by answering multiple questions (e.g. 20 questions), and can be handled by simple natural language processing techniques such as the transcription described above.

ここで、多くの場合、契約書を相手と確認し合うことさえ憚られる心理的抵抗が強い場面が多いため、録音データはその約束ないし事実を記録するのに重要である。録音データの削除の防止、改竄の防止を保証することが望まれる。例えば、虐待を受けている子供が虐待現場を録音に成功したとしても、その音声が見つかり、故意に削除されてしまっては何の意味もない。 In many cases, there is strong psychological resistance to even confirming the contract with the other party, so audio recordings are important for recording the promises or facts. It is desirable to guarantee that audio recordings cannot be deleted or tampered with. For example, even if an abused child manages to record the scene of the abuse, it is meaningless if the audio is discovered and intentionally deleted.

これに対応して、上記実施の形態では、端末が録音した音声ファイルを端末のみで録音／保存するに限らず、録音した音声ファイルをリアルタイムにクラウドへアップロードし、クラウド側で保存する構成としてもよい。なお、音声ファイルに対するセキュリティ保持や改竄防止のために、クラウド側で音声ファイルを分散保持する構成や、音声ファイルのハッシュ値をブロックチェーンに記録する構成等をおこなってもよい。 In response to this, in the above embodiment, the audio file recorded by the terminal is not limited to being recorded/saved only on the terminal, but the recorded audio file may be uploaded to the cloud in real time and saved on the cloud side. Note that in order to maintain security and prevent tampering with the audio file, the audio file may be distributed and saved on the cloud side, or the hash value of the audio file may be recorded on the blockchain, etc.

上述した実施の形態によれば、音声認識装置は、音声ファイルに含まれる話者人数を推定し、予め用意された話者別の学習済モデルを参照し、推定した話者人数のそれぞれの話者を認識し、音声ファイルに含まれる話者をタグ付けする。これにより、音声ファイルに含まれる話者人数と話者を具体的にユーザに提示できるようになる。したがって、多数の
音声ファイルのなかから所望する音声を容易に見つけ出すことができるようになる。 According to the above-described embodiment, the speech recognition device estimates the number of speakers included in the speech file, refers to a trained model for each speaker prepared in advance, recognizes each speaker of the estimated number of speakers, and tags the speakers included in the speech file. This makes it possible to specifically present the number of speakers and the speakers included in the speech file to the user. Therefore, it becomes possible to easily find a desired speech from among a large number of speech files.

また、認識の処理は、推定した話者人数の情報、および話者人数に対応する話者候補をユーザに提示し、ユーザによる話者候補から話者を特定する操作に基づき、話者人数のそれぞれの話者を認識し、タグ付けの処理は、ユーザの操作に基づき話者をタグ付けする。このように、装置側が推定した話者候補をユーザにより特定する簡単な操作を加えるだけで、話者をより精度よく特定できるようになる。 The recognition process presents the user with information on the estimated number of speakers and speaker candidates corresponding to the number of speakers, and recognizes each speaker based on the user's operation to identify a speaker from the speaker candidates, and the tagging process tags the speaker based on the user's operation. In this way, by simply adding a simple operation that allows the user to identify the speaker candidates estimated by the device, speakers can be identified with greater accuracy.

また、推定の処理は、推定した話者人数をユーザに提示し、ユーザによる話者人数の変更操作に基づき、音声ファイルに含まれる話者人数の推定を再度実行する。これにより、音声ファイルに含まれる話者人数を簡単なユーザ操作で精度よく推定できるようになる。 The estimation process also presents the estimated number of speakers to the user, and re-estimates the number of speakers included in the audio file based on the user's operation to change the number of speakers. This makes it possible to accurately estimate the number of speakers included in the audio file with simple user operations.

さらに、タグ付け後の話者の情報を学習および蓄積する学習を行い、認識の処理は、学習済モデルに基づき、推定した話者人数のそれぞれの話者を認識する。これにより、学習の繰り返しで音声ファイルに含まれる話者の認識精度を向上できるようになる。 Furthermore, learning is performed to learn and accumulate speaker information after tagging, and the recognition process recognizes each speaker of the estimated number of speakers based on the trained model. This makes it possible to improve the recognition accuracy of speakers contained in audio files by repeating the learning process.

上記音声認識の処理は、端末単体で実施してもよいし、端末とクラウドを用いたシステムで分担処理してもよい。システム構成の場合、端末と、クラウドが通信接続された音声認識システムにおいて、端末は、音声の録音部と、録音あるいは再生した音声ファイルをクラウドにアップロードする通信部と、を有し、クラウドは、音声ファイルに含まれる音声を発した話者人数を推定し、予め用意された話者別の学習済モデルを参照し、推定した話者人数のそれぞれの話者を認識し、音声ファイルに含まれる話者をタグ付けした情報を端末に通知する。このように、音声ファイルに対する音声認識、すなわち上記話者人数の推定と話者の認識にかかる処理をクラウド側で処理することで、端末側の処理負担を軽減しつつ音声認識の精度を向上できるようになる。また、複数の端末の音声認識をクラウド側でまとめて処理できるようになる。また、端末がクラウドとの間でオフライン中に蓄積された複数の音声ファイルを、オンライン時にクラウドにまとめてアップロードし、クラウドが複数のファイルを一括して音声認識する構成とすることもできる。 The voice recognition process may be performed by the terminal alone, or may be shared by a system using the terminal and the cloud. In the case of a system configuration, in a voice recognition system in which a terminal and a cloud are connected for communication, the terminal has a voice recording unit and a communication unit that uploads recorded or played voice files to the cloud, and the cloud estimates the number of speakers who made the voices included in the voice files, refers to a trained model for each speaker prepared in advance, recognizes each speaker of the estimated number of speakers, and notifies the terminal of information tagged with the speakers included in the voice files. In this way, by processing the voice recognition for the voice files, i.e., the processing for estimating the number of speakers and recognizing the speakers, on the cloud side, it is possible to improve the accuracy of the voice recognition while reducing the processing burden on the terminal side. In addition, it is possible to collectively process the voice recognition of multiple terminals on the cloud side. In addition, it is also possible to configure the terminal to collectively upload multiple voice files accumulated while offline between the terminal and the cloud to the cloud when online, and the cloud to collectively voice recognize the multiple files.

また、端末の制御部は、音声の録音時あるいは再生時に、音声ファイルに含まれる文字をリアルタイムに生成する。これにより、音声ファイルの内容を具体的にユーザに提示できるようになる。加えて、上述した話者人数と話者の情報をユーザに提示でき、所望する音声ファイルを簡単に見つけ出すことができるようになる。 The control unit of the terminal also generates the text contained in the audio file in real time when recording or playing back the audio. This makes it possible to present the content of the audio file specifically to the user. In addition, the number of speakers and speaker information described above can be presented to the user, making it easier to find the desired audio file.

また、クラウドは、端末からアップロードされた音声ファイルを保存する保存部を有する。これにより、音声ファイルは、端末のみで保持することなく、上述したような音声ファイルの外部保存によって音声ファイルを保護でき音声ファイルの有効性を向上できるようになる。 The cloud also has a storage unit that stores audio files uploaded from the terminal. This allows the audio files to be protected by externally storing them as described above, without having to be stored only on the terminal, and improves the effectiveness of the audio files.

これらのように、実施の形態では、音声認識の対象となる音声ファイルは、録音時のみに限らず、再生時においても音声認識できる。したがって、録音前に話者人数の事前設定、話者分のマイクの用意、話者別の方向検出、等の煩雑な手間を省いて簡単に話者人数の推定および話者認識が行えるようになる。また、実施の形態によれば、録音を繰り返して音声ファイルが多数となった場合でも、音声ファイルの検索に、話者人数や話者を加えて実施でき、所望する音声ファイルを容易に見つけ出すことができるようになる。 As described above, in the embodiment, the audio files that are the subject of voice recognition can be recognized not only during recording, but also during playback. Therefore, it is possible to easily estimate the number of speakers and recognize the speakers without the need for complicated tasks such as presetting the number of speakers before recording, preparing microphones for each speaker, and detecting the direction of each speaker. Furthermore, according to the embodiment, even if a large number of audio files are created due to repeated recording, the number of speakers and the speaker can be added to the search for audio files, making it easy to find the desired audio file.

なお、本実施の形態で説明した音声認識にかかるプログラムは、予め用意されたプログラムをコンピュータで実行することにより実現することができる。また、このプログラムは、半導体メモリ、ハードディスク、フレキシブルディスク、ＣＤ－ＲＯＭ、ＤＶＤ等のコンピュータで読み取り可能な記録媒体に記録され、コンピュータによって記録媒体から
読み出されることによって実行される。また、このプログラムは、インターネット等のネットワークを介して配布してもよい。 The program for speech recognition described in the present embodiment can be realized by executing a prepared program on a computer. This program is recorded on a computer-readable recording medium such as a semiconductor memory, a hard disk, a flexible disk, a CD-ROM, or a DVD, and is executed by being read from the recording medium by the computer. This program may also be distributed via a network such as the Internet.

以上のように、本発明は、録音および再生するＩＣレコーダや録音アプリを搭載したスマートフォン等を含み音声認識する機器類への適用に有用である。 As described above, the present invention is useful for application to devices that recognize voice, including IC recorders that record and play back, and smartphones equipped with recording apps.

１００端末
１０１マイク
１０２録音部
１０３文字起こし部
１０４話者タグ付け部
１０５制御部
１０６キーボード
１０７ディスプレイ
１１０クラウド
１２０ストレージサーバー
１３０機械学習サーバー
１３１話者人数推定部
１３２話者認識部
１４０学習済モデルＤＢ
２０１ＣＰＵ
２０２ＲＯＭ
２０３ＲＡＭ
２０４外部メモリ
２１１通信Ｉ／Ｆ
６０１モバイルアプリ
７１１話者数推定表示領域
７１２話者表示領域
Ｄ音声ファイル REFERENCE SIGNS LIST 100 Terminal 101 Microphone 102 Recording unit 103 Transcription unit 104 Speaker tagging unit 105 Control unit 106 Keyboard 107 Display 110 Cloud 120 Storage server 130 Machine learning server 131 Speaker number estimation unit 132 Speaker recognition unit 140 Trained model DB
201 CPU
202 ROM
203 RAM
204 External memory 211 Communication I/F
601 Mobile application 711 Speaker number estimation display area 712 Speaker display area D Audio file

Claims

On the computer,
Estimate the number of speakers for each speaker included in the audio file;
Refer to a pre-prepared trained model for each speaker to recognize each speaker of the estimated number of speakers;
tagging speakers contained in said audio files ;
The estimation process includes:
presenting the estimated number of speakers to a user;
re-estimating the number of speakers included in the audio file based on a change operation of the number of speakers by the user.
A speech recognition program for executing a process.

Further, learning and storing the tagged speaker information;
The recognition process includes:
Recognizing each speaker of the estimated number of speakers based on the trained model.
2. The speech recognition program according to claim 1 ,

3. The speech recognition program according to claim 1, further comprising: generating characters contained in the voice file in real time when the voice file is being recorded or reproduced .

The computer
Estimate the number of speakers for each speaker included in the audio file,
Refer to a pre-prepared trained model for each speaker to recognize each speaker of the estimated number of speakers;
tagging speakers contained in said audio files;
The estimation process includes:
presenting the estimated number of speakers to a user;
re-estimating the number of speakers included in the audio file based on a change operation of the number of speakers by the user.
23. A speech recognition method comprising:

A control unit for recognizing the number and speakers included in the audio file,
The control unit is
Estimate the number of speakers for each speaker included in the audio file;
Refer to a pre-prepared trained model for each speaker to recognize each speaker of the estimated number of speakers;
tagging speakers contained in said audio files;
The control unit, as the estimation process,
presenting the estimated number of speakers to a user;
re-estimating the number of speakers included in the audio file based on a change operation of the number of speakers by the user.
1. A speech recognition device comprising:

In a voice recognition system in which a terminal is connected to the cloud,
The terminal includes:
A voice recording section;
A communication unit that uploads the recorded or played audio file to the cloud,
The cloud includes:
Estimating the number of speakers for each speaker included in the audio file;
Refer to a pre-prepared trained model for each speaker to recognize each speaker of the estimated number of speakers;
notifying the terminal of speaker tagged information included in the audio file;
The terminal includes:
presenting the number of speakers estimated by the cloud to a user via a display unit;
The cloud includes:
re-estimating the number of speakers included in the audio file based on the operation of changing the number of speakers by the user received from the terminal, and transmitting the result to the terminal.
A speech recognition system comprising:

The cloud includes:
The voice recognition system according to claim 6, further comprising a storage unit for storing the voice file uploaded from the terminal.