JP7828627B2

JP7828627B2 - Speech recognition device and method, and computer program

Info

Publication number: JP7828627B2
Application number: JP2021134879A
Authority: JP
Inventors: 鵬沈; 恒河井
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2021-08-20
Filing date: 2021-08-20
Publication date: 2026-03-12
Anticipated expiration: 2041-08-20
Also published as: JP2023028902A

Description

この発明は音声認識装置に関し、特に、多言語の発話の音声認識装置に関する。 This invention relates to a speech recognition device, and more particularly to a speech recognition device for multilingual speech.

音声認識技術の普及により、音声の言語が予め分かっている場合、発話単位なら音声をほぼ同時に高い精度を持ってテキストに変換できる技術が存在する。発話単位ではなく、音声の言語が予め分かっていない複数の発話の音声認識をする場合には、さらに音声がどの言語によるものかを自動的に判定する言語識別技術が必要である。例えば後掲の特許文献１には、発話の開始とともに言語の識別を開始し、発話の先頭の短い期間のみを使用して言語の識別を行う技術が開示されている。また、従来は、そのように発話と同時に音声認識を行うシステムの評価は、単語誤り率（ＷＥＲ（ＷｏｒｄＥｒｒｏｒＲａｔｅ））及びリアルタイムファクター（ＲＴＦ（ＲｅａｌＴｉｍｅＦａｃｔｏｒ））が広く使用されている。 With the widespread adoption of speech recognition technology, there are techniques that can convert speech into text with high accuracy, almost simultaneously, on an utterance-by-utterance basis, provided the language of the speech is known in advance. However, when recognizing multiple utterances whose languages are not known in advance, language identification technology is required to automatically determine the language of the speech. For example, Patent Document 1 (described below) discloses a technology that begins language identification at the start of an utterance and uses only the short initial period of the utterance for language identification. Furthermore, conventionally, the evaluation of systems that perform speech recognition simultaneously with utterance has widely used word error rate (WER) and real-time factor (RTF).

特開２０２０－１６０３７４号公報Japanese Patent Publication No. 2020-160374

最近、政治、学術及びビジネスの領域において国際的な会議、講演、質疑応答などが一般的になっている。いわゆるビデオ会議により、複数の国を結び、複数の話者が複数の言語により会議を行う機会も増えている。そうした場を有意義なものにするためには、同時通訳を自動翻訳を使用して行うことが必須である。 Recently, international conferences, lectures, and Q&A sessions have become commonplace in the fields of politics, academia, and business. Video conferencing has increased opportunities to connect multiple countries and allow multiple speakers to participate in meetings in multiple languages. To make such events meaningful, simultaneous interpretation using machine translation is essential.

自動翻訳をする前提として、各話者の発話を正確に音声認識する必要がある。会議などにおいては発話が長時間に及ぶことが多い。しかしそれだけでなく、長い発話と短い発話とが交互に出現したり、短い発話が続いたりする場合も多い。使用される言語も話者も複数ありそれが頻繁に切り替わる。そのような状況においては、単独の話者の発話を音声認識する場合と異なり、発話区間検出、言語の判別、及び自動翻訳を短時間の間に正確に行わなければならないという制約がある。さらに長時間の音声の場合は発話中に非音声が挿入されることも多く、処理をさらに難しくなる。こうした状況は、同時通訳を行う場合だけではなく、会議の議事録の作成、字幕の作成などにおいても同様に生じる。 As a prerequisite for machine translation, it is necessary to accurately recognize the speech of each speaker. In meetings, for example, speeches often last for extended periods. Furthermore, long and short speeches frequently alternate, or short speeches may continue in succession. Multiple languages and speakers are often used, and these switch frequently. In such situations, unlike speech recognition of a single speaker, there are constraints: speech segment detection, language identification, and machine translation must be performed accurately in a short amount of time. Moreover, in the case of long audio recordings, non-speech is often inserted during speech, making processing even more difficult. These situations arise not only in simultaneous interpretation but also in tasks such as creating meeting minutes and subtitles.

使用される言語が切り替わったとき（話者が交代したとき）には、短い時間の間に言語を識別する必要がある。しかし既存の言語識別技術によっては、発話の先頭という限られた情報から正確に言語を識別することが難しいという問題がある。多言語を連続的に処理する音声認識装置では、話者が発話している間に言語を識別し、言語を識別した後に素早く音声認識の結果を表示したり、翻訳などの後処理に渡したりする。また、多言語の発話の言語識別及び音声認識を精度よく認識、識別するため、発話区間を高い精度で検出することが必要になる。既存の発話区間検出技術は、短い発話に対しては精度が低いという問題があり、発話の切り替えをうまく検出できないことが多い。 When the language being used changes (when the speaker changes), it is necessary to identify the language in a short amount of time. However, existing language identification technologies have the problem of difficulty in accurately identifying the language from the limited information available at the beginning of an utterance. Speech recognition devices that process multiple languages continuously identify the language while the speaker is speaking, and then quickly display the speech recognition result or pass it on to post-processing such as translation. Furthermore, in order to accurately recognize and identify the language and speech of multilingual utterances, it is necessary to detect utterance segments with high accuracy. Existing utterance segment detection technologies have the problem of low accuracy for short utterances and often fail to detect utterance changes effectively.

また、このように発話をリアルタイムで音声認識するシステムを評価する際に、従来のようにＷＥＲ及びＲＴＦを使用することが適切かという問題もある。 Furthermore, when evaluating systems that perform real-time speech recognition in this manner, there is the question of whether it is appropriate to use WER and RTF as in the past.

したがってこの発明の目的は、多言語でのリアルタイムの音声認識を、高精度でかつ低レイテンシで行える多言語ライブ音声認識システムを提供することである。 Therefore, the objective of this invention is to provide a multilingual live speech recognition system that can perform real-time speech recognition in multiple languages with high accuracy and low latency.

この発明の他の目的は、多言語ライブ音声認識システムを適切に評価する指標を提供することである。 Another objective of this invention is to provide metrics for appropriately evaluating multilingual live speech recognition systems.

この発明の第１の局面に係る音声認識装置は、音声信号の発話区間の開始及び終了を検出する発話区間検出手段と、発話区間検出手段により検出された発話区間の終了後、所定時間以上の無音区間があったことに応答して、発話区切りを示す発話区分信号を出力するための発話区切り検出手段と、発話区分信号に応答して、直前の発話区間の発話の言語を識別し言語の識別子を出力するための言語識別手段と、発話区間検出手段により発話区間の開始が検出されたことに応答して、当該発話区間の音声信号に対する音声認識をそれぞれ行うための、互いに異なる複数の言語のための複数の音声認識手段と、識別子に応答して、複数の音声認識手段のうち、当該識別子の示す言語の音声認識手段の出力を選択して出力するための選択手段とを含む。 The speech recognition device according to the first aspect of this invention includes: a speech segment detection means for detecting the start and end of a speech segment of a speech signal; a speech segment detection means for outputting a speech segment signal indicating a speech segment in response to a silence period of a predetermined time or longer after the end of a speech segment detected by the speech segment detection means; a language identification means for identifying the language of the utterance in the immediately preceding speech segment and outputting a language identifier in response to the speech segment signal; a plurality of speech recognition means for a plurality of different languages, each for performing speech recognition on the speech signal of the said speech segment in response to the detection of the start of a speech segment by the speech segment detection means; and a selection means for selecting and outputting the output of the speech recognition means for the language indicated by the identifier from among the plurality of speech recognition means in response to the identifier.

好ましくは、言語識別手段は、発話区間検出手段により発話区間の終了が検出されたことに応答して、当該発話区間の音声信号から、所定長で所定シフト量の部分区間の音声信号を生成する部分区間信号生成手段と、部分区間の各々の音声信号を受け、当該部分区間が複数の言語のいずれに相当するかを表す情報を出力する言語識別モデルと、言語識別モデルが出力する情報に応答して、発話区間の音声信号の言語を決定し当該言語の識別子を出力する言語決定手段とを含む。 Preferably, the language identification means includes: a sub-section signal generation means that generates a sub-section audio signal of a predetermined length and shift amount from the audio signal of the utterance section in response to the detection of the end of an utterance section by the utterance section detection means; a language identification model that receives each audio signal of the sub-section and outputs information indicating which of the multiple languages the sub-section corresponds to; and a language determination means that determines the language of the audio signal of the utterance section and outputs an identifier for that language in response to the information output by the language identification model.

より好ましくは、複数の音声認識手段の各々は、当該音声認識手段の言語の音声認識を個別に行う複数の同一言語音声認識手段と、発話区分信号に応答して、複数の言語の音声認識手段の各々について、複数の同一言語音声認識手段の中でアイドリング中である同一言語音声認識手段に音声認識を開始させるための切替手段とを含む。 More preferably, each of the multiple speech recognition means includes a plurality of same-language speech recognition means that individually perform speech recognition for the language of the speech recognition means, and a switching means that, in response to an utterance segmentation signal, causes the same-language speech recognition means that is idling among the plurality of same-language speech recognition means to start speech recognition for each of the plurality of language speech recognition means.

さらに好ましくは、発話区間検出手段は、音声信号を所定長の対象区間に分割する分割手段と、対象区間の各々に対して、その直前に少なくとも第１の所定長の無音区間を含む付加信号を付加する信号付加手段と、信号付加手段により付加信号が付加された対象区間に含まれる有音区間を無音区間と区別して検出するための有音区間検出手段と、有音区間検出手段により検出された有音区間の中で、付加信号に対応する有音区間を削除することにより有音区間を補正するための補正手段とを含む。 More preferably, the speech segment detection means includes: a division means for dividing the speech signal into target segments of a predetermined length; a signal addition means for adding an additional signal to each target segment, which includes at least a first predetermined length of silent segment immediately preceding it; a sound segment detection means for distinguishing and detecting sound segments from silent segments within the target segments to which the signal addition means has been added; and a correction means for correcting sound segments by deleting sound segments corresponding to the additional signal within the sound segments detected by the sound segment detection means.

この発明の第２の局面に係るコンピュータプログラムは、コンピュータを、音声信号の発話区間を検出する発話区間検出手段と、発話区間検出手段により発話区間が検出されたことに応答して、当該検出された発話区間に対して当該区間の発話の言語を識別するための言語識別手段と、発話区間検出手段により発話区間が検出されたことに応答して、当該発話区間の音声信号に対する音声認識をそれぞれ行うための、互いに異なる複数の言語のための複数の音声認識手段と、言語識別手段による識別結果に応答して、複数の音声認識手段のうち、当該識別結果の示す言語の音声認識手段の出力を選択して出力するための選択手段として機能させる。 The computer program according to the second aspect of this invention causes the computer to function as follows: a speech segment detection means for detecting speech segments of an audio signal; a language identification means for identifying the language of the speech in the detected speech segment in response to the detection of a speech segment by the speech segment detection means; a plurality of speech recognition means for performing speech recognition on the audio signal of the speech segment in response to the detection of a speech segment by the speech segment detection means, for each of several different languages; and a selection means for selecting and outputting the output of the speech recognition means for the language indicated by the identification result from among the plurality of speech recognition means in response to the identification result by the language identification means.

この発明の第３の局面に係る音声認識方法は、コンピュータが、音声信号の発話区間の開始及び終了を検出するステップと、コンピュータが、検出された発話区間の終了後、所定時間以上の無音区間があったことに応答して、発話区切りを示す発話区分信号を出力するステップと、コンピュータが、発話区分信号に応答して、直前の発話区間の発話の言語を識別し言語の識別子を出力するステップと、コンピュータが、発話区間の開始が検出されたことに応答して、当該発話区間の音声信号に対する音声認識を、互いに異なる複数の言語のための複数の音声認識手段により開始させるステップと、コンピュータが、識別子に応答して、複数の音声認識手段のうち、当該識別子の示す言語の音声認識手段の出力を選択して出力するステップとを含む。 A third aspect of this invention relates to a speech recognition method comprising the steps of: a computer detecting the start and end of a speech segment of a speech signal; the computer outputting a speech segmentation signal indicating a speech delimiter in response to a silence period of a predetermined time or longer after the end of the detected speech segment; the computer identifying the language of the utterance in the immediately preceding speech segment and outputting a language identifier in response to the speech segmentation signal; the computer initiating speech recognition of the speech signal of the speech segment using multiple speech recognition means for multiple different languages in response to the detection of the start of a speech segment; and the computer selecting and outputting the output of the speech recognition means for the language indicated by the identifier from among the multiple speech recognition means in response to the identifier.

図１は、この発明の第１実施形態に係る多言語ライブ音声認識装置の機能的ブロック図である。Figure 1 is a functional block diagram of a multilingual live speech recognition device according to the first embodiment of this invention. 図２は、図１に示す多言語ライブ音声認識装置を実現するコンピュータの外観を示す図である。Figure 2 shows the external appearance of the computer that implements the multilingual live speech recognition device shown in Figure 1. 図３は、図２に示すコンピュータのハードウェア構成を示すブロック図である。Figure 3 is a block diagram showing the hardware configuration of the computer shown in Figure 2. 図４は、この発明の第１実施形態における発話区間検出の方法を説明するための模式図である。Figure 4 is a schematic diagram illustrating the method for detecting speech segments in the first embodiment of this invention. 図５は、この発明の第１実施形態において検出された発話区間の補正方法を説明するための模式図である。Figure 5 is a schematic diagram illustrating a method for correcting a detected speech segment according to the first embodiment of this invention. 図６は、この発明の第１実施形態における言語識別方法を説明するための模式図である。Figure 6 is a schematic diagram illustrating a language identification method in a first embodiment of the present invention. 図７は、第１実施形態における発話区切りの検出方法を説明するための模式図である。Figure 7 is a schematic diagram illustrating the method for detecting utterance segments in the first embodiment. 図８は、第１実施形態に係る多言語ライブ音声認識装置が実行する多言語ライブ音声認識処理を実現するコンピュータプログラムの制御構造を示すフローチャートである。Figure 8 is a flowchart showing the control structure of a computer program that implements multilingual live speech recognition processing performed by the multilingual live speech recognition device according to the first embodiment. 図９は、ＶＡＤ（ＶｏｉｃｅＡｃｔｉｖｉｔｙＤｅｔｅｃｔｉｏｎ）の前処理を実現するプログラムの制御構造を示すフローチャートである。Figure 9 is a flowchart showing the control structure of a program that implements preprocessing for VAD (Voice Activity Detection). 図１０は、ＶＡＤ出力の補正を実現するプログラムの制御構造を示すフローチャートである。Figure 10 is a flowchart showing the control structure of the program that implements VAD output correction. 図１１は、ＡＳＲ（ＡｕｔｏｍａｔｉｃＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎ）の切替を実現するプログラムの制御構造を示すフローチャートである。Figure 11 is a flowchart showing the control structure of the program that enables ASR (Automatic Speech Recognition) switching. 図１２は、ＬＩＤ（ＬａｎｇｕａｇｅＩｄｅｎｔｉｆｉｃａｔｉｏｎ）の制御を実現するプログラムの制御構造を示すフローチャートである。Figure 12 is a flowchart showing the control structure of a program that implements LID (Language Identification) control. 図１３は、表示の更新処理を実現するプログラムの制御構造を示すフローチャートである。Figure 13 is a flowchart showing the control structure of the program that implements the display update process. 図１４は、クライアント・サーバシステムによる多言語ライブ音声認識の際のタイムチャートを示す図である。Figure 14 shows a time chart for multilingual live speech recognition using a client-server system. 図１５は、第１実施形態によるレイテンシの計算方法を示す図である。Figure 15 shows the method for calculating latency according to the first embodiment. 図１６は、第１実施形態の実験で用いたテスト発話を示す図である。Figure 16 shows the test utterances used in the experiment of the first embodiment. 図１７は、実験に使用したＡＳＲ単体の評価と実施形態に係る多言語ライブ音声認識の評価とを対比して示す図である。Figure 17 is a diagram comparing the evaluation of the ASR unit used in the experiment with the evaluation of the multilingual live speech recognition according to the embodiment.

以下の説明及び図面では、同一の部品には同一の参照番号を付してある。したがって、それらについての詳細な説明は繰返さない。 In the following descriptions and drawings, identical parts are assigned the same reference number. Therefore, detailed descriptions of them will not be repeated.

第１第１実施形態
１．構成
（１）全体構成
図１にこの出願の第１実施形態に係る多言語ライブ音声認識装置５０の概略の機能構成をブロック図形式で示す。図１を参照して、この多言語ライブ音声認識装置５０は、例えばリモートからの音声信号である入力５２を受けてその発話区間を検出し、発話区間ごとにその言語を識別して音声認識し、音声認識の結果を表示５４として出力する機能を持つ。 First Embodiment 1. Configuration (1) Overall Configuration Figure 1 shows a schematic functional configuration of the multilingual live speech recognition device 50 according to the first embodiment of this application in block diagram form. Referring to Figure 1, the multilingual live speech recognition device 50 has the function of receiving an input 52, which is a voice signal from a remote, detecting the speech interval, identifying the language for each speech interval and performing speech recognition, and outputting the result of speech recognition as a display 54.

多言語ライブ音声認識装置５０は、入力５２を受けて蓄積する、ＦＩＦＯ（Ｆｉｒｓｔ－Ｉｎ，Ｆｉｒｓｔ－Ｏｕｔ）形式のバッファ７０と、このバッファ７０から音声信号を読み出し、発話区間を検出するための発話区間検出部７２とを含む。入力５２は音声サンプル列を含み、各サンプルにはある時刻を基準とした時間情報が付されている。発話区間検出部７２は、この音声サンプル列に含まれる音声区間を検出し、（開始時刻、終了時刻）のペアからなる発話区間検出信号を出力するためのものである。 The multilingual live speech recognition device 50 includes a FIFO (First-In, First-Out) format buffer 70 that receives and stores input 52, and a speech segment detection unit 72 that reads speech signals from this buffer 70 and detects speech segments. The input 52 includes a sequence of speech samples, each sample having time information based on a specific time. The speech segment detection unit 72 detects speech segments included in this sequence of speech samples and outputs a speech segment detection signal consisting of pairs of (start time, end time).

多言語ライブ音声認識装置５０はさらに、発話区間検出部７２の出力に基づいて、無音区間が所定の長さ（例えば０．５秒を使用するがこれに限定されるわけではなく、これより長くても短くてもよい。）以上継続したときを発話区切りとみなして発話区切りを示す発話区分信号を出力するための発話区切検出部７６と、入力される音声信号の先頭部分を用いてその音声の言語を識別し、言語の識別子を出力するためのＬＩＤ７８と、複数の言語の音声の音声認識を並列して実行可能で、ＬＩＤ７８の出力する言語の識別子に対応する音声認識結果を選択して出力するための多言語ＡＳＲ処理部８０と、バッファ７０に記憶された音声データのうち、ＶＡＤ１０２からの音声区間を示す情報により特定される位置の音声データを読み出してＬＩＤ７８及び多言語ＡＳＲ処理部８０に与えるための制御部７４と、多言語ＡＳＲ処理部８０の出力する音声認識結果を受け、ＬＩＤ７８によりＬＩＤ判別結果が出力されたことに応答して、多言語ＡＳＲ処理部８０からの音声認識結果を出力し表示５４を行ったり、自動翻訳装置に出力したりするための結果表示制御部８２とを含む。なおここでいう「発話区切り」とは、単なる１発話の終了を示すというだけのものではなく、発話者が交代する可能性があることまで前提とした発話の区切りのことをいう。したがって、発話区切りが検出されたことを契機として、音声データの入力先であるＡＳＲスレッド（後述）が切り替えられる。発話の言語の識別処理も一旦停止された上で次の発話に対して再開される。 The multilingual live speech recognition device 50 further includes a speech segment detection unit 76 that, based on the output of the speech segment detection unit 72, considers a silent segment to be a speech segment when it continues for a predetermined length (for example, 0.5 seconds, but is not limited to this, and may be longer or shorter) and outputs a speech segment signal indicating a speech segment; an LID 78 that identifies the language of the speech using the beginning portion of the input speech signal and outputs a language identifier; and a device that can perform speech recognition of speech in multiple languages in parallel and outputs speech corresponding to the language identifier output by the LID 78. The system includes a multilingual ASR processing unit 80 for selecting and outputting recognition results, a control unit 74 for reading audio data from the buffer 70 at a position identified by information indicating the audio section from the VAD 102 and providing it to the LID 78 and the multilingual ASR processing unit 80, and a result display control unit 82 for receiving the speech recognition results output by the multilingual ASR processing unit 80 and, in response to the output of the LID discrimination result by the LID 78, outputting the speech recognition results from the multilingual ASR processing unit 80 for display 54 or outputting to an automatic translation device. Here, "utterance segment" refers not merely to the end of a single utterance, but to an utterance segment that takes into account the possibility of a change in the speaker. Therefore, when an utterance segment is detected, the ASR thread (described later), which is the input destination for the audio data, is switched. The language identification process of the utterance is also temporarily stopped and then resumed for the next utterance.

発話区間検出部７２は、バッファ７０から読み出した音声データに対して、発話区間を精度高く認識するための所定の前処理を行うための前処理部１００と、前処理部１００により前処理された音声データに基づいて音声区間の開始時刻及び終了時刻を検出して発話区間検出信号を出力するためのＶＡＤ１０２とを含む。 The speech segment detection unit 72 includes a preprocessing unit 100 for performing predetermined preprocessing on the audio data read from the buffer 70 to accurately recognize speech segments, and a VAD 102 for detecting the start and end times of the audio segments based on the audio data preprocessed by the preprocessing unit 100 and outputting a speech segment detection signal.

（２）ハードウェア構成
図２は、図１に示す多言語ライブ音声認識装置５０を実現するためのコンピュータの外観を示す。図３は、図２に示すコンピュータのハードウェアをブロック図形式で示す。 (2) Hardware Configuration Figure 2 shows the external appearance of the computer used to realize the multilingual live speech recognition device 50 shown in Figure 1. Figure 3 shows the hardware of the computer shown in Figure 2 in block diagram format.

図２を参照して、この多言語ライブ音声認識装置５０、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）ドライブ１８２、及びＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）メモリポート１８６を有するコンピュータ１５０と、いずれもコンピュータ１５０に接続された、ユーザと対話するためのキーボード１５４、マウス１５６、及びモニタ１５２とを含む。もちろんこれらはユーザ対話が必要となったときのための構成の一例であって、ユーザ対話に利用できる一般のハードウェア及びソフトウェア（例えばタッチパネル、ポインティングデバイス一般）であればどのようなものも利用できる。 Referring to Figure 2, this includes a computer 150 having a multilingual live speech recognition device 50, a DVD (Digital Versatile Disc) drive 182, and a USB (Universal Serial Bus) memory port 186, as well as a keyboard 154, mouse 156, and monitor 152, all connected to the computer 150, for user interaction. Of course, these are just one example of a configuration for when user interaction is required; any general hardware and software (e.g., touch panels, pointing devices in general) that can be used for user interaction can be utilized.

図３を参照して、コンピュータ１５０は、ＤＶＤドライブ１８２及びＵＳＢメモリポート１８６に加えて、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１７０と、ＧＰＵ（ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１７２と、ＣＰＵ１７０、ＧＰＵ１７２、ＤＶＤドライブ１８２に接続されたバス１９０と、バス１９０に接続され、コンピュータ１５０のブートアッププログラムなどを記憶するＲＯＭ（Ｒｅａｄ－ＯｎｌｙＭｅｍｏｒｙ）１７６と、バス１９０に接続され、プログラムを構成する命令、システムプログラム、及び作業データなどを記憶するＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１７８と、バス１９０に接続された不揮発性メモリであるＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）１８０とを含む。ＳＳＤ１８０は、ＣＰＵ１７０及びＧＰＵ１７２が実行するプログラム、並びにＣＰＵ１７０及びＧＰＵ１７２が実行するプログラムが使用するデータなどを記憶するためのものである。コンピュータ１５０はさらに、他端末との通信を可能とするネットワーク１６６への接続を提供するネットワークＩ／Ｆ（Ｉｎｔｅｒｆａｃｅ）１８８とを含む。ＵＳＢメモリポート１８６にはＵＳＢメモリ１６４が着脱可能で、ＵＳＢメモリ１６４とコンピュータ１５０内の各部との通信を提供する。 Referring to Figure 3, the computer 150 includes, in addition to the DVD drive 182 and USB memory port 186, a CPU (Central Processing Unit) 170, a GPU (Graphics Processing Unit) 172, a bus 190 connected to the CPU 170, GPU 172, and DVD drive 182, a ROM (Read-Only Memory) 176 connected to the bus 190 for storing the computer 150's boot-up program, etc., a RAM (Random Access Memory) 178 connected to the bus 190 for storing program instructions, system programs, and work data, etc., and an SSD (Solid State Drive) 180, which is a non-volatile memory connected to the bus 190. The SSD 180 is for storing programs executed by the CPU 170 and GPU 172, as well as data used by those programs. The computer 150 further includes a network interface 188 that provides connection to a network 166 enabling communication with other terminals. A USB memory stick 164 is detachable from the USB memory port 186, providing communication between the USB memory stick 164 and various parts of the computer 150.

上記実施形態では、図１に示すＶＡＤ１０２、ＬＩＤ７８、多言語ＡＳＲ処理部８０の主要部は訓練済モデル、ニューラルネットワーク及びプログラムからなる。 In the above embodiment, the main components of the VAD102, LID78, and multilingual ASR processing unit 80 shown in Figure 1 consist of a trained model, a neural network, and a program.

このコンピュータ１５０を、図１に示す多言語ライブ音声認識装置５０のバッファ７０、制御部７４、発話区切検出部７６、ＬＩＤ７８、多言語ＡＳＲ処理部８０，前処理部１００、ＶＡＤ１０２及び補正部１０４として機能させるためのプログラム及びそれらプログラムが使用するパラメータは、ＤＶＤドライブ１８２に装着されるＤＶＤ１５８に記憶され、ＤＶＤドライブ１８２からＳＳＤ１８０に転送される。又は、これらのプログラムはＵＳＢメモリ１６４に記憶され、ＵＳＢメモリ１６４をＵＳＢメモリポート１８６に装着し、プログラムをＳＳＤ１８０に転送する。又は、このプログラムはネットワーク１６６を通じてコンピュータ１５０に送信されＳＳＤ１８０に記憶されてもよい。 The programs and parameters used by these programs, which cause the computer 150 to function as the buffer 70, control unit 74, utterance segment detection unit 76, LID 78, multilingual ASR processing unit 80, pre-processing unit 100, VAD 102, and correction unit 104 of the multilingual live speech recognition device 50 shown in Figure 1, are stored on a DVD 158 inserted into the DVD drive 182 and transferred from the DVD drive 182 to the SSD 180. Alternatively, these programs may be stored on a USB memory 164, the USB memory 164 is inserted into the USB memory port 186, and the programs are transferred to the SSD 180. Alternatively, these programs may be transmitted to the computer 150 via the network 166 and stored on the SSD 180.

プログラムは実行のときにＲＡＭ１７８にロードされる。もちろん、キーボード１５４、モニタ１５２及びマウス１５６を用いてソースプログラムを入力し、コンパイルした後のオブジェクトプログラムをＳＳＤ１８０に格納してもよい。スクリプト言語の場合には、キーボード１５４などを用いて入力したスクリプトをＳＳＤ１８０に格納してもよい。仮想マシン上で動作するプログラムの場合には、仮想マシンとして機能するプログラムを予めコンピュータ１５０にインストールしておく必要がある。ただし、複数のＡＳＲスレッドによる推論には大量の計算が伴うため、スクリプト言語ではなくコンピュータのネイティブなコードからなるオブジェクトプログラムとして本発明の実施形態の各部を実現する方が好ましい。 The program is loaded into RAM 178 at runtime. Of course, the source program may be input using the keyboard 154, monitor 152, and mouse 156, and the compiled object program may be stored in SSD 180. In the case of a scripting language, the script entered using the keyboard 154, etc., may be stored in SSD 180. For programs running on a virtual machine, a program that functions as a virtual machine must be pre-installed on computer 150. However, since inference using multiple ASR threads involves a large amount of computation, it is preferable to implement each part of the embodiment of the present invention as an object program consisting of the computer's native code rather than a scripting language.

ＣＰＵ１７０は、その内部のプログラムカウンタと呼ばれるレジスタ（図示せず）により示されるアドレスに従ってＲＡＭ１７８からプログラムを読み出して命令を解釈する。ＣＰＵ１７０はさらに、命令の実行に必要なデータを命令により指定されるアドレスに従ってＲＡＭ１７８、ＳＳＤ１８０又はそれ以外の機器から読み出して命令により指定される処理を実行する。ＣＰＵ１７０は、実行結果のデータを、ＲＡＭ１７８、ＳＳＤ１８０、ＣＰＵ１７０内のレジスタなど、プログラムにより指定されるアドレスに格納する。このとき、プログラムカウンタの値もプログラムに従って動作するＣＰＵ１７０により更新される。コンピュータプログラムは、ＤＶＤ１５８から、ＵＳＢメモリ１６４から、又はネットワークを介して、ＲＡＭ１７８に直接にロードしてもよい。なお、ＣＰＵ１７０が実行するプログラムの中で、一部のタスク（主として並列実行可能な数値計算）については、プログラムに含まれる命令により、又はＣＰＵ１７０による命令実行時の解析結果に従って、ＧＰＵ１７２により実行される。 The CPU 170 reads the program from RAM 178 according to the address indicated by an internal register called the program counter (not shown) and interprets the instructions. The CPU 170 then reads the data necessary for executing the instructions from RAM 178, SSD 180, or other devices according to the address specified by the instructions and executes the processing specified by the instructions. The CPU 170 stores the execution result data at an address specified by the program, such as RAM 178, SSD 180, or a register within the CPU 170. At this time, the value of the program counter is also updated by the CPU 170 operating according to the program. The computer program may be loaded directly into RAM 178 from DVD 158, USB memory 164, or via a network. Note that some tasks within the program executed by the CPU 170 (primarily numerical calculations that can be executed in parallel) are executed by the GPU 172 according to the instructions included in the program or according to the analysis results when the CPU 170 executes the instructions.

コンピュータ１５０により上記した実施形態に係る各部の機能を実現するプログラムは、それら機能を実現するようコンピュータ１５０を動作させるように記述され配列された複数の命令を含む。この命令を実行するのに必要な基本的機能のいくつかはコンピュータ１５０上で動作するオペレーティングシステム若しくはサードパーティのプログラム、又はコンピュータ１５０にインストールされる各種ツールキットのモジュールにより提供され、実行時にダイナミックリンクによりオブジェクトプログラムにリンクされる。したがって、このプログラムはこの実施形態のシステム及び方法を実現するのに必要な機能全てを必ずしも含まなくてよい。このプログラムは、命令の中で、所望の結果が得られるように制御されたやり方で適切な機能又は「プログラミング・ツール・キット」の機能を呼出すことにより、上記した各装置及びその構成要素としての動作を実行する命令のみを含んでいればよい。そのためのコンピュータ１５０の動作方法は周知であるので、ここでは繰返さない。 The program that implements the functions of each part according to the above embodiment using computer 150 includes a plurality of instructions written and arranged to operate computer 150 to implement those functions. Some of the basic functions necessary to execute these instructions are provided by an operating system running on computer 150, a third-party program, or modules of various toolkits installed on computer 150, and are linked to the object program via dynamic linking at runtime. Therefore, this program does not necessarily have to include all the functions necessary to implement the system and method of this embodiment. This program only needs to include instructions that execute the operations of each of the above-described devices and their components by calling appropriate functions or functions of the "programming toolkit" in a controlled manner to obtain the desired results. The method of operating computer 150 for this purpose is well known and will not be repeated here.

なお、ＧＰＵ１７２は並列処理を行うことが可能であり、ニューラルネットワークによる処理に伴う多量の計算を同時並列的又はパイプライン的に実行できる。例えばプログラムのコンパイル時にプログラム中で発見された並列的計算要素、又はプログラムの実行時に発見された並列的計算要素は、随時、ＣＰＵ１７０からＧＰＵ１７２に対して発行され、実行され、その結果が直接に、又はＲＡＭ１７８の所定アドレスを介してＣＰＵ１７０に返され、プログラム中の所定の変数に代入される。 Furthermore, the GPU 172 is capable of parallel processing, allowing it to execute large amounts of computations associated with neural network processing simultaneously, in parallel, or via pipeline. For example, parallel computation elements discovered in the program during compilation, or during program execution, are issued from the CPU 170 to the GPU 172 as needed, executed, and the results are returned to the CPU 170 directly or via a predetermined address in RAM 178, and assigned to predetermined variables in the program.

（３）処理の概略
ア．発話区間の検出
図４を参照して、前処理部１００、ＶＡＤ１０２及び補正部１０４が行う前処理、発話区間の検出、及び検出された発話区間の補正処理について説明する。 (3) Overview of the process A. Detection of speech segments Referring to Figure 4, the preprocessing, detection of speech segments, and correction processing of the detected speech segments performed by the preprocessing unit 100, VAD 102, and correction unit 104 will be described.

図４（Ａ）に示すように、入力音声が発話２００及び２０２を持つものとする。図４（Ａ）及び後続する同種の各グラフにおいて、縦線は発話の１秒ごとの区切りを示す。すなわち図４（Ａ）には、連続する５つの期間２１０、２１２、２１４、２１６及び２１８が示されており、これら期間はいずれも１秒の長さを持つ。ただし対象とする期間は１秒に限らず、これより長くても短くてもよい。図４（Ａ）によれば、発話２００の先頭が例えば０秒であるものとすると、発話２００は１．３秒程度の期間だけ継続している。発話２０２は１．８秒頃から開始し、４．３秒程度まで継続している。 As shown in Figure 4(A), the input audio is assumed to have utterances 200 and 202. In Figure 4(A) and subsequent similar graphs, vertical lines indicate one-second intervals in the utterances. Specifically, Figure 4(A) shows five consecutive periods 210, 212, 214, 216, and 218, each with a length of one second. However, the period is not limited to one second; it can be longer or shorter. According to Figure 4(A), if we assume that utterance 200 begins at, for example, 0 seconds, then utterance 200 continues for approximately 1.3 seconds. Utterance 202 begins around 1.8 seconds and continues until approximately 4.3 seconds.

図４（Ｂ）に示すように、発話２００のうち、期間２１０には１秒分の有音部分２２０が存在している。この実施形態では、処理の開始後の先頭の発話区間検出を行う場合図４（Ｃ）に示すように、その期間の音声信号の先頭に０．１秒分の無音部分２２２を付加して発話区間検出の対象とする。すなわち図４（Ｃ）に示す例では、有音部分２２０の先頭に無音部分２２２が付加されて発話区間検出処理が行われる。このように先頭に０．１秒分の無音部分をつけることで有音区間の検出の精度が高くなるという効果が得られる。ただしこの付加部分の長さは０．１秒に限るわけではなく、これより長くてもよいし、多少短くてもよい。 As shown in Figure 4(B), within the utterance 200, a 1-second audible portion 220 exists within the period 210. In this embodiment, when detecting the first utterance segment after the start of processing, as shown in Figure 4(C), a 0.1-second silent portion 222 is added to the beginning of the audio signal for that period to be used for utterance segment detection. That is, in the example shown in Figure 4(C), a silent portion 222 is added to the beginning of the audible portion 220 before the utterance segment detection process is performed. By adding a 0.1-second silent portion to the beginning in this way, the accuracy of detecting the audible segment is improved. However, the length of this added portion is not limited to 0.1 seconds; it may be longer or slightly shorter.

続いて、図４（Ｄ）に示すように、入力音声に対する期間２１２の処理では、この期間の音声信号２３０が発話区間検出の対象となる。音声信号２３０は有音部分２３２（発話２００の末尾）と有音部分２３４（発話２０２の先頭）、及びその間の無音部分を含む。この実施形態では、発話区間検出処理の２番目以降の期間、すなわち期間２１２以後では、発話区間検出の対象となる期間の音声信号、図４（Ｄ）の場合には音声信号２３０の先頭に、直前の期間（すなわち期間２１０の後半の音声信号のうち、後半の０．５秒分の音声信号２３６（この例では音声信号２３６は有音区間の一部である。））を付加し、さらにその前に無音部分２３８を付加する。すなわち、２回目の発話区間検出は、音声信号２３０、音声信号２３６、及び無音部分２３８を連結した音声信号となる。ただし、対象区間の直前に付加する前区間の音声信号は対象区間の１／２に相当する値に限定されない。より短くてもよいしより長くてもよい。短ければ後述するレイテンシは短くなる。しかし有音区間の検出精度は少し下がることが予想される。長ければ有音区間の検出精度は高くなるがレイテンシは長くなる。 Next, as shown in Figure 4(D), in the processing of the input audio for period 212, the audio signal 230 for this period becomes the target of speech segment detection. The audio signal 230 includes a sounded portion 232 (the end of the utterance 200), a sounded portion 234 (the beginning of the utterance 202), and the silent portion in between. In this embodiment, for the second and subsequent periods of the speech segment detection process, i.e., from period 212 onward, the audio signal for the period targeted for speech segment detection, in the case of audio signal 230 in Figure 4(D), is preceded by the audio signal 236 for the latter half of the audio signal of the latter half of period 210 (in this example, audio signal 236 is part of the sounded portion)), and then the silent portion 238 is preceded. In other words, the second speech segment detection results in an audio signal that concatenates audio signal 230, audio signal 236, and the silent portion 238. However, the audio signal from the preceding section, which is added immediately before the target section, is not limited to a value equivalent to half the length of the target section. It can be shorter or longer. A shorter value will result in lower latency (discussed later), but the detection accuracy of the audible section is expected to decrease slightly. A longer value will increase the detection accuracy of the audible section, but the latency will increase.

以後、同様の処理が行われる。すなわち、図４（Ｅ）を参照して、期間２１４に対する処理では、音声信号のうち期間２１４の部分である音声信号２４０（発話２０２のうち期間２１４内の部分）と、直前の期間２１２の音声信号２３０のうち、後半の０．５秒分である付加区間２４２と、その直前に付加される無音部分２４４とが処理対象となる。図４（Ｆ）を参照して、期間２１６に対する処理では、音声信号のうち期間２１６の部分（発話２０２の期間２１６内の音声信号２５０）と、直前の音声信号２４０のうち、後半の０．５秒分である付加区間２５２と、さらにその直前に付加される無音部分２５４とが処理対象となる。図４（Ｇ）を参照して、期間２１８に対する処理では、音声信号のうち期間２１８の部分（発話２０２の末尾と後続する無音部分からなる音声信号２６０）と、直前の期間２１６における音声信号２５０の後半の０．５秒分である付加区間２６２と、さらにその直前に付加される無音部分２６４とを含む。 The same processing is performed thereafter. Specifically, referring to Figure 4(E), in the processing for period 214, the audio signal 240 (the portion of the utterance 202 within period 214), which is the portion of the audio signal for period 214, the added section 242, which is the latter 0.5 seconds of the audio signal 230 for the preceding period 212, and the silent portion 244 added immediately before it are the targets of processing. Referring to Figure 4(F), in the processing for period 216, the portion of the audio signal for period 216 (the audio signal 250 of the utterance 202 within period 216), the added section 252, which is the latter 0.5 seconds of the preceding audio signal 240, and the silent portion 254 added immediately before it are the targets of processing. Referring to Figure 4(G), the processing for period 218 includes the portion of the audio signal corresponding to period 218 (the audio signal 260 consisting of the end of the utterance 202 and the subsequent silent portion), an additional section 262 which is the latter half of the audio signal 250 in the preceding period 216 (0.5 seconds), and a silent portion 264 that is added immediately before that.

すなわちこの実施形態では、１回目の処理を除いてＶＡＤ前に、音声信号の中で発話区間検出の対象となる期間だけではなく、直前の期間の音声信号の後半部分と、一定音の無音部分とを付加する。そしてその音声信号をＶＡＤに供する。後述するようにこうすることで短いレイテンシで発話区間を精度高く検出できるようになる。１回目には直前の対象区間が存在しないので、対象区間に無音区間のみを付加して処理対象とする。 In this embodiment, except for the first processing step, before VAD, the audio signal is processed by adding not only the period targeted for speech segment detection, but also the latter half of the audio signal from the immediately preceding period and a silent portion of a constant sound. This audio signal is then fed to VAD. As will be described later, this allows for highly accurate detection of speech segments with low latency. Since the immediately preceding target segment does not exist in the first processing step, only the silent portion is added to the target segment for processing.

この実施形態では、直前の処理対象の音声信号のうち、後半の０．５秒分を処理対象の音声信号の先頭に付加する。しかしこれはこの発明を限定しない。付加される音声信号の長さは一定である必要はない。さらに先頭に付加される無音部分の長さも一定である必要もない。ただし、実装の容易さに鑑みると、この実施形態のようにこれらの長さを一定とすることが望ましい。また付加される音声信号の長さも０．５秒に限定されるわけではない。ここで、付加される音声信号が０秒より長ければ、何も付加しない場合より精度が向上することが見込めるが、０．５秒より長くても精度の向上には限度があると考えられる。さらに対象となる音声によってもこの長さは変化すると考えられる。したがって、付加される音声信号の長さは０秒より大きく０．５秒以下、望ましくは０．１秒より大きく０．４５秒以下、さらに望ましくは０．２秒より大きく０．４秒以下とすることが望ましい。また、先頭に付加される無音部分の長さも０．１秒には限定されず、それより大きくても小さくてもよい。ただし付加される無音部分は０秒より大きくする必要がある。無音部分の長さがあまりに長くなるとレイテンシが大きくなるため、０．２秒より長くすることは望ましくない。したがって先頭に付加される無音部分は、０秒より長く０．２秒以下、望ましくは０．０５秒より長く０．１５秒以下とすることが望ましい。また付加される無音部分は、直前の音声信号のうち対象区間の先頭に付加される部分の長さより短いことが望ましい。なおこの前処理については後述する。 In this embodiment, the latter 0.5 seconds of the previously processed audio signal are added to the beginning of the audio signal to be processed. However, this does not limit the invention. The length of the added audio signal does not need to be constant. Furthermore, the length of the silent portion added to the beginning does not need to be constant. However, in light of ease of implementation, it is desirable to make these lengths constant as in this embodiment. Also, the length of the added audio signal is not limited to 0.5 seconds. Here, if the added audio signal is longer than 0 seconds, it can be expected that the accuracy will improve compared to not adding anything, but it is thought that there is a limit to the improvement in accuracy even if it is longer than 0.5 seconds. Furthermore, it is thought that this length will also change depending on the audio being processed. Therefore, it is desirable that the length of the added audio signal be greater than 0 seconds and 0.5 seconds or less, preferably greater than 0.1 seconds and 0.45 seconds or less, and even more preferably greater than 0.2 seconds and 0.4 seconds or less. Also, the length of the silent portion added to the beginning is not limited to 0.1 seconds, and may be greater or less than that. However, the added silent portion must be greater than 0 seconds. If the length of the silent portion is too long, the latency will increase, so it is undesirable for it to be longer than 0.2 seconds. Therefore, it is desirable for the silent portion added at the beginning to be longer than 0 seconds but 0.2 seconds or less, and preferably longer than 0.05 seconds but 0.15 seconds or less. Furthermore, it is desirable that the added silent portion be shorter than the length of the portion added at the beginning of the target section in the preceding audio signal. This pre-processing will be described later.

こうして期間２１０及び２１２及びそれ以後の各期間の音声信号に対して前処理を行った上で有音区間の検出を行う。この有音区間の検出処理自体は既存の方法のいずれを用いてもよい。 Thus, after preprocessing the audio signals for periods 210 and 212 and each subsequent period, the detection of audible sections is performed. Any existing method may be used for this audible section detection process.

以上のように前処理をした音声信号に対する有音区間の検出を行った場合、本来の対象区間に含まれる有音区間より長い有音区間が検出される可能性がある。したがってＬＩＤを行うに先立ち、余分に検出された有音区間を削除する必要がある。その方法について図５を参照して説明する。 When detecting audible segments in an audio signal that has undergone the preprocessing described above, it is possible that longer audible segments than those included in the original target segment may be detected. Therefore, before performing LID (Limited Identification), it is necessary to remove the extra detected audible segments. This method will be explained with reference to Figure 5.

図５（Ａ）は図４（Ａ）に示した音声信号と同じ音声信号の期間２１２から２１６を示す。以下、期間２１４及び２１６に関する発話区間の補正処理について説明する。図５（Ｂ）及び図５（Ｃ）を参照して、期間２１４での有音区間の検出対象は、音声信号の期間２１４の部分（音声信号２４０）である。前述したとおり、音声信号２４０における有音区間の検出のための前処理として、期間２１２における音声信号２３０の付加区間２４２を音声信号２４０の先頭に付加する。さらにその直前には無音部分２４４を付加して有音区間の検出を行う。さらに図５（Ｄ）を参照して、期間２１６における音声信号２５０については、音声信号２４０の後半部分である付加区間２５２を付加し、さらにその前に無音部分２５４を付加して有音区間の検出を行う。したがって、期間２１４の場合、図５（Ｆ）に示されるように、検出される有音部分３００は、音声信号２４０に対応する有音部分３０２に加え、期間２１２の末尾の有音部分２３４（図５（Ｂ））に対応する有音部分３０４を含むことがある。補正では、１秒前の音声信号に対する処理において有音部分２３４として検出され既に制御部７４及び発話区切検出部７６に送信済であれば、有音部分３０４を送信せず、図５（Ｇ）に示すように有音部分３０２のみを制御部７４及び発話区切検出部７６に送信する。有音部分３０４が１秒前の音声信号に対する処理において検出されておらず、制御部７４及び発話区切検出部７６に送信されていなければ、有音部分３０４と有音部分３０２とが制御部７４及び発話区切検出部７６に送信される。 Figure 5(A) shows the same audio signal as shown in Figure 4(A), specifically the period 212 to 216. The following describes the speech section correction process for periods 214 and 216. Referring to Figures 5(B) and 5(C), the target for detecting audible sections in period 214 is the portion of the audio signal corresponding to period 214 (audio signal 240). As previously mentioned, as a preprocessing step for detecting audible sections in audio signal 240, the additional section 242 of audio signal 230 in period 212 is added to the beginning of audio signal 240. Furthermore, a silent section 244 is added immediately before this to detect audible sections. Referring further to Figure 5(D), for audio signal 250 in period 216, the additional section 252, which is the latter half of audio signal 240, is added, and a silent section 254 is added before this to detect audible sections. Therefore, in the case of period 214, as shown in Figure 5(F), the detected audible portion 300 may include not only the audible portion 302 corresponding to the audio signal 240, but also the audible portion 304 corresponding to the audible portion 234 at the end of period 212 (Figure 5(B)). During correction, if the audible portion 234 was detected in the processing of the audio signal one second prior and has already been transmitted to the control unit 74 and the speech segment detection unit 76, then the audible portion 304 is not transmitted, and only the audible portion 302 is transmitted to the control unit 74 and the speech segment detection unit 76, as shown in Figure 5(G). If the audible portion 304 was not detected in the processing of the audio signal one second prior and has not been transmitted to the control unit 74 and the speech segment detection unit 76, then both the audible portion 304 and the audible portion 302 are transmitted to the control unit 74 and the speech segment detection unit 76.

同様に、期間２１６の場合、図５（Ｈ）に示すように、検出される有音部分３１０は、音声信号２５０に含まれる有音部分３１２だけでなく、期間２１４の末尾の有音部分３１４も含む。したがって、この有音部分３１４についても有音部分３０４と同様の処理を行う。すなわち、１秒前の処理で有音部分３１４が検出され制御部７４及び発話区切検出部７６に送信されていれば、補正では、図５（Ｉ）に示すように、有音部分３１０から有音部分３１４を削除して有音部分３１２のみとする。１秒前の処理で有音部分３１４が制御部７４及び発話区切検出部７６に送信されていなければ、有音部分３１４が有音部分３１２とともに制御部７４及び発話区切検出部７６に送信される。この例では、図５（Ｇ）に示すように、有音部分３１４に相当する部分が有音部分として検出され送信済である。したがって、有音部分３１４は削除され、有音部分３１２のみが送信される。 Similarly, in the case of period 216, as shown in Figure 5(H), the detected audible portion 310 includes not only the audible portion 312 included in the audio signal 250, but also the audible portion 314 at the end of period 214. Therefore, the same processing as for the audible portion 304 is performed on this audible portion 314. That is, if the audible portion 314 was detected in the processing one second prior and transmitted to the control unit 74 and the speech segment detection unit 76, the correction removes the audible portion 314 from the audible portion 310, leaving only the audible portion 312, as shown in Figure 5(I). If the audible portion 314 was not transmitted to the control unit 74 and the speech segment detection unit 76 in the processing one second prior, the audible portion 314 is transmitted to the control unit 74 and the speech segment detection unit 76 together with the audible portion 312. In this example, as shown in Figure 5(G), the portion corresponding to the audible portion 314 has been detected and transmitted as an audible portion. Therefore, the audible portion 314 is deleted, and only the audible portion 312 is transmitted.

この補正を行うためのプログラム構成については後述する。 The program configuration for performing this correction will be described later.

イ．ＬＩＤによる処理
図６を参照して、この実施形態における言語識別処理（ＬＩＤ）について説明する。ここでも、図６（Ａ）に示すように、期間２１０、２１２、２１４及び２１６を例とする。 I. LID Processing Referring to Figure 6, the language identification processing (LID) in this embodiment will be described. Here again, as shown in Figure 6(A), periods 210, 212, 214, and 216 will be used as examples.

図６（Ａ）に示すような処理をした結果、図６（Ｂ）に示すように、発話２００及び２０２に対応する有音部分３５０及び３５２が検出される。この実施形態では、有音部分３５０が検出されると、それに対応する発話２００の音声信号のうち、先頭から所定の長さ、例えば１．５秒及び音声信号の終端のいずれかが検出されると、先頭からそこまでの音声信号が図１に示すＬＩＤ７８に投入される。図６（Ａ）に示す例では発話２００は１．５秒より短い。したがって発話２００に対応する音声信号３６０（図６（Ｃ））の終端が検出されるとその全体がＬＩＤ７８に投入される。言語の判別自体はそれほど時間を必要としない。したがって、音声信号３６０がＬＩＤ７８に投入された後、わずかな時間遅れで言語の判別が行われ、図６（Ｇ）に示すように発話２００の開始時を基準とすると約１．５秒経過した時点でＬＩＤ判別結果３６２が得られる。 As a result of the processing shown in Figure 6(A), as shown in Figure 6(B), audible portions 350 and 352 corresponding to utterances 200 and 202 are detected. In this embodiment, when an audible portion 350 is detected, either a predetermined length from the beginning of the corresponding utterance 200 (for example, 1.5 seconds) or the end of the utterance signal is detected, and the utterance signal from the beginning up to that point is fed into the LID 78 shown in Figure 1. In the example shown in Figure 6(A), the utterance 200 is shorter than 1.5 seconds. Therefore, when the end of the utterance signal 360 (Figure 6(C)) corresponding to utterance 200 is detected, the entire signal is fed into the LID 78. Language discrimination itself does not require much time. Therefore, after the utterance signal 360 is fed into the LID 78, language discrimination is performed with a slight time delay, and as shown in Figure 6(G), the LID discrimination result 362 is obtained approximately 1.5 seconds after the start of utterance 200.

一方、発話２００に続く無音区間の後、発話２０２に対応する有音部分３５２として検出が開始されると、図６（Ｄ）に示されるように、発話２０２に対応する音声信号のうち、先頭から１．５秒又は音声信号の終端のいずれか早く検出されるまでの音声信号３７０がＬＩＤ７８に投入される。この例では発話２０２に対応する音声信号の先頭から１．５秒分が、ＬＩＤ７８に投入される。この結果、発話２００の開始時を基準とすると約３．５秒が経過した時点でＬＩＤ判別結果３７２が得られる。 On the other hand, after the silent interval following utterance 200, when detection begins for the audible portion 352 corresponding to utterance 202, as shown in Figure 6(D), the audio signal 370 corresponding to utterance 202, from the beginning until 1.5 seconds or the end of the audio signal, whichever comes first, is fed into the LID 78. In this example, the first 1.5 seconds of the audio signal corresponding to utterance 202 are fed into the LID 78. As a result, the LID discrimination result 372 is obtained approximately 3.5 seconds after the start of utterance 200.

さらに、図６（Ｅ）を参照して、発話２０２の先頭から０．７５秒が経過した時（説明を簡潔にするために「０．７５秒経過時」という。以下同様）にまで発話２０２が終端に達していないと、その０．７５秒経過時から１．５秒及び音声信号の終端までのいずれか早い方が検出されると、０．７５秒経過時からそこまでの音声信号の全体がＬＩＤ７８に投入される。この例では０．７５秒経過時から１．５秒経過した時点では発話２０２の終端には達していない。したがって０．７５秒経過時から２．２５秒経過時までの音声信号３８０がＬＩＤ７８に投入される。その結果、発話２００の開始時刻を基準として約４．２秒経過したときに図６（Ｇ）に示すようにＬＩＤ判別結果３８２が得られる。最後に、図６（Ｆ）を参照して、１．５秒経過時から１．５秒又は発話２０２の終端までのいずれか早いほうが検出されると、１．５秒経過時からその時点（この例では発話２０２の終端）までの音声信号３９０がＬＩＤ７８に投入される。その結果、発話２００の開始時を基準として４．３秒が経過したときに図６（Ｇ）に示すようにＬＩＤ判別結果３９２が得られる。 Furthermore, referring to Figure 6(E), if utterance 202 has not reached its end by 0.75 seconds from the beginning of utterance 202 (referred to as "0.75 seconds elapsed" for simplicity of explanation; the same applies hereafter), then when the earlier of 1.5 seconds from the 0.75-second mark to the end of the speech signal is detected, the entire speech signal from 0.75 seconds elapsed up to that point is fed into LID 78. In this example, the end of utterance 202 has not been reached at the 1.5-second mark from the 0.75-second mark. Therefore, the speech signal 380 from 0.75 seconds elapsed to 2.25 seconds elapsed is fed into LID 78. As a result, the LID discrimination result 382 is obtained as shown in Figure 6(G) approximately 4.2 seconds after the start time of utterance 200. Finally, referring to Figure 6(F), when the earlier of 1.5 seconds elapsed or the end of utterance 202 is detected, the audio signal 390 from 1.5 seconds elapsed to that point (the end of utterance 202 in this example) is fed into the LID 78. As a result, the LID discrimination result 392 is obtained as shown in Figure 6(G) when 4.3 seconds have elapsed from the start of utterance 200.

このようにこの実施形態では、発話区間のうち先頭から初めて一定時間間隔（この例では０．７５秒間隔）を起点として、その時点から所定時間（この例では１．５秒）又は発話区間の終端までのいずれか早く方が検出されるまでの音声信号をＬＩＤ７８に投入し、その結果を得る。その結果、発話が１．５秒程度かそれより短いときにはその発話の終了とほぼ同時にその発話の言語の判別結果が得られ、１．５秒より長いときには、その発話の先頭から１．７秒程度経過したときに最初の言語の判別結果が得られ、その後は０．７５秒程度おきに、判別結果が得られる。ただし発話の終端ではより短い間隔で判別結果が得られることが多い。 In this embodiment, the speech signal is input to the LID 78 starting from a fixed time interval (0.75 seconds in this example) from the beginning of the utterance section, and continuing until a predetermined time (1.5 seconds in this example) or the end of the utterance section, whichever comes first, is detected. The result is then obtained. As a result, when the utterance is approximately 1.5 seconds or shorter, the language discrimination result is obtained almost simultaneously with the end of the utterance. When the utterance is longer than 1.5 seconds, the first language discrimination result is obtained approximately 1.7 seconds after the beginning of the utterance, and thereafter, discrimination results are obtained at intervals of approximately 0.75 seconds. However, discrimination results are often obtained at shorter intervals at the end of the utterance.

このように１発話について言語の判別結果が複数個得られるため、判別結果が互いに矛盾する場合があり得る。そうしたときでも言語を判別する必要がある。そのためにはたとえば判別結果の多数決により最終結果を決定したり、ニューラルネットワークの出力として判別結果とともに得られるスコア（又は確率）の最も高いもの、又は同種のものが複数個ある場合にはその平均が最も高いものを最終結果として決定したりしてもよい。この実施形態では各発話に対して言語ごとに得られるスコアの平均スコアが最も高い言語を最終結果とする。具体的な例については図７を参照して後述する。 As multiple language classification results are obtained for a single utterance, these results may contradict each other. Even in such cases, it is necessary to classify the language. To achieve this, the final result may be determined by, for example, a majority vote of the classification results, or by selecting the language with the highest score (or probability) obtained along with the classification result as the output of the neural network, or, if there are multiple identical results, the language with the highest average score. In this embodiment, the language with the highest average score obtained for each language for each utterance is selected as the final result. A specific example will be described later with reference to Figure 7.

ウ. 音声認識装置の切替
前述したように、図１に示す多言語ＡＳＲ処理部８０は、複数の言語に対する音声認識処理を並列で実行可能である。しかしそれだけではなく、多言語ＡＳＲ処理部８０は、各言語についても複数の音声認識処理を実行可能であり、それらを適宜切り替えて動作させる。なぜなら、音声認識処理は多量の計算を必要として時間を要するため、ある発話の音声認識を実行中に、続く発話についても音声認識を並列で実行する必要が生じる可能性があるためである。特にこの実施形態では、ある発話について、複数の音声認識処理部がそれぞれ別の言語とみなして音声認識処理を実行する。音声認識処理では、正しい言語での音声認識でもかなりの演算処理が必要とされ、誤った言語での音声認識ではさらに演算量が増加する。そのため、各言語の音声認識処理部が限定された数しかないと、必要なときに必要な音声認識処理部のいずれかがビジーとなり音声認識処理が実行できなくなる可能性がある。そこでこの実施形態では、音声認識処理部として各言語について３個の音声認識処理部を設け、これらの中でアイドリング中のものを選択して音声認識処理を実行させる。 C. Switching of Speech Recognition Devices As mentioned above, the multilingual ASR processing unit 80 shown in Figure 1 is capable of executing speech recognition processing for multiple languages in parallel. However, not only that, the multilingual ASR processing unit 80 is also capable of executing multiple speech recognition processes for each language, and switches between them as appropriate. This is because speech recognition processing requires a large amount of computation and takes time, so it may be necessary to execute speech recognition for subsequent utterances in parallel while speech recognition of one utterance is being performed. In particular, in this embodiment, for a given utterance, multiple speech recognition processing units each treat it as a different language and execute speech recognition processing. Speech recognition processing requires considerable computation even for speech recognition in the correct language, and the amount of computation increases even further for speech recognition in the wrong language. Therefore, if there are only a limited number of speech recognition processing units for each language, it is possible that one of the necessary speech recognition processing units will become busy when needed, making it impossible to execute speech recognition processing. Therefore, in this embodiment, three speech recognition processing units are provided for each language, and one of these that is idle is selected to execute speech recognition processing.

例えば、図７（Ａ）に示す発話２００及び２０２に対して図７（Ｂ）に示す有音部分３５０及び３５２が検出されたものとする。図６を用いて説明したように、発話２００の終了後にＬＩＤ判別結果３６２が得られる。発話２０２については、ＬＩＤ判別結果３７２、３８２及び３９２がこの順序で得られる。発話２００と発話２０２とが同じ言語の発話であることは一般的に全く保証されない。ここにおいて示す例でも図７（Ｃ）に示すように発話２００に対するＬＩＤ判別結果３６２は日本語（ＪＡ）であり、発話２０２に対するＬＩＤ判別結果３７２、３８２及び３９２はそれぞれ日本語、英語（ＥＮ）、英語である。当該発話区間の既に処理をした部分の平均スコアにより発話２０２の言語は最終的に英語と判別されるが、いずれにせよ発話２００に対する言語の判別結果と発話２０２に対する言語の判別結果との関係は不確実である。したがって、発話２００に対する音声認識と発話２０２に対する音声認識とで、その言語を適切に切り替える必要がある。この実施形態では、図７（Ｄ）に示すように、前の発話（例えば発話２００）の終端が検出された後、０．５秒の無音区間４２２が検出された時点を発話の切替時点である発話区切り４２０とみなし、この前後で使用する音声認識処理部を各言語について切り替える。したがって発話２００及び２０２の言語が仮に同じ言語であっても、間に０．５秒以上の無音区間が存在すれば音声認識処理部は切り替えられる。こうすることで、例えば発話２００に対する音声認識処理に時間がかかったとしても、有音部分３５２が検出されると同時に後続する発話２０２に対する各言語での音声認識処理を開始できるという効果がある。なお、発話後に０．５秒未満の無音区間を挟んで次の発話が開始されたときは、これら２つの発話は一つの発話とみなし、各言語について同一の音声認識処理部が音声認識処理を行う。ここで使用する音声認識処理は、音声認識の結果を逐次的に出力するものである。 For example, suppose that the audible portions 350 and 352 shown in Figure 7(B) are detected for utterances 200 and 202 shown in Figure 7(A). As explained using Figure 6, the LID discrimination result 362 is obtained after the end of utterance 200. For utterance 202, the LID discrimination results 372, 382, and 392 are obtained in this order. It is not generally guaranteed that utterances 200 and utterance 202 are utterances of the same language. In the example shown here, as shown in Figure 7(C), the LID discrimination result 362 for utterance 200 is Japanese (JA), and the LID discrimination results 372, 382, and 392 for utterance 202 are Japanese, English (EN), and English, respectively. The language of utterance 202 is ultimately determined to be English based on the average score of the already processed portion of the utterance section, but in any case, the relationship between the language discrimination result for utterance 200 and the language discrimination result for utterance 202 is uncertain. Therefore, it is necessary to appropriately switch the language used for speech recognition of utterance 200 and speech recognition of utterance 202. In this embodiment, as shown in Figure 7(D), after the end of the previous utterance (for example, utterance 200) is detected, the point at which a 0.5-second silent interval 422 is detected is considered the utterance switching point, or utterance segment 420, and the speech recognition processing unit used before and after this point is switched for each language. Therefore, even if the languages of utterances 200 and 202 are the same language, the speech recognition processing unit will be switched if there is a silent interval of 0.5 seconds or more between them. This has the effect that, for example, even if the speech recognition processing for utterance 200 takes time, the speech recognition processing for the subsequent utterance 202 in each language can be started at the same time as the audible portion 352 is detected. Note that if the next utterance starts after a silent interval of less than 0.5 seconds following the utterance, these two utterances are considered as one utterance, and the same speech recognition processing unit performs speech recognition processing for each language. The speech recognition processing used here outputs the results of speech recognition sequentially.

なお、図７（Ｃ）に示すような例では、発話２０２の言語の判別結果としてＬＩＤ判別結果３７２、３８２及び３９２の３つが得られる。ＬＩＤ判別結果３７２が得られた時点ではＬＩＤ判別結果３８２及び３９２はまだ得られていない。したがって、例えばＬＩＤ判別結果３７２により表される言語に対応する音声認識結果が出力される。仮にＬＩＤ判別結果３８２が得られたときにその結果が他の言語であり、かつそのスコアがＬＩＤ判別結果３７２のスコアより高い場合には、発話２０２の言語が変化することになる。そうした場合には、ＬＩＤ判別結果３７２を使用して選択した音声認識結果から、ＬＩＤ判別結果３８２を使用して選択した音声認識結果に途中で出力（画面表示の場合には表示されるテキスト）の言語が変化することになる。ＬＩＤ判別結果３９２が得られたときも同様である。 In the example shown in Figure 7(C), three LID discrimination results are obtained as a result of language discrimination of the utterance 202: LID discrimination results 372, 382, and 392. When LID discrimination result 372 is obtained, LID discrimination results 382 and 392 have not yet been obtained. Therefore, for example, the speech recognition result corresponding to the language represented by LID discrimination result 372 is output. If, when LID discrimination result 382 is obtained, the result is another language and its score is higher than the score of LID discrimination result 372, the language of the utterance 202 will change. In such cases, the language of the output (the text displayed on the screen) will change midway through the speech recognition process, from the speech recognition result selected using LID discrimination result 372 to the speech recognition result selected using LID discrimination result 382. The same applies when LID discrimination result 392 is obtained.

より具体的には、言語の判別は以下のようにして行われる。例えば判別対象の言語が１０言語であるものとする。図７（Ｃ）を参照して、発話２０２に対するＬＩＤ判別結果３７２が得られた時点では、１０言語の各々に対してスコアが１個ずつ得られる。１言語に対するスコアが１個だけなので、ここではそのスコアが最大の言語（例えば日本語（ＪＡ））が選択される。その結果、ＬＩＤ判別結果３７２のときにそれまでの日本語の途中の音声認識結果が得られていれば、ＬＩＤ判別結果３７２の時点で日本語の音声認識結果が出力される。ＬＩＤ判別結果３７２の時点で日本語の音声認識結果が得られていなければ、得られた時点でその音声認識結果が出力される。 More specifically, language discrimination is performed as follows. For example, let's assume there are 10 languages to be discriminated against. Referring to Figure 7(C), when the LID discrimination result 372 for utterance 202 is obtained, one score is obtained for each of the 10 languages. Since there is only one score per language, the language with the highest score (e.g., Japanese (JA)) is selected. As a result, if an intermediate speech recognition result for Japanese has been obtained at the time of the LID discrimination result 372, the Japanese speech recognition result is output at the time of the LID discrimination result 372. If a Japanese speech recognition result has not been obtained at the time of the LID discrimination result 372, the speech recognition result is output as soon as it is obtained.

一方、ＬＩＤ判別結果３８２が得られた時点では、各言語について、ＬＩＤ判別結果３７２で得られたスコアとＬＩＤ判別結果３８２で得られたスコアという２個のスコアが存在する。この実施形態では、各言語についてこのように得られた２個のスコアについて言語ごとに平均をとり、その値が最も高い言語を選択する。例えば、ＬＩＤ判別結果３７２の時点では日本語のスコアが最も高かったとしても、ＬＩＤ判別結果３８２の時点で算出された平均スコアでは英語ＥＮが最も高ければ、発話２０２に対する言語の識別結果は日本語から英語に切り替わる。したがって、ＬＩＤ判別結果３８２が得られたときに、音声認識結果は日本語から英語に切り替わる。 On the other hand, at the point when the LID discrimination result 382 is obtained, there are two scores for each language: the score obtained in the LID discrimination result 372 and the score obtained in the LID discrimination result 382. In this embodiment, the average of these two scores obtained for each language is taken, and the language with the highest average score is selected. For example, even if the score for Japanese was highest at the time of the LID discrimination result 372, if the average score calculated at the time of the LID discrimination result 382 shows that English (EN) is the highest, the language identification result for the utterance 202 will switch from Japanese to English. Therefore, when the LID discrimination result 382 is obtained, the speech recognition result switches from Japanese to English.

発話２０２に対する最終的なＬＩＤ判別結果３９２の時点でも同様である。ＬＩＤ判別結果３９２が得られた時点では、各言語について３個のスコアが得られる。各言語について、それら３個のスコアの平均値が算出される。そして平均スコアが最も高い言語が選択される。図７（Ｃ）に示すようにＬＩＤ判別結果３７２、３８２、及び３９２により各言語について得られた平均スコアによる判定結果も英語であるとすれば、英語の音声認識結果が引き続いて出力され、途中の音声認識結果が得られるごとに出力が更新される。 The same applies at the point of obtaining the final LID discrimination result 392 for utterance 202. At the point where the LID discrimination result 392 is obtained, three scores are acquired for each language. For each language, the average of these three scores is calculated. The language with the highest average score is then selected. As shown in Figure 7(C), if the judgment result based on the average scores obtained for each language by the LID discrimination results 372, 382, and 392 is also English, then the English speech recognition result is output continuously, and the output is updated each time an intermediate speech recognition result is obtained.

（４）プログラム構成
ア．全体制御構造
以上に説明したように機能するよう、コンピュータを動作させるためのコンピュータプログラムの制御構造を以下に説明する。図８にその全体構成を示す。なお、以下に示すのは上記した多言語ライブ音声認識装置５０の全体をコンピュータハードウェア及びコンピュータプログラムにより実現するものだが、これらの一部又は全体を専用のハードウェアで実現することも可能である。また汎用コンピュータを以下に説明する制御構造に従って動作するようプログラミングすることにより、汎用コンピュータが専用の多言語ライブ音声認識装置として機能する。 (4) Program Structure A. Overall Control Structure The control structure of the computer program for operating the computer in order to function as described above is described below. Figure 8 shows the overall structure. The following shows how the entire multilingual live speech recognition device 50 described above is realized by computer hardware and computer programs, but it is also possible to realize some or all of these with dedicated hardware. Furthermore, by programming a general-purpose computer to operate according to the control structure described below, the general-purpose computer can function as a dedicated multilingual live speech recognition device.

図８を参照して、このプログラムは、ＶＡＤスレッド、ＬＩＤスレッド及び複数言語の各々について複数設けられる音声認識スレッドの各々を起動するステップ４５０と、これら各スレッドとのコネクションを確立するステップ４５２と、音声認識の対象となる音声信号の入力を開始するステップ４５４とを含む。ここでいうスレッドとは、１つのプロセス配下において、アドレス空間を共有して動作する複数のプログラムのことをいう。 Referring to Figure 8, this program includes step 450, which starts each of the VAD thread, the LID thread, and the multiple speech recognition threads provided for each of the multiple languages; step 452, which establishes a connection with each of these threads; and step 454, which starts inputting the speech signal to be recognized. Here, a thread refers to multiple programs operating under a single process, sharing an address space.

このプログラムはさらに、入力された音声信号をバッファに蓄積し、所定量（この実施形態では１秒分）の音声信号がバッファに蓄積されるたびに、その音声信号に図４を参照して前述したような前処理を行ってＶＡＤスレッドに出力する処理を実行するステップ４５６と、ステップ４５６に続き、ＶＡＤの出力から図５を参照して前述したような有音区間の補正処理を行って有音区間の判別出力を得るステップ４５８と、ステップ４５８により得られた有音区間の判別出力に基づいて、有音区間の終了後、０．５秒以上の無音区間があればそれを発話区切りとして検出するステップ４６０と、ステップ４６０における処理の結果に応じて制御の流れを分岐させるステップ４６２と、ステップ４６２の判定が肯定のときに、各言語のＡＳＲについて、その中でアイドリング中のスレッドを次の有音区間の音声信号の音声認識用に切り替えるステップ４６６とを含む。 This program further includes: step 456, which stores the input audio signal in a buffer and, each time a predetermined amount (1 second in this embodiment) of audio signal is stored in the buffer, performs preprocessing on the audio signal as described above with reference to Figure 4 and outputs it to the VAD thread; step 458, following step 456, which performs audible section correction processing as described above with reference to Figure 5 from the VAD output to obtain an output for determining the audible section; step 460, which, based on the output for determining the audible section obtained in step 458, detects any silent section of 0.5 seconds or more after the end of the audible section as a speech segment; step 462, which branches the control flow according to the result of the processing in step 460; and step 466, which, if the determination in step 462 is affirmative, switches the idle thread within the ASR for speech recognition of the next audible section's audio signal for each language.

このプログラムはさらに、ステップ４６２の判定が否定のとき、及びステップ４６２の判定が肯定でかつステップ４６６の処理が終了したときに実行され、多言語ＡＳＲ処理部８０の中で切り替えられたＡＳＲスレッドの各々、及びＬＩＤ７８に音声データを提供するステップ４６８と、ステップ４６８の後、ＡＳＲの各スレッド及びＬＩＤ７８でそれぞれの処理を実行するステップ４７０と、ＬＩＤ７８による言語識別の結果が得られるまで待機するステップ４７２と、言語識別の結果が得られた後、各言語のＡＳＲの各スレッドの出力のうち、ＬＩＤ７８により識別された言語のＡＳＲのスレッドによる音声認識結果をそれまでの表示に変えて表示して制御をステップ４５６に戻すステップ４７４とを含む。 This program further includes step 468, which is executed when the determination in step 462 is negative, and when the determination in step 462 is positive and the processing in step 466 is completed, and which provides audio data to each of the ASR threads switched within the multilingual ASR processing unit 80 and to the LID 78; step 470, which executes the respective processes in each ASR thread and LID 78 after step 468; step 472, which waits until the language identification result from LID 78 is obtained; and step 474, which, after the language identification result is obtained, displays the audio recognition result from the ASR thread of the language identified by LID 78 from the output of each ASR thread for each language, changing the previous display, and returns control to step 456.

イ．ＶＡＤ
（ア）前処理
図９は、ＶＡＤのうちの前処理を実現するプログラムの制御構造を示すフローチャートである。図９を参照して、このプログラムは、変数ｉにゼロを代入するステップ５００と、入力データを読みバッファ［ｉ］に蓄積するステップ５０２と、ステップ５０２の結果、バッファに蓄積された音声データが１秒分になるまでステップ５０２を繰り返すステップ５０４とを含む。この実施形態では、バッファとしてバッファ［０］及びバッファ［１］の２つを少なくとも準備し、ｉの値が偶数のとき（ｉ％２＝０）のときにはバッファ［０］に、ｉの値が奇数のとき（ｉ％２＝１）のときにはバッファ［１］に、それぞれ音声データを蓄積する。もちろんこのような方法ではなく他の方法で音声データの蓄積を行ってもよい。例えばリングバッファに音声データを蓄積し、そのうちの１秒分の音声データをその開始位置から読み出すようにしてもよい。音声データがリングバッファの容量を超えるときには、リングバッファ内に蓄積されている音声データを上書きしていけばよい。なお記号「％」はモジュロ演算子を表す。 I. VAD
(a) Preprocessing Figure 9 is a flowchart showing the control structure of a program that implements preprocessing in VAD. Referring to Figure 9, this program includes a step 500 which assigns zero to the variable i, a step 502 which reads the input data and stores it in buffer [i], and a step 504 which repeats step 502 until the audio data stored in the buffer amounts to one second. In this embodiment, at least two buffers, buffer [0] and buffer [1], are prepared, and when the value of i is even (i % 2 = 0), the audio data is stored in buffer [0], and when the value of i is odd (i % 2 = 1), the audio data is stored in buffer [1]. Of course, audio data may be stored in a different way than this. For example, audio data may be stored in a ring buffer, and one second of audio data may be read from its starting position. When the audio data exceeds the capacity of the ring buffer, the audio data stored in the ring buffer may be overwritten. The symbol "%" represents the modulo operator.

このプログラムはさらに、ステップ５０４の判定が肯定になったこと、すなわち１秒分の音声データがバッファに蓄積されたことに応答して実行され、その音声データの先頭に０．１秒分の無音区間を付加するステップ５０６と、ステップ５０６により処理された音声データをＶＡＤに提供するステップ５０８と、変数ｉの値に１を加算するステップ５１０と、入力される音声データを読み、バッファ［ｉ％２］に蓄積するステップ５１２と、バッファ［ｉ％２］に１秒分の音声データが蓄積されるまでステップ５１２を繰り返すステップ５１４とを含む。ステップ５１２及び５１４は、ステップ５０２及び５０４の処理と実質的に同じものである。ただしステップ５０２ではバッファ［０］にデータが蓄積されるのに対し、ステップ５１２では変数ｉの値が奇数か偶数かによりバッファ［１］とバッファ［０］とが切り替えて使用される。 This program further includes steps 506, which are executed in response to the determination in step 504 being positive, i.e., when one second of audio data has been accumulated in the buffer, and which add a 0.1 second silent interval to the beginning of the audio data; step 508, which provides the audio data processed in step 506 to the VAD; step 510, which adds 1 to the value of variable i; step 512, which reads the input audio data and accumulates it in buffer [i%2]; and step 514, which repeats step 512 until one second of audio data has been accumulated in buffer [i%2]. Steps 512 and 514 are substantially the same as the processing in steps 502 and 504. However, in step 502, data is accumulated in buffer [0], whereas in step 512, buffer [1] and buffer [0] are switched and used depending on whether the value of variable i is odd or even.

このプログラムはさらに、ステップ５１４においてバッファ［ｉ％２］に１秒分の音声データが蓄積されたと判定されたときに、バッファ［（ｉ－１）％２］の後半の０．５秒分の音声データをバッファ［ｉ％２］の音声データの先頭に付加するステップ５１６と、さらにその先頭にステップ５０６と同様に０．１秒分の無音区間を付加するステップ５１８と、ステップ５１６及び５１８において処理されたバッファ［ｉ％２］の音声データをＶＡＤに提供し、制御をステップ５１０に戻すステップ５２０とを含む。ＶＡＤは、この音声データが提供されたことに応答して、音声データの先頭から有音区間の検出を行う。有音区間は［開始時刻、終了時刻］のペアとしてＶＡＤから出力される。 This program further includes, when it is determined in step 514 that one second's worth of audio data has been accumulated in buffer [i%2], step 516 appends the latter 0.5 seconds of audio data from buffer [(i-1)%2] to the beginning of the audio data in buffer [i%2], step 518 appends a 0.1 second silent interval to the beginning of that, similar to step 506, and step 520 provides the audio data from buffer [i%2] processed in steps 516 and 518 to the VAD and returns control to step 510. In response to the provision of this audio data, the VAD detects the audio interval from the beginning of the audio data. The audio interval is output from the VAD as a pair of [start time, end time].

（イ）補正
ＶＡＤの出力は、図９により示されるような前処理を行った後の音声データに対するものである。その音声データには、最初の１秒分の音声データを除き、本来の対象区間の音声データだけではなく、直前の対象区間の後半の０．５秒分の音声データが付されている。さらにその音声データの先頭には０．１秒の無音区間が付されている。したがって、本来の対象区間の有音区間以外の有音区間がＶＡＤの出力に含まれている可能性がある。そのような区間は、既にその有音区間が処理用のバッファに送信済なら、以下に述べる補正処理においてＶＡＤの出力から取り除かれ、そうでない場合はそのまま制御部７４及び発話区切検出部７６に送信される。 (i) Correction The output of the VAD is for audio data after preprocessing as shown in Figure 9. This audio data includes not only the audio data of the original target section, but also the last 0.5 seconds of audio data from the previous target section, except for the first second of audio data. Furthermore, a 0.1 second silent section is added to the beginning of this audio data. Therefore, there is a possibility that audio sections other than the audio sections of the original target section are included in the output of the VAD. If such audio sections have already been sent to the processing buffer, they are removed from the output of the VAD in the correction process described below; otherwise, they are sent as is to the control unit 74 and the speech segment detection unit 76.

図１０を参照して、このプログラムは、ＶＡＤ出力を受けるステップ５５０と、ステップ５５０で受けたＶＡＤ出力により示される有音区間が、対象となる１秒より前の部分（０．１秒の無音部分）にあればその部分を削除するステップ５５２と、ステップ５５２の処理が終了した後のＶＡＤによる有音区間の先頭時刻及び終了時刻のペアを出力するステップ５５４とを含む。具体的にはステップ５５２では、付加された０．１秒の無音区間内に仮に有音区間があればそれを削除する。付加された無音区間から１秒の対象区間まで続く有音区間があれば、有音区間の開始時刻を対象の１秒の先頭に修正する。そして、１秒の対象区間内に０．５秒未満の無音区間があれば、それを有音区間とみなす。 Referring to Figure 10, this program includes a step 550 to receive VAD output, a step 552 to delete any audible intervals indicated by the VAD output received in step 550 that are prior to the target 1 second (a 0.1-second silent interval), and a step 554 to output a pair of start and end times for the audible interval determined by VAD after the processing in step 552 is completed. Specifically, in step 552, if there is a hypothetical audible interval within the added 0.1-second silent interval, it is deleted. If there is an audible interval that continues from the added silent interval to the 1-second target interval, the start time of the audible interval is corrected to the beginning of the target 1 second. Then, if there is a silent interval of less than 0.5 seconds within the 1-second target interval, it is considered an audible interval.

このプログラムはさらに、ステップ５５４の後、ステップ５５０と同様にＶＡＤ出力を受けるステップ５５６と、ステップ５５６で受けたＶＡＤ出力を補正するステップ５５８と、ステップ５５８により補正されたＶＡＤによる有音区間の検出データを出力して制御をステップ５５６に戻すステップ５６０とを含む。ステップ５５８では、対象となる１秒の区間前に付加された直前の０．５秒分の音声データで検出された有音区間のうち、前回の処理で既に制御部７４及び発話区切検出部７６に送信済のものを削除する。さらにその直前に付加された０．１秒分の無音区間の間に検出された有音区間も削除する。 This program further includes, after step 554, step 556, which receives the VAD output in the same manner as in step 550; step 558, which corrects the VAD output received in step 556; and step 560, which outputs the audible section detection data corrected by the VAD in step 558 and returns control to step 556. In step 558, among the audible sections detected in the preceding 0.5 seconds of audio data added before the target 1-second section, those already transmitted to the control unit 74 and the speech segment detection unit 76 in the previous processing are deleted. Furthermore, any audible sections detected during the 0.1-second silent section added immediately before that are also deleted.

このような補正を行うことにより、音声データの１秒ずつの区間の各々について、有音区間を検出できる。 By performing this correction, it is possible to detect audible segments within each one-second interval of the audio data.

ウ．ＡＳＲの切替
この実施形態では、音声認識の対象となる各言語について複数のＡＳＲスレッドを起動する。それらＡＳＲスレッドを発話ごとに切り替えて音声認識を行うことで、１つの発話に対する音声認識処理が長引いても、次の発話の音声認識を並列して実行できる。特に音声データの言語に対応しない言語の音声認識の場合には、処理が長くなることが予測される。したがって、この実施形態のように複数の言語の各々について複数のＡＳＲスレッドを準備しておき、発話の切れ目で切り替えることが望ましい。この実施形態では、有音区間の後、０．５秒以上の無音区間がＶＡＤで検出されたときに、その時点でＡＳＲスレッドを切り替え、次の音声データからは切替後のＡＳＲスレッドに与える。 C. ASR Switching In this embodiment, multiple ASR threads are launched for each language targeted for speech recognition. By switching these ASR threads for each utterance to perform speech recognition, even if the speech recognition process for one utterance is lengthy, the speech recognition of the next utterance can be executed in parallel. In particular, when recognizing speech in a language that does not correspond to the language of the audio data, it is expected that the processing will be lengthy. Therefore, it is desirable to prepare multiple ASR threads for each of the multiple languages, as in this embodiment, and switch them at the end of the utterance. In this embodiment, when a silent section of 0.5 seconds or more is detected by the VAD after a sounded section, the ASR thread is switched at that point, and the next audio data is provided to the switched ASR thread.

図１１を参照して、ＡＳＲの切替のためのプログラムは、処理の対象となり得る言語の各々に対して以下に説明するステップ６０２を実行するステップ６００を含む。 Referring to Figure 11, the program for switching ASRs includes step 600, which performs step 602 described below for each of the languages that may be processed.

ステップ６０２は、その言語について前回選択したＡＳＲスレッドの次のＡＳＲスレッドを選択するステップ６１０と、選択されたＡＳＲがアイドリング中か否かを判定し、判定に従って制御の流れを分岐させるステップ６１２と、ステップ６１２の判定が否定のときに次のＡＳＲスレッドの選択を試みるステップ６１６と、選択対象となるスレッドの中に空いているスレッドがあるか否かに従って制御の流れを分岐させるステップ６１８と、ステップ６１８の判定が否定のとき、すなわち空いているスレッドがこれ以上存在しないときに、新しいスレッドを生成するステップ６２０と、この新たに生成したＡＳＲスレッドに音声データの入力先を切り替えるステップ６２１とを含む。ステップ６１８の判定が肯定のときにはそのスレッドを選択して制御をステップ６１２に戻す。ステップ６２１の後には、制御は後述するステップ６２２に進む。 Step 602 includes step 610, which selects the next ASR thread after the previously selected ASR thread for that language; step 612, which determines whether the selected ASR is idling and branches the control flow accordingly; step 616, which attempts to select the next ASR thread if the determination in step 612 is negative; step 618, which branches the control flow according to whether there is an available thread among the threads to be selected; step 620, which generates a new thread if the determination in step 618 is negative, i.e., if there are no more available threads; and step 621, which switches the input destination of the audio data to this newly generated ASR thread. If the determination in step 618 is positive, that thread is selected and control returns to step 612. After step 621, control proceeds to step 622, which will be described later.

このプログラムはさらに、ステップ６１２の判定が肯定のときには選択したＡＳＲスレッドに音声データの入力先を切り替えるステップ６１４と、ステップ６１４の後、及びステップ６２０の後に実行され、切り替えたＡＳＲスレッドのスレッドＩＤをそのＡＳＲスレッドに入力している音声データの発話と関係付けて記憶するステップ６２２と、ステップ６２２で切り替えた先のＡＳＲスレッドの状態をビジーに変更するステップ６２４とを含む。 This program further includes step 614, which switches the input destination of the audio data to the selected ASR thread if the determination in step 612 is affirmative; step 622, which is executed after step 614 and after step 620, and stores the thread ID of the switched ASR thread in relation to the utterance of the audio data being input to that ASR thread; and step 624, which changes the state of the ASR thread switched in step 622 to busy.

ある言語についてのＡＳＲスレッドの数は少なくとも２個、望ましくは３個以上起動することが望ましい。ＡＳＲスレッドが少なくとも２個あればそれら２個のＡＳＲスレッドを交互に動作させることができる。この場合、先行する２つの発話に対する音声認識が２個のＡＳＲスレッドで行われているときに次の発話の音声認識を開始する必要が生じたときには、新たなＡＳＲスレッドが生成され、そのＡＳＲスレッドが新たな音声データに対する音声認識に割り当てられる。３個以上のＡＳＲスレッドがビジーになった場合には、さらに追加のスレッドを起動すればよい。 For a given language, it is desirable to have at least two ASR threads running, preferably three or more. With at least two ASR threads, they can be operated alternately. In this case, when speech recognition for the next utterance needs to begin while the two preceding utterances are being processed by the two ASR threads, a new ASR thread is created and assigned to speech recognition for the new audio data. If three or more ASR threads become busy, additional threads can be launched.

なお、各言語で起動するＡＳＲスレッドの数の上限、及び対象となる言語の数の上限はこのプログラムが動作するコンピュータの性能にも依存するので、一概に限定はできない。 Furthermore, the maximum number of ASR threads launched for each language, and the maximum number of target languages, depend on the performance of the computer on which this program runs, and therefore cannot be generalized.

エ. ＬＩＤの制御
この実施形態では、ＬＩＤは以下のような制御に従って行われる。ＬＩＤの処理自体は既に述べたように訓練済のニューラルネットワークからなる言語識別モデルにより行われる。ここでは、このニューラルネットワークに、いつ、どのような音声データを入力してＬＩＤの結果を得るかを実現するプログラムの制御構造を説明する。 E. Control of LID In this embodiment, LID is performed according to the following control. As already mentioned, the LID processing itself is performed by a language recognition model consisting of a trained neural network. Here, we will explain the control structure of the program that realizes when and what kind of audio data is input to this neural network to obtain the LID result.

図１２を参照して、このプログラムは、処理対象となる有音区間が検出されるまで待機し、有音区間が検出されると有音区間のデータをメモリから読み出すステップ７００と、音声データの読み出し開始位置をステップ７００で読み出した音声データの先頭に設定するステップ７０２と、読み出し開始位置から１．５秒と有音区間の末尾までの時間とのいずれか短い方の期間の音声データを読み出すステップ７０４と、読み出した音声データを言語識別モデルに投入するステップ７０６と、読み出し開始位置を０．７秒進めるステップ７０８と、ステップ７０８で設定された読み出し開始位置が有音区間の終了時刻より前か否かを判定し、判定が肯定なら制御をステップ７０４に戻すステップ７１０と、ステップ７１０の判定が否定のとき、すなわち処理中の有音区間の末尾まで処理が終わったときに、処理対象を次の有音区間に進めて制御の流れをステップ７００に戻すステップ７１２とを含む。 Referring to Figure 12, this program includes the following steps: waiting until a sounded section to be processed is detected; reading the data of the sounded section from memory when a sounded section is detected (step 700); setting the start position of the audio data reading to the beginning of the audio data read in step 700 (step 702); reading audio data for a period of 1.5 seconds from the reading start position or the shorter of the time to the end of the sounded section (step 704); feeding the read audio data into the language identification model (step 706); advancing the reading start position by 0.7 seconds (step 708); determining whether the reading start position set in step 708 is before the end time of the sounded section (step 710); and, if the determination is positive, returning control to step 704 (step 712); and, if the determination in step 710 is negative, i.e., when processing has finished up to the end of the sounded section being processed, advancing the target to the next sounded section and returning the control flow to step 700 (step 712).

オ．表示の更新
音声認識結果は以下のような制御構造を持つプログラムにより表示される。図１３を参照して、このプログラムは、ＬＩＤによる言語識別結果の出力があったか否かに従って制御の流れを分岐させるステップ７５０と、ステップ７５０の判定が肯定のときに、ＬＩＤの出力により示される言語の言語ＩＤを、現在の言語を表す情報としてメモリに記憶するステップ７５２と、ステップ７５２の後、及びステップ７５０の判定が否定のときの双方において、現在の発話に対応して動作している複数のＡＳＲスレッドの中で、現在の言語ＩＤに対応するものから出力されている認識結果を表示して制御をステップ７５０に戻すステップ７５４とを含む。 O. Display Update The speech recognition result is displayed by a program having the following control structure. Referring to Figure 13, this program includes a step 750 that branches the control flow according to whether or not there is an output of a language identification result by LID; a step 752 that, when the determination in step 750 is affirmative, stores the language ID of the language indicated by the output of LID in memory as information representing the current language; and a step 754 that, after step 752 and when the determination in step 750 is negative, displays the recognition result output from the ASR thread corresponding to the current language ID among the multiple ASR threads operating in response to the current utterance and returns control to step 750.

２．動作
（１）起動
図８を参照して、ステップ４５０において、ＶＡＤスレッド、ＬＩＤスレッド及び複数言語の各々について複数設けられる音声認識スレッドの各々が起動される。これらは図１に示す発話区間検出部７２、ＬＩＤ７８及び多言語ＡＳＲ処理部８０にそれぞれ相当する。ステップ４５２において、メインルーチン（図１の制御部７４に相当）が、これら各スレッドとのコネクションを確立する。 2. Operation (1) Startup Referring to Figure 8, in step 450, the VAD thread, the LID thread, and each of the multiple speech recognition threads provided for each of the multiple languages are started. These correspond to the speech interval detection unit 72, LID 78, and multilingual ASR processing unit 80 shown in Figure 1, respectively. In step 452, the main routine (corresponding to the control unit 74 in Figure 1) establishes connections with each of these threads.

ステップ４５４において音声データである入力５２が与えられると、バッファ７０がその音声データの蓄積を開始する。ステップ４５６では図９に示す処理が実行される。すなわち、バッファ［０］に音声データを蓄積（ステップ５０２）する。 In step 454, when the audio data input 52 is provided, the buffer 70 begins to store the audio data. In step 456, the process shown in Figure 9 is executed. That is, the audio data is stored in buffer [0] (step 502).

（２）最初の入力に対する前処理
最初の１秒分の音声データが蓄積されると（ステップ５０４の判定が肯定）、その音声データの先頭に０．１秒分の無音区間を付加し（ステップ５０６）、その音声データをＶＡＤに投入する（ステップ５０８）。ＶＡＤ１０２はこの入力に応答して、有音区間を表す情報（開始時刻及び終了時刻のペア）の出力を開始する。 (2) Preprocessing for the initial input Once the first second of audio data has been accumulated (the determination in step 504 is positive), a 0.1 second silent interval is added to the beginning of the audio data (step 506), and the audio data is fed into the VAD (step 508). In response to this input, the VAD 102 begins outputting information representing the audio interval (a pair of start time and end time).

ステップ５１０ではバッファ［１］に音声データの蓄積先を変更する。以下、２番目以降の入力に対する処理が実行される。 In step 510, the storage location for the audio data is changed to buffer [1]. Processing for the second and subsequent inputs is then performed.

（３）２番目以降の入力に対する前処理
２番目以降の入力に対しては、ステップ５１２でバッファ［０］又はバッファ［１］に入力された音声データが蓄積される。１秒分の音声データが蓄積されると（ステップ５１４の判定が肯定）、ステップ５１６において、現在蓄積に使用されているバッファ（バッファ［ｉ％２］）と別のバッファ（バッファ［（ｉ－１）％２］に蓄積されている音声データのうち後半の０．５秒分の音声データをバッファ［ｉ％２］に蓄積されている音声データの先頭に付加する。ステップ５１８でさらにその先頭に０．１秒分の無音区間を付加する。ステップ５２０でこの音声データをＶＡＤに投入する。 (3) Preprocessing for the second and subsequent inputs For the second and subsequent inputs, the audio data input in buffer [0] or buffer [1] is stored in step 512. When 1 second of audio data has been stored (the determination in step 514 is positive), in step 516, the latter 0.5 seconds of audio data stored in a different buffer (buffer [(i-1)%2]) from the buffer currently being used for storage (buffer [i%2]) is added to the beginning of the audio data stored in buffer [i%2]. In step 518, a 0.1 second silent interval is added to the beginning of that. In step 520, this audio data is fed into the VAD.

以後、音声データの入力がある限りこの動作を繰り返す。 This process will be repeated as long as there is audio data input.

（４）ＶＡＤ出力の補正
図１に示すＶＡＤ１０２は、前処理部１００により前処理がされた音声データの入力を受けて、その音声データの中の有音区間を示す［開始時刻、終了時刻］のペアの出力を行う。この出力には、前処理で付加された無音区間及び直前の期間の音声データの後ろ半分で検出された有音区間があり得る。補正部１０４は図１０に制御構造を示すプログラムを実行することにより、この有音区間のうち、既に制御部７４及び発話区切検出部７６に送信済の有音区間があればそれを削除する。 (4) Correction of VAD output The VAD 102 shown in Figure 1 receives audio data that has been preprocessed by the preprocessing unit 100 and outputs a pair of [start time, end time] indicating the audible section within the audio data. This output may include silent sections added by preprocessing and audible sections detected in the latter half of the audio data for the immediately preceding period. The correction unit 104 executes a program whose control structure is shown in Figure 10 to delete any audible sections that have already been transmitted to the control unit 74 and the speech segment detection unit 76.

（５）ＬＩＤの制御及びＡＳＲの切替
図１を参照して、制御部７４は、ＶＡＤ１０２により検知された有音区間の音声データをＬＩＤ７８と多言語ＡＳＲ処理部８０との双方に与える。ＬＩＤ７８は、図１２に示す制御構造を持つプログラムにより制御され、入力された有音区間の各々について、１．５秒の長さでかつ０．７秒のシフト長で音声データを読み出し、言語識別モデルに投入する。この処理は各有音区間について、有音区間の終了まで行われる。なおここで言語識別モデルに投入される音声の長さは、１．５秒が最大であり、それより前に発話が終了するときはその発話の末尾までの長さとなる。 (5) Control of LID and switching of ASR Referring to Figure 1, the control unit 74 provides the audio data of the audible section detected by the VAD 102 to both the LID 78 and the multilingual ASR processing unit 80. The LID 78 is controlled by a program having the control structure shown in Figure 12, and for each input audible section, it reads out audio data with a length of 1.5 seconds and a shift length of 0.7 seconds and feeds it into the language identification model. This process is carried out for each audible section until the end of the audible section. The maximum length of the audio fed into the language identification model here is 1.5 seconds, and if the utterance ends before that, it will be the length to the end of the utterance.

言語識別モデルは、上記した１．５秒の音声データに対する言語識別処理を行い、結果が得られるとそれを出力する。言語識別モデルは、さらに有音区間の入力があれば、その音声データに対する処理を開始する。こうした処理が繰り返されるため、図６に示すように、各有音区間について少なくとも１回、有音区間が長ければ２回以上、言語識別の結果が出力される。多言語ＡＳＲ処理部８０はこの言語識別の結果を受けて、言語識別の結果に対応する言語の音声認識結果を表示する。 The language recognition model performs language recognition processing on the 1.5-second audio data described above, and outputs the result when it is obtained. If there is further input of a sounded section, the language recognition model begins processing that audio data. Because this processing is repeated, as shown in Figure 6, the language recognition result is output at least once for each sounded section, and two or more times if the sounded section is long. The multilingual ASR processing unit 80 receives this language recognition result and displays the speech recognition result for the language corresponding to the language recognition result.

多言語ＡＳＲ処理部８０は以下のように動作する。この実施形態では、ＡＳＲスレッドは、各言語について３個が生成されている。制御部７４は、音声データを多言語ＡＳＲ処理部８０内の各言語のＡＳＲ処理部（ＡＳＲスレッドに相当）に入力する。プログラムの開始時にはそれらの中のいずれかのＡＳＲスレッドが音声データの入力先として選択されている。つまり、最初の有音区間は、各言語について入力先として選択されたＡＳＲスレッドにより処理される。この時点では音声が表す言語は不明である。音声データを受けたＡＳＲスレッドは、それぞれその音声データが自己の担当している言語のものとして音声認識を行う。言語識別の結果が得られると、多言語ＡＳＲ処理部８０は、複数の言語のＡＳＲスレッドの出力の中で、言語識別の結果により特定される言語のためのＡＳＲスレッドの出力を選択する。結果表示制御部８２は、ＬＩＤ７８からＬＩＤ判別結果が出力されたことに応答して、そのときに多言語ＡＳＲ処理部８０から出力されている音声認識結果を図１の表示５４として表示装置に表示したり、又は遠隔のコンピュータにテキストデータとして送信したりする。 The multilingual ASR processing unit 80 operates as follows. In this embodiment, three ASR threads are generated for each language. The control unit 74 inputs the audio data to the ASR processing unit (corresponding to an ASR thread) for each language within the multilingual ASR processing unit 80. At the start of the program, one of these ASR threads is selected as the input destination for the audio data. In other words, the first audible section is processed by the ASR thread selected as the input destination for each language. At this point, the language represented by the audio is unknown. The ASR thread that receives the audio data performs speech recognition, assuming that the audio data belongs to the language it is responsible for. Once the language identification result is obtained, the multilingual ASR processing unit 80 selects the output of the ASR thread for the language identified by the language identification result from among the outputs of the ASR threads for multiple languages. The result display control unit 82, in response to the output of the LID discrimination result from the LID 78, displays the speech recognition result output from the multilingual ASR processing unit 80 at that time as display 54 in Figure 1 on the display device, or transmits it as text data to a remote computer.

言語識別の結果は、有音区間が長いと図６（Ｃ）から（Ｇ）に示すように何回か出力されることがある。ＡＳＲスレッドは音声認識結果を逐次的に出力するので、そのたびに出力が更新される。図７（Ｃ）の有音部分３５２に対する言語判別結果のように、複数回の言語識別の結果が互いに異なっている場合には、多数決又は最も信頼度が高い判別結果により選択された言語の音声認識結果が出力される。例えば多数決により言語を決定するときには、図７（Ｃ）に示す例では最初に日本語で音声認識結果が表示されるが、途中で英語に切り替わるということになる。 The language identification result may be output several times, as shown in Figures 6(C) to (G), if the voiced section is long. Since the ASR thread outputs the speech recognition results sequentially, the output is updated each time. If the results of multiple language identification attempts differ, as in the language discrimination result for the voiced section 352 in Figure 7(C), the speech recognition result for the language selected by majority vote or the most confident discrimination result is output. For example, when determining the language by majority vote, in the example shown in Figure 7(C), the speech recognition result is initially displayed in Japanese, but then switches to English midway through.

図１に示す発話区切検出部７６は、有音区間の後に０．５秒の無音区間が検出されると、発話区切りの検出信号を多言語ＡＳＲ処理部８０に与える。多言語ＡＳＲ処理部８０はこの発話区切りの検出信号を受けると、各言語について、音声データの入力先をそれまで稼働していたＡＳＲスレッドから、別のアイドリング中のＡＳＲスレッドに切り替える。この時点で２つのＡＳＲスレッドが稼働している可能性が高い。しかし、先行するＡＳＲスレッドは音声認識が完了するとアイドリング中に移行する。このようにして、各言語の複数のＡＳＲスレッドにおいて、音声信号の入力先のＡＳＲスレッドが発話区切りをトリガーとして切り替えられていく。 The speech segment detection unit 76 shown in Figure 1, upon detecting a 0.5-second silent interval following a spoken interval, provides a speech segment detection signal to the multilingual ASR processing unit 80. Upon receiving this speech segment detection signal, the multilingual ASR processing unit 80 switches the input destination of the audio data for each language from the previously active ASR thread to another idle ASR thread. At this point, it is highly likely that two ASR threads are running. However, the leading ASR thread transitions to idle mode once speech recognition is complete. In this manner, the ASR thread receiving the audio signal is switched in multiple ASR threads for each language, triggered by speech segments.

第２評価
（１）レイテンシの算出方法
上記実施形態のように、発話をその直後にテキスト化して表示したり自動翻訳に入力したりするためには、発話がされた時刻からその発話が音声認識されテキストに変換されるまでの時間、すなわちレイテンシをできるだけ短くする必要がある。しかし、従来の音声認識システムの評価は、発話単位でのレイテンシ、精度（単語誤り率：ＷＥＲ）、速度（実時間係数：ＲＴＦ）などによるものであった。こうした評価尺度は音声認識システムの性能の評価尺度として有用だが、連続する複数の発話をそれらの発話と並列に連続的に音声認識する、いわゆるライブ音声認識処理の評価尺度としては不十分なものであった。 Section 2 Evaluation (1) Latency Calculation Method As in the above embodiment, in order to convert utterances into text and display them immediately after they are spoken or to input them into automatic translation, it is necessary to minimize the time from when an utterance is spoken until that utterance is recognized as speech and converted into text, i.e., latency. However, conventional speech recognition systems have been evaluated based on latency, accuracy (word error rate: WER), and speed (real-time factor: RTF) on an utterance basis. While these evaluation metrics are useful as metrics for evaluating the performance of speech recognition systems, they are insufficient as evaluation metrics for so-called live speech recognition processing, which involves continuously recognizing multiple consecutive utterances in parallel with those utterances.

例えば発話単位でレイテンシを評価する場合、発話長が長ければレイテンシも長くなるため、発話長の影響を除去する必要がある。また発話の音声データとそのトランスクリプションとを事前にアライメントしておき、その音声データに対する音声認識の結果と事前のアライメントとを比較して評価尺度とする場合、誤認識などの影響で計算の誤差が大きくなってしまうという問題がある。 For example, when evaluating latency on an utterance basis, longer utterances result in longer latency, so it's necessary to eliminate the influence of utterance length. Furthermore, when pre-aligning utterance audio data and its transcription, and then comparing the speech recognition results with the pre-alignment as an evaluation metric, there's a problem with large calculation errors due to factors like misrecognition.

ライブ音声認識の場合、低レイテンシが重要な目標である。セグメンテーション、言語識別、話者識別、及び音声認識などに要する時間がいずれもレイテンシに影響し、その影響は複合的である。システムの評価及び各技術の改善などの影響を反映するため、この実施形態では、音声認識の処理中に音声認識エンジンから出力される音声結果の一時出力を利用して、単語ベースでのレイテンシの算出を行う。ここでいう音声結果の一時出力とは、音声認識エンジンが一定時間長の音声信号に対する音声認識を行ったときに外部に出力する一時的な音声認識結果である。 In live speech recognition, low latency is a critical goal. The time required for segmentation, language identification, speaker identification, and speech recognition all affect latency, and these effects are complex. To reflect the impact of system evaluation and improvements to each technology, this embodiment uses the temporary output of speech results from the speech recognition engine during speech recognition processing to calculate word-based latency. Here, the temporary output of speech results refers to the temporary speech recognition result that the speech recognition engine outputs externally when it performs speech recognition on a speech signal of a certain duration.

例えば、図１４を参照して、クライアント８００からの音声信号をサーバ８０２でライブ音声認識する場合を考える。まずライブ音声認識処理の最初に、クライアント８００からサーバ８０２に対してポーズを表すポーズ信号８１０が送信され、それに引き続きデータ８１２、８１４、８１６及び８２０が一定時間ごとにサーバ８０２に送信される。サーバ８０２はこのデータに対する音声認識を行い、処理対象の音声信号のうち、一定時間長の音声信号に対する音声認識を行うたびに、一時認識結果８１８、８２２、８２６及び８３４をクライアント８００に送信する。図１４に示す例ではサーバ８０２はさらに、認識が終了すると最終認識結果８３６をクライアント８００に送信する。逆にクライアント８００からは、データ８２０に引き続く発話の音声データがデータ８２８、８３０及び８３２としてサーバ８０２に送信される。ここで提案するレイテンシの計算には、一時認識結果８１８などの一時認識結果を使用する。 For example, consider the case where a voice signal from client 800 is performed live speech recognition on server 802, referring to Figure 14. First, at the beginning of the live speech recognition process, client 800 sends a pause signal 810 to server 802, followed by data 812, 814, 816, and 820 sent to server 802 at regular intervals. Server 802 performs speech recognition on this data, and each time it performs speech recognition on a voice signal of a certain length, it sends temporary recognition results 818, 822, 826, and 834 to client 800. In the example shown in Figure 14, server 802 further sends the final recognition result 836 to client 800 when recognition is complete. Conversely, client 800 sends the voice data of the utterance following data 820 to server 802 as data 828, 830, and 832. The latency calculation proposed here uses temporary recognition results such as temporary recognition result 818.

なお、サーバ８０２は、一時認識結果８１８などの一時認識結果だけではなく、部分認識結果８２４と呼ばれるものをクライアント８００に送信することもある。部分認識結果８２４とは、ここでは、一時認識結果８１８などと同様、一時的な認識結果の一種ではあるものの、外部からは制御不可能な内部条件を音声認識エンジンが満たしたときのみ出力される音声認識結果である。 Furthermore, server 802 may send not only temporary recognition results such as temporary recognition result 818, but also something called a partial recognition result 824 to client 800. In this context, the partial recognition result 824 is a type of temporary recognition result, similar to temporary recognition result 818, but it is a speech recognition result that is output only when the speech recognition engine satisfies internal conditions that are uncontrollable from the outside.

この実施形態では、このように音声認識の過程で音声認識エンジンから出力される認識結果を利用してライブ音声認識のレイテンシを測定する。なお以下の手法では、ＡＳＲの機能として、単語別の発話時間長が出力されることが前提となっており、これを利用して以下のようにしてレイテンシを算出する。 In this embodiment, the latency of live speech recognition is measured using the recognition results output from the speech recognition engine during the speech recognition process. The following method assumes that the ASR (Audio-Speech Recognition) function outputs the speech duration for each word, and the latency is calculated using this information as follows.

（ア）一時的音声認識結果（一時認識結果）、及び部分音声認識結果で出力された単語の中から、前回の一時認識結果で認識されたものを除く。 (a) From the words output in the temporary speech recognition results (temporary recognition results) and partial speech recognition results, exclude those recognized in the previous temporary recognition result.

（イ）それらの単語の発話から認識結果出力までの時間を合計し、単語数で割る。 (i) Sum the time from the utterance of those words to the output of the recognition result, and divide by the number of words.

（ウ）上記（イ）で得た値を音声認識の結果出力時間から減じた値をレイテンシとする。 (c) The latency is calculated by subtracting the value obtained in (b) above from the speech recognition result output time.

（２）実験による評価
図１５を参照して、「please wait here. I’m going to look for a rope or something.」という発話の音声認識を例としてレイテンシの計算方法を説明する。１回目のＡＳＲ出力は「please wait here.」に対するものであって、先頭に無音区間と末尾の無音区間とを含め、各単語の発声時間はそれぞれ、０．００秒、０．６０秒、０．８１秒、1.１７秒、及び１．３５秒であり、音声認識の結果出力時間は３．５０秒である。レイテンシの計算には、これらのうち二重下線で示す３個の単語の発話長（０．６０秒、０．８１秒、1.１７秒）を加算し、その値を単語数（３）で割る。得られた値を音声認識の結果出力時間である３．５０秒から減算する。その結果、２．６４という値が得られる。これが第１の発話に対する音声認識処理の単語ベースのレイテンシである。 (2) Experimental Evaluation Referring to Figure 15, the method for calculating latency will be explained using speech recognition of the utterance "please wait here. I'm going to look for a rope or something." as an example. The first ASR output is for "please wait here." Including the silent intervals at the beginning and end, the utterance times of each word are 0.00 seconds, 0.60 seconds, 0.81 seconds, 1.17 seconds, and 1.35 seconds, respectively, and the speech recognition result output time is 3.50 seconds. To calculate the latency, the utterance lengths of the three words indicated by the double underline (0.60 seconds, 0.81 seconds, and 1.17 seconds) are added together, and the value is divided by the number of words (3). The obtained value is subtracted from the speech recognition result output time of 3.50 seconds. As a result, a value of 2.64 is obtained. This is the word-based latency of the speech recognition processing for the first utterance.

２回目のＡＳＲ出力は、ここでは先に挙げた発話の全体に対する音声認識結果である。結果出力時間は４．３２秒である。この出力から、１回目に出力された、下線で示す単語を除いて、二重下線で示す残りの９個の単語の発話長を加算して得た値を単語数（９）で除算し、その値（約２．２４）を結果出力時間である４．３２から減算する。その結果、２．０８という値が得られる。これが発話の２文目に対する単語ベースのレイテンシである。 The second ASR output here is the speech recognition result for the entire utterance mentioned earlier. The output time is 4.32 seconds. From this output, excluding the underlined word from the first output, the utterance lengths of the remaining nine words (indicated by double underlines) are added together. This value is then divided by the number of words (9), and this result (approximately 2.24) is subtracted from the output time of 4.32 seconds. The result is 2.08. This is the word-based latency for the second sentence of the utterance.

なお、ここでは一時的な認識結果のみを用いてレイテンシを計算している。しかし、最終的な認識結果を用いてレイテンシを計算してもよいことはいうまでもない。 Note that here, latency is calculated using only the temporary recognition result. However, it goes without saying that latency can also be calculated using the final recognition result.

この方法により算出した方法を用いて評価した、上記実施形態に係るライブ音声認識処理を他の指標と対比して図１７に示し、その測定に用いた発話の例を図１６に示す。 Figure 17 shows the live speech recognition processing according to the above embodiment, evaluated using the method calculated by this method, in comparison with other metrics, and Figure 16 shows an example of the utterance used for measurement.

図１７に示す表で、上段はＡＳＲ単体に対する評価を示し、下段はライブ音声認識処理に対する評価を示す。ＬＩＤの値は精度を示す。英語、日本語における評価値（４箇所）はいずれもＷＥＲである。「レイテンシ」と記載された列の第２行は上記実施形態で測定されたものである。 In the table shown in Figure 17, the upper row shows the evaluation of the ASR unit alone, and the lower row shows the evaluation of the live speech recognition processing. The LID value indicates accuracy. The evaluation values for English and Japanese (four locations) are all WER. The second row of the column labeled "Latency" shows the measurements taken in the above embodiment.

図１７から分かるように、ライブ音声認識は、ＬＩＤにおいてＡＳＲ単体を上回る精度を示す。またライブ音声認識では英語の場合も日本語の場合も、ライブ音声認識処理は言語判定の影響がないＡＳＲ単体よりも若干下回る性能を示した。レイテンシはＡＳＲ単体に対する評価では測定できず、ライブ処理の場合にのみ測定できる。 As can be seen from Figure 17, live speech recognition shows higher accuracy in LID than ASR alone. Furthermore, in both English and Japanese, live speech recognition processing showed slightly lower performance than ASR alone, which is unaffected by language discrimination. Latency could not be measured in the evaluation of ASR alone and could only be measured during live processing.

第３変形例
上記実施の形態では、各言語について使用するＡＳＲスレッドは同じ数であった。しかしこの発明はそのような実施形態には限定されない。言語ごとに異なる数のＡＳＲスレッドを生成してもよい。また各スレッドはプログラムの最初に生成している。しかしこれは限定的ではなく、各スレッドが必要となった時点で生成してもよい。例えばある言語で生成したスレッドの全てがビジーとなってしまったときに次の発話の音声認識をする必要が生じたときには、ビジーのスレッドのいずれかの処理を中断させるのではなく、新たなスレッドを生成するようにしてもよい。 Third Modification In the above embodiment, the same number of ASR threads were used for each language. However, this invention is not limited to such embodiments. Different numbers of ASR threads may be generated for each language. Also, each thread is generated at the beginning of the program. However, this is not limiting, and each thread may be generated when needed. For example, if all the threads generated for a certain language become busy and it becomes necessary to perform speech recognition for the next utterance, instead of interrupting the processing of any of the busy threads, a new thread may be generated.

さらに上記実施の形態では、ライブ音声認識処理の結果は、発話に少し遅れて画面に表示される。しかしこの実施形態に発明が限定されるわけではない。例えば音声認識処理の結果を通信によりリモートのコンピュータに送信してもよいし、ライブの自動翻訳装置に入力して翻訳結果を何らかの形で出力してもよい。例えば複数の会場で得られる、多言語の複数の発話列を１つのサーバの上で動作する複数のプロセスで別々に処理し、結果をそれぞれ複数の言語に翻訳して出力することもできる。言語識別の部分と同じように、複数話者識別のモデルを導入することにより、音声認識結果に、言語のみではなく話者情報の付与もできる。 Furthermore, in the above embodiment, the results of the live speech recognition processing are displayed on the screen with a slight delay after the utterance. However, the invention is not limited to this embodiment. For example, the results of the speech recognition processing may be transmitted to a remote computer via communication, or they may be input to a live automatic translation device and the translation results output in some form. For example, multiple speech sequences in multiple languages obtained from multiple venues can be processed separately by multiple processes running on a single server, and the results can be translated into multiple languages and output. Similar to the language identification part, by introducing a multi-speaker identification model, speaker information, not just language, can be added to the speech recognition results.

上記実施の形態では、０．５秒以上の無音部分があったときに発話区切りを検出したものとしている。しかしこの発明はそのような実施形態には限定されない。例えば言語識別の結果に基づいて発話の言語の変化を検出したり、話者識別の結果を使用して話者の変化を検出したりしたときに発話区切りを検出したものとしてもよい。 In the above embodiment, a speech segment is detected when there is a silent portion of 0.5 seconds or more. However, this invention is not limited to such embodiments. For example, a speech segment may be detected when a change in the language of an utterance is detected based on the result of language identification, or when a change in the speaker is detected using the result of speaker identification.

上記実施の形態では、単一のコンピュータシステムで、単一のＣＰＵを用いて複数のスレッドを動作させている。しかしこの発明はそのような実施形態には限定されない。複数のコアを持つＣＰＵであれば各コアで別々にスレッドを生成して上記処理を実現できる。さらに、互いに独立した複数のＣＰＵを利用すれば、上記実施形態に係る多言語ライブ音声認識システムを分散システムとしても実現できる。 In the above embodiment, multiple threads are operated using a single CPU in a single computer system. However, this invention is not limited to such an embodiment. If a CPU has multiple cores, the above processing can be achieved by generating threads separately on each core. Furthermore, by utilizing multiple independent CPUs, the multilingual live speech recognition system according to the above embodiment can also be implemented as a distributed system.

［付記］
上記実施形態の他の局面について以下に付記する。 [Note]
Other aspects of the above embodiment are described below.

（１）言語識別モデルが出力する情報は、部分区間の音声信号が複数の言語である確率をそれぞれ示す複数のスコアを含み、言語決定手段は、部分区間の各々について、言語識別モデルがスコアを出力したことに応答して、複数の言語のうち、スコアが所定の条件を満足する言語の識別子を出力する条件検証手段を含んでもよい。 (1) The information output by the language identification model includes multiple scores indicating the probability that the speech signal in a sub-interval is in one of several languages. The language determination means may include a condition verification means that, in response to the language identification model outputting a score for each sub-interval, outputs an identifier for the language among the multiple languages whose score satisfies predetermined conditions.

（２）条件検証手段は、部分区間の各々について言語識別モデルが出力するスコアが最大値となった回数が最も多い言語の識別子を出力してもよい。 (2) The condition verification means may output the language identifier for the language that has produced the maximum score most frequently for each subinterval.

（３）条件検証手段は、部分区間の各々について言語識別モデルが出力するスコアの平均値が最も高い言語の識別子を出力するようにしてもよい。 (3) The condition verification means may output the language identifier with the highest average score output by the language identification model for each subinterval.

（４）信号付加手段は、対象区間のうち少なくとも直前の対象区間の末尾の第２の所定長を記憶するための末尾部分記憶手段と、対象区間のうち、最初の対象区間についてはその先頭に第１の所定長の無音区間を付加して出力するための先頭区間処理手段と、対象区間のうち、最初の対象区間より後の対象区間の各々について、当該対象区間の先頭に、末尾部分記憶手段に記憶されている直前の対象区間の末尾の第２の所定長を付加し、さらにその直前に第１の所定長の無音区間を付加して出力するための後続区間処理手段とを含んでもよい。第２の所定長は第１の所定長よりも長く、対象区間の長さよりも短くてもよい。 (4) The signal addition means may include a tail portion storage means for storing a second predetermined length at the end of at least the immediately preceding target section; a leading section processing means for adding a first predetermined length of silent section to the beginning of the first target section and outputting it; and a subsequent section processing means for adding the second predetermined length at the end of the immediately preceding target section stored in the tail portion storage means to the beginning of each target section after the first target section, and further adding a first predetermined length of silent section immediately before that and outputting it. The second predetermined length may be longer than the first predetermined length and shorter than the length of the target section.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内での全ての変更を含む。 The embodiments disclosed herein are merely illustrative, and the present invention is not limited to those embodiments described above. The scope of the present invention is defined by the claims, with reference to the detailed description of the invention, and includes all modifications within the meaning and scope equivalent to the wording contained herein.

５０多言語ライブ音声認識装置
７０バッファ
７２発話区間検出部
７４制御部
７６発話区切検出部
７８ＬＩＤ
８０多言語ＡＳＲ処理部
８２結果表示制御部
１００前処理部
１０２ＶＡＤ
１０４補正部
２００、２０２発話
２１０、２１２、２１４、２１６、２１８期間
２２０、２３２、２３４、３００、３０２、３０４、３１０、３１２、３１４、３５０、３５２有音部分
２２２、２３８、２４４、２５４、２６４無音部分
２３０、２３６、２４０、２５０、２６０、３６０、３７０、３８０、３９０音声信号
２４２、２５２、２６２付加区間
３６２、３７２、３８２、３９２ＬＩＤ判別結果
４２０発話区切り
４２２無音区間
８１２、８１４、８１６、８２０、８２８、８３０、８３２データ
８１８、８２２、８２６、８３４一時認識結果
８２４部分認識結果
８３６最終認識結果
50 Multilingual live speech recognition device 70 Buffer 72 Speech segment detection unit 74 Control unit 76 Speech segment detection unit 78 LID
80 Multilingual ASR processing unit 82 Result display control unit 100 Pre-processing unit 102 VAD
104 Correction Units 200, 202 Speech 210, 212, 214, 216, 218 Period 220, 232, 234, 300, 302, 304, 310, 312, 314, 350, 352 Audible Parts 222, 238, 244, 254, 264 Silent Parts 230, 236, 240, 250, 260, 360, 370, 380, 390 Speech Signals 242, 252, 262 Additional Sections 362, 372, 382, 392 LID Discrimination Result 420 Speech Separation 422 Silent Sections 812, 814, 816, 820, 828, 830, 832 Data 818, 822, 826, 834 Temporary Recognition Result 824 Partial recognition result 836 Final recognition result

Claims

A speech segment detection means for detecting the start and end of a speech segment in an audio signal,
A speech segment detection means outputs a speech segmentation signal indicating a speech segment in response to the end of a speech segment detected by the speech segment detection means and the presence of a silent period of a predetermined time or longer.
Language identification means for identifying the language of the utterance in the immediately preceding utterance segment and outputting a language identifier in response to the aforementioned utterance segment signal,
Speech recognition means including a plurality of speech recognition processing means for a plurality of different languages, for performing speech recognition on the speech signal of the speech segment in response to the detection of the start of a speech segment by the speech segment detection means ,
The system includes a selection means for selecting and outputting the output of the speech recognition processing means for the language indicated by the identifier, from among the plurality of speech recognition processing means, in response to the identifier.
The aforementioned speech interval detection means is
A division means for dividing the aforementioned audio signal into target sections of a predetermined length,
A signal adding means for adding an additional signal to each of the aforementioned target sections, which includes at least a first predetermined length of silent section immediately preceding it,
A sound section detection means for detecting a sound section included in the target section to which the added signal has been added by the signal adding means, distinguishing it from a silent section;
A speech recognition device comprising correction means for correcting a sounded section by deleting the sounded section corresponding to the additional signal from the sounded section detected by the sounded section detection means.

The language identification means is
In response to the detection of the end of a speech segment by the speech segment detection means, a sub-segment signal generation means generates a sub-segment audio signal of a predetermined length and shift amount from the audio signal of the speech segment,
A language identification model that receives the audio signal from each of the aforementioned sub-sections and outputs information indicating which of the plurality of languages the sub-section corresponds to,
The speech recognition device according to claim 1 , further comprising: language determination means that, in response to the information output by the language identification model, determines the language of the speech signal in the speech segment and outputs an identifier for that language.

The speech recognition device according to claim 1 or 2, further comprising speaker identification means for identifying the speaker of the utterance in the immediately preceding speech segment in response to the utterance segmentation signal and providing information about the speaker to the output of the speech recognition means.

Computers,
A speech segment detection means for detecting the speech segment of an audio signal,
A speech segment detection means outputs a speech segmentation signal indicating a speech segment in response to the end of a speech segment detected by the speech segment detection means and the presence of a silent period of a predetermined time or longer.
Language identification means for identifying the language of the utterance in the immediately preceding utterance segment and outputting a language identifier in response to the aforementioned utterance segment signal,
Speech recognition means including speech recognition processing means for multiple different languages, for performing speech recognition on the audio signals of the speech segments in response to the detection of a speech segment by the speech segment detection means ,
A computer program that, in response to the identifier, causes one of the plurality of speech recognition processing means to function as a selection means for selecting and outputting the output of the speech recognition processing means for the language indicated by the identifier,
The aforementioned speech interval detection means is
A division means for dividing the aforementioned audio signal into target sections of a predetermined length,
A signal adding means for adding an additional signal to each of the aforementioned target sections, which includes at least a first predetermined length of silent section immediately preceding it,
A sound section detection means for detecting a sound section included in the target section to which the added signal has been added by the signal adding means, distinguishing it from a silent section;
A computer program comprising correction means for correcting a sounded section by deleting the sounded section corresponding to the additional signal from the sounded section detected by the sounded section detection means.

The computer detects the start and end of the speech interval of the audio signal,
The computer, in response to the end of the detected speech segment, outputting a speech segmentation signal indicating a speech segment,
A language identification step in which a computer, in response to the speech segmentation signal, identifies the language of the speech in the immediately preceding speech segment and outputs a language identifier,
The computer, in response to the detection of the start of the utterance segment, initiates speech recognition of the speech signal of the utterance segment using speech recognition processing means for each of several different languages ,
A speech recognition method comprising a selection step in which a computer, in response to an identifier, selects and outputs the output of a speech recognition processing means for the language indicated by the identifier from among a plurality of speech recognition processing means ,
The step of outputting the aforementioned speech segmentation signal is:
The steps include dividing the audio signal into target sections of a predetermined length,
A signal addition step is to add an additional signal to each of the aforementioned target sections, which includes at least a first predetermined length of silent section immediately preceding it.
The signal addition step includes a sounded section detection step for detecting a sounded section included in the target section to which the added signal has been added, distinguishing it from a silent section.
A speech recognition method comprising a correction step for correcting a sounded section by deleting the sounded section corresponding to the additional signal from the sounded section detected in the sounded section detection step.