JP7498231B2

JP7498231B2 - Information processing device, voice recognition support method, and voice recognition support program

Info

Publication number: JP7498231B2
Application number: JP2022134286A
Authority: JP
Inventors: 武飯野
Original assignee: NEC Personal Computers Ltd
Current assignee: NEC Personal Computers Ltd
Priority date: 2022-08-25
Filing date: 2022-08-25
Publication date: 2024-06-11
Anticipated expiration: 2042-08-25
Also published as: JP2024031012A

Description

本発明は、情報処理装置、音声認識支援方法、及び音声認識支援プログラムに関するものである。 The present invention relates to an information processing device, a voice recognition assistance method, and a voice recognition assistance program.

従来、音声認識エンジンを複数のクライアント端末で共有する音声認識システムが知られている。音声認識システムとして、家庭内ネットワークを介して複数のクライアント端末と音声認識サーバとを接続する狭域型の音声認識システム、クラウドサーバに音声認識エンジンを搭載し、インターネット回線等のネットワークを介して音声認識を行うクラウド型の音声認識システム等がある。 Conventionally, there are known voice recognition systems in which a voice recognition engine is shared by multiple client terminals. Examples of voice recognition systems include narrow-area voice recognition systems that connect multiple client terminals and a voice recognition server via a home network, and cloud-based voice recognition systems that have a voice recognition engine installed on a cloud server and perform voice recognition via a network such as the Internet.

このような音声認識システムでは、入力音声データから音声が発話された区間を検出する音声区間検出（ＶＡＤ：ＶｏｉｃｅＡｃｔｉｖｉｔｙＤｅｔｅｃｔｉｏｎ）の機能をクライアント端末に搭載し、検出された音声区間の音声データのみをサーバに送信する技術が提案されている（例えば、特許文献１参照）。 In such a voice recognition system, a technology has been proposed in which a client terminal is equipped with a voice activity detection (VAD) function that detects speech intervals from input voice data, and only the voice data of the detected speech intervals is sent to the server (see, for example, Patent Document 1).

特許第４４２５０５５号公報Patent No. 4425055

従来の音声区間検出は、音があるか否かを検出するものであったため、音声区間検出の出力をそのまま音声認識エンジンに入力すると、雑音や発話の途切れなどによって、音声認識が正確にできない可能性があった。 Conventional voice activity detection detects whether or not there is sound, so if the output of the voice activity detection is directly input to a speech recognition engine, there is a risk that the speech recognition will not be accurate due to noise or interruptions in speech.

例えば、図１５に示すように、雑音が生じた場合に、音声と雑音とを区別できないため、雑音の音声信号が音声区間として検出されて、音声認識エンジンに入力される。この場合、無駄な音声認識をしてしまうこととなる。また、ユーザが「夏休みに帰省中」という発話を行った際に、「夏休みに」と「帰省中」の間に音声の途切れが生じた場合、「なつやすみに」という音声データと「きせいちゅう」という音声データとが個別に検出されて、音声認識エンジンに入力されることとなる。この場合、連続した一つの音声として捉えることができず、文脈を把握できないため、「夏休みに寄生虫」といったように誤った音声認識がされるおそれがある。 For example, as shown in FIG. 15, when noise occurs, speech and noise cannot be distinguished, so the noise speech signal is detected as a speech section and input to the speech recognition engine. In this case, unnecessary speech recognition will occur. Also, when a user utters "Returning home for summer vacation," if there is a break in the speech between "returning home for summer vacation" and "returning home," the speech data "summer vacation" and "returning home" will be detected separately and input to the speech recognition engine. In this case, it is not possible to capture it as one continuous speech, and the context cannot be understood, so there is a risk of erroneous speech recognition, such as "parasites during summer vacation."

本発明は、このような事情に鑑みてなされたものであって、音声認識の精度を向上させることのできる情報処理装置、音声認識支援方法、及び音声認識支援プログラムを提供することを目的とする。 The present invention has been made in consideration of the above circumstances, and aims to provide an information processing device, a voice recognition assistance method, and a voice recognition assistance program that can improve the accuracy of voice recognition.

本発明の第一態様は、音声データから音が発生している音声区間を検出する音声区間検出部と、前記音声区間検出部の出力信号を平滑化する平滑化処理部と、停止中状態において、前記平滑化処理部から出力された平滑化信号が発話開始閾値以上である状態を予め設定されている発話開始判定時間維持した場合に、発話中状態と判定する発話中判定部と、発話中状態において、前記平滑化信号が発話停止閾値以下である状態を予め設定されている発話停止判定時間維持した場合に、停止中状態と判定する停止中判定部と、発話中状態と判定された発話開始時刻及び停止中状態と判定された発話停止時刻に基づいて発話区間を決定する音声データ管理部とを具備する情報処理装置である。 The first aspect of the present invention is an information processing device that includes a voice section detection unit that detects a voice section where sound is generated from voice data, a smoothing processing unit that smoothes an output signal of the voice section detection unit, a speech determination unit that determines a speech state when a state in which the smoothed signal output from the smoothing processing unit is equal to or greater than a speech start threshold is maintained for a preset speech start determination time in a stopped state, a stop determination unit that determines a stop state when a state in which the smoothed signal is equal to or less than a speech stop threshold is maintained for a preset speech stop determination time in the stopped state, and a voice data management unit that determines a speech section based on the speech start time at which the speech state is determined and the speech stop time at which the stopped state is determined.

本発明の第二態様は、音声データから音が発生している音声区間を検出する音声区間検出部と、前記音声区間検出部の出力信号が発話開始閾値以上である状態を予め設定されている発話開始判定時間維持した場合に、発話中状態と判定する発話中判定部と、前記出力信号が発話停止閾値以下である状態を予め設定されている発話停止判定時間維持した場合に、停止中状態と判定する停止中判定部と、発話中状態と判定された発話開始時刻及び停止中状態と判定された発話停止時刻に基づいて発話区間を決定する音声データ管理部とを具備し、前記音声データ管理部は、前記発話開始時刻及び予め設定された発話開始余裕時間に基づいて発話区間の開始時を決定し、前記発話停止時刻及び予め設定された発話停止余裕時間に基づいて発話区間の終了時を決定する情報処理装置である。 A second aspect of the present invention is an information processing device that includes a voice section detection unit that detects voice sections where sound is being generated from voice data, a speech in progress determination unit that determines a speech state when a state in which an output signal of the voice section detection unit is above a speech start threshold is maintained for a predetermined speech start determination time, a stop state determination unit that determines a stop state when a state in which the output signal is below a speech stop threshold is maintained for a predetermined speech stop determination time, and a voice data management unit that determines a speech section based on the speech start time determined to be the speech in progress state and the speech stop time determined to be the stop state, wherein the voice data management unit determines the start time of the speech section based on the speech start time and a predetermined speech start margin time, and determines the end time of the speech section based on the speech stop time and the predetermined speech stop margin time .

本発明の第三態様は、音声データから音が発生している音声区間を検出する音声区間検出工程と、前記音声区間検出工程の出力信号を平滑化する平滑化処理工程と、停止中状態において、前記平滑化処理工程から出力された平滑化信号が発話開始閾値以上である状態を予め設定されている発話開始判定時間維持した場合に、発話中状態と判定する発話中判定工程と、発話中状態において、前記平滑化信号が発話停止閾値以下である状態を予め設定されている発話停止判定時間維持した場合に、停止中状態と判定する停止中判定工程と、発話中状態と判定された発話開始時刻及び停止中状態と判定された発話停止時刻に基づいて発話区間を決定する音声データ管理工程とをコンピュータが実行する音声認識支援方法である。 The third aspect of the present invention is a speech recognition support method in which a computer executes a speech interval detection step of detecting a speech interval in which sound is generated from speech data, a smoothing process step of smoothing an output signal of the speech interval detection step, a speech determination step of determining an utterance in progress state when a state in which the smoothed signal output from the smoothing process is equal to or greater than a speech start threshold is maintained for a predetermined speech start determination time in a stopped state, a stop determination step of determining an utterance in progress state when a state in which the smoothed signal is equal to or less than a speech stop threshold is maintained for a predetermined speech stop determination time in the stopped state, and a speech data management step of determining an utterance interval based on the speech start time at which the utterance in progress state is determined and the speech stop time at which the utterance in progress state is determined.

本発明の第四態様は、コンピュータに上記音声認識支援方法を実行させるための音声認識支援プログラムである。 A fourth aspect of the present invention is a speech recognition assistance program for causing a computer to execute the above-mentioned speech recognition assistance method.

本発明によれば、音声認識の精度を向上させることができるという効果を奏する。 The present invention has the effect of improving the accuracy of voice recognition.

本発明の第１実施形態に係る音声認識システムの全体構成を概略的に示したシステム構成図である。1 is a system configuration diagram illustrating an overall configuration of a voice recognition system according to a first embodiment of the present invention. 本発明の第１実施形態に係るクライアント端末のハードウェア構成の一例を示した図である。FIG. 2 is a diagram illustrating an example of a hardware configuration of a client terminal according to the first embodiment of the present invention. 本発明の第１実施形態に係るクライアント端末が有する音声認識支援機能の一例を示した機能構成図である。FIG. 2 is a functional configuration diagram showing an example of a voice recognition assistance function of the client terminal according to the first embodiment of the present invention. 本発明の第１実施形態に係る音声検出部から出力されるＶＡＤ信号の一例を示した図である。4 is a diagram showing an example of a VAD signal output from a voice detection unit according to the first embodiment of the present invention; FIG. 本発明の第１実施形態に係る平滑化処理部から出力される平滑化信号の一例を示した図である。5 is a diagram showing an example of a smoothed signal output from a smoothing processing unit according to the first embodiment of the present invention; FIG. 本発明の第１実施形態に係る状態管理部の動作を説明するための図である。5 is a diagram for explaining the operation of a state management unit according to the first embodiment of the present invention; FIG. 本発明の第１実施形態に係る状態管理部における発話状態遷移図である。FIG. 2 is a speech state transition diagram in a state management unit according to the first embodiment of the present invention. 本発明の第１実施形態に係る音声データ管理部の動作を説明するための図である。5A to 5C are diagrams for explaining the operation of a voice data management unit according to the first embodiment of the present invention. 本発明の第１実施形態に係るリングバッファの一例を示した図である。FIG. 2 is a diagram illustrating an example of a ring buffer according to the first embodiment of the present invention. 本発明の第１実施形態に係るサーバが有する音声認識機能の一例を示した機能構成図である。FIG. 2 is a functional configuration diagram showing an example of a voice recognition function of the server according to the first embodiment of the present invention. 本発明の第２実施形態に係るクライアント端末が有する音声認識支援機能の一例を示した機能構成図である。FIG. 11 is a functional configuration diagram showing an example of a voice recognition assistance function of a client terminal according to a second embodiment of the present invention. 本発明の第２実施形態に係る音声認識支援部の動作を説明するための図である。13 is a diagram for explaining the operation of a voice recognition support unit according to the second embodiment of the present invention. FIG. 本発明の第３実施形態に係るクライアント端末が有する音声認識支援機能の一例を示した機能構成図である。FIG. 13 is a functional configuration diagram showing an example of a voice recognition assistance function of a client terminal according to a third embodiment of the present invention. 本発明の第３実施形態に係る音声認識支援部の動作を説明するための図である。FIG. 13 is a diagram for explaining the operation of a voice recognition support unit according to the third embodiment of the present invention. 従来の発話区間の決定方法について説明するための図である。FIG. 1 is a diagram for explaining a conventional method for determining a speech section.

〔第１実施形態〕
図１は、本発明の第１実施形態に係る音声認識システムの全体構成を概略的に示したシステム構成図である。図１に示すように、音声認識システム１は、複数のクライアント端末（情報処理装置）１０とサーバ５０とを備えている。複数のクライアント端末１０とサーバ５０とは、通信ネットワーク２を介して接続可能に構成されている。通信ネットワーク２の一例として、インターネット、Ｂｌｕｅｔｏｏｔｈ（登録商標）、Ｗｉ－Ｆｉ、無線ＬＡＮ、有線ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）等が挙げられる。図１では、２台のクライアント端末１０を図示しているが、サーバ５０に接続されるクライアント端末１０の台数は特に限られない。 First Embodiment
Fig. 1 is a system configuration diagram showing an outline of the overall configuration of a voice recognition system according to a first embodiment of the present invention. As shown in Fig. 1, the voice recognition system 1 includes a plurality of client terminals (information processing devices) 10 and a server 50. The plurality of client terminals 10 and the server 50 are configured to be connectable via a communication network 2. Examples of the communication network 2 include the Internet, Bluetooth (registered trademark), Wi-Fi, a wireless LAN, and a wired LAN (Local Area Network). Although two client terminals 10 are shown in Fig. 1, the number of client terminals 10 connected to the server 50 is not particularly limited.

クライアント端末１０は、例えば、ノートＰＣ、デスクトップ型ＰＣ、タブレット端末、スマートフォン等の情報処理装置である。
図２は、本実施形態に係るクライアント端末１０のハードウェアの概略構成の一例を示した図である。図２に示すように、クライアント端末１０は、例えば、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１１、メインメモリ１２、二次記憶装置１３、外部インターフェース１４、通信インターフェース１５、入力デバイス１６、ディスプレイ１７、マイクロフォン（マイク）１８、スピーカ１９を備えている。これら各部は、バスを介して直接または間接的に接続されている。 The client terminal 10 is, for example, an information processing device such as a notebook PC, a desktop PC, a tablet terminal, or a smartphone.
Fig. 2 is a diagram showing an example of a schematic configuration of hardware of the client terminal 10 according to this embodiment. As shown in Fig. 2, the client terminal 10 includes, for example, a CPU (Central Processing Unit) 11, a main memory 12, a secondary storage device 13, an external interface 14, a communication interface 15, an input device 16, a display 17, a microphone 18, and a speaker 19. These components are directly or indirectly connected via a bus.

ＣＰＵ１１は、例えば、バスを介して接続された二次記憶装置１３に格納されたＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）によりクライアント端末１０全体の制御を行うとともに、二次記憶装置１３に格納された各種プログラムを実行することにより後述するような各種処理を実行する。ＣＰＵ１１は、１つ又は複数設けられており、互いに協働して処理を実現してもよい。 The CPU 11 controls the entire client terminal 10 using, for example, an OS (Operating System) stored in a secondary storage device 13 connected via a bus, and executes various processes, as described below, by executing various programs stored in the secondary storage device 13. One or more CPUs 11 may be provided, and they may work together to realize processes.

メインメモリ１２は、例えば、キャッシュメモリ、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等の書き込み可能なメモリで構成され、ＣＰＵ１１の実行プログラムの読み出し、実行プログラムによる処理データの書き込み等を行う作業領域として利用される。 The main memory 12 is composed of writable memory such as cache memory and RAM (Random Access Memory), and is used as a working area for reading the execution program of the CPU 11, writing processing data by the execution program, etc.

二次記憶装置１３は、非一時的なコンピュータ読み取り可能な記録媒体（ｎｏｎ－ｔｒａｎｓｉｔｏｒｙｃｏｍｐｕｔｅｒｒｅａｄａｂｌｅｓｔｏｒａｇｅｍｅｄｉｕｍ）である。二次記憶装置１３は、例えば、磁気ディスク、光磁気ディスク、ＣＤ－ＲＯＭ、ＤＶＤ－ＲＯＭ、半導体メモリなどである。二次記憶装置１３の一例として、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）フラッシュメモリなどが挙げられる。二次記憶装置１３は、例えば、Ｗｉｎｄｏｗｓ（登録商標）、ｉＯＳ（登録商標）、Ａｎｄｒｏｉｄ（登録商標）等のクライアント端末１０全体の制御を行うためのＯＳ、ＢＩＯＳ（ＢａｓｉｃＩｎｐｕｔ／ＯｕｔｐｕｔＳｙｓｔｅｍ）、周辺機器類をハードウェア操作するための各種デバイスドライバ、各種アプリケーションソフトウェア、及び各種データやファイル等を格納する。また、二次記憶装置１３には、各種処理を実現するためのプログラムや、各種処理を実現するために必要とされる各種データが格納されている。二次記憶装置１３は、複数設けられていてもよく、各二次記憶装置１３に上述したようなプログラムやデータが分割されて格納されていてもよい。 The secondary storage device 13 is a non-transitory computer readable storage medium. The secondary storage device 13 is, for example, a magnetic disk, a magneto-optical disk, a CD-ROM, a DVD-ROM, a semiconductor memory, etc. Examples of the secondary storage device 13 include a ROM (Read Only Memory), a HDD (Hard Disk Drive), a SSD (Solid State Drive), a flash memory, etc. The secondary storage device 13 stores, for example, an OS for controlling the entire client terminal 10, such as Windows (registered trademark), iOS (registered trademark), Android (registered trademark), a BIOS (Basic Input/Output System), various device drivers for operating the hardware of peripheral devices, various application software, and various data and files. The secondary storage device 13 also stores programs for implementing various processes and various data required for implementing various processes. A plurality of secondary storage devices 13 may be provided, and the above-mentioned programs and data may be divided and stored in each secondary storage device 13.

外部インターフェース１４は、外部機器と接続するためのインターフェースである。外部機器の一例として、外部モニタ、ＵＳＢメモリ、外付けＨＤＤ、外付けカメラ等が挙げられる。なお、図２に示した例では、外部インターフェース１４は、１つしか図示されていないが、複数の外部インターフェース１４を備えていてもよい。外部インターフェース１４は、例えば、接続される機器に応じてそれぞれ適切な入出力端子およびインターフェースを備えている。 The external interface 14 is an interface for connecting to an external device. Examples of external devices include an external monitor, a USB memory, an external HDD, and an external camera. Note that in the example shown in FIG. 2, only one external interface 14 is shown, but multiple external interfaces 14 may be provided. The external interfaces 14 each have appropriate input/output terminals and interfaces depending on the device to be connected, for example.

通信インターフェース１５は、ネットワークに接続して他の装置と通信を行い、情報の送受信を行うためのインターフェースとして機能する。例えば、通信インターフェース１５は、有線又は無線により他の装置と通信を行う。無線通信として、Ｂｌｕｅｔｏｏｔｈ（登録商標）、Ｗｉ－Ｆｉ、移動通信システム（３Ｇ、４Ｇ、５Ｇ、６Ｇ、ＬＴＥ等）、無線ＬＡＮなどの回線を通じた通信が挙げられる。有線通信の一例として、有線ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）などの回線を通じた通信が挙げられる。 The communication interface 15 functions as an interface for connecting to a network to communicate with other devices and to send and receive information. For example, the communication interface 15 communicates with other devices by wire or wirelessly. Examples of wireless communication include communication through lines such as Bluetooth (registered trademark), Wi-Fi, mobile communication systems (3G, 4G, 5G, 6G, LTE, etc.), and wireless LAN. An example of wired communication is communication through lines such as a wired LAN (Local Area Network).

入力デバイス１６は、例えば、キーボード、マウス、タッチパッド等、ユーザがクライアント端末１０に対して指示を与えるためのユーザインターフェースである。 The input device 16 is a user interface, such as a keyboard, mouse, or touchpad, that allows the user to give instructions to the client terminal 10.

ディスプレイ１７は、例えば、液晶ディスプレイ、有機ＥＬ（Ｅｌｅｃｔｒｏｌｕｍｉｎｅｓｃｅｎｃｅ）ディスプレイ等である。また、ディスプレイ１７は、タッチパネルが重畳されたタッチパネルディスプレイでもよい。
マイク１８は、音をアナログ信号に変換し、音声信号として出力する。
スピーカは、音声信号を音として出力する。 The display 17 is, for example, a liquid crystal display, an organic electroluminescence (EL) display, etc. Furthermore, the display 17 may be a touch panel display on which a touch panel is superimposed.
The microphone 18 converts the sound into an analog signal and outputs it as an audio signal.
The speaker outputs the audio signal as sound.

サーバ５０は、情報処理装置であり、ＣＰＵ、メインメモリ、二次記憶装置等を備えている。なお、主なハードウェア構成については、上述したクライアント端末１０とほぼ同様であり、公知であるため、詳細な説明は省略する。 The server 50 is an information processing device and includes a CPU, main memory, secondary storage device, etc. Note that the main hardware configuration is almost the same as that of the client terminal 10 described above and is publicly known, so a detailed description will be omitted.

図３は、本実施形態に係るクライアント端末１０が有する音声認識支援機能の一例を示した機能構成図である。 Figure 3 is a functional configuration diagram showing an example of a voice recognition support function possessed by the client terminal 10 according to this embodiment.

以下に説明する各種機能を実現するための一連の処理は、一例として、プログラム（例えば、音声認識支援プログラム）の形式で二次記憶装置１３に記憶されており、このプログラムをＣＰＵ１１がメインメモリ１２に読み出して、情報の加工・演算処理を実行することにより、各種機能が実現される。なお、プログラムは、二次記憶装置１３に予めインストールされている形態や、他のコンピュータ読み取り可能な記憶媒体に記憶された状態で提供される形態、有線又は無線による通信手段を介して配信される形態等が適用されてもよい。コンピュータ読み取り可能な記憶媒体とは、磁気ディスク、光磁気ディスク、ＣＤ－ＲＯＭ、ＤＶＤ－ＲＯＭ、半導体メモリ等である。 As an example, a series of processes for realizing the various functions described below is stored in the secondary storage device 13 in the form of a program (e.g., a voice recognition assistance program), and the CPU 11 reads this program into the main memory 12 and executes information processing and arithmetic processing to realize the various functions. Note that the program may be pre-installed in the secondary storage device 13, provided in a state stored in another computer-readable storage medium, or distributed via wired or wireless communication means. Examples of computer-readable storage media include magnetic disks, magneto-optical disks, CD-ROMs, DVD-ROMs, and semiconductor memories.

図３に示すように、クライアント端末１０は、音声認識支援部２０を備えている。音声認識支援部２０は、例えば、音声入力部２１、音声区間検出部２２、及び補正フィルタ部２３を備えている。 As shown in FIG. 3, the client terminal 10 includes a voice recognition support unit 20. The voice recognition support unit 20 includes, for example, a voice input unit 21, a voice segment detection unit 22, and a correction filter unit 23.

音声入力部２１は、例えば、マイク１８から入力された音声信号を一定区間（例えば、数十ｍｓ）で区切ることにより１フレーム毎の音声データとし、時刻情報と共に出力する。なお、本実施形態では、マイク１８から音声信号が入力される場合を例示しているが、音声信号は、これに限られない。音声信号は、例えば、外部装置から受信した音声信号であってもよいし、記録媒体等から読み出した音声信号でもよい。 The audio input unit 21 divides the audio signal input from the microphone 18 into certain intervals (e.g., several tens of ms) to generate audio data for each frame, and outputs the data together with time information. Note that, although the present embodiment illustrates a case in which an audio signal is input from the microphone 18, the audio signal is not limited to this. The audio signal may be, for example, an audio signal received from an external device, or an audio signal read from a recording medium, etc.

音声データは、音声区間検出部２２及び補正フィルタ部２３の音声データ管理部３３に出力される。
音声区間検出部（ＶＡＤ：ＶｏｉｃｅＡｃｔｉｖｉｔｙＤｅｔｅｃｔｉｏｎ）２２は、音声入力部２１から出力された音声データから音が発生している音声区間を検出する。音声区間検出部２２は、例えば、音声データの１フレーム毎に、音声の有無を判定し、音声がある場合には「１」を、音声がない場合には「０」の出力信号を出力する。なお、音声区間検出（ＶＡＤ）については、様々な手法が提案されており、それらの公知の技術を適宜採用すればよい。 The voice data is output to the voice activity detection unit 22 and the voice data management unit 33 of the correction filter unit 23 .
The voice activity detection unit (VAD) 22 detects a voice activity in which sound is generated from the voice data output from the voice input unit 21. For example, the voice activity detection unit 22 determines the presence or absence of voice for each frame of voice data, and outputs an output signal of "1" when voice is present and "0" when voice is not present. Note that various methods have been proposed for voice activity detection (VAD), and any of these known techniques may be appropriately adopted.

音声区間検出部２２の出力信号は時刻情報と関連付けられて補正フィルタ部２３に出力される。以下、音声区間検出部２２から出力される信号を「ＶＡＤ信号」という。図４に、ＶＡＤ信号の一例を示す。図４において、横軸は時間、縦軸は出力を示している。音声が検出された区間については「１」、音声が検出されていない区間については「０」の出力とされる。 The output signal of the voice activity detection unit 22 is associated with time information and output to the correction filter unit 23. Hereinafter, the signal output from the voice activity detection unit 22 is referred to as the "VAD signal." Figure 4 shows an example of a VAD signal. In Figure 4, the horizontal axis represents time and the vertical axis represents output. The output is "1" for sections where voice is detected, and "0" for sections where voice is not detected.

補正フィルタ部２３は、例えば、平滑化処理部３１、状態管理部３２、及び音声データ管理部３３を備えている。
平滑化処理部３１は、例えば、音声区間検出部２２からのＶＡＤ信号を平滑化する。平滑化の一例として、移動平均（例えば、単純移動平均、加重移動平均、指数移動平均等）が挙げられる。本実施形態では、平滑化処理として、単純移動平均を用いる場合を例示して説明する。 The correction filter unit 23 includes, for example, a smoothing processing unit 31 , a state management unit 32 , and a voice data management unit 33 .
The smoothing processing unit 31 smoothes, for example, the VAD signal from the voice activity detection unit 22. One example of the smoothing is moving average (for example, simple moving average, weighted moving average, exponential moving average, etc.). In this embodiment, a case where a simple moving average is used as the smoothing process will be described as an example.

移動平均の平滑区間は、例えば、音声入力部２１によって設定される１フレームの時間長によって適宜決定することが可能である。一例として、１フレームの時間長が２０ｍｓの場合、移動平均の平滑区間は１００ｍｓ以上３００ｍｓ以下の範囲で設定される。
図５に、図４に示したＶＡＤ信号に対応する平滑化信号の一例を示す。このように、平滑化処理を行うことで、「１」、「０」の離散的な信号を連続的な信号として表すことが可能となる。 The smoothing section of the moving average can be appropriately determined, for example, based on the time length of one frame set by the voice input unit 21. As an example, when the time length of one frame is 20 ms, the smoothing section of the moving average is set in the range of 100 ms to 300 ms.
Fig. 5 shows an example of a smoothed signal corresponding to the VAD signal shown in Fig. 4. In this way, by performing smoothing processing, it is possible to express discrete signals of "1" and "0" as a continuous signal.

状態管理部３２は、平滑化処理部３１から出力される平滑化信号に基づいて、発話中状態及び停止中状態などの発話状態を管理する。
状態管理部３２は、発話中判定部３４及び停止中判定部３５を備えている。
発話中判定部３４は、停止中状態において、平滑化処理部３１から出力された平滑化信号が発話開始閾値以上である状態を所定期間（発話開始判定時間）維持した場合に、発話中状態と判定する。
停止中判定部３５は、発話中状態において、平滑化信号が発話停止閾値以下である状態を所定時間（発話停止判定時間）維持した場合に、停止中状態と判定する。
ここで、発話停止閾値は、発話開始閾値よりも高い値に設定されている。
発話開始判定時間及び発話停止判定時間は、適宜設定することが可能であるが、いずれも２００ｍｓ以上５００ｍｓ未満の範囲で設定するとよい。 The state management unit 32 manages the speech state, such as the speech in progress state and the paused state, based on the smoothed signal output from the smoothing processing unit 31 .
The state management unit 32 includes an utterance determination unit 34 and a stoppage determination unit 35 .
The speech initiation determination unit 34 determines that the state is a speech initiation state when, in the stopped state, the smoothed signal output from the smoothing processing unit 31 remains equal to or greater than the speech start threshold for a predetermined period (speech start determination time).
When the smoothed signal is maintained at or below the speech stop threshold for a predetermined time (speech stop determination time) during speech, the stop state determination unit 35 determines that the speech state is in a stop state.
Here, the speech stop threshold is set to a value higher than the speech start threshold.
The speech start determination time and speech stop determination time can be set appropriately, but it is preferable to set both in the range of 200 ms or more and less than 500 ms.

例えば、図６に示すように、時刻ｔ１，ｔ２，ｔ３においては、ＶＡＤ信号が「１」を示しているが、平滑化信号は発話開始閾値以下とされているため、発話中とは判定されない。時刻ｔ４において、平滑化信号が発話開始閾値以上となり、その状態が発話開始判定時間維持されると、発話中判定部３４は、発話中状態と判定する（時刻ｔ５）。 For example, as shown in FIG. 6, at times t1, t2, and t3, the VAD signal indicates "1", but the smoothed signal is below the speech start threshold, so it is not determined that speech is occurring. At time t4, the smoothed signal becomes above the speech start threshold, and when this state is maintained for the speech start determination time, the speech determination unit 34 determines that the speech is occurring (time t5).

また、図６において、時刻ｔ６において、平滑化信号が発話停止閾値以下となっているが、その状態が発話停止判定時間維持されていないことから、停止中状態と判定されない。一方、時刻ｔ７において、平滑化信号が発話停止閾値以下となり、その状態が発話停止判定時間維持されると、停止中判定部３５は、停止中状態と判定する（時刻ｔ８）。 In addition, in FIG. 6, at time t6, the smoothed signal is below the speech stop threshold, but this state is not maintained for the speech stop determination time, so it is not determined to be a stopped state. On the other hand, at time t7, when the smoothed signal becomes below the speech stop threshold and this state is maintained for the speech stop determination time, the stopped state determination unit 35 determines that it is a stopped state (time t8).

状態管理部３２は、発話開始と判定した時刻である発話開始時刻と発話停止と判定した時刻である発話停止時刻とを音声データ管理部３３に出力する。例えば、図６に例示した平滑化信号については、発話開始時刻として時刻ｔ５が、発話停止時刻として時刻ｔ８が出力される。 The state management unit 32 outputs the speech start time, which is the time when it is determined that the speech has started, and the speech stop time, which is the time when it is determined that the speech has stopped, to the voice data management unit 33. For example, for the smoothed signal illustrated in FIG. 6, time t5 is output as the speech start time, and time t8 is output as the speech stop time.

図７は、状態管理部３２における発話状態遷移図である。図７に示すように、平滑化信号が発話開始閾値未満の状態（Ｓ１）から発話開始閾値以上の状態（Ｓ２）になり、かつ、この状態が発話開始判定時間維持されると、停止中状態から発話中状態に遷移する。一方、平滑化信号が発話開始閾値以上になっても、その状態が発話開始判定時間維持されなかった場合には、状態Ｓ１に再び戻る。 Figure 7 is a speech state transition diagram in the state management unit 32. As shown in Figure 7, when the smoothing signal goes from a state below the speech start threshold (S1) to a state above the speech start threshold (S2) and this state is maintained for the speech start determination time, the state transitions from the stopped state to the speaking state. On the other hand, if the smoothing signal goes above the speech start threshold but the state is not maintained for the speech start determination time, the state returns to S1 again.

発話中状態において、平滑化信号が発話停止閾値を超えている状態（Ｓ３）から発話停止閾値以下の状態になり（Ｓ４）、かつ、この状態が発話停止判定時間維持されると、発話中状態から停止中状態に遷移する。一方、平滑化信号が発話停止閾値以下になっても、その状態が発話停止判定時間維持されなかった場合には、状態Ｓ３に再び戻る。 In the speaking state, if the smoothing signal goes from a state exceeding the speech stop threshold (S3) to a state below the speech stop threshold (S4) and this state is maintained for the speech stop determination time, the state transitions from the speaking state to the stopped state. On the other hand, if the smoothing signal goes below the speech stop threshold but this state is not maintained for the speech stop determination time, the state returns to S3 again.

図３に戻り、音声データ管理部３３は、発話開始時刻及び発話停止時刻に基づいて発話区間を決定し、発話区間における音声データをサーバ５０に送信する。
例えば、音声データ管理部３３は、発話開始時刻及び予め設定された発話開始余裕時間に基づいて発話区間の開始時を決定する。具体的には、音声データ管理部３３は、発話開始時刻よりも発話開始余裕時間早い時刻を発話区間の開始時として決定する。 Returning to FIG. 3, the voice data management unit 33 determines an utterance section based on an utterance start time and an utterance stop time, and transmits voice data in the utterance section to the server 50.
For example, the voice data management unit 33 determines the start time of the speech section based on the speech start time and a preset speech start margin. Specifically, the voice data management unit 33 determines the start time of the speech section to be a time that is earlier than the speech start time by the speech start margin.

また、音声データ管理部３３は、例えば、発話停止時刻及び予め設定された発話停止余裕時間に基づいて発話区の終了時を決定する。具体的には、音声データ管理部３３は、発話停止時刻よりも発話停止余裕時間早い時刻を発話区間の終了時として決定する。 The voice data management unit 33 also determines the end of the speech section based on, for example, the speech stop time and a predetermined speech stop margin. Specifically, the voice data management unit 33 determines the end of the speech section to be a time that is earlier than the speech stop time by the speech stop margin.

例えば、図８に示すように、発話開始時刻として時刻ｔ５が、発話停止時刻として時刻ｔ８が状態管理部３２から出力された場合、音声データ管理部３３は、発話区間の開始時として時刻Ｔｓを、発話区間の終了時として時刻Ｔｅを決定する。これにより、発話区間はＴｓ～Ｔｅの期間に決定される。なお、図８に示した発話区間は、発話停止余裕時間が正の値を取る場合を例示している。 For example, as shown in FIG. 8, when time t5 is output as the speech start time and time t8 is output as the speech stop time from the state management unit 32, the voice data management unit 33 determines time Ts as the start time of the speech section and time Te as the end time of the speech section. As a result, the speech section is determined to be the period from Ts to Te. Note that the speech section shown in FIG. 8 illustrates a case where the speech stop margin time is a positive value.

発話開始余裕時間及び発話停止余裕時間は、後述するサーバ５０に搭載された音声認識部の仕様に応じて設定すればよく、認識させたい音声データの前後に必要なバッファの長さによって決定される。
発話開始余裕時間は、例えば、４００ｍｓ以上１０００ｍｓ以下の範囲で設定されるとよい。
発話停止余裕時間は、例えば、－４００ｍｓ以上５０ｍｓ以下の範囲で設定されるとよい。ここで、発話停止余裕時間が負の値を取る場合、発話区間の終了時が発話停止時刻よりも遅いことを意味する。 The speech start margin and speech stop margin may be set according to the specifications of a voice recognition unit mounted on the server 50, which will be described later, and are determined by the length of the buffer required before and after the voice data to be recognized.
The speech start margin may be set, for example, in the range of 400 ms to 1000 ms.
The speech stop margin may be set, for example, in the range of -400 ms to 50 ms. If the speech stop margin is a negative value, it means that the end of the speech section is later than the speech stop time.

音声データ管理部３３は、例えば、停止中状態の場合には、音声入力部２１から出力された音声データを一定のリングバッファで管理する。図９にリングバッファの概念図を示す。音声データ管理部３３は、新しい音声データが入力される度に、リングバッファのポインタを移動させ、新しい音声データを所定のメモリ領域に格納する。 For example, when in a stopped state, the audio data management unit 33 manages the audio data output from the audio input unit 21 in a fixed ring buffer. Figure 9 shows a conceptual diagram of a ring buffer. Each time new audio data is input, the audio data management unit 33 moves the pointer of the ring buffer and stores the new audio data in a specified memory area.

また、音声データ管理部３３は、停止中状態から発話中状態に遷移した場合には、発話区間の開始時以降におけるリングバッファの音声データを可変長バッファに格納する。そして、発話中状態においては、音声データ管理部３３は、新たな音声データが入力される毎にその音声データを可変長バッファに追加格納する。そして、発話中状態から停止中状態に遷移した場合には、可変長バッファ内の先頭から発話区間の終了時までの音声データを発話区間における音声データとして読み出し、通信ネットワーク２を介してサーバ５０（図１参照）に送信する。 When the voice data management unit 33 transitions from the stopped state to the speaking state, it stores the voice data in the ring buffer from the start of the speech section onwards in the variable length buffer. In the speaking state, the voice data management unit 33 adds new voice data to the variable length buffer each time the data is input. When the voice data management unit 33 transitions from the speaking state to the stopped state, it reads out the voice data from the beginning of the variable length buffer to the end of the speech section as voice data for the speech section, and transmits it to the server 50 (see Figure 1) via the communication network 2.

図１０は、サーバ５０が備える機能の一例を示した機能構成図である。図１０に示すように、サーバ５０は、要求管理部５１と、音声認識部５２とを備えている。要求管理部５１は、通信ネットワーク２を介して受信した音声データを受信順に管理する。音声認識部５２は、受信した順番で音声データの音声認識を行い、音声結果のテキストデータを対応するクライアント端末１０に送信する。なお、サーバ５０が備えるこれらの音声認識の機能は、公知の技術を適宜採用すればよい。 Fig. 10 is a functional configuration diagram showing an example of functions provided by the server 50. As shown in Fig. 10, the server 50 includes a request management unit 51 and a voice recognition unit 52. The request management unit 51 manages the voice data received via the communication network 2 in the order of reception. The voice recognition unit 52 performs voice recognition on the voice data in the order of reception, and transmits text data of the voice result to the corresponding client terminal 10. Note that these voice recognition functions provided by the server 50 may be appropriately implemented using publicly known technologies.

次に、本実施形態に係る音声認識支援部２０の動作について簡単に説明する。
まず、マイク１８等からの音声信号は、所定時間間隔で分割されて１フレーム毎の音声データとされ、１フレーム毎に音声区間検出（ＶＡＤ）が行われる。そして、ＶＡＤ信号が平滑化されることで、平滑化信号が取得される。 Next, the operation of the voice recognition support unit 20 according to this embodiment will be briefly described.
First, the audio signal from the microphone 18 or the like is divided at a predetermined time interval to obtain audio data for each frame, and voice activity detection (VAD) is performed for each frame. Then, the VAD signal is smoothed to obtain a smoothed signal.

そして、平滑化信号を用いて発話の状態が判定される。例えば、平滑化信号が発話開始閾値以上である状態が発話開始判定時間維持された場合に、発話中状態と判定される。また、平滑化信号が発話停止閾値以下である状態が発話停止判定時間維持された場合に、停止中状態と判定される。そして、発話開始状態と判定した発話開始時刻、発話停止状態と判定した発話停止時刻、発話開始余裕時間、及び発話停止余裕時間を用いて、発話区間が決定され、発話区間の音声データがサーバ５０に送信される。 Then, the state of speech is determined using the smoothed signal. For example, if the state in which the smoothed signal is equal to or greater than the speech start threshold is maintained for the speech start determination time, it is determined to be in a speech state. Also, if the state in which the smoothed signal is equal to or less than the speech stop threshold is maintained for the speech stop determination time, it is determined to be in a stopped state. Then, the speech section is determined using the speech start time determined to be the speech start state, the speech stop time determined to be the speech stop state, the speech start margin time, and the speech stop margin time, and the voice data of the speech section is transmitted to the server 50.

サーバ５０では、例えば、各クライアント端末１０から受信した発話区間の音声データを受信順に音声認識し、音声認識結果を対応するクライアント端末１０に送信する。 For example, the server 50 performs voice recognition on the speech section voice data received from each client terminal 10 in the order in which it is received, and transmits the voice recognition results to the corresponding client terminal 10.

以上説明してきたように、本実施形態に係る情報処理装置、音声認識支援方法、及び音声認識支援プログラムによれば、ＶＡＤ信号を平滑化処理した平滑化信号を用いて発話区間を決定するので、雑音が入力された場合やユーザの発話中に途切れが発生した場合でも適切な発話区間を決定することが可能となる。これにより、雑音の音声データを音声認識する等の音声認識の無駄を低減でき、音声認識の効率を高めることができる。また、一連の発話に関する音声データをひとまとまりの音声データとして音声認識を行うことができる。この結果、音声認識の精度を向上させることが可能となる。 As described above, according to the information processing device, speech recognition assistance method, and speech recognition assistance program of this embodiment, the speech section is determined using a smoothed signal obtained by smoothing the VAD signal, so that it is possible to determine an appropriate speech section even when noise is input or when an interruption occurs during the user's speech. This makes it possible to reduce wasteful speech recognition, such as performing speech recognition on noisy speech data, and to improve the efficiency of speech recognition. In addition, speech data related to a series of utterances can be treated as a single block of speech data for speech recognition. As a result, it is possible to improve the accuracy of speech recognition.

なお、上述した第１実施形態では、平滑化処理部３１が、移動平均を用いて平滑化信号を得る場合について説明したが、平滑化処理は移動平均に限られない。例えば、回帰モデルを用いて平滑化信号を得ることとしてもよい。具体的には、現在時刻より過去一定数における音声区間検出部からの出力信号に基づいて回帰分析を行うことにより、回帰モデルを取得する。このとき、回帰モデルは線形モデルが好ましい。そして、取得した回帰モデルを用いて現在の値を算出する。 In the above-described first embodiment, the smoothing processing unit 31 obtains a smoothed signal using a moving average, but the smoothing processing is not limited to the moving average. For example, the smoothed signal may be obtained using a regression model. Specifically, a regression model is obtained by performing regression analysis based on output signals from the voice activity detection unit for a certain number of times in the past from the current time. In this case, the regression model is preferably a linear model. The current value is then calculated using the obtained regression model.

〔第２実施形態〕
次に、本発明の第２実施形態に係る音声認識システムについて図面を参照して説明する。上述した第１実施形態では、発話開始閾値及び発話停止閾値として固定値を用いていたが、本実施形態は、音声データに基づいて動的に発話開始閾値及び発話停止閾値を設定する点が異なる。
以下、上述した第１実施形態と共通する構成については同一の符号を付して説明を省略し、異なる点について主に説明する。 Second Embodiment
Next, a speech recognition system according to a second embodiment of the present invention will be described with reference to the drawings. In the first embodiment described above, fixed values are used as the speech start threshold and speech stop threshold, but in this embodiment, the speech start threshold and speech stop threshold are dynamically set based on the voice data.
Hereinafter, configurations common to the first embodiment described above will be denoted by the same reference numerals and description thereof will be omitted, and differences will be mainly described.

図１１は、本実施形態に係るクライアント端末１０ａが有する音声認識支援機能の一例を示した機能構成図である。図１１に示すように、音声認識支援部２０ａの補正フィルタ部２３ａは、閾値設定部３６を備えている。
閾値設定部３６は、音声区間検出部２２から出力されるＶＡＤ信号を平滑化処理部３１の平滑区間（第１平滑区間）よりも長い平滑区間（第２平滑区間）で平滑化することで第２平滑化信号を得、この第２平滑化信号を用いて発話開始閾値と発話停止閾値とを設定する。ここで、発話開始閾値及び発話停止閾値は、同じ値としてもよいし、異なる値を設定してもよい。例えば、第２平滑化信号を発話開始閾値及び発話停止閾値として使用してもよい。また、第２平滑化信号を発話開始閾値として設定し、第２平滑化信号を所定値増加させた信号を発話停止閾値として設定してもよい。
第２平滑区間は、例えば、３００ｍｓ以上９００ｍｓ以下の範囲で設定される。 11 is a functional configuration diagram showing an example of a voice recognition support function of the client terminal 10a according to the present embodiment. As shown in FIG. 11, the correction filter unit 23a of the voice recognition support unit 20a includes a threshold setting unit 36.
The threshold setting unit 36 obtains a second smoothed signal by smoothing the VAD signal output from the voice activity detection unit 22 in a smoothing section (second smoothing section) longer than the smoothing section (first smoothing section) of the smoothing processing unit 31, and sets the speech start threshold and the speech stop threshold using this second smoothed signal. Here, the speech start threshold and the speech stop threshold may be set to the same value or different values. For example, the second smoothed signal may be used as the speech start threshold and the speech stop threshold. Also, the second smoothed signal may be set as the speech start threshold, and a signal obtained by increasing the second smoothed signal by a predetermined value may be set as the speech stop threshold.
The second smoothing section is set, for example, in the range of 300 ms to 900 ms.

このようにして、閾値設定部３６によって発話開始閾値及び発話停止閾値が設定されると、これらの閾値は、状態管理部３２に出力される。これにより、閾値設定部３６によって設定された発話開始閾値及び発話停止閾値を用いて、状態管理部３２による発話状態の管理が行われる。 When the speech start threshold and speech stop threshold are set by the threshold setting unit 36 in this manner, these thresholds are output to the state management unit 32. As a result, the speech start threshold and speech stop threshold set by the threshold setting unit 36 are used by the state management unit 32 to manage the speech state.

図１２に、本実施形態に係る音声認識支援部２０ａによって決定される発話区間の一例を示す。図１２では、第２平滑化信号を発話開始閾値及び発話停止閾値として使用している場合を例示している。すなわち、発話開始閾値と発話停止閾値とを同じ値としている。 Figure 12 shows an example of a speech section determined by the voice recognition support unit 20a according to this embodiment. Figure 12 illustrates a case in which the second smoothed signal is used as the speech start threshold and speech stop threshold. In other words, the speech start threshold and speech stop threshold are set to the same value.

このように、本実施形態に係る情報処理装置、音声認識支援方法、及び音声認識支援プログラムによれば、ＶＡＤ信号に基づいて発話開始閾値及び発話停止閾値を動的に設定するので、ユーザ環境（雑音や音声の途切れの発生頻度）に応じた発話区間の判定を行うことが可能となる。これにより、発話区間の推定精度を高めることができ、音声認識の精度向上が期待できる。 In this way, according to the information processing device, speech recognition assistance method, and speech recognition assistance program of this embodiment, the speech start threshold and speech stop threshold are dynamically set based on the VAD signal, making it possible to determine the speech section according to the user environment (frequency of noise and speech interruptions). This makes it possible to improve the accuracy of estimating the speech section, and is expected to improve the accuracy of speech recognition.

〔第３実施形態〕
次に、本発明の第３実施形態に係る音声認識システムについて図面を参照して説明する。上述した第１及び第２実施形態では、音声区間検出部２２から出力されたＶＡＤ信号を平滑化した平滑化信号を用いて発話状態の管理を行っていたが、本実施形態は、平滑化を行わずに、ＶＡＤ信号を直接用いて発話の状態管理を行う点が異なる。
以下、上述した第１実施形態と共通する構成については同一の符号を付して説明を省略し、異なる点について主に説明する。 Third Embodiment
Next, a speech recognition system according to a third embodiment of the present invention will be described with reference to the drawings. In the first and second embodiments described above, the speech state is managed using a smoothed signal obtained by smoothing the VAD signal output from the voice activity detector 22. However, in this embodiment, the speech state is managed directly using the VAD signal without smoothing.
Hereinafter, configurations common to the first embodiment described above will be denoted by the same reference numerals and description thereof will be omitted, and differences will be mainly described.

図１３は、本実施形態に係るクライアント端末１０ｂが有する音声認識支援機能の一例を示した機能構成図である。図１３に示すように、音声認識支援部２０ｂの補正フィルタ部２３ｂは、平滑化処理部３１を省略した構成とされている。
このような構成によれば、状態管理部３２は、音声区間検出部２２から出力されたＶＡＤ信号と予め設定された発話開始閾値及び発話停止閾値を用いて発話中状態及び停止中状態を判定する。具体的には、状態管理部３２の発話中判定部３４は、ＶＡＤ信号が発話開始閾値以上である状態が所定の発話開始判定時間維持されたときに、換言すると、ＶＡＤ信号の出力が「１」である状態が発話開始判定時間維持されたときに、発話中と判定する。また、停止中判定部３５は、ＶＡＤ信号が発話停止閾値以下である状態が所定の発話停止判定時間維持されたときに、換言すると、ＶＡＤ信号の出力が「０」である状態が発話停止判定時間維持されたときに、停止中と判定する。 13 is a functional configuration diagram showing an example of a voice recognition support function of the client terminal 10b according to the present embodiment. As shown in FIG. 13, the correction filter unit 23b of the voice recognition support unit 20b is configured such that the smoothing processing unit 31 is omitted.
According to such a configuration, the state management unit 32 judges the speaking state and the stopped state using the VAD signal output from the voice section detection unit 22 and the preset speech start threshold and speech stop threshold. Specifically, the speaking determination unit 34 of the state management unit 32 judges that the state is speaking when the state in which the VAD signal is equal to or greater than the speech start threshold is maintained for a predetermined speech start determination time, in other words, when the state in which the output of the VAD signal is "1" is maintained for the speech start determination time. Moreover, the stopped state determination unit 35 judges that the state is stopped when the state in which the VAD signal is equal to or less than the speech stop threshold is maintained for a predetermined speech stop determination time, in other words, when the state in which the output of the VAD signal is "0" is maintained for the speech stop determination time.

そして、状態管理部３２によって検出された発話開始時刻及び発話停止時刻に基づいて音声データ管理部３３による発話区間の決定が行われる。このような音声認識支援処理が行われることにより、図１４に示すような発話区間が決定される。 Then, the speech data management unit 33 determines the speech section based on the speech start time and speech stop time detected by the state management unit 32. By performing such speech recognition support processing, the speech section is determined as shown in FIG. 14.

このように、本実施形態に係る情報処理装置、音声認識支援方法、及び音声認識支援プログラムによれば、ＶＡＤ信号を平滑化処理せずに、ＶＡＤ信号を直接的に用いて発話状態を管理するので、簡素な構成により発話区間の推定を行うことが可能となる。更に、入力音声信号に雑音が含まれている場合や、発話の途中で途切れが生じた場合でも適切な発話区間を判定することが可能となる。 In this way, according to the information processing device, speech recognition assistance method, and speech recognition assistance program of this embodiment, the VAD signal is not smoothed, but the VAD signal is directly used to manage the speech state, making it possible to estimate the speech section with a simple configuration. Furthermore, even if the input voice signal contains noise or if a break occurs midway through the speech, it is possible to determine the appropriate speech section.

以上、本発明について実施形態を用いて説明したが、本発明の技術的範囲は上記実施形態に記載の範囲には限定されない。発明の要旨を逸脱しない範囲で上記実施形態に多様な変更又は改良を加えることができ、該変更又は改良を加えた形態も本発明の技術的範囲に含まれる。また、上記実施形態を適宜組み合わせてもよい。 The present invention has been described above using embodiments, but the technical scope of the present invention is not limited to the scope described in the above embodiments. Various modifications or improvements can be made to the above embodiments without departing from the gist of the invention, and forms with such modifications or improvements are also included in the technical scope of the present invention. In addition, the above embodiments may be combined as appropriate.

例えば、上述した各実施形態では、補正フィルタ部２３、２３ａ、２３ｂがクライアント端末１０、１０ａ、１０ｂに設けられている場合を例示して説明したが、これに限られない。例えば、補正フィルタ部２３、２３ａ、２３ｂは、サーバ５０に設けられていてもよい。 For example, in each of the above-described embodiments, the correction filter units 23, 23a, and 23b are provided in the client terminals 10, 10a, and 10b, but this is not limiting. For example, the correction filter units 23, 23a, and 23b may be provided in the server 50.

１：音声認識システム
２：通信ネットワーク
１０：クライアント端末（情報処理装置）
１０ａ：クライアント端末（情報処理装置）
１０ｂ：クライアント端末（情報処理装置）
１１：ＣＰＵ
１２：メインメモリ
１３：二次記憶装置
１４：外部インターフェース
１５：通信インターフェース
１６：入力デバイス
１７：ディスプレイ
１８：マイクロフォン（マイク）
１９：スピーカ
２０：音声認識支援部
２０ａ：音声認識支援部
２０ｂ：音声認識支援部
２１：音声入力部
２２：音声区間検出部
２３：補正フィルタ部
２３ａ：補正フィルタ部
２３ｂ：補正フィルタ部
３１：平滑化処理部
３２：状態管理部
３３：音声データ管理部
３４：発話中判定部
３５：停止中判定部
３６：閾値設定部
５０：サーバ
５１：要求管理部
５２：音声認識部 1: Speech recognition system 2: Communication network 10: Client terminal (information processing device)
10a: Client terminal (information processing device)
10b: Client terminal (information processing device)
11: CPU
12: Main memory 13: Secondary storage device 14: External interface 15: Communication interface 16: Input device 17: Display 18: Microphone (microphone)
19: Speaker 20: Voice recognition support unit 20a: Voice recognition support unit 20b: Voice recognition support unit 21: Voice input unit 22: Voice section detection unit 23: Correction filter unit 23a: Correction filter unit 23b: Correction filter unit 31: Smoothing processing unit 32: State management unit 33: Voice data management unit 34: Speaking determination unit 35: Stop determination unit 36: Threshold setting unit 50: Server 51: Request management unit 52: Voice recognition unit

Claims

a voice activity detection unit for detecting a voice activity in which sound is generated from the voice data;
a smoothing processor that smoothes an output signal of the voice activity detector;
a speech-on-call determination unit that determines the state as being in a speech-on-call state when a state in which the smoothed signal output from the smoothing processing unit is equal to or greater than a speech-on-call threshold is maintained for a preset speech-on-call determination time in a stopped state;
a stop state determination unit that determines the state as a stop state when a state in which the smoothed signal is equal to or less than an utterance stop threshold is maintained for a preset utterance stop determination time in the utterance state;
and a voice data management unit that determines a speech section based on an utterance start time determined to be an utterance in progress and an utterance stop time determined to be an utterance stopped state.

The information processing device according to claim 1, wherein the voice data management unit determines the start time of a speech section based on the speech start time and a preset speech start margin, and determines the end time of a speech section based on the speech stop time and a preset speech stop margin.

The information processing device according to claim 1, further comprising a threshold setting unit that sets the speech start threshold and the speech stop threshold using the smoothed signal.

The information processing device according to claim 3, wherein the threshold setting unit smoothes the output signal of the voice activity detection unit over a section longer than the smoothing section of the smoothing processing unit, and uses this smoothed signal to set the speech start threshold and the speech stop threshold.

The information processing device according to claim 1, wherein the smoothing processing unit smoothes the output signal of the voice activity detection unit using a moving average.

a voice activity detection step of detecting a voice activity in which sound is generated from the voice data;
a smoothing process step of smoothing an output signal of the voice activity detection process;
a speech-on-call determination step of determining, in a stopped state, that the state is one in which the smoothed signal output from the smoothing processing step is equal to or greater than a speech-on-call threshold for a preset speech-on-call determination time;
a stop state determination step of determining, in the speaking state, that the state is a stop state when the state in which the smoothed signal is equal to or less than the speech stop threshold is maintained for a preset speech stop determination time;
and a voice data management step of determining a speech section based on an utterance start time determined to be an utterance in progress and an utterance stop time determined to be an utterance stop state.

A speech recognition support program for causing a computer to execute the speech recognition support method according to claim 6 .