JP6320962B2

JP6320962B2 - Speech recognition system, speech recognition method, program

Info

Publication number: JP6320962B2
Application number: JP2015061831A
Authority: JP
Inventors: 智子川瀬; 小林　和則; 和則小林; 仲大室
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2015-03-25
Filing date: 2015-03-25
Publication date: 2018-05-09
Anticipated expiration: 2035-03-25
Also published as: JP2016180914A

Description

本発明は、クライアント装置と複数の音声認識サーバ装置と管理部を含む音声認識システム、音声認識方法、プログラムに関する。 The present invention relates to a voice recognition system, a voice recognition method, and a program including a client device, a plurality of voice recognition server devices, and a management unit.

従来のサーバ・クライアント型の音声認識方法について開示した文献として例えば特許文献１がある。特許文献１のサーバ・クライアント型音声認識方法は、クライアント装置で観測した無音声区間の信号によって雑音モデルを生成し、クライアント装置と音声認識サーバ装置のそれぞれにおいて雑音モデルと音声モデルとにより共通の雑音重畳音声モデルを生成し、クライアント装置では認識対象とする入力音声の特徴量を雑音重畳音声モデルに基づいてコード化して音声認識サーバ装置に送信し、音声認識サーバ装置ではクライアント装置から受信したコードを雑音重畳音声モデルに基づいて特徴量に変換する。この方法は、雑音モデルに基づく雑音重畳音声モデルをその都度生成するので、様々な雑音下における音声の認識処理に対応できるという利点があった。 For example, Patent Literature 1 discloses a conventional server / client type speech recognition method. In the server / client type speech recognition method of Patent Document 1, a noise model is generated based on a signal of a non-voice section observed by a client device, and noise common to the client device and the speech recognition server device is determined by the noise model and the speech model. A superimposed speech model is generated, and the client device encodes the feature quantity of the input speech to be recognized based on the noise superimposed speech model and transmits it to the speech recognition server device. The speech recognition server device uses the code received from the client device. Based on the noise superimposed speech model, it is converted into a feature value. Since this method generates a noise-superimposed speech model based on a noise model each time, there is an advantage that it can cope with speech recognition processing under various noises.

また特許文献２の音声認識方法では、音声入力時の雑音区間（音声区間でない区間）の信号と、雑音重畳音声モデルを作成する際に重畳した雑音信号との類似度（雑音類似度という）を計算し、類似度が所定値以上を示す雑音重畳音声モデルを音声認識用の確率モデルとして利用したり、類似度が所定値以上となる雑音重畳音声モデルが存在しなければ予め格納してある音声モデルと雑音区間の信号とを利用して雑音信号に適応した雑音重畳音声モデルを作成して音声認識用の確率モデルとする。この方法は、クライアント装置に高度な処理を要求しなくて良いという利点があった。 Further, in the speech recognition method of Patent Document 2, the similarity (referred to as noise similarity) between a signal in a noise interval (interval that is not a speech interval) at the time of speech input and a noise signal superimposed when creating a noise-superimposed speech model. Calculated and used as a noise-superimposed speech model whose similarity is a predetermined value or more as a probability model for speech recognition, or stored in advance if there is no noise-superimposed speech model whose similarity is a predetermined value or more A noise-superimposed speech model adapted to the noise signal is created using the model and the signal in the noise interval to obtain a stochastic model for speech recognition. This method has an advantage in that it is not necessary to request a high-level processing from the client device.

特許第４７６９１２１号公報Japanese Patent No. 4769121 特許第４２４２３２０号公報Japanese Patent No. 4242320

特許文献１の方法では、雑音モデルを観測してから雑音重畳音声モデルを生成して認識に利用するために時間がかかるとすれば、雑音モデル観測時の雑音の特性と認識利用時の雑音の特性とが異なってしまい音声認識性能に影響を及ぼす可能性がある。音声認識利用のために雑音重畳音声モデルを高速に生成するためには、クライアント装置に高度な処理能力を求めることになってしまう。また音声認識サーバ装置側においても同様に、雑音重畳音声モデルを作成するためにその稼働量が一時的に増大するという問題が生じる。 In the method of Patent Document 1, if it takes time to generate a noise-superimposed speech model and use it for recognition after observing the noise model, the noise characteristics during noise model observation and the noise characteristics during recognition use The characteristics may be different and the speech recognition performance may be affected. In order to generate a noise-superimposed speech model at high speed for use in speech recognition, a high processing capability is required for the client device. Similarly, the voice recognition server apparatus also has a problem that the amount of operation temporarily increases in order to create a noise superimposed voice model.

特許文献２の方法を、大多数の利用者が同時に利用するサーバ・クライアント型音声認識システムに適用しようとすると、あらゆる利用者の雑音区間の信号に対応するためには、１つの音声認識サーバ装置の中に様々な雑音に対応した雑音重畳音声モデルを格納する必要があり、モデルの管理が複雑になってしまう。もしくは、クライアント装置から雑音区間の信号を受信する都度、雑音重畳音声モデルを作成するために音声認識サーバ装置の稼働量が一時的に増大するという問題が生じる。 If the method of Patent Document 2 is applied to a server / client type speech recognition system that is used by a large number of users at the same time, one speech recognition server device can be used to deal with signals in the noise interval of all users. It is necessary to store a noise-superimposed speech model corresponding to various types of noise, and the model management becomes complicated. Alternatively, every time a signal in the noise interval is received from the client device, there is a problem that the amount of operation of the speech recognition server device temporarily increases in order to create a noise superimposed speech model.

そこで本発明では、クライアント装置に高度な処理を要求せず、低い導入コストで高い性能を実現することができる音声認識システムを提供することを目的とする。 Therefore, an object of the present invention is to provide a voice recognition system that does not require advanced processing from a client device and can realize high performance at a low introduction cost.

本発明の音声認識システムは、クライアント装置と、複数の音声認識サーバ装置と、管理部を含む。 The speech recognition system of the present invention includes a client device, a plurality of speech recognition server devices, and a management unit.

クライアント装置は、送信部を含む。送信部は、入力された音響信号または音響信号に由来する信号を、その収音条件に基づいて選択された音声認識サーバ装置に送信する。音声認識サーバ装置のそれぞれは、設定記憶部と、利用率送信部を含む。設定記憶部は、音声認識に関する設定を予め記憶する。利用率送信部は、クライアント装置が自装置を送信先として利用した割合である利用率に関する情報を管理部に送信する。管理部は、クライアント装置に対して利用率に基づいて更新された収音条件のしきい値を送信する動作、利用率に基づいて特定された音声認識サーバ装置に対して利用率に基づいて更新された設定に関する情報を送信する動作のうち少なくとも何れか一つの動作を実行する。 The client device includes a transmission unit. The transmission unit transmits the input acoustic signal or a signal derived from the acoustic signal to the voice recognition server device selected based on the sound collection condition. Each of the voice recognition server devices includes a setting storage unit and a utilization rate transmission unit. The setting storage unit stores in advance settings relating to voice recognition. The usage rate transmission unit transmits information on the usage rate, which is a rate at which the client device uses the own device as a transmission destination, to the management unit. The management unit transmits the threshold value of the sound collection condition updated based on the usage rate to the client device, and updates based on the usage rate for the voice recognition server device specified based on the usage rate. At least one of the operations for transmitting information on the set setting is executed.

本発明の音声認識システムによれば、クライアント装置に高度な処理を要求せず、低い導入コストで高い性能を実現することができる。 According to the voice recognition system of the present invention, high performance can be realized at a low introduction cost without requiring high-level processing from the client device.

実施例１の音声認識システムの構成を示すブロック図。1 is a block diagram illustrating a configuration of a voice recognition system according to Embodiment 1. FIG. 実施例１の音声認識サーバ装置の構成を示すブロック図。1 is a block diagram illustrating a configuration of a voice recognition server device according to a first embodiment. 実施例１の管理部の構成を示すブロック図。FIG. 3 is a block diagram illustrating a configuration of a management unit according to the first embodiment. 実施例１の音声認識システムの音声認識動作を示すシーケンス図。FIG. 3 is a sequence diagram illustrating a voice recognition operation of the voice recognition system according to the first embodiment. 実施例１の音声認識システムの情報更新動作を示すシーケンス図。FIG. 3 is a sequence diagram illustrating an information update operation of the voice recognition system according to the first embodiment. 適用例における設定更新前の利用実績の割合（利用率）を示す図。The figure which shows the ratio (utilization rate) of the use track record before the setting update in an application example. 適用例における設定更新後の利用実績の割合（利用率）を示す図。The figure which shows the ratio (utilization rate) of the use track record after the setting update in an application example. 実施例２の音声認識システムの構成を示すブロック図。FIG. 3 is a block diagram illustrating a configuration of a voice recognition system according to a second embodiment. 実施例２の音声認識サーバ装置の構成を示すブロック図。The block diagram which shows the structure of the speech recognition server apparatus of Example 2. FIG. 実施例２の管理部の構成を示すブロック図。FIG. 9 is a block diagram illustrating a configuration of a management unit according to the second embodiment. 実施例２の音声認識システムの情報更新動作を示すシーケンス図。FIG. 9 is a sequence diagram illustrating an information update operation of the voice recognition system according to the second embodiment.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the structure part which has the same function, and duplication description is abbreviate | omitted.

以下の説明では、音声認識対象とする発声された信号を音声信号、音声信号と音声信号以外の背景雑音信号などが混在した状態で収音した信号を音響信号と呼ぶこととする。 In the following description, a signal uttered as a speech recognition target is referred to as a sound signal, and a signal collected in a state where a sound signal and a background noise signal other than the sound signal are mixed is referred to as an acoustic signal.

以下、実施例１の音声認識システムの概要について説明する。前述したように、クライアント装置に入力された音響信号を、その収音条件毎に複数の音声認識サーバ装置に分散して送信し、各音声認識サーバ装置において音声認識処理を実行すれば、クライアント装置に高度な処理を要求せず、低い導入コストで高い性能を実現することが可能となる。しかしながら、クライアント装置がどのような収音条件の下で本システムを利用するかは、本システムの計画段階では未知数であるため、特定の音声認識サーバ装置へのトラフィックの集中、過疎が起こる可能性がある。すなわち、ある収音条件下における本システムの利用頻度が著しく高くなれば、その収音条件に基づいて選択される音声認識サーバ装置に負荷が集中することになる。また、ある収音条件下における本システムの利用頻度が著しく少なければ、その収音条件に基づいて選択される音声認識サーバ装置はほとんど利用されないことになる。局所的な負荷の集中により処理の遅延が発生すれば、ユーザの不利益となる。また、ほとんど利用されない音声認識サーバ装置が存在すれば、設備の無駄が発生し、システム運営者の不利益となる。あるいは、本システムの運用開始以降、本システムのユーザの利用傾向が変動して、上記の問題が新たに発生した場合には、本システムが上記の変動に追従できないことにより、上述と同様のユーザの不利益、システム運営者の不利益が生ずる。 Hereinafter, an outline of the voice recognition system according to the first embodiment will be described. As described above, if the acoustic signal input to the client device is distributed and transmitted to a plurality of voice recognition server devices for each sound collection condition, and the voice recognition processing is executed in each voice recognition server device, the client device Therefore, it is possible to achieve high performance at a low introduction cost without requiring advanced processing. However, under what kind of sound collection conditions the client device uses this system is unknown at the planning stage of this system, so there is a possibility of traffic concentration and depopulation to a specific voice recognition server device. There is. In other words, if the frequency of use of the system under a certain sound collection condition is remarkably increased, the load is concentrated on the voice recognition server device selected based on the sound collection condition. Further, if the frequency of use of the system under a certain sound pickup condition is extremely low, the voice recognition server device selected based on the sound pickup condition is hardly used. If processing delay occurs due to local load concentration, it is disadvantageous for the user. Further, if there is a voice recognition server device that is hardly used, equipment is wasted, which is disadvantageous for the system operator. Or, since the usage tendency of the users of the system fluctuates after the start of operation of the system, and the above problem newly occurs, the system cannot follow the fluctuations. The disadvantage of the system operator.

そこで本実施例では、音声認識サーバ装置の利用率を監視し、当該利用率に応じて音声認識サーバ装置の設定、またはクライアント装置の設定を変更することによって、特定の音声認識サーバ装置に負荷が集中しないように運用し、音声認識システム全体の利用性能（パフォーマンス）を高めることができる音声認識システムを開示する。 Therefore, in this embodiment, the load on a specific voice recognition server device is monitored by monitoring the usage rate of the voice recognition server device and changing the setting of the voice recognition server device or the client device according to the usage rate. Disclosed is a speech recognition system that can be operated so as not to concentrate and can improve the performance (performance) of the entire speech recognition system.

以下、図１、図２、図３を参照して本実施例の音声認識システムの構成について説明する。図１は、本実施例の音声認識システム１の構成を示すブロック図である。図２は、本実施例の音声認識サーバ装置２１−ｎの構成を示すブロック図である。図３は、本実施例の管理部３０の構成を示すブロック図である。 Hereinafter, the configuration of the speech recognition system according to the present embodiment will be described with reference to FIGS. 1, 2, and 3. FIG. 1 is a block diagram showing the configuration of the speech recognition system 1 of the present embodiment. FIG. 2 is a block diagram illustrating the configuration of the voice recognition server device 21-n according to the present embodiment. FIG. 3 is a block diagram illustrating the configuration of the management unit 30 according to the present embodiment.

図１に示すように、本実施例の音声認識システム１は、クライアント装置１０と、複数の音声認識サーバ装置２１−１、…、２１−ｎ、…、２１−Ｎ（ＮはＮ≧２を充たす整数、ｎは１≦ｎ≦Ｎを充たす整数）と、管理部３０を含む。図１においてクライアント装置１０は１台のみ図示したが、クライアント装置１０は複数台存在するものとする。音声認識サーバ装置２１−１、…、２１−ｎ、…、２１−Ｎをまとめて呼称する際には、音声認識サーバ装置群２０と呼ぶ。クライアント装置１０と音声認識サーバ装置群２０は、ネットワークを介し、無線または有線で通信可能に接続されているものとする。管理部３０は、単独のハードウェア（装置）として構成されてもよい。管理部３０を単独のハードウェア（装置）として構成した場合は、これを管理装置３０と呼んでもよい。管理部３０を単独のハードウェア（装置）として構成した場合、クライアント装置１０と音声認識サーバ装置群２０と管理部３０（管理装置３０）はネットワークを介して、無線または有線で通信可能に接続されているものとする。また、管理部３０は、クライアント装置１０内の構成要件であってもよいし、音声認識サーバ装置群２０内の何れかの音声認識サーバ装置内の構成要件であってもよい。 As shown in FIG. 1, the speech recognition system 1 of the present embodiment includes a client device 10 and a plurality of speech recognition server devices 21-1, ..., 21-n, ..., 21-N (N is N ≧ 2). An integer to be satisfied, n is an integer satisfying 1 ≦ n ≦ N), and the management unit 30. Although only one client device 10 is shown in FIG. 1, it is assumed that there are a plurality of client devices 10. When the voice recognition server devices 21-1,..., 21-n,..., 21-N are collectively called, they are referred to as a voice recognition server device group 20. It is assumed that the client device 10 and the voice recognition server device group 20 are connected to be communicable wirelessly or by wire via a network. The management unit 30 may be configured as a single hardware (device). When the management unit 30 is configured as a single piece of hardware (device), it may be called the management device 30. When the management unit 30 is configured as a single piece of hardware (device), the client device 10, the voice recognition server device group 20, and the management unit 30 (management device 30) are connected to be communicable wirelessly or wired via a network. It shall be. Further, the management unit 30 may be a configuration requirement in the client device 10 or a configuration requirement in any voice recognition server device in the voice recognition server device group 20.

図１に示すように、クライアント装置１０は、収音条件抽出部１１と、しきい値記憶部１１１と、選択部１２と、送信先記憶部１２１と、信号処理部１３と、送信部１４と、受信部１５と、呈示部１６と、送信先変更部１７を含む。図２に示すように、音声認識サーバ装置群２０に含まれる全ての音声認識サーバ装置（２１−ｎに代表させた）は、音響信号受信部２１Ａと、音声認識部２１Ｂと、認識結果送信部２１Ｃと、利用率送信部２１Ｄと、設定情報受信部２１Ｅと、設定更新部２１Ｆと、設定記憶部２１Ｇを含む。図３に示すように、管理部３０（管理装置３０）は、利用率受信部３０Ａと、設定情報更新部３０Ｂと、設定情報送信部３０Ｃと、送信先情報更新部３０Ｄと、送信先情報送信部３０Ｅと、送信先記憶部３０Ｇと、設定記憶部３０Ｆを含む。 As illustrated in FIG. 1, the client device 10 includes a sound collection condition extraction unit 11, a threshold storage unit 111, a selection unit 12, a transmission destination storage unit 121, a signal processing unit 13, and a transmission unit 14. The receiving unit 15, the presenting unit 16, and the transmission destination changing unit 17 are included. As shown in FIG. 2, all the speech recognition server devices (represented by 21-n) included in the speech recognition server device group 20 include an acoustic signal receiving unit 21A, a speech recognizing unit 21B, and a recognition result transmitting unit. 21C, a utilization rate transmission unit 21D, a setting information reception unit 21E, a setting update unit 21F, and a setting storage unit 21G. As illustrated in FIG. 3, the management unit 30 (management device 30) includes a usage rate reception unit 30A, a setting information update unit 30B, a setting information transmission unit 30C, a transmission destination information update unit 30D, and transmission destination information transmission. A unit 30E, a destination storage unit 30G, and a setting storage unit 30F.

以下、図４を参照して本システムの音声認識動作について説明する。図４は、本実施例の音声認識システム１の音声認識動作を示すシーケンス図である。まず、収音条件抽出部１１は、入力された音響信号の収音条件を抽出する（Ｓ１１）。選択部１２は、抽出された収音条件に基づいて、対応する音響信号の送信先となる音声認識サーバ装置（例えば音声認識サーバ装置２１−１）を選択する（Ｓ１２）。収音条件と送信先となる音声認識サーバ装置との関係は、送信先記憶部１２１に予め記憶されているものとする。 Hereinafter, the speech recognition operation of this system will be described with reference to FIG. FIG. 4 is a sequence diagram showing the voice recognition operation of the voice recognition system 1 of the present embodiment. First, the sound collection condition extraction unit 11 extracts a sound collection condition of the input acoustic signal (S11). The selection unit 12 selects a voice recognition server device (for example, the voice recognition server device 21-1) as a transmission destination of the corresponding acoustic signal based on the extracted sound collection condition (S12). It is assumed that the relationship between the sound collection condition and the voice recognition server device as the transmission destination is stored in advance in the transmission destination storage unit 121.

＜収音条件＞
収音条件は、例えば音声信号の大きさと背景雑音信号の大きさの比率であるＳ／Ｎ比に関する特徴量、音響信号のひずみに関する特徴量、背景雑音信号のスペクトル形状に関する特徴量、背景雑音信号の大きさに関する特徴量のうち少なくとも何れかの特徴量についてのしきい値に基づく条件とすることができる。しきい値は、しきい値記憶部１１１に予め記憶されているものとする。 <Sound collection conditions>
The sound collection conditions include, for example, a feature amount relating to the S / N ratio, which is a ratio of the size of the audio signal and the background noise signal, a feature amount relating to the distortion of the acoustic signal, a feature amount relating to the spectrum shape of the background noise signal, and a background noise signal. It is possible to set a condition based on a threshold value for at least one of the feature amounts related to the size of the feature amount. It is assumed that the threshold value is stored in advance in the threshold value storage unit 111.

背景雑音信号とは発声音声や目的音が入力される直前の一定時間にマイクで観測された信号である。背景雑音信号の大きさとは、背景雑音信号のパワースペクトルの一定時間の平均値である。背景雑音信号のスペクトル形状とは、背景雑音信号のスペクトルにおける各帯域の成分やその時間変化である。音声信号と背景雑音信号のＳ／Ｎ比とは、発声音声（目的音）入力中の音響信号中の音声信号の大きさと背景雑音信号の大きさの比である。音声信号として、発声音声（目的音）入力中の一定時間の音響信号のパワースペクトルから背景雑音信号のパワースペクトルの一定時間の平均値を差し引いたパワースペクトルを用いることができる。音声信号の大きさとは、発声音声（目的音）入力中の一定時間の音声信号のパワースペクトルの平均値である。 The background noise signal is a signal observed by a microphone for a certain time immediately before the voiced sound or the target sound is input. The magnitude of the background noise signal is an average value of the power spectrum of the background noise signal over a certain period of time. The spectrum shape of the background noise signal is a component of each band in the spectrum of the background noise signal and its time change. The S / N ratio between the audio signal and the background noise signal is the ratio of the size of the audio signal in the acoustic signal being input to the uttered speech (target sound) and the size of the background noise signal. As the audio signal, a power spectrum obtained by subtracting the average value of the power spectrum of the background noise signal for a certain time from the power spectrum of the acoustic signal for a certain time during the input of the uttered voice (target sound) can be used. The magnitude of the audio signal is an average value of the power spectrum of the audio signal for a certain time during the input of the uttered voice (target sound).

音響信号のひずみとは、音響信号の入力が大きすぎたためにマイクロホン素子、マイクロホンアンプ回路、Ａ／Ｄ変換がクリッピングしているものを指す。入力信号レベルが、あらかじめ決めた閾値以上の振幅を持つ区間を検出し、その時間的な割合を計算する。この割合が高ければひずみが大きく、割合が小さければひずみが小さい。閾値以上の振幅となっていなければ、ひずみなしとすることができる。閾値は、マイク素子、回路、AD変換のクリッピングレベルに合わせて設定する。 The distortion of the acoustic signal means that the microphone element, the microphone amplifier circuit, and the A / D conversion are clipping because the input of the acoustic signal is too large. A section where the input signal level has an amplitude greater than or equal to a predetermined threshold is detected, and the time ratio is calculated. If this ratio is high, the strain is large, and if the ratio is small, the strain is small. If the amplitude does not exceed the threshold, no distortion can be achieved. The threshold is set according to the clipping level of the microphone element, circuit, and AD conversion.

＜収音条件抽出部１１（Ｓ１１）、選択部１２（Ｓ１２）＞
以下に、収音条件抽出部１１、選択部１２の動作（Ｓ１１、Ｓ１２）の例について述べる。収音条件抽出部１１は、例えば入力された音響信号から収音条件を表す特徴量を抽出し、特徴量の値に応じて入力された音響信号をグループ（例えば収音条件を表す符号）に分ける。 <Sound Collection Condition Extraction Unit 11 (S11), Selection Unit 12 (S12)>
Hereinafter, an example of the operation (S11, S12) of the sound collection condition extraction unit 11 and the selection unit 12 will be described. The sound collection condition extraction unit 11 extracts, for example, a feature amount representing the sound collection condition from the input sound signal, and the sound signal input according to the value of the feature amount is grouped (for example, a code representing the sound collection condition). Divide.

次に、選択部１２は、表１に記載のように、グループ（収音条件を表す符号）とインデックス（送信先音声認識サーバ装置を表す符号）の関係に基づいて、対応する音響信号の送信先となる音声認識サーバ装置（例えば音声認識サーバ装置２１−１）を選択する（Ｓ１２）。 Next, as shown in Table 1, the selection unit 12 transmits a corresponding acoustic signal based on the relationship between a group (a code representing a sound collection condition) and an index (a code representing a destination speech recognition server device). A previous voice recognition server device (for example, the voice recognition server device 21-1) is selected (S12).

特徴量xは例えば、音響信号に含まれる音声信号の大きさと背景雑音信号の大きさの比率であるＳ／Ｎ比、音響信号のひずみの有無やひずみの頻度、背景雑音信号のスペクトル形状、背景雑音信号の大きさ、などとすることができる。 The feature amount x is, for example, the S / N ratio that is the ratio of the size of the audio signal included in the acoustic signal and the size of the background noise signal, the presence or absence of distortion of the acoustic signal, the frequency of distortion, the spectrum shape of the background noise signal, the background The magnitude of the noise signal, etc.

特徴量xをＳ／Ｎ比とする場合、例えばしきい値をθ₁=0dB、θ₂=10dB、θ₃=20dB等と設定し、x=5dBならば収音条件抽出部１１はグループ2を収音条件として抽出し、選択部１２はインデックス2を選択する。 When the feature quantity x is an S / N ratio, for example, threshold values are set as θ ₁ = 0 dB, θ ₂ = 10 dB, θ ₃ = 20 dB, and so on. Are extracted as sound collection conditions, and the selection unit 12 selects index 2.

特徴量xを音響信号のひずみとする場合、たとえばビットデプス16bitで量子化した信号で、0.5秒間で振幅の絶対値が30000以上となる時間の割合を特徴量xとする。しきい値をθ₁=0.8等と設定し、x=0ならば収音条件抽出部１１はグループ1を収音条件として抽出し、選択部１２はインデックス1を選択し、x=0.9ならば収音条件抽出部１１はグループ2を収音条件として抽出し、選択部１２はインデックス2を選択する。 When the feature quantity x is a distortion of an acoustic signal, for example, the ratio of the time when the absolute value of the amplitude is 30000 or more in 0.5 seconds is a feature quantity x in a signal quantized with a bit depth of 16 bits. The threshold is set as θ ₁ = 0.8, etc., and if x = 0, the sound collection condition extraction unit 11 extracts group 1 as the sound collection condition, the selection unit 12 selects index 1, and if x = 0.9 The sound collection condition extraction unit 11 extracts group 2 as the sound collection condition, and the selection unit 12 selects index 2.

特徴量xを背景雑音信号のスペクトル形状とする場合、例えば背景雑音信号の大きさを周波数帯域や継続時間によりx₁、x₂、…、x_m等（mはm≧2を充たす整数）と分けて評価する。収音条件抽出部１１は、評価結果の組み合わせからグループを抽出し、選択部１２はそのインデックスを選択する。また、背景雑音信号のスペクトル形状を特徴量として用いる別の方法として、複数種類の背景雑音信号のモデルを記憶しておき、入力された信号の背景雑音信号をモデルのいずれかに分類することもできる。複数種類の背景雑音信号とは、例えばホワイトノイズ、ピンクノイズ、バーストノイズなどである。この方法では、モデル一つ一つに対応するグループを割振っておき、入力された信号の背景雑音信号が分類されたモデルに応じてグループが決定される。 When the feature amount x is the spectrum shape of the background noise signal, for example, the size of the background noise signal is x ₁ , x ₂ ,..., X _m (m is an integer satisfying m ≧ 2) depending on the frequency band and duration. Separately evaluate. The sound collection condition extraction unit 11 extracts a group from the combination of evaluation results, and the selection unit 12 selects the index. As another method of using the spectrum shape of the background noise signal as a feature amount, it is also possible to store a plurality of types of background noise signal models and classify the input background noise signal into one of the models. it can. The multiple types of background noise signals are, for example, white noise, pink noise, burst noise, and the like. In this method, a group corresponding to each model is allocated, and a group is determined according to a model in which background noise signals of input signals are classified.

特徴量xを背景雑音信号の大きさとする場合、たとえばθ₁=40dBA、θ₂=55dBA、θ₃=70dBA等と設定し、特徴量x=50dBAならば収音条件抽出部１１はグループ2を収音条件として抽出し、選択部１２はインデックス2を選択する。ここでdBAとは人間の聴覚を考慮した周波数重み付け特性（A特性）のもとで測定した騒音レベルのdB値の単位である。 When the feature amount x is set to the size of the background noise signal, for example, θ ₁ = 40 dBA, θ ₂ = 55 dBA, θ ₃ = 70 dBA, etc. are set, and if the feature amount x = 50 dBA, the sound collection condition extraction unit 11 sets the group 2 Extracting is performed as a sound collection condition, and the selection unit 12 selects index 2. Here, dBA is a unit of dB value of noise level measured under frequency weighting characteristics (A characteristics) considering human hearing.

＜信号処理部１３（Ｓ１３）＞
信号処理部１３は、抽出された収音条件が所定の条件に該当する場合に、対応する音響信号を信号処理する（Ｓ１３）。具体的には信号処理部１３は、Ｓ／Ｎ比や背景雑音信号の大きさが、収音条件抽出部１１で抽出された収音条件に基づいて決定される音声認識サーバ装置において音声認識対象として想定していた特徴量の範囲に適合するように、対応する音響信号を信号処理する。例えばＳ／Ｎ比＝１近傍、すなわち０ｄＢ近傍の収音条件は、音声信号の大きさと背景雑音信号の大きさが同等であり、そのような音響信号をそのまま音声認識に用いれば性能の低下を招きやすい。従って、Ｓ／Ｎ比＝０ｄＢ近傍の収音条件を収音条件抽出部１１で抽出した場合は、当該収音条件の音響信号に対して背景雑音信号を抑圧する信号処理を信号処理部１３で適用する。あるいは例えばＳ／Ｎ比＝１００近傍、すなわち２０ｄＢ近傍の収音条件を収音条件抽出部１１で抽出した場合は、前述の０ｄＢ近傍の収音条件と同様に、Ｓ／Ｎ比の値に応じて適応的に背景雑音信号を抑圧する処理を行うとしても良いし、抑圧する処理を全く行わないとしても良い。その他の収音条件においても、信号処理部１３において、収音条件抽出部１１で抽出した結果に基づき、音響信号への信号処理を適応的に行う。 <Signal processing unit 13 (S13)>
The signal processing unit 13 performs signal processing on the corresponding acoustic signal when the extracted sound collection condition corresponds to a predetermined condition (S13). Specifically, the signal processing unit 13 is a speech recognition target in a speech recognition server device in which the S / N ratio and the size of the background noise signal are determined based on the sound collection conditions extracted by the sound collection condition extraction unit 11. The corresponding acoustic signal is subjected to signal processing so as to conform to the range of the feature amount assumed as. For example, in the sound pickup condition in the vicinity of S / N ratio = 1, that is, in the vicinity of 0 dB, the size of the audio signal and the size of the background noise signal are the same. Easy to invite. Accordingly, when the sound collection condition extraction unit 11 extracts a sound collection condition in the vicinity of S / N ratio = 0 dB, the signal processing unit 13 performs signal processing for suppressing the background noise signal with respect to the sound signal of the sound collection condition. Apply. Alternatively, for example, when the sound collection condition extraction unit 11 extracts a sound collection condition in the vicinity of S / N ratio = 100, that is, in the vicinity of 20 dB, according to the value of the S / N ratio, similar to the sound collection condition in the vicinity of 0 dB described above. Thus, the process of adaptively suppressing the background noise signal may be performed, or the process of suppressing may not be performed at all. Even in other sound collection conditions, the signal processing unit 13 adaptively performs signal processing on the acoustic signal based on the result extracted by the sound collection condition extraction unit 11.

以下に、信号処理部１３の動作（Ｓ１３）の例について述べる。音声認識では、多くの場合、前処理として信号処理により入力音声を補正する。音声認識において前処理して対処すべき音響特性として、例えば、加法性雑音と乗法性雑音がある。加法性雑音は、音声入力環境に遍在する雑音のように音声信号に対して加法的に観測される信号である。一方、乗法性雑音とはマイクの特性や空間伝達特性などの音響特性に起因する雑音（ひずみ）であり、時間波形では原音声波形に対する畳み込み演算として観測されるもので、スペクトル波形では乗算性のひずみとなるものである。加法性雑音に対処した音声認識処理の例としては、参考特許文献１の段落［０００５］に開示されたスペクトルサブトラクション法に基づく雑音抑圧法、または同文献の段落［０００７］に開示されたウィナー・フィルタ法（以下、ＷＦ法という）に基づく雑音抑圧法などのように、雑音の重畳した音声から雑音を抑圧して音声認識に適用する方法がある。
（参考特許文献１：特許第４４６４７９７号公報） Hereinafter, an example of the operation (S13) of the signal processing unit 13 will be described. In speech recognition, in many cases, input speech is corrected by signal processing as preprocessing. Examples of acoustic characteristics to be dealt with by preprocessing in speech recognition include additive noise and multiplicative noise. Additive noise is a signal that is additively observed with respect to a speech signal, such as noise ubiquitous in the speech input environment. On the other hand, multiplicative noise is noise (distortion) caused by acoustic characteristics such as microphone characteristics and spatial transfer characteristics, and is observed as a convolution operation with respect to the original speech waveform in the time waveform. It becomes a distortion. Examples of speech recognition processing that addresses additive noise include the noise suppression method based on the spectral subtraction method disclosed in paragraph [0005] of Reference Patent Document 1, or the Wiener method disclosed in paragraph [0007] of that document. There is a method of suppressing noise from speech with superimposed noise and applying it to speech recognition, such as a noise suppression method based on a filter method (hereinafter referred to as WF method).
(Reference Patent Document 1: Japanese Patent No. 4464797)

加法性雑音に加えて乗法性雑音に対処した音声認識処理の例としては、参考特許文献１のように乗法性雑音の影響を除去した音声モデルに雑音モデルを重畳させた雑音重畳音声モデルを生成してから乗法性特徴量に基づいてモデルを更新する方法がある。あるいは参考特許文献２の発明のように、雑音モデルに対しても乗法性雑音特徴量に基づいて正規化した上で正規化雑音重畳音声モデルを生成する方法がある。
（参考特許文献２：特許第５２０００８０号公報） As an example of speech recognition processing that copes with multiplicative noise in addition to additive noise, a noise superimposed speech model is generated by superimposing a noise model on a speech model from which the influence of multiplicative noise has been removed as in Patent Document 1. Then, there is a method of updating the model based on the multiplicative feature amount. Alternatively, there is a method of generating a normalized noise superimposed speech model after normalizing a noise model based on the multiplicative noise feature quantity as in the invention of Reference Patent Document 2.
(Reference Patent Document 2: Japanese Patent No. 5200080)

信号処理部１３が行う信号処理として典型的には雑音抑圧が考えられる。雑音抑圧以外の信号処理としては、例えばAGC(Automatic Gain Control)、CMN(Cepstrum Mean Normalization)、イコライザなどでもよい。 Noise suppression is typically considered as signal processing performed by the signal processing unit 13. As signal processing other than noise suppression, for example, AGC (Automatic Gain Control), CMN (Cepstrum Mean Normalization), and an equalizer may be used.

＜AGC＞
Automatic Gain Control(AGC)は、入力音声信号の短時間平均パワーまたは短時間平均振幅をもとに入力信号レベルを検出し、入力信号レベルと最適レベル（目標値）との差分が少なくなるように音声入力段の利得（ゲイン）を調整する処理である。AGCはA/D変換後の音声波形が過少または過大になって音声特徴量が不明瞭になることを防ぐ効果がある。AGCについては、例えば参考特許文献３の段落［０００１］に開示されている。
（参考特許文献３：特許第３５８８５５５号公報） <AGC>
Automatic Gain Control (AGC) detects the input signal level based on the short-time average power or short-time average amplitude of the input audio signal so that the difference between the input signal level and the optimum level (target value) is reduced. This is a process for adjusting the gain of the audio input stage. AGC has an effect of preventing the voice feature amount from becoming unclear due to the voice waveform after A / D conversion being too small or excessive. AGC is disclosed in paragraph [0001] of Reference Patent Document 3, for example.
(Reference Patent Document 3: Japanese Patent No. 3588555)

＜CMN＞
Cepstrum Mean Normalization(CMN)とは、音声認識の特徴量であるケプストラムにおいて、入力音声信号の長時間ケプストラム平均を求め、各フレームの入力音声のケプストラムから長時間ケプストラム平均をさし引く処理である。CMNは、マイクロホンの特性、マイクロホンの位置、部屋の形状に代表される乗算性ひずみの影響を軽減するために用いられる。CMNについては、例えば参考特許文献１の段落［００１０］に開示されている。 <CMN>
Cepstrum Mean Normalization (CMN) is a process of obtaining a long-term cepstrum average of an input speech signal in a cepstrum that is a feature amount of speech recognition, and subtracting the long-term cepstrum average from the cepstrum of the input speech of each frame. The CMN is used to reduce the influence of multiplicative distortion represented by the characteristics of the microphone, the position of the microphone, and the shape of the room. CMN is disclosed in paragraph [0010] of Reference Patent Document 1, for example.

なお、クライアント装置１０の信号処理部１３でCMNを実施する場合、クライアント装置１０から音声認識サーバ装置へは、音声認識のための音響信号に由来する信号として、CMN適用後のMFCC(メル周波数ケプストラム)が送信されることとしておけば、音声認識サーバ装置で再度ケプストラム分析する処理を省くことができる。 When the CMN is performed by the signal processing unit 13 of the client device 10, the client device 10 transmits to the speech recognition server device an MFCC (Mel Frequency Cepstrum after CMN application) as a signal derived from an acoustic signal for speech recognition. ) Is transmitted, it is possible to omit the cepstrum analysis process again by the speech recognition server device.

＜イコライザ＞
イコライザとは、入力音声信号のゲインを周波数帯域ごとに調整する処理である。例えば音声入力用のマイクロホンの音響特性が平坦でないことが予めわかっていれば、イコライザを経由することで、音響特性を改善したうえで収音することができる。イコライザについては、例えば参考特許文献４の段落［００１０］、［００１６］に開示されている。
（参考特許文献４：特許第２８６５２６８号公報） <Equalizer>
The equalizer is a process for adjusting the gain of the input audio signal for each frequency band. For example, if it is known in advance that the acoustic characteristics of a microphone for voice input are not flat, sound can be collected after improving the acoustic characteristics via an equalizer. The equalizer is disclosed in, for example, paragraphs [0010] and [0016] of Reference Patent Document 4.
(Reference Patent Document 4: Japanese Patent No. 2865268)

次に、送信部１４は、抽出された収音条件に対応する音声認識サーバ装置（ステップＳ１２で選択された音声認識サーバ装置）に、音響信号または音響信号に由来する信号を送信する（Ｓ１４）。このとき、送信部１４は、ステップＳ１３の信号処理がされていない場合と信号処理がされた場合とで送信先を異ならせて、信号処理がされていない音響信号、または信号処理がされた音響信号を送信するものとする。また、ステップＳ１２で選択された音声認識サーバ装置とは関係なくステップＳ１３の信号処理が実施されたか否かだけで、異なる音声認識サーバ装置のうちのいずれかの送信先を決定しても良い。なお、音響信号に由来する信号とは、音響信号の特徴量を表す信号、ステップＳ１３における信号処理を施した音響信号などを指す。また送信部１４は、音響信号または音響信号に由来する信号を送信する際に、収音条件（グループ）やそのしきい値、信号処理部１３における信号処理の有無に関する情報を音声認識サーバ装置に送信しても良い。音声認識サーバ装置は収音条件（グループ）やそのしきい値、や信号処理の有無から、どのような収音条件または信号処理条件において当該音声認識サーバ装置が選択されたかを記録することが可能になる。 Next, the transmission unit 14 transmits an acoustic signal or a signal derived from the acoustic signal to the speech recognition server device (speech recognition server device selected in step S12) corresponding to the extracted sound collection condition (S14). . At this time, the transmission unit 14 varies the transmission destination between the case where the signal processing of step S13 is not performed and the case where the signal processing is performed, and the acoustic signal which is not subjected to signal processing or the acoustic signal which is subjected to signal processing. A signal shall be transmitted. Moreover, you may determine the transmission destination in any one of different speech recognition server apparatuses only by whether the signal processing of step S13 was implemented irrespective of the speech recognition server apparatus selected by step S12. Note that the signal derived from the acoustic signal refers to a signal representing a feature amount of the acoustic signal, an acoustic signal subjected to the signal processing in step S13, and the like. In addition, when the transmission unit 14 transmits an acoustic signal or a signal derived from the acoustic signal, the sound recognition condition (group), the threshold value thereof, and information on the presence or absence of signal processing in the signal processing unit 13 are transmitted to the voice recognition server device You may send it. The voice recognition server device can record the sound pickup condition or signal processing condition for selecting the voice recognition server device from the sound pickup condition (group), its threshold value, and the presence or absence of signal processing. become.

音声認識サーバ装置２１−１、…、２１−ｎ、…、２１−Ｎの音響信号受信部２１Ａは、クライアント装置１０から音響信号または音響信号に由来する信号を受信する（Ｓ２１Ａ）。音響信号または音響信号に由来する信号を受信した音声認識サーバ装置（例えば音声認識サーバ装置２１−１）の音声認識部２１Ｂは、音声認識処理を実行する（Ｓ２１Ｂ）。 The acoustic signal receiving units 21A of the voice recognition server devices 21-1, ..., 21-n, ..., 21-N receive an acoustic signal or a signal derived from the acoustic signal from the client device 10 (S21A). The speech recognition unit 21B of the speech recognition server device (for example, the speech recognition server device 21-1) that has received the acoustic signal or the signal derived from the acoustic signal executes speech recognition processing (S21B).

＜音声認識処理（Ｓ２１Ｂ）＞
ステップＳ２１Ｂの音声認識処理は、例えば以下のように実行される。音声認識部２１Ｂは、一文章や一単語の発話を文字列に変換する。音声認識部２１Ｂは、音声特徴量として音声のパワーやその変化量、MFCC(メル周波数ケプストラム、Mel-Frequency Cepstrum Coefficient)やその動的変化量を用いる。音声認識部２１Ｂは、統計的な音響モデルや言語モデルを用いて単語列を探索する。 <Voice recognition processing (S21B)>
The voice recognition process in step S21B is executed as follows, for example. The voice recognition unit 21B converts one sentence or one word utterance into a character string. The speech recognition unit 21B uses speech power and its variation, MFCC (Mel-Frequency Cepstrum Coefficient) and its dynamic variation as speech features. The speech recognition unit 21B searches for a word string using a statistical acoustic model or a language model.

ステップＳ２１Ｂの音声認識処理を実行後、認識結果送信部２１Ｃは、音声認識結果をクライアント装置１０に送信する（Ｓ２１Ｃ）。クライアント装置１０の受信部１５は、音声認識結果を受信する（Ｓ１５Ａ）。クライアント装置１０の呈示部１６は、受信した音声認識結果を呈示する（Ｓ１６）。 After executing the speech recognition process in step S21B, the recognition result transmitting unit 21C transmits the speech recognition result to the client device 10 (S21C). The receiving unit 15 of the client device 10 receives the voice recognition result (S15A). The presentation unit 16 of the client device 10 presents the received voice recognition result (S16).

以下、図５を参照して本実施例の音声認識システム１の情報更新動作について説明する。図５は、本実施例の音声認識システム１の情報更新動作を示すシーケンス図である。まず、全ての音声認識サーバ装置の利用率送信部２１Ｄは、利用率に関する情報を管理部３０に送信する（Ｓ２１Ｄ）。ある音声認識サーバ装置の利用率は、例えばクライアント装置１０が当該音声認識サーバ装置を送信先として利用した回数やデータ送信量などを、当該音声認識サーバ装置を利用した全てのクライアント装置１０について累計し所定時間で除算した割合と定義される。ステップＳ２１Ｄは、予め設定した時刻に定期的に実行されてもよいし、予め設定した所定時間経過ごとに実行されてもよい。利用率送信部２１Ｄは、利用率そのものではなく利用率に関する情報を送ってもよい。利用率に関する情報とは、例えば各クライアント装置で入力された音響信号または音響信号に由来する信号のデータについての、単位時間あたりのデータ受信量（送信量）や、音声認識サーバ装置で認識処理した際のCPU時間などである。単位時間あたりのデータ受信量（送信量）やCPU時間は、単独では利用率そのものを表す情報ではないが、管理部３０は他の音声認識サーバ装置からも同様の受信量（送信量）やCPU時間を取得し、管理部３０がこれらの受信量（送信量）やCPU時間を集計することにより単位時間あたりの利用率を求めることができる。従って、単位時間あたりのデータ受信量（送信量）やCPU時間などは利用率に関する情報に分類される。管理部３０の利用率受信部３０Ａは、音声認識サーバ装置群２０から利用率に関する情報を受信する（Ｓ３０Ａ）。管理部３０の設定情報更新部３０Ｂは、利用率に関する情報から求めた利用率に基づいて特定される音声認識サーバ装置（ここでは２１−ｎに代表させる）の設定に関する情報（設定情報）を更新する（３０Ｂ）。更新前の設定情報は、管理部３０の設定記憶部３０Ｆに記憶されているものとし、設定情報更新部３０Ｂは更新された設定情報を、設定記憶部３０Ｆに記憶、または上書き記憶する。 Hereinafter, the information update operation of the speech recognition system 1 of the present embodiment will be described with reference to FIG. FIG. 5 is a sequence diagram showing an information update operation of the speech recognition system 1 of the present embodiment. First, the utilization rate transmission unit 21D of all the voice recognition server devices transmits information on the utilization rate to the management unit 30 (S21D). For example, the usage rate of a certain voice recognition server device is obtained by accumulating the number of times the client device 10 has used the voice recognition server device as a transmission destination, the amount of data transmission, etc. for all the client devices 10 using the voice recognition server device. It is defined as the ratio divided by a predetermined time. Step S <b> 21 </ b> D may be periodically executed at a preset time, or may be executed every elapse of a preset predetermined time. The utilization rate transmitting unit 21D may send information on the utilization rate instead of the utilization rate itself. The information on the utilization rate is, for example, a data reception amount (transmission amount) per unit time for a sound signal input from each client device or data of a signal derived from the sound signal, or a recognition process performed by the voice recognition server device. Such as CPU time. The data reception amount (transmission amount) and CPU time per unit time are not information that represents the utilization rate by itself, but the management unit 30 also receives the same reception amount (transmission amount) and CPU from other voice recognition server devices. The utilization rate per unit time can be obtained by acquiring the time, and the management unit 30 summing up these reception amounts (transmission amounts) and CPU time. Accordingly, the amount of data received per unit time (transmission amount), CPU time, and the like are classified as information on the utilization rate. The utilization rate receiving unit 30A of the management unit 30 receives information on the utilization rate from the voice recognition server device group 20 (S30A). The setting information updating unit 30B of the management unit 30 updates information (setting information) related to the setting of the voice recognition server device (represented by 21-n in this case) specified based on the usage rate obtained from the usage rate information. (30B). It is assumed that the setting information before update is stored in the setting storage unit 30F of the management unit 30, and the setting information update unit 30B stores or overwrites the updated setting information in the setting storage unit 30F.

＜設定＞
ここで、設定とは各音声認識サーバ装置の設定記憶部２１Ｇに記憶される情報であって、音声認識に利用する音響モデル、言語モデル、当該音響モデル、当該言語モデルを用いた認識動作に関する動作設定、音声認識に用いる他のパラメータ、その他音声認識に際して予め決めておく設定全般を指す。各設定は、ある収音条件に特化して高い認識性能を有するように調整されているものとし、各音声認識サーバ装置は互いに異なる設定を有しているか、あるいは数台で同じ設定を共有しているものとする。設定記憶部２１Ｇには、複数の設定を記憶しておくこともできる。この場合、各音声認識サーバ装置は、設定記憶部２１Ｇに記憶された設定のうちの一つをアクティブな設定として予め選択しているものとする。 <Setting>
Here, the setting is information stored in the setting storage unit 21G of each voice recognition server device, and an acoustic model used for voice recognition, a language model, the acoustic model, and an operation related to a recognition operation using the language model. This refers to settings, other parameters used for voice recognition, and other settings determined in advance for voice recognition. It is assumed that each setting is adjusted to have high recognition performance specialized for a certain sound pickup condition, and each voice recognition server device has a different setting from each other or shares the same setting in several units. It shall be. The setting storage unit 21G can store a plurality of settings. In this case, it is assumed that each voice recognition server device has previously selected one of the settings stored in the setting storage unit 21G as an active setting.

前述の設定情報更新部３０Ｂは、利用率が低い音声認識サーバ装置向けの設定情報を、利用率が高い音声認識サーバ装置の設定と同じになるように更新してもよい。これにより、負荷が集中している音声認識サーバ装置と設定を共有する音声認識サーバ装置が増えることとなるため、負荷の集中が緩和される。また、前述の設定情報更新部３０Ｂは、利用率が高い音声認識サーバ装置向けの設定情報を、利用率が低い音声認識サーバ装置の設定と同じになるように更新してもよい。これにより、該当する音声認識サーバ装置に対するトラフィックを一時的に減少させることができる。ただし、この場合は負荷が集中する要因が取り除かれたわけではないため、他の音声認識サーバ装置の設定情報を更新することにより、負荷の集中を緩和する措置が別途必要となる。 The setting information update unit 30B described above may update the setting information for the voice recognition server device having a low usage rate so as to be the same as the setting of the voice recognition server device having a high usage rate. This increases the number of voice recognition server devices that share settings with the voice recognition server device on which the load is concentrated, thus reducing the load concentration. Further, the setting information update unit 30B described above may update the setting information for the voice recognition server device having a high usage rate so as to be the same as the setting of the voice recognition server device having a low usage rate. Thereby, the traffic with respect to the applicable speech recognition server apparatus can be reduced temporarily. However, in this case, since the factor that concentrates the load is not removed, a measure to alleviate the concentration of the load is required by updating the setting information of other voice recognition server devices.

次に、管理部３０の設定情報送信部３０Ｃは、利用率が高い、または利用率が低い音声認識サーバ装置（一つ以上、複数でも可）に対して前述の設定情報を送信する（Ｓ３０Ｃ）。例えば、ステップＳ３０Ｂにおいて利用率が低い音声認識サーバ装置向けに設定情報を更新した場合、設定情報送信部３０Ｃは、当該利用率が低い音声認識サーバ装置に対して、当該設定情報を送信する（Ｓ３０Ｃ）。反対に、ステップＳ３０Ｂにおいて利用率が高い音声認識サーバ装置向けに設定情報を更新した場合、設定情報送信部３０Ｃは、当該利用率が高い音声認識サーバ装置に対して、当該設定情報を送信する（Ｓ３０Ｃ）。 Next, the setting information transmission unit 30C of the management unit 30 transmits the setting information described above to a voice recognition server device (one or more or more than one) having a high usage rate or a low usage rate (S30C). . For example, when the setting information is updated for the voice recognition server device having a low usage rate in step S30B, the setting information transmitting unit 30C transmits the setting information to the voice recognition server device having the low usage rate (S30C ). On the other hand, when the setting information is updated for the voice recognition server device having a high usage rate in step S30B, the setting information transmitting unit 30C transmits the setting information to the voice recognition server device having the high usage rate ( S30C).

設定情報送信部３０Ｃは、最高の利用率、最低の利用率に該当する音声認識サーバ装置の何れか（双方でも良い）に対して前述の設定情報を送信してもよい（Ｓ３０Ｃ）。典型的には設定情報送信部３０Ｃは、ステップＳ３０Ｂにおいて利用率が最低となる音声認識サーバ装置に対して、利用率が最高となる音声認識サーバ装置と設定が共有されるように更新した設定情報を送信することが考えられる。あるいは設定情報送信部３０Ｃは、ステップＳ３０Ｂにおいて利用率が最高となり、負荷が集中している音声認識サーバ装置に対して、利用率が最低、あるいは利用率が平均となる音声認識サーバ装置と設定が共有されるように更新した設定情報を送信することが考えられる。この場合は当該音声認識サーバ装置に対する一時的なトラフィック増大の回避が目的である。 The setting information transmission unit 30C may transmit the setting information described above to any one (or both) of the voice recognition server devices corresponding to the highest usage rate and the lowest usage rate (S30C). Typically, the setting information transmitting unit 30C updates the setting information updated so that the setting is shared with the voice recognition server device having the highest usage rate with respect to the voice recognition server device having the lowest usage rate in step S30B. Can be considered. Alternatively, the setting information transmission unit 30C is set to the voice recognition server device having the lowest usage rate or the average usage rate with respect to the voice recognition server device having the highest usage rate and concentrated load in step S30B. It is conceivable to transmit the setting information updated so as to be shared. In this case, the purpose is to avoid a temporary increase in traffic to the voice recognition server device.

音声認識サーバ装置２１−ｎの設定情報受信部２１Ｅは、管理部３０から設定情報を受信する（Ｓ２１Ｅ）。前述したように、音声認識サーバ装置２１−ｎの設定記憶部２１Ｇに、音声認識に関する設定が予め複数記憶されている場合、設定更新部２１Ｆは、受信した設定情報に基づいて記憶された複数の設定のうちの一つを（アクティブな設定として）選択することで設定を更新する（Ｓ２１Ｆ）。 The setting information receiving unit 21E of the voice recognition server device 21-n receives setting information from the management unit 30 (S21E). As described above, when a plurality of settings related to voice recognition are stored in advance in the setting storage unit 21G of the voice recognition server device 21-n, the setting update unit 21F stores the plurality of settings stored based on the received setting information. The setting is updated by selecting one of the settings (as an active setting) (S21F).

次に、管理部３０の送信先情報更新部３０Ｄは、設定情報送信部３０Ｃが設定情報を送信した場合に、これに併せて送信先情報を更新する（Ｓ３０Ｄ）。前述したように、送信先情報とは、収音条件と送信先となる音声認識サーバ装置との関係に関する情報である。送信先情報とは、収音条件と送信先音声認識サーバ装置とを結びつける情報といってもよい。更新前の送信先情報は、管理部３０の送信先情報記憶部３０Ｇに記憶されているものとし、送信先情報更新部３０Ｄは更新された送信先情報を、送信先情報記憶部３０Ｇに記憶、または上書き記憶する。管理部３０の送信先情報送信部３０Ｅは、更新された送信先情報をクライアント装置１０に送信する（Ｓ３０Ｅ）。 Next, when the setting information transmission unit 30C transmits the setting information, the transmission destination information update unit 30D of the management unit 30 updates the transmission destination information along with this (S30D). As described above, the transmission destination information is information related to the relationship between the sound collection condition and the voice recognition server device that is the transmission destination. The transmission destination information may be referred to as information that links the sound collection condition and the transmission destination speech recognition server apparatus. It is assumed that the transmission destination information before update is stored in the transmission destination information storage unit 30G of the management unit 30, and the transmission destination information update unit 30D stores the updated transmission destination information in the transmission destination information storage unit 30G. Or memorize by overwriting. The transmission destination information transmission unit 30E of the management unit 30 transmits the updated transmission destination information to the client device 10 (S30E).

クライアント装置１０の受信部１５は、管理部３０から送信先情報を受信する（Ｓ１５Ｂ）。クライアント装置１０において、更新前の送信先情報は、送信先記憶部１２１に記憶されている。送信先変更部１７は、管理部３０から受信した送信先情報に基づいて記憶された送信先情報を変更する（Ｓ１７）。選択部１２は、入力された音響信号の収音条件とステップＳ１７において変更された送信先情報に基づいて、対応する音響信号の送信先となる音声認識サーバ装置を選択する（Ｓ１２）。 The receiving unit 15 of the client device 10 receives the transmission destination information from the management unit 30 (S15B). In the client device 10, transmission destination information before update is stored in the transmission destination storage unit 121. The transmission destination changing unit 17 changes the stored transmission destination information based on the transmission destination information received from the management unit 30 (S17). The selection unit 12 selects a voice recognition server device that is a transmission destination of the corresponding acoustic signal based on the sound pickup condition of the input acoustic signal and the transmission destination information changed in step S17 (S12).

＜本システムの適用例＞
以下、本システムの適用例について説明する。まず事前学習により収音条件を抽出するためのしきい値を決定しておく。同様に、音声認識サーバ装置群２０の音響モデルのパラメータを決定しておく。具体的には、サーバの管理者が予め音響モデルを含む認識動作設定を複数通り学習しておき、学習結果を音声認識サーバ装置群２０に保存する。この例では、音声認識サーバ装置が１０台用意されているものとし、１０台の音声認識サーバ装置のうち２台ずつに収音条件に対応した５種類の認識動作設定（設定Ａ、Ｂ、Ｃ、Ｄ、Ｅと呼称する）を保存しておくものとする。設定Ａが保存された２台の音声認識サーバ装置を音声認識サーバ装置２１−１、２１−２と呼称する。同様に、各２台ずつ設定Ｂ、Ｃ、Ｄ、Ｅを保持する音声認識サーバ装置を、音声認識サーバ装置２１−３と２１−４、２１−５と２１−６、２１−７と２１−８、２１−９と２１−１０と呼称する。これらの音声認識サーバ装置をまとめて呼称する際には、前述と同様に音声認識サーバ装置群２０と呼ぶ。 <Application example of this system>
Hereinafter, application examples of this system will be described. First, a threshold for extracting a sound collection condition is determined by prior learning. Similarly, the parameters of the acoustic model of the speech recognition server device group 20 are determined. Specifically, the server administrator learns a plurality of recognition operation settings including an acoustic model in advance, and stores the learning result in the speech recognition server device group 20. In this example, it is assumed that ten speech recognition server devices are prepared, and five types of recognition operation settings (settings A, B, C) corresponding to the sound pickup conditions are set for two of the ten speech recognition server devices. , D, E)). The two speech recognition server devices in which the setting A is stored are referred to as speech recognition server devices 21-1 and 21-2. Similarly, the speech recognition server devices that hold the settings B, C, D, and E for each of the two units are designated as speech recognition server devices 21-3 and 21-4, 21-5 and 21-6, 21-7 and 21-. 8, 21-9 and 21-10. When these voice recognition server devices are collectively called, they are called the voice recognition server device group 20 as described above.

音声認識サーバ装置群２０は特定の音声レベル、雑音レベル、雑音の定常性の音声入力に対して特化した動作設定を保持している。例えば、音声認識サーバ装置２１−１、２１−２は雑音が低いレベルで混入した音声を用いて作成した音響モデルＡと音響モデルＡに適した動作設定（設定Ａ）を保持する。音声認識サーバ装置２１−３、２１−４は雑音が中程度のレベルで混入した音声を用いて作成した音響モデルＢと音響モデルＢに適した動作設定（設定Ｂ）を保持する。音声認識サーバ装置２１−５、２１−６は非定常雑音が混入した音声を用いて作成した音響モデルＣと音響モデルＣに適した動作設定（設定Ｃ）を保持する。音声認識サーバ装置２１−７、２１−８は雑音が高いレベルで混入した音声を用いて作成した音響モデルＤと音響モデルＤに適した動作設定を保持する（設定Ｄ）。音声認識サーバ装置２１−９、２１−１０は信号レベルが低い音声を用いて作成した音響モデルＥと音響モデルＥに適した動作設定（設定Ｅ）を保持する。音声認識サーバ装置と、これらに保持される設定の関係を下表に示す。 The voice recognition server device group 20 holds operation settings specialized for a voice input of a specific voice level, noise level, and noise steadiness. For example, the speech recognition server devices 21-1 and 21-2 hold the acoustic model A created using speech mixed at a low noise level and operation settings (setting A) suitable for the acoustic model A. The speech recognition server devices 21-3 and 21-4 hold the acoustic model B created using speech mixed with a medium level of noise and operation settings (setting B) suitable for the acoustic model B. The speech recognition server devices 21-5 and 21-6 hold the acoustic model C created using speech mixed with non-stationary noise and operation settings (setting C) suitable for the acoustic model C. The speech recognition server devices 21-7 and 21-8 hold the acoustic model D created using speech mixed with a high level of noise and operation settings suitable for the acoustic model D (setting D). The speech recognition server devices 21-9 and 21-10 hold an acoustic model E created using speech with a low signal level and operation settings (setting E) suitable for the acoustic model E. The table below shows the relationship between the voice recognition server device and the settings held in these servers.

クライアント装置１０の収音条件抽出部１１は、入力された音響信号から計算した音声レベル、雑音レベル、雑音の定常性などから、音響信号の収音条件を抽出する（Ｓ１１）。選択部１２は、抽出された収音条件に基づいて、最適な音声認識動作設定を保持する音声認識サーバ装置を選択する（Ｓ１２）。信号処理部１３は、抽出された収音条件に従い、音響信号に混入した雑音成分を抑圧する（Ｓ１３）。送信部１４は、雑音が抑圧された音響信号を、ステップＳ１２で選択された音声認識サーバ装置に送信する（Ｓ１４）。 The sound collection condition extraction unit 11 of the client device 10 extracts the sound collection condition of the acoustic signal from the voice level, noise level, noise continuity, and the like calculated from the input acoustic signal (S11). The selection unit 12 selects a speech recognition server device that holds the optimal speech recognition operation setting based on the extracted sound collection conditions (S12). The signal processing unit 13 suppresses a noise component mixed in the acoustic signal according to the extracted sound collection condition (S13). The transmission unit 14 transmits the acoustic signal in which noise is suppressed to the voice recognition server device selected in step S12 (S14).

音声認識サーバ装置群２０の音響信号受信部２１Ａは、クライアント装置１０から音響信号を受信する（Ｓ２１Ａ）と例えば下表のような通信ログを自装置の所定の記憶領域に保存する。 When the acoustic signal receiving unit 21A of the voice recognition server device group 20 receives the acoustic signal from the client device 10 (S21A), for example, a communication log as shown in the table below is stored in a predetermined storage area of the own device.

音声認識部２１Ｂは、設定記憶部２１Ｇに保持された何れかのアクティブな設定（設定Ａ、Ｂ、Ｃ、Ｄ、Ｅの何れか）に従って、音声認識処理を実行する（Ｓ２１Ｂ）。認識結果送信部２１Ｃは、音声認識結果をクライアント装置１０に送信する（Ｓ２１Ｃ）。 The voice recognition unit 21B executes voice recognition processing according to any active setting (any one of settings A, B, C, D, and E) held in the setting storage unit 21G (S21B). The recognition result transmission unit 21C transmits the voice recognition result to the client device 10 (S21C).

音声認識サーバ装置群２０の利用率送信部２１Ｄは、例えば1週間に1回、通信ログを管理部３０に送信する。管理部３０は、音声認識サーバ装置群２０から受信した音響信号のデータサイズを総計する。例えば、受信したデータサイズが音声認識サーバ装置群２０全体で１０ＧＢ、音声認識サーバ装置２１−１、２１−２で５ＧＢ、音声認識サーバ装置２１−３、２１−４で２ＧＢ、音声認識サーバ装置２１−５、２１−６で１．５ＧＢ、音声認識サーバ装置２１−７、２１−８で１ＧＢ、音声認識サーバ装置２１−９、２１−１０で０．５ＧＢであったとする。図６は、本適用例における設定更新前の利用実績の割合（利用率）を示す図である。 The utilization rate transmission unit 21D of the voice recognition server device group 20 transmits the communication log to the management unit 30, for example, once a week. The management unit 30 totals the data sizes of the acoustic signals received from the voice recognition server device group 20. For example, the received data size is 10 GB for the entire speech recognition server device group 20, 5 GB for the speech recognition server devices 21-1 and 21-2, 2 GB for the speech recognition server devices 21-3 and 21-4, and the speech recognition server device 21. -5, 21-6 is 1.5 GB, the speech recognition server device 21-7, 21-8 is 1 GB, and the speech recognition server device 21-9, 21-10 is 0.5 GB. FIG. 6 is a diagram illustrating a ratio (utilization rate) of a usage record before setting update in the application example.

設定更新後に設定Ａを保持する音声認識サーバ装置の台数N_Aは、例えば式(1)で計算することができる。
N_A=└N_all・D_A/D_all+0.5┘ (1)
N_allは音声認識サーバ装置の総数、D_allは全音声認識サーバ装置で受信したデータサイズの合計、D_Aは設定Ａを保持する音声認識サーバ装置２１−１、２１−２で受信したデータサイズの合計である。└ ┘は床関数であり、└x+0.5┘はxの四捨五入を意味する。設定更新後に設定Ｂ、Ｃ、Ｄを保持する音声認識サーバ装置の台数N_B、N_C、N_Dは式(1)の添え字ＡをＢ、Ｃ、Ｄに書き換えて得られる。設定更新後に設定Ｅを保持する音声認識サーバ装置の台数N_Eは、例えば式(2)で計算することができる。
N_E=N_all-(N_A+N_B+N_C+N_D) (2) The number N _A of speech recognition server apparatuses that retain the setting A after the setting update can be calculated by, for example, Expression (1).
N _A = └N _all・ D _A / D _all + 0.5┘ (1)
N _all is the total number of voice recognition server devices, D _all is the total data size received by all the voice recognition server devices, D _A is the data size received by the voice recognition server devices 21-1 and 21-2 holding the setting A Is the sum of └ ┘ is a floor function, and └x + 0.5┘ means rounding x. The numbers N _B , N _C , and N _D of the speech recognition server apparatuses that retain the settings B, C, and D after the setting update are obtained by rewriting the subscript A in the formula (1) to B, C, and D. The number N _E of speech recognition server apparatuses that retain the setting E after the setting update can be calculated by, for example, Expression (2).
N _E = N _all- (N _A + N _B + N _C + N _D ) (2)

図６の例ではN_A=5、N_B=2、N_C=2、N_D=1、N_E=0となり、設定更新後の音声認識サーバ装置の利用率は図７のように表される。管理部３０の設定情報更新部３０Ｂは、利用率が低い音声認識サーバ装置２１−８、２１−９、２１−１０の設定情報を、利用率が高い音声認識サーバ装置２１−１、２１−２の設定（設定Ａ）と同じになるように更新する（Ｓ３０Ｂ）。管理部３０の設定情報送信部３０Ｃは、利用率が低い音声認識サーバ装置２１−８、２１−９、２１−１０に対して前述の設定情報を送信する（Ｓ３０Ｃ）。 In the example of FIG. 6, N _A = 5, N _B = 2, N _C = 2, N _D = 1 and N _E = 0, and the usage rate of the voice recognition server device after the setting update is expressed as shown in FIG. The The setting information update unit 30B of the management unit 30 uses the setting information of the voice recognition server devices 21-8, 21-9, and 21-10 with low usage rates as the voice recognition server devices 21-1 and 21-2 with high usage rates. (S30B). The setting information transmission unit 30C of the management unit 30 transmits the setting information described above to the voice recognition server devices 21-8, 21-9, and 21-10 having a low usage rate (S30C).

音声認識サーバ装置２１−８、２１−９、２１−１０の設定情報受信部２１Ｅは、管理部３０から設定情報を受信する（Ｓ２１Ｅ）。前述したように、音声認識サーバ装置２１−８、２１−９、２１−１０の設定記憶部２１Ｇに、音声認識に関する設定が予め複数記憶されている場合、設定更新部２１Ｆは、受信した設定情報に基づいて記憶された複数の設定のうちの一つ（この例では設定Ａ）をアクティブな設定として選択することで設定を更新する（Ｓ２１Ｆ）。 The setting information receiving unit 21E of the voice recognition server devices 21-8, 21-9, and 21-10 receives the setting information from the management unit 30 (S21E). As described above, when a plurality of settings relating to voice recognition are stored in advance in the setting storage unit 21G of the voice recognition server devices 21-8, 21-9, and 21-10, the setting update unit 21F receives the received setting information. The setting is updated by selecting one of the plurality of settings stored based on the setting (setting A in this example) as an active setting (S21F).

管理部３０の送信先情報送信部３０Ｅは、上述の設定更新に伴って更新された送信先情報をクライアント装置１０に送信する（Ｓ３０Ｅ）。上述の適用例のように、同一の設定の音声認識サーバ装置が２台以上ある場合、クライアント装置１０の端末IDによって送信先となる同一の設定を持つ音声認識サーバ装置のうちいずれかのIPアドレスが送信先になるよう変更し、同じクライアント装置１０からは同一の音声認識サーバ装置にデータを送信させてもよい。また上述の例における、設定Ｅのように、更新後に当該設定を保持する音声認識サーバの装置が０台になる場合があるため、あらかじめ代替として似た設定情報を指定しておく。例えば、設定Ｅの音声認識サーバ装置の代替として設定Ｄの音声認識サーバ装置をあらかじめ指定しておき、設定Ｄの音声認識サーバ装置のIPアドレスを対応付けておく。 The transmission destination information transmission unit 30E of the management unit 30 transmits the transmission destination information updated with the above setting update to the client device 10 (S30E). When there are two or more voice recognition server devices with the same setting as in the above application example, any IP address of the voice recognition server devices having the same setting as the transmission destination by the terminal ID of the client device 10 May be changed to be a transmission destination, and the same client device 10 may transmit data to the same voice recognition server device. In addition, since there may be 0 voice recognition server devices that retain the setting after the update as in the setting E in the above example, similar setting information is designated in advance as an alternative. For example, the voice recognition server device of setting D is designated in advance as an alternative to the voice recognition server device of setting E, and the IP address of the voice recognition server device of setting D is associated.

本実施例の音声認識システム１によれば、管理部３０が音声認識サーバ装置の利用率を監視し、当該利用率に応じて音声認識サーバ装置の設定を変更することによって、特定の音声認識サーバ装置に負荷が集中しないように運用することができ、音声認識システム１全体の利用性能（パフォーマンス）を高めることができる。 According to the voice recognition system 1 of the present embodiment, the management unit 30 monitors the usage rate of the voice recognition server device, and changes the setting of the voice recognition server device according to the usage rate, thereby enabling a specific voice recognition server. The operation can be performed so that the load is not concentrated on the apparatus, and the utilization performance (performance) of the entire speech recognition system 1 can be improved.

以下、クライアント装置に設定された収音条件抽出のためのしきい値を変更することによって、実施例１と同様の効果を達成した実施例２の音声認識システムについて説明する。まず図８、図９、図１０を参照して本実施例の音声認識システムの構成について説明する。図８は、本実施例の音声認識システム４の構成を示すブロック図である。図９は、本実施例の音声認識サーバ装置５１−ｎの構成を示すブロック図である。図１０は、本実施例の管理部６０の構成を示すブロック図である。 Hereinafter, the speech recognition system according to the second embodiment that achieves the same effect as that of the first embodiment by changing the threshold value for extracting the sound pickup conditions set in the client device will be described. First, the configuration of the speech recognition system according to the present embodiment will be described with reference to FIGS. FIG. 8 is a block diagram showing the configuration of the voice recognition system 4 of the present embodiment. FIG. 9 is a block diagram illustrating a configuration of the voice recognition server device 51-n according to the present embodiment. FIG. 10 is a block diagram illustrating a configuration of the management unit 60 of the present embodiment.

図８に示すように、本実施例の音声認識システム４は、クライアント装置４０と、複数の音声認識サーバ装置５１−１、…、５１−ｎ、…、５１−Ｎと、管理部６０を含む。図８においてクライアント装置４０は１台のみ図示したが、実施例１と同様クライアント装置４０は複数台存在するものとする。音声認識サーバ装置５１−１、…、５１−ｎ、…、５１−Ｎをまとめて呼称する際には、音声認識サーバ装置群５０と呼ぶ。クライアント装置４０と音声認識サーバ装置群５０は、実施例１と同様、ネットワークを介し、無線または有線で通信可能に接続されている。管理部６０は、単独のハードウェア（装置）として構成されてもよく、これを管理装置６０と呼んでもよい。この場合、クライアント装置４０と音声認識サーバ装置群５０と管理部６０（管理装置６０）はネットワークを介して、無線または有線で通信可能に接続される。実施例１同様、管理部６０は、クライアント装置４０内の構成要件であってもよいし、音声認識サーバ装置群５０内の何れかの音声認識サーバ装置内の構成要件であってもよい。 As shown in FIG. 8, the speech recognition system 4 of this embodiment includes a client device 40, a plurality of speech recognition server devices 51-1,..., 51-n, 51 -N, and a management unit 60. . Although only one client device 40 is illustrated in FIG. 8, it is assumed that there are a plurality of client devices 40 as in the first embodiment. When the voice recognition server devices 51-1,..., 51-n,..., 51-N are collectively called, they are referred to as a voice recognition server device group 50. As in the first embodiment, the client device 40 and the voice recognition server device group 50 are connected to be communicable wirelessly or wired via a network. The management unit 60 may be configured as a single hardware (device), and may be referred to as a management device 60. In this case, the client device 40, the voice recognition server device group 50, and the management unit 60 (management device 60) are connected to be communicable wirelessly or wired via a network. As in the first embodiment, the management unit 60 may be a configuration requirement in the client device 40 or a configuration requirement in any of the voice recognition server devices in the voice recognition server device group 50.

図８に示すように、クライアント装置４０は、収音条件抽出部１１と、しきい値記憶部１１１と、選択部１２と、送信先記憶部１２１と、信号処理部１３と、送信部１４と、受信部１５と、呈示部１６と、しきい値変更部４７を含み、送信先変更部１７がしきい値変更部４７に変更されたこと以外は、実施例１のクライアント装置１０と同じである。図９に示すように、音声認識サーバ装置５１−ｎ（代表させた）は、音響信号受信部２１Ａと、音声認識部２１Ｂと、認識結果送信部２１Ｃと、利用率送信部２１Ｄと、設定記憶部２１Ｇを含み、設定情報受信部２１Ｅと、設定更新部２１Ｆが存在しないこと以外は、実施例１の音声認識サーバ装置２１−ｎと同じである。図１０に示すように、管理部６０（管理装置６０）は、利用率受信部３０Ａと、しきい値更新部６０Ｂと、しきい値送信部６０Ｃと、しきい値記憶部６０Ｄを含む。本実施例の利用率受信部３０Ａは実施例１の利用率受信部３０Ａと同じである。なお、本実施例の音声認識システム４の音声認識動作（Ｓ１１〜Ｓ１４、Ｓ２１Ａ〜Ｓ２１Ｃ、Ｓ１５Ａ、Ｓ１６）は実施例１の音声認識動作と全く同じであるから説明を省略する。 As illustrated in FIG. 8, the client device 40 includes a sound collection condition extraction unit 11, a threshold storage unit 111, a selection unit 12, a transmission destination storage unit 121, a signal processing unit 13, and a transmission unit 14. , Including the receiving unit 15, the presenting unit 16, and the threshold value changing unit 47, except that the transmission destination changing unit 17 is changed to the threshold value changing unit 47. is there. As shown in FIG. 9, the voice recognition server device 51-n (represented) includes an acoustic signal receiving unit 21A, a voice recognition unit 21B, a recognition result transmission unit 21C, a utilization rate transmission unit 21D, and a setting storage. It is the same as the speech recognition server device 21-n of the first embodiment except that the setting information receiving unit 21E and the setting update unit 21F are not included. As shown in FIG. 10, management unit 60 (management device 60) includes a utilization rate receiving unit 30A, a threshold update unit 60B, a threshold transmission unit 60C, and a threshold storage unit 60D. The utilization rate receiving unit 30A of the present embodiment is the same as the utilization rate receiving unit 30A of the first embodiment. Note that the voice recognition operation (S11 to S14, S21A to S21C, S15A, and S16) of the voice recognition system 4 of the present embodiment is exactly the same as the voice recognition operation of the first embodiment, and thus the description thereof is omitted.

以下、図１１を参照して本実施例の音声認識システム４の情報更新動作について説明する。図１１は、本実施例の音声認識システム４の情報更新動作を示すシーケンス図である。ステップＳ２１Ｄ、Ｓ３０Ａは実施例１と同様に実行される。次に、管理部６０のしきい値更新部６０Ｂは、前述の利用率に基づいて収音条件のしきい値を更新する（Ｓ６０Ｂ）。このしきい値は、収音条件の抽出に用いるしきい値であって、例えば前述のθ₁、θ₂などがこれに該当する。しきい値更新部６０Ｂは実施例１と同様の方針に従って、しきい値を更新する。すなわち、利用率が低い音声認識サーバ装置の利用率が高くなるように前述のしきい値を調整することで、負荷集中の緩和を実現する。あるいは、利用率が高い音声認識サーバ装置の利用率が低くなるように前述のしきい値を調整することで、該当する音声認識サーバ装置に対するトラフィックを一時的に減少させる。更新前のしきい値は、管理部６０のしきい値記憶部６０Ｄに記憶されているものとし、しきい値更新部６０Ｂは更新されたしきい値を、しきい値記憶部６０Ｄに記憶、または上書き記憶する。 Hereinafter, the information update operation of the speech recognition system 4 of the present embodiment will be described with reference to FIG. FIG. 11 is a sequence diagram showing an information update operation of the voice recognition system 4 of the present embodiment. Steps S21D and S30A are executed in the same manner as in the first embodiment. Next, the threshold update unit 60B of the management unit 60 updates the threshold value of the sound collection condition based on the above-described usage rate (S60B). This threshold value is a threshold value used for extraction of the sound pickup condition, and corresponds to, for example, the aforementioned θ ₁ , θ _{2, and the} like. The threshold update unit 60B updates the threshold according to the same policy as in the first embodiment. In other words, the load concentration is alleviated by adjusting the above-described threshold value so that the utilization rate of the voice recognition server device having a low utilization rate is increased. Alternatively, the traffic to the corresponding voice recognition server device is temporarily reduced by adjusting the threshold value so that the usage rate of the voice recognition server device having a high usage rate becomes low. The threshold before update is stored in the threshold storage unit 60D of the management unit 60, and the threshold update unit 60B stores the updated threshold in the threshold storage unit 60D. Or memorize by overwriting.

次に、管理部６０のしきい値送信部６０Ｃは、ステップＳ６０Ｂで更新されたしきい値をクライアント装置４０に送信する（Ｓ６０Ｃ）。クライアント装置４０の受信部１５は、管理部６０からしきい値を受信する（Ｓ１５Ｃ）。クライアント装置４０のしきい値変更部４７は、受信したしきい値に基づいて、しきい値記憶部１１１に記憶されたしきい値を変更する（Ｓ４７）。収音条件抽出部１１は、変更されたしきい値を用いて、入力された音響信号の収音条件を抽出する（Ｓ１１）。 Next, the threshold transmission unit 60C of the management unit 60 transmits the threshold updated in step S60B to the client device 40 (S60C). The receiving unit 15 of the client device 40 receives the threshold value from the management unit 60 (S15C). The threshold value changing unit 47 of the client device 40 changes the threshold value stored in the threshold value storage unit 111 based on the received threshold value (S47). The sound collection condition extraction unit 11 extracts a sound collection condition of the input acoustic signal using the changed threshold value (S11).

本実施例の音声認識システム４によれば、クライアント装置４０に設定された収音条件抽出のためのしきい値を変更することによって、特定の音声認識サーバ装置に負荷が集中しないように運用することができ、音声認識システム４全体の利用性能（パフォーマンス）を高めることができる。 According to the voice recognition system 4 of the present embodiment, the threshold for extracting the sound pickup conditions set in the client device 40 is changed so that the load is not concentrated on a specific voice recognition server device. Therefore, the utilization performance (performance) of the entire speech recognition system 4 can be improved.

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ−ＲＯＭなどの記録媒体を読み書きできる装置（ドライブ）などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Supplementary note>
The apparatus of the present invention includes, for example, a single hardware entity as an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity. Can be connected to a communication unit, a CPU (Central Processing Unit, may include a cache memory or a register), a RAM or ROM that is a memory, an external storage device that is a hard disk, and an input unit, an output unit, or a communication unit thereof , A CPU, a RAM, a ROM, and a bus connected so that data can be exchanged between the external storage devices. If necessary, the hardware entity may be provided with a device (drive) that can read and write a recording medium such as a CD-ROM. A physical entity having such hardware resources includes a general-purpose computer.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program necessary for realizing the above functions and data necessary for processing the program (not limited to the external storage device, for example, reading a program) It may be stored in a ROM that is a dedicated storage device). Data obtained by the processing of these programs is appropriately stored in a RAM or an external storage device.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭなど）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段などと表した各構成要件）を実現する。 In the hardware entity, each program stored in an external storage device (or ROM or the like) and data necessary for processing each program are read into a memory as necessary, and are interpreted and executed by a CPU as appropriate. . As a result, the CPU realizes a predetermined function (respective component requirements expressed as the above-described unit, unit, etc.).

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiment, and can be appropriately changed without departing from the spirit of the present invention. In addition, the processing described in the above embodiment may be executed not only in time series according to the order of description but also in parallel or individually as required by the processing capability of the apparatus that executes the processing. .

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing functions in the hardware entity (the apparatus of the present invention) described in the above embodiments are realized by a computer, the processing contents of the functions that the hardware entity should have are described by a program. Then, by executing this program on a computer, the processing functions in the hardware entity are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, a hardware entity is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

Claims

A voice recognition system including a client device, a plurality of voice recognition server devices, and a management unit,
The client device is
A transmission unit that transmits an input acoustic signal or a signal derived from the acoustic signal to a voice recognition server device selected based on the sound collection condition;
Each of the voice recognition server devices
A setting storage unit that stores in advance settings relating to voice recognition;
A utilization rate transmitting unit that transmits information on the utilization rate, which is a rate at which the client device uses the device as a transmission destination, to the management unit;
The management unit
An operation of transmitting to the client device the threshold value used for extraction of the sound collection condition, the threshold value updated so as to reduce the bias of the utilization rate between the voice recognition server devices. For the voice recognition server device in which the usage rate is biased in comparison with other voice recognition server devices, the information regarding the setting updated so that the bias in the usage rate is reduced as a first operation The transmission destination information updated to correspond to the updated information relating to the setting, the transmission destination information being information relating to the relationship between the sound collection condition and the voice recognition server device that is the transmission destination. information, wherein when the operation to be transmitted to the client device and a second operation, the speech recognition system to perform at least one of operation of said second operation and the first operation

The speech recognition system according to claim 1,
In the setting storage unit, a plurality of settings relating to voice recognition are stored in advance,
Each of the voice recognition server devices
A setting update unit that selects one of the plurality of stored settings based on setting information that is information related to the setting updated based on the utilization rate and updates the setting;
The management unit
The setting for one or more voice recognition server devices corresponding to a high usage rate compared to the usage rate of other voice recognition server devices or a low usage rate compared to the usage rate of other voice recognition server devices. A speech recognition system including a setting information transmission unit for transmitting information.

The speech recognition system according to claim 1,
In the setting storage unit, a plurality of settings relating to voice recognition are stored in advance,
Each of the voice recognition server devices
A setting update unit that selects one of the plurality of stored settings based on setting information that is information related to the setting updated based on the utilization rate and updates the setting;
The management unit
For one or more speech recognition server devices corresponding to the highest utilization rate compared to the utilization rate of other speech recognition server devices or the lowest utilization rate compared to the utilization rate of other speech recognition server devices A speech recognition system including a setting information transmission unit for transmitting the setting information.

The speech recognition system according to claim 2 or 3 ,
The management unit
A transmission destination information update unit that updates transmission destination information that is information relating to a relationship between a sound collection condition and a voice recognition server device that is a transmission destination when the setting information transmission unit transmits the setting information;
A transmission destination information transmission unit that transmits the updated transmission destination information to the client device;
The client device is
A destination storage unit for storing the destination information;
A transmission destination changing unit that changes the stored transmission destination information based on the transmission destination information received from the management unit;
A speech recognition system including a selection unit that selects a speech recognition server device that is a transmission destination of a corresponding acoustic signal based on the input sound collection condition of the input acoustic signal and the changed transmission destination information.

The speech recognition system according to claim 1,
The client device is
A sound collection condition extraction unit that extracts sound collection conditions of the input acoustic signal;
A speech recognition system including a threshold value changing unit that changes a threshold value used for extraction of the sound pickup condition based on the threshold value of the sound pickup condition updated based on the utilization rate.

A voice recognition method executed by a client device, a plurality of voice recognition server devices, and a management unit,
The client device is
Executing the step of transmitting an input acoustic signal or a signal derived from the acoustic signal to a voice recognition server device selected based on the sound collection condition;
Each of the voice recognition server devices
Storing in advance settings relating to voice recognition;
Executing the step of transmitting information relating to a utilization rate, which is a rate at which the client device has used its own device as a transmission destination, to the management unit;
The management unit
An operation of transmitting to the client device the threshold value used for extraction of the sound collection condition, the threshold value updated so as to reduce the bias of the utilization rate between the voice recognition server devices. As a first step, the information regarding the setting updated so that the bias of the utilization rate is reduced is compared with the other voice recognition server devices for the voice recognition server device in which the utilization rate is biased. The transmission destination information updated to correspond to the updated information relating to the setting, the transmission destination information being information relating to the relationship between the sound collection condition and the voice recognition server device that is the transmission destination. information, if the operation to be transmitted to the client apparatus as a second step, performing at least one of steps of said first step and said second step Speech recognition method that.

A program for causing a computer to function as a voice recognition server device included in the voice recognition system according to any one of claims 1 to 3.

A program for causing a computer to function as a client device included in the voice recognition system according to any one of claims 4 and 5.