JP6500375B2

JP6500375B2 - Voice processing apparatus, voice processing method, and program

Info

Publication number: JP6500375B2
Application number: JP2014187535A
Authority: JP
Inventors: 山本　仁; 仁山本; 孝文越仲
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2014-09-16
Filing date: 2014-09-16
Publication date: 2019-04-17
Anticipated expiration: 2034-09-16
Also published as: JP2016061824A

Description

本発明は、音声信号から話者の個人性や発話された言語等の属性情報を認識する音声処理装置、音声処理方法、およびプログラムに関する。 The present invention relates to a voice processing apparatus, a voice processing method, and a program for recognizing attribute information such as a speaker's individuality and a spoken language from a voice signal.

音声信号から、音声を発した話者を特定するための個人性を表す音響的特徴や、音声が伝える言語を表す音響的特徴を抽出する音声処理装置が知られている。また、これらの音響的特徴を用いて音声信号から話者を推定する話者認識装置や、言語を推定する言語認識装置が知られている。 A speech processing apparatus is known which extracts, from a speech signal, acoustic features representing individuality for specifying a speaker who has made speech, and acoustic features representing a language that the speech conveys. Also, a speaker recognition device that estimates a speaker from a speech signal using these acoustic features, and a language recognition device that estimates a language are known.

この種の音声処理装置を用いる話者認識装置では、音声処理装置が音声信号から抽出した音響特徴と、音響特徴の出現傾向の話者依存性を表現する話者モデルとの類似度を評価し、その評価に基づき話者を選択する。話者認識装置は、例えば、最も類似度が高いと評価された話者モデルの話者を選択する。このとき、話者認識装置に入力される音声信号に、音の種類が一部欠落したり雑音が混入したりすることによってその音響特徴に歪みが生じると、話者モデルが有する音響特徴との間に差異が生じて、話者認識の精度が低下することがある。以下の先行技術文献には、話者認識の精度低下を抑制する技術が記載されている。 In a speaker recognition apparatus using this kind of speech processing apparatus, the speech processing apparatus evaluates the similarity between the acoustic feature extracted from the speech signal and the speaker model that expresses the speaker dependency of the appearance tendency of the acoustic feature. , Select a speaker based on the evaluation. The speaker recognition apparatus selects, for example, a speaker of a speaker model evaluated as having the highest similarity. At this time, if distortion occurs in the acoustic feature due to a partial loss of sound type or noise mixing in the voice signal input to the speaker recognition device, the acoustic feature of the speaker model There may be differences between them, which may reduce the accuracy of speaker recognition. The following prior art documents describe techniques for suppressing the degradation of the accuracy of speaker recognition.

先行技術文献には、話者認識装置に入力される音声信号の特性に基づいて、話者認識の判定基準を調整する技術が記載されている。 The prior art documents describe techniques for adjusting the criteria for determining the speaker recognition based on the characteristics of the speech signal input to the speaker recognition device.

非特許文献１には、音声信号の量を表す特性として音声信号の継続時間長を測定し、その値に応じて話者認識結果に当該結果の信頼性を示す信頼度を付与することにより、話者認識の誤りを抑制する技術が記載されている。また、特許文献１には、音声信号の多様性を表す特性として音声信号に含まれる有声音と無声音の比率や音声信号に含まれる繰り返し発話区間の比率を算出し、その値を話者認識結果の信頼度として用いて話者認識の判定閾値をシフトさせることにより、話者認識精度の低下を抑制する技術が記載されている。 In Non-Patent Document 1, the duration of the speech signal is measured as a characteristic representing the quantity of the speech signal, and according to the value, the speaker recognition result is given a reliability indicating the reliability of the result. Techniques for suppressing speaker recognition errors are described. Further, in Patent Document 1, the ratio of voiced sound to unvoiced sound included in the voice signal and the ratio of repetitive utterance sections included in the voice signal are calculated as characteristics representing the diversity of the voice signal, and the value is used as a speaker recognition result There is disclosed a technique for suppressing the degradation of the speaker recognition accuracy by shifting the determination threshold of the speaker recognition by using it as the reliability of the.

特許文献２には、ＤＢ（データベース）が有する複数の声質特徴と入力から算出された重みを用いて各声質間の距離を算出する。当該声質の距離を用いて各声質の声質空間上での座標を算出し、表示部に算出された声質空間上の座標へ当該声質に対する話者属性情報を表示する技術が記載されている。特許文献３には、発声者の音声の特徴量に基づいて当該発声者の正当性、すなわちあらかじめ登録された正規の利用者であるか否か、を判定する話者認証のための装置であり、混合モデルの確定に必要な登録区間を区分した各単位区間について特徴ベクトルの算定と更新混合モデルの更新とを順次に実行する技術が記載されている。 In Patent Document 2, a distance between each voice quality is calculated using a plurality of voice quality features of a DB (database) and weights calculated from inputs. A technique is described that calculates coordinates of each voice quality on the voice quality space using the distance of the voice quality, and displays speaker attribute information for the voice quality to the coordinates on the voice quality space calculated by the display unit. Patent Document 3 is an apparatus for speaker authentication that determines the legitimacy of the speaker based on the feature amount of the speaker's voice, that is, whether or not the user is a legitimate user registered in advance. A technique is described which sequentially executes the calculation of feature vectors and the update of the update mixture model for each unit interval in which the registration intervals necessary for determining the mixture model are divided.

特開２００７−１５６４２２号公報JP, 2007-156422, A 特開２００８−２３３７５９号公報JP 2008-233759 A 国際公開第２００８／１４９５４７号International Publication No. 2008/149547

Ｗ．Ｍ．Ｃａｍｐｂｅｌｌ，Ｄ．Ａ．Ｒｅｙｎｏｌｄｓ，Ｊ．Ｐ．Ｃａｍｐｂｅｌｌ，ａｎｄＫ．Ｊ．Ｂｒａｄｙ， “ＥｓｔｉｍａｔｉｎｇａｎｄＥｖａｌｕａｔｉｎｇＣｏｎｆｉｄｅｎｃｅｆｏｒＦｏｒｅｎｓｉｃＳｐｅａｋｅｒＲｅｃｏｇｎｉｔｉｏｎ，” ｉｎＰｒｏｃｅｅｄｉｎｇｓｏｆＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，Ｓｐｅｅｃｈ，ａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ，２００５．W. M. Campbell, D. A. Reynolds, J.A. P. Campbell, and K. J. Brady, “Estimating and Evaluating Confidence for Forensic Speaker Recognition,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005.

しかしながら、先行技術文献に記載の技術には、前述の信頼度が適切に求められていないため話者認識精度の低下を十分に抑制できない場合があるという問題があった。 However, in the technique described in the prior art document, there is a problem that the reduction in the speaker recognition accuracy may not be sufficiently suppressed because the above-mentioned reliability is not properly obtained.

非特許文献１に記載の技術は、話者認識装置に入力された音声信号の継続時間長が大きいほど、その音声信号に含まれる音の種類の欠落や偏りが少なく、話者認識結果の信頼性が高いと仮定している。しかし、同技術は、音声信号に含まれる音の種類を明示的には評価していないため、音声信号の継続時間長が大きくても同じ言葉が繰り返されたりする場合には適切でない。 According to the technique described in Non-Patent Document 1, the longer the duration of the speech signal input to the speaker recognition apparatus, the smaller the loss or bias of the type of sound included in the speech signal, and the confidence in the speaker recognition result. It is assumed that the sex is high. However, since the technology does not explicitly evaluate the type of sound included in the audio signal, it is not appropriate when the same word is repeated even if the duration of the audio signal is large.

特許文献１に記載の技術は、話者認識装置に入力された音声信号に含まれる音の多様性の大小に応じて話者認識の判定閾値が異なる値になるよう設定する。しかし、同技術は、話者認識装置に入力された音声信号の特性のみを計算し、事前に定めた一律の基準で話者認識の判定基準を調整する。そのため、話者認識装置に入力された音声信号の多様性を表す特性と、各話者モデルの学習時に用いられた各話者の音声信号の多様性を表す特性との間に差異があるとき、話者認識の判定基準を適切に調整できないおそれがある。話者モデルの学習用音声信号を十分に取得できない場合などにこのような問題が起こりうる。 According to the technique described in Patent Document 1, the determination threshold for speaker recognition is set to a different value according to the magnitude of the diversity of sounds included in the speech signal input to the speaker recognition device. However, the same technology calculates only the characteristics of the speech signal input to the speaker recognition device, and adjusts the speaker recognition criteria based on a predetermined uniform standard. Therefore, when there is a difference between the characteristic representing the diversity of the speech signal input to the speaker recognition device and the characteristic representing the diversity of the speech signal of each speaker used in learning each speaker model , There is a possibility that the judgment criteria of speaker recognition can not be adjusted appropriately. Such a problem may occur when, for example, the training speech signal of the speaker model can not be obtained sufficiently.

特許文献２に記載の技術は、ＤＢ（データベース）が有する複数の声質特徴と入力から算出された重みを用いて各声質間の距離を算出する。そして、特許文献２に記載の技術は、当該声質の距離を用いて各声質の声質空間上での座標を算出し、表示部に算出された声質空間上の座標へ当該声質に対する話者属性情報を表示するが、当該表示する座標に対する信頼度を算出していない。また、特許文献３に記載の技術は、混合モデルの確定に必要な登録区間を区分した各単位区間について特徴ベクトルの算定と更新混合モデルの更新とを順次に実行するが、算定される特徴ベクトルと更新される混合モデルに対する信頼度を算出していない。 The technology described in Patent Document 2 calculates the distance between each voice quality using a plurality of voice quality features of a DB (database) and weights calculated from the input. Then, the technology described in Patent Document 2 calculates the coordinates of each voice quality on the voice quality space using the distance of the voice quality, and converts the coordinates on the voice quality space calculated by the display unit to the speaker attribute information for the voice quality. Is displayed, but the reliability of the displayed coordinates is not calculated. In addition, the technology described in Patent Document 3 sequentially executes the calculation of feature vectors and the update of the update mixture model for each unit interval obtained by dividing the registration intervals necessary for determining the mixture model. The reliability of the mixed model to be updated is not calculated.

本発明の目的は、上記の問題を解決し、話者認識結果の信頼度を適切に求めることにより、話者認識の精度低下を抑制する音声処理装置を提供することである。 An object of the present invention is to provide a speech processing apparatus which suppresses the degradation of the accuracy of speaker recognition by solving the above problems and appropriately determining the reliability of the speaker recognition result.

本発明の一態様における音声処理装置は、音声を表す音声信号に基づき、前記音声信号の種類に関するばらつきの程度を表す音響多様度を算出する音響多様度算出手段と、前記音声信号の前記音響多様度と、基準となる他の音声信号の前記音響多様度との相違度に基づき、音声信号から特定の属性情報を認識する対象として前記音声信号が好適である程度を表す音響信頼度を算出する音響信頼度算出手段とを備える。 An audio processing device according to one aspect of the present invention includes acoustic diversity calculation means for calculating acoustic diversity representing a degree of variation regarding the type of the audio signal based on an audio signal representing audio, and the acoustic diversity of the audio signal. Calculating the acoustic reliability indicating the degree to which the audio signal is suitable for recognizing specific attribute information from the audio signal based on the degree of difference between the acoustic diversity of the other audio signal as a reference And a reliability calculation means.

本発明の一態様における音声処理方法は、音声を表す音声信号に基づき、前記音声信号に含まれる音の種類に関するばらつきの程度を表す音響多様度を算出し、前記音声信号の前記音響多様度と、基準となる他の音声信号の前記音響多様度との相違度に基づき、音声信号から特定の属性情報を認識する対象として前記音声信号が好適である程度を表す音響信頼度を算出する。 The sound processing method according to one aspect of the present invention calculates an acoustic diversity degree indicating a degree of variation regarding the type of sound included in the sound signal based on the sound signal representing the sound, and the sound diversity of the sound signal Based on the degree of difference between the acoustic diversity of another audio signal as a reference, acoustic reliability is calculated that represents the degree to which the audio signal is suitable for recognizing specific attribute information from the audio signal.

本発明の一態様におけるプログラムは、音声を表す音声信号に基づき、前記音声信号に含まれる音の種類に関するばらつきの程度を表す音響多様度を算出する音響多様度算出処理と、前記音声信号の前記音響多様度と、基準となる他の音声信号の前記音響多様度との相違度に基づき、音声信号から特定の属性情報を認識する対象として前記音声信号が好適である程度を表す音響信頼度を算出する音響信頼度算出処理とをコンピュータに実行させる。 A program according to one aspect of the present invention includes an acoustic diversity calculation process of calculating acoustic diversity representing the degree of variation regarding the type of sound included in the audio signal based on an audio signal representing audio; Based on the degree of acoustic diversity and the degree of difference between the acoustic diversity of another audio signal as a reference, the audio reliability indicating the degree to which the audio signal is suitable for recognizing specific attribute information from the audio signal is calculated The computer is made to execute the sound reliability calculation processing to be performed.

本発明は、話者認識結果の信頼度を適切に求めることにより、話者認識の精度低下を抑制することができる。 The present invention can suppress the degradation of the accuracy of speaker recognition by appropriately determining the reliability of the speaker recognition result.

本発明の第１の実施形態における音声処理装置１００の構成を示すブロック図である。It is a block diagram showing composition of speech processing unit 100 in a 1st embodiment of the present invention. 本発明の第１の実施形態における音声処理装置１００の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech processing unit 100 in the 1st Embodiment of this invention. 音声信号の継続時間長と音響信頼度算出部１４が算出する音響信頼度との関係を説明する図である。It is a figure explaining the relationship between the continuation time length of an audio | voice signal, and the acoustic reliability which the acoustic reliability calculation part 14 calculates. 本発明の第２の実施形態における話者認識装置２００の構成を示すブロック図である。It is a block diagram which shows the structure of the speaker recognition apparatus 200 in the 2nd Embodiment of this invention. 本発明の第２の実施形態における話者認識装置２００の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speaker recognition apparatus 200 in the 2nd Embodiment of this invention. 本発明の第３の実施形態における話者モデル学習装置３００の構成を示すブロック図である。It is a block diagram which shows the structure of the speaker model learning apparatus 300 in the 3rd Embodiment of this invention. 本発明の第３の実施形態における話者モデル学習装置３００の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speaker model learning apparatus 300 in the 3rd Embodiment of this invention. 本発明の第４の実施形態における音声処理装置４００の構成を示すブロック図である。It is a block diagram which shows the structure of the speech processing unit 400 in the 4th Embodiment of this invention.

以下、音声処理装置等および話者認識装置の実施形態について、図面を参照して説明する。なお、各実施形態において同じ符号を付した構成要素は同様の動作を行うので、再度の説明を省略する場合がある。 Hereinafter, embodiments of the speech processing device and the like and the speaker recognition device will be described with reference to the drawings. In addition, since the component which attached | subjected the same code | symbol in each embodiment performs the same operation | movement, description for the second time may be abbreviate | omitted.

＜第１の実施形態＞
図１は、第１の実施形態における音声処理装置１００のブロック図である。音声処理装置１００は、音声モデル記憶部１１、音響多様度算出部１２、音響多様度記憶部１３、および音響信頼度算出部１４を備える。 First Embodiment
FIG. 1 is a block diagram of the speech processing apparatus 100 according to the first embodiment. The voice processing device 100 includes a voice model storage unit 11, an acoustic diversity calculation unit 12, an acoustic diversity storage unit 13, and an acoustic reliability calculation unit 14.

音響多様度算出部１２は、音声信号を受理する。ここで、受理とは、例えば、外部の装置からの受信、他の処理装置や他のプログラムからの処理結果の引き渡しのことである。 The acoustic diversity calculation unit 12 receives an audio signal. Here, acceptance means, for example, reception from an external device, and delivery of processing results from another processing device or another program.

音声モデル記憶部１１は、１つ以上の音声モデルを記憶する。音声モデルは、音声信号に対して、そのモデルとの適合度合いを表す数値情報を算出するための情報を有する。例えば、音声モデルが混合ガウス分布（ＧＭＭ：ＧａｕｓｓｉａｎＭｉｘｔｕｒｅＭｏｄｅｌ）である場合、音声処理装置１００は、混合ガウス分布が有する平均、分散および混合係数に基づいて、音声信号の出現確率を算出できる。 The voice model storage unit 11 stores one or more voice models. The speech model has information for calculating, for the speech signal, numerical information indicating the degree of matching with the model. For example, when the speech model is a Gaussian mixture model (GMM: Gaussian Mixture Model), the speech processing apparatus 100 can calculate the appearance probability of the speech signal based on the average, the variance, and the mixing coefficient of the Gaussian mixture distribution.

音声モデル記憶部１１が記憶する音声モデルは、訓練用音声信号を用いて、最尤基準等の一般的な最適化基準に従って訓練された音声モデルである。音声モデルの訓練では、音声認識で一般的に用いられる音響モデルの訓練とは異なり、訓練用音声信号の内容を表す言語情報（単語など）は必要としない。音声モデル記憶部１１は、例えば、話者の性別（男性および女性）や録音環境別（屋内および屋外）のように訓練用音声信号を分けて学習した２つ以上の音声モデルを記憶してもよい。 The speech model stored in the speech model storage unit 11 is a speech model trained using a training speech signal according to a general optimization criterion such as a maximum likelihood criterion. Unlike the training of an acoustic model generally used in speech recognition, the training of a speech model does not require linguistic information (such as words) representing the content of a training speech signal. The voice model storage unit 11 may store, for example, two or more voice models in which training voice signals are learned separately, such as gender of a speaker (male and female) and recording environment separately (indoor and outdoor). Good.

音響多様度算出部１２は、音声信号を受理し、音声モデル記憶部１１に記憶されている１つ以上の音声モデルを参照して、音声信号に含まれる音の種類を表す音響多様度を算出する。そして音響多様度算出部１２は、その処理結果を受理した音声信号と併せて出力する。音の種類は、例えば、音声信号を類似度に基づいて自動的にグループ化、すなわちクラスタリング、をして得られる音のまとまり、すなわち音のクラスとして表される。ここで、出力とは、例えば、外部の装置への送信、他の処理装置や他のプログラムへの処理結果の引き渡しのことである。また、出力は、例えば、ディスプレイへの表示、プロジェクタを用いた投影、プリンタでの印字なども含む概念である。 The acoustic diversity calculation unit 12 receives an audio signal, refers to one or more audio models stored in the audio model storage unit 11, and calculates an audio diversity representing the type of sound included in the audio signal. Do. Then, the acoustic diversity calculation unit 12 outputs the processing result together with the received audio signal. The type of sound is represented, for example, as a grouping of sounds obtained by automatically grouping or clustering audio signals based on the degree of similarity, that is, a class of sounds. Here, the output means, for example, transmission to an external device, and delivery of the processing result to another processing device or another program. Further, the output is a concept including, for example, display on a display, projection using a projector, printing on a printer, and the like.

音響多様度算出部１２が音声信号ｘの音響多様度Ｖ（ｘ）を算出する方法の一例を説明する。例えば、音声モデルが混合ガウス分布であるとき、混合ガウス分布の各要素分布はそれぞれ異なる音を表している。そこで、音響多様度算出部１２は、音声信号ｘに対して、音声モデルである混合ガウス分布の各要素分布の事後確率を求める。混合ガウス分布のｉ番目の要素分布の事後確率Ｐ_ｉ（ｘ）は、以下の式で計算できる。 An example of a method of calculating the acoustic multiplicity V (x) of the audio signal x by the acoustic diversity computation unit 12 will be described. For example, when the speech model is a Gaussian mixture, each element distribution of the Gaussian mixture represents different sounds. Therefore, the acoustic diversity calculation unit 12 obtains, for the audio signal x, the posterior probability of each element distribution of the mixed Gaussian distribution that is an audio model. The posterior probability P _i (x) of the i-th element distribution of the mixed Gaussian distribution can be calculated by the following equation.

ここで、Ｎはガウス分布の確率密度関数を表し、θ_ｉは混合ガウス分布のｉ番目の要素分布のパラメタ（平均および分散）、ｗ_ｉは混合ガウス分布のｉ番目の要素分布の混合係数を表す。このＰ_ｉ（ｘ）は、音声信号ｘが混合ガウス分布のｉ番目の要素分布にどの程度属するかを表している。このＰ_ｉ（ｘ）を要素に持つベクトルとして音響多様度Ｖ（ｘ）を構成する。例えば、音声モデルである混合ガウス分布の混合数が４であるとき、音響多様度をＶ（ｘ）＝（Ｐ_１（ｘ），Ｐ_２（ｘ），Ｐ_３（ｘ），Ｐ_４（ｘ））と定める。 Here, N represents the probability density function of the Gaussian distribution, θ _i is the parameter (average and variance) of the i-th element distribution of the mixed Gaussian distribution, and w _i is the mixing coefficient of the i-th element distribution of the mixed Gaussian distribution Represent. This P _i (x) indicates how much the speech signal x belongs to the i-th element distribution of the mixed Gaussian distribution. The acoustic multiplicity V (x) is configured as a vector having this P _i (x) as an element. For example, when the mixing number of the mixture Gaussian distribution, which is a speech model, is four, the acoustic diversity is expressed as V (x) = (P ₁ (x), P ₂ (x), P ₃ (x), P ₄ (x It defines as).

また、音響多様度算出部１２が音声信号ｘの音響多様度Ｖ（ｘ）を算出する方法の他の例を説明する。例えば、音声モデルが混合ガウス分布であるとき、音響多様度算出部１２は、音声信号ｘを短時間音声信号の時系列｛ｘ_１，ｘ_２，…，ｘ_Ｔ｝に分割し、短時間音声信号それぞれについて、その出現確率が最大となる要素分布番号ｉをＡｒｇｍａｘ_ｉ（ｘ_ｔ）＝Ｎ（ｘ_ｔ｜θ_ｉ）によって求める。混合ガウス分布のｉ番目の要素分布が選ばれた回数をＣ_ｉ（ｘ）とすると、これは音声信号ｘが混合ガウス分布のｉ番目の要素分布にどの程度属するかを表していると解釈できる。このＣ_ｉ（ｘ）あるいはＣ_ｉ（ｘ）／Σ_ｊＣ_ｊ（ｘ）を要素に持つベクトルとして音響多様度Ｖ（ｘ）を構成する。例えば、音声モデルである混合ガウス分布の混合数が４であるとき、音響多様度をＶ（ｘ）＝（Ｃ_１（ｘ），Ｃ_２（ｘ），Ｃ_３（ｘ），Ｃ_４（ｘ））と定める。 In addition, another example of a method for the acoustic diversity calculation unit 12 to calculate the acoustic diversity V (x) of the audio signal x will be described. For example, when the speech model is a mixed Gaussian distribution, acoustic diversity calculating unit 12, the time series _{_{{x 1, x 2, ...}} , x T} brief audio signal a sound signal x is divided into a short time voice For each of the signals, an element distribution number i for which the appearance probability is maximum is determined by Argmax _i (x _t ) = N (x _t | θ _i ). Assuming that the number of times the i-th element distribution of the mixed Gaussian distribution is selected is C _i (x), this can be interpreted as representing how much the speech signal x belongs to the i-th element distribution of the mixed Gaussian distribution . The acoustic diversity V (x) is configured as a vector having this C _i (x) or C _i (x) / Σ _j C _j (x) as an element. For example, when the mixing number of the mixture Gaussian distribution, which is a speech model, is four, the acoustic multiplicity is V (x) = (C ₁ (x), C ₂ (x), C ₃ (x), C ₄ (x It defines as).

音響多様度算出部１２は、受理した音声信号を区分化した音声信号の音響多様度を算出してもよい。音響多様度算出部１２は、例えば、受理した音声信号を一定時間の音声信号に区分化し、区分化の結果であるそれぞれの区分化音声信号について音響多様度を算出してもよい。音響多様度算出部１２は、あるいは、音声信号の受理に同期して、音声信号の継続時間長が所定の値を超えるときに、その時点で受理した音声信号の音響多様度を算出してもよい。 The acoustic diversity calculation unit 12 may calculate the acoustic diversity of the audio signal obtained by dividing the received audio signal. The acoustic diversity calculation unit 12 may, for example, segment the received audio signal into audio signals of a predetermined time, and calculate the acoustic diversity of each segmented audio signal as a result of the segmentation. The acoustic diversity calculation unit 12 may also calculate the acoustic diversity of the audio signal received at that time when the duration of the audio signal exceeds a predetermined value in synchronization with the reception of the audio signal. Good.

音響多様度算出部１２は、音声モデル記憶部１１に記憶されている２つ以上の音声モデルを参照する場合では、それぞれの音声モデルに基づいて算出した２つ以上の音響多様度を重みづけ加算したものを音響多様度としてもよい。 When referring to two or more speech models stored in the speech model storage unit 11, the acoustic diversity calculation unit 12 performs weighted addition on the two or more acoustic diversity calculated based on the respective speech models. You may make it the acoustic diversity.

音響多様度算出部１２は、確率分布を要素分布とする混合モデルとして構成され音声信号の出現確率分布を表す音声モデルに基づき、その音声信号の音声モデルの要素分布に対する尤度を算出し、その尤度、または、その尤度を全ての要素分布の尤度の和で正規化した値に基づき、その音声信号の音響多様度を算出してもよい。 The acoustic diversity calculation unit 12 calculates the likelihood for the element distribution of the speech model of the speech signal based on the speech model that is configured as a mixed model with the probability distribution as the element distribution and represents the appearance probability distribution of the speech signal The acoustic diversity of the speech signal may be calculated based on the likelihood or a value obtained by normalizing the likelihood with the sum of the likelihoods of all the element distributions.

以上述べたように、音響多様度算出部１２は、音声モデル記憶部１１に記憶された音声モデルに基づき音響多様度を算出する。音声モデルは１つあるいは少数の混合ガウス分布で表されるため、一般的な音声認識で用いられる音声モデルのように、言語情報、例えば音素などごとに異なる混合ガウス分布を有するものよりも、モデルを構成するパラメタが少ない。同じ理由により、音響多様度の算出にかかる計算量も、一般的な音声認識より少ない。このことから、音響多様度算出部１２は、言語情報を用いる特許文献１に記載の技術よりも高速に音響多様度を算出できる。 As described above, the acoustic diversity calculation unit 12 calculates the acoustic diversity based on the speech model stored in the speech model storage unit 11. Since a speech model is represented by one or a small number of mixed Gaussian distributions, it is more likely to be a speech model used in general speech recognition than a language information, for example, one having a mixed Gaussian distribution different for each phoneme, etc. There are few parameters to configure. For the same reason, the amount of calculation required to calculate the acoustic diversity is also less than in general speech recognition. From this, the acoustic diversity calculation unit 12 can calculate the acoustic diversity faster than the technology described in Patent Document 1 that uses language information.

音響多様度記憶部１３は、１つ以上の音響多様度を表す情報を記憶する。音響多様度記憶部１３は、あらかじめ、１つ以上の音声信号について、上述の音響多様度算出部１２と同様の方法で算出した音響多様度を記憶する。例えば、音響多様度記憶部１３は、話者認識装置が有する話者モデルを学習する際に、各話者モデルの学習に用いる音声信号それぞれについて算出した音響多様度を、各話者と対応付けて記憶する。音声処理装置１００は、例えば、３名の話者（Ａ、Ｂ、Ｃ）の音声信号ｘ_Ａ，ｘ_Ｂ，ｘ_Ｃを用いて３つの話者モデルを学習する際に、それぞれの音声信号の音響多様度を算出し、｛Ａ，Ｖ（ｘ_Ａ）｝，｛Ｂ，Ｖ（ｘ_Ｂ）｝，｛Ｃ，Ｖ（ｘ_Ｃ）｝のように表される情報を記憶する。 The acoustic diversity storage unit 13 stores information indicating one or more acoustic diversity. The acoustic diversity storage unit 13 stores, in advance, the acoustic diversity calculated for the one or more audio signals in the same manner as the above-described acoustic diversity calculation unit 12. For example, when learning the speaker model included in the speaker recognition apparatus, the sound diversity storage unit 13 associates the sound diversity calculated for each of the audio signals used for learning of each speaker model with each speaker. To memorize. Speech processing apparatus 100 includes, for example, three speakers (A, B, C) audio signal _x A _of, x B, when learning the three speaker models using _{x C,} of the respective audio signals calculating an acoustic diversity _stores {a, V (x a) }, {B, V (x B)}, the information represented as _{{C, V (x C)} }.

音響信頼度算出部１４は、音響多様度算出部１２が出力した音声信号とその音響多様度を受理し、音響多様度記憶部１３に記憶されている音響多様度を参照して、音声信号の音響信頼度を算出し、その処理結果を音響多様度算出部１２が出力した音声信号および音響多様度と併せて出力する。 The sound reliability calculation unit 14 receives the sound signal output from the sound diversity calculation unit 12 and the sound diversity thereof, and refers to the sound diversity stored in the sound diversity storage unit 13 to obtain the sound signal The audio reliability is calculated, and the processing result is output together with the audio signal and the audio diversity output from the audio diversity calculation unit 12.

音響信頼度算出部１４が音声信号ｘの音響信頼度Ｒ（ｘ）を算出する方法の一例を説明する。例えば、音声信号ｘの音響多様度Ｖ（ｘ）が、上述［数１］の計算結果であるＰ_ｉ（ｘ）を要素に持つ多項分布（ベクトル）であるとする。また、音響多様度記憶部１３に記憶されている１つ以上の音響多様度のひとつであるＶ_Ａ＝Ｖ（ｘ_Ａ）も同様の多項分布であるとする。例えば、Ｖ_Ａは話者Ａの話者モデル学習用音声信号の音響多様度である。これらの２つの多項分布Ｖ（ｘ）とＶ_Ａとの相違度として音響信頼度を構成する。音響信頼度算出部１４は、２つの多項分布Ｖ（ｘ）とＶ_Ａとの相違度を表す任意の尺度、例えば、カルバック・ライブラー情報量やコサイン類似度を用いて音響信頼度を定める。例えば、カルバック・ライブラー情報量とは、２つの確率分布の差異を図る尺度である。例えば、以下の式に従って音響信頼度を算出するとき、その値は正の値をとり、２つの多項分布の相違度が小さいほど、大きな値をとる。 An example of a method of calculating the sound reliability R (x) of the audio signal x by the sound reliability calculation unit 14 will be described. For example, it is assumed that the acoustic multiplicity V (x) of the audio signal x is a multinomial distribution (vector) having P _i (x) which is the calculation result of the above [Equation 1] as an element. Further, it is assumed that V _A = V (x _A ), which is one of one or more acoustic diversity stored in the acoustic diversity storage unit 13, has the same multinomial distribution. For example, V _A is the acoustic diversity of the speaker model training speech signal of speaker A. The acoustic reliability is configured as the difference between these two multinomial distributions V (x) and _VA . The acoustic reliability calculation unit 14 determines the acoustic reliability using an arbitrary scale that indicates the degree of difference between the two multinomial distributions V (x) and _VA , for example, the amount of information of the Kullback-Leibler and the degree of cosine similarity. For example, the Kullack-Leibler information amount is a measure for determining the difference between two probability distributions. For example, when calculating the sound reliability according to the following equation, the value takes a positive value, and the smaller the difference between the two multinomial distributions, the larger the value.

音響信頼度算出部１４は、あるいは、音声信号ｘの音響多様度Ｖ（ｘ）のエントロピーを計算して、音響信頼度としてもよい。この場合、音響信頼度Ｒ（ｘ）＝−Σ_ｉＶ_ｉ（ｘ）ｌｏｇＶ_ｉ（ｘ）は正の値をとり、音声信号ｘに含まれている音の種類が均一であるとき、最も大きな値をとる。 Alternatively, the sound reliability calculation unit 14 may calculate the entropy of the sound multiplicity V (x) of the audio signal x to obtain the sound reliability. In this case, acoustic reliability R (x) = - when _{_{Σ i V i (x) logV}} i (x) is a positive value, the type of sound contained in the audio signal x is uniform, the largest Take a value.

音響信頼度算出部１４は、受理した音声信号を区分化した音声信号の音響信頼度を算出してもよい。音響信頼度算出部１４は、音響多様度算出部１２と同様に音声信号を区分化してもよいし、異なる区分化をしてもよい。 The sound reliability calculation unit 14 may calculate the sound reliability of the sound signal obtained by dividing the received sound signal. The sound reliability calculation unit 14 may divide the audio signal in the same manner as the sound diversity calculation unit 12 or may divide the sound signal differently.

音響信頼度算出部１４は、音響多様度記憶部１３に記憶されている２つ以上の音響多様度を参照する場合は、それぞれの音響多様度に基づいて算出した２つ以上の結果をそれぞれ出力してもよい。 The acoustic reliability calculation unit 14 outputs two or more results calculated based on the respective acoustic diversity degrees when referring to two or more acoustic diversity degrees stored in the acoustic diversity degree storage unit 13 You may

以上述べたように、音響多様度算出部１２が算出する音響多様度により、音声信号に含まれる音の種類のばらつきの程度を表す多様性を表現できる。これにより、例えば、２つの同じ長さの音声信号があり、片方が同じ言葉の繰り返しで、もう一方がそうでないとき、音響多様度を用いて、２つの音声信号にそれぞれ含まれる音の種類が異なるさまを表すことができる。すなわち、同じ言葉の繰り返しの音声信号には音の種類のばらつきの程度が小さいので音響多様度が低くなる。一方、同じ言葉の繰り返しでない音声信号には音の種類のばらつきの程度が大きいので音響多様度が高くなる。音響信頼度算出部１４では、音響多様度に基づいて、音の種類に偏りや欠落が少ない音声信号の音響信頼度が高くなるよう算出することによって、話者認識に適した音声信号に高い音響信頼度を与えることができる。 As described above, it is possible to express the diversity representing the degree of the variation of the type of sound included in the audio signal by the acoustic diversity calculated by the acoustic diversity calculation unit 12. Thereby, for example, when there are two audio signals of the same length, one is a repetition of the same word, and the other is not, using the sound diversity, the types of sound included in each of the two audio signals are It can represent different things. That is, since the degree of the variation of the kind of sound is small in the sound signal of the same word, the degree of acoustic diversity is low. On the other hand, since the degree of the variation of the kind of sound is large in the sound signal which is not the repetition of the same word, the degree of acoustic diversity increases. The sound reliability calculation unit 14 calculates high sound quality of the sound signal suitable for speaker recognition by calculating the sound reliability of the sound signal with less deviation or omission in the sound type to be high based on the sound diversity degree. Confidence can be given.

なお、特許文献１に記載の技術は、音声信号に含まれる音の多様性を表す特性として、音声に含まれる有声音と無声音の比率や繰り返し発話区間の比率を用いている。しかし、これらの特性を計算するためには、音声に対応する記号（有声音、無声音、および単語など）を推定する必要がある。このような記号の推定に用いられる音声認識技術は一般に計算時間を要するため、高速な話者認識処理には適さないという課題がある。 Note that the technology described in Patent Document 1 uses the ratio of voiced sound to unvoiced sound included in voice and the ratio of repeated utterance sections as a characteristic indicating the diversity of sounds included in a voice signal. However, in order to calculate these characteristics, it is necessary to estimate symbols corresponding to speech (such as voiced speech, unvoiced speech, and words). Since the speech recognition technology used for such symbol estimation generally requires computation time, there is a problem that it is not suitable for high-speed speaker recognition processing.

これに対し、第１の実施形態にかかる音響信頼度算出部１４は、音響多様度算出部１２が算出する、受理音声信号の音響多様度と、音響多様度記憶部１３が記憶する、他の音声信号の音響多様度に基づいて、両者の相違度を表す尺度を計算し、計算結果に基づき音響信頼度を求める。音響信頼度算出部１４は、例えば、話者認識装置に受理された音声信号と話者モデルの学習に用いた音声信号の、音響多様度の相違に基づいて音響信頼度を算出する。話者認識装置に受理された音声信号と話者モデルとの類似度を評価する際に、話者モデルの学習用音声信号に含まれていない種類の音に基づいて類似度を評価するよりも、話者モデルの学習用音声信号に含まれている種類の音に基づいて類似度を評価する方が、類似度の信頼性が高い。音響信頼度算出部１４は、十分な長さの話者モデル学習用音声信号を取得できず、学習用音声信号に含まれる音の種類に偏りや欠落が生じたような場合でも、音響信頼度を精度よく算出することができる。 On the other hand, the sound reliability calculation unit 14 according to the first embodiment is the sound diversity degree of the received voice signal calculated by the sound diversity degree calculation unit 12 and the sound diversity degree storage unit 13 that are stored. Based on the acoustic diversity of the audio signal, a measure representing the difference between the two is calculated, and the acoustic reliability is determined based on the calculation result. The sound reliability calculation unit 14 calculates sound reliability based on, for example, the difference in acoustic diversity between the sound signal received by the speaker recognition device and the sound signal used for learning the speaker model. When evaluating the similarity between the speech signal received by the speaker recognition device and the speaker model, rather than evaluating the similarity based on sounds of types not included in the training speech signal of the speaker model The degree of similarity is more reliable if the degree of similarity is evaluated based on the type of sound included in the training speech signal of the speaker model. The acoustic reliability calculation unit 14 can not acquire the speaker model learning voice signal of a sufficient length, and even if the type of sound included in the learning voice signal is uneven or missing, the acoustic reliability can be obtained. Can be calculated accurately.

音響信頼度算出部１４は、音声信号に任意の区分化を施した区分化音声信号について音響信頼度を算出してもよい。これにより、第１の実施形態にかかる音声処理装置１００は、非特許文献１に記載の技術である音声信号の継続時間長に基づく計算法よりも精度よく音響信頼度を算出できる。図３を用いて、音響信頼度について説明する。図３のグラフは、音声信号の継続時間長と、その時点での音声信号の音響多様度との関係を示すものである。図３に示す（１）のグラフは、非特許文献１に記載の技術によるものであり、音声信号の継続時間長が３０のときに音響信頼度が１であることを示す。非特許文献１に記載の技術では、どのような音声信号に対しても図３に示す（１）のグラフと同様になる。一方で、本発明の一態様における技術では、音声信号に含まれる音の多様性に応じて、当該グラフの様相が異なる。音声信号の多様性が大きい場合は、継続時間長が小さい段階でも音響信頼度は高い値をとり、例えば、図３に示す（２）のグラフのように継続時間長が２０のときに音響信頼度が１になることがある。また、音声信号の先頭部分の多様性が小さい場合は、継続時間長が大きくなるまで音響信頼度は高い値とならず、例えば、図３に示す（３）のグラフのように継続時間長が４０のときに音響信頼度が１になることがある。このように、第１の実施形態における音声処理装置１００は、音響多様性に基づいて音声信号の音響信頼度を算出することで、非特許文献１に記載の技術よりも精度よく音響信頼度を求めることができる。 The sound reliability calculation unit 14 may calculate the sound reliability of the segmented audio signal obtained by subjecting the audio signal to any segmentation. Thereby, the speech processing apparatus 100 according to the first embodiment can calculate the acoustic reliability more accurately than the calculation method based on the duration time of the speech signal, which is the technology described in Non-Patent Document 1. The acoustic reliability will be described with reference to FIG. The graph of FIG. 3 shows the relationship between the duration of the audio signal and the acoustic diversity of the audio signal at that time. The graph (1) shown in FIG. 3 is based on the technique described in Non-Patent Document 1, and shows that the sound reliability is 1 when the duration time of the audio signal is 30. In the technique described in Non-Patent Document 1, any audio signal is similar to the graph of (1) shown in FIG. On the other hand, in the technique according to one aspect of the present invention, the appearance of the graph is different depending on the variety of sounds included in the audio signal. When the diversity of the audio signal is large, the acoustic reliability takes a high value even at a stage where the duration is small, for example, when the duration is 20 as shown in the graph (2) shown in FIG. The degree may be one. In addition, when the diversity of the head portion of the audio signal is small, the acoustic reliability does not have a high value until the duration becomes large. For example, as in the graph of (3) shown in FIG. The sound reliability may be 1 at 40. As described above, the speech processing apparatus 100 according to the first embodiment calculates the acoustic reliability of the speech signal based on the acoustic diversity, so that the acoustic reliability can be more accurately obtained than the technology described in Non-Patent Document 1. It can be asked.

音声モデル記憶部１１および音響多様度記憶部１３は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。 Although the voice model storage unit 11 and the acoustic multiplicity storage unit 13 are preferably non-volatile recording media, they can also be realized as volatile recording media.

音声モデル記憶部１１に音声モデルが記憶される過程は問わない。例えば、記録媒体を介して音声モデルが音声モデル記憶部１１に記憶されるようになってもよく、通信回線等を介して送信された音声モデルが音声モデル記憶部１１に記憶されるようになってもよく、あるいは、入力デバイスを介して入力された音声モデルが音声モデル記憶部１１で記憶されるようになってもよい。音響多様度記憶部１３についても同様である。 The process in which the speech model is stored in the speech model storage unit 11 does not matter. For example, the speech model may be stored in the speech model storage unit 11 via the recording medium, and the speech model storage unit 11 may store the speech model transmitted via the communication line or the like. Alternatively, the speech model storage unit 11 may store the speech model input through the input device. The same applies to the acoustic diversity degree storage unit 13.

音響多様度算出部１２、音響信頼度算出部１４は、例えば、演算装置やメモリ等から実現されうる。音響多様度算出部１２等の処理手順は、例えば、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。また、ハードウェア（専用回路）で実現してもよい。 The sound diversity calculation unit 12 and the sound reliability calculation unit 14 can be realized by, for example, an arithmetic device or a memory. The processing procedure of the acoustic diversity calculation unit 12 or the like is realized by, for example, software, and the software is stored in a storage medium such as a ROM. Also, it may be realized by hardware (dedicated circuit).

（第１の実施形態における音声処理装置１００の動作）
次に、第１の実施形態における音声処理装置１００の動作について、図２のフローチャートを用いて説明する。図２は、音声処理装置１００の動作を示すフローチャートである。 (Operation of Voice Processing Device 100 in First Embodiment)
Next, the operation of the speech processing apparatus 100 according to the first embodiment will be described with reference to the flowchart of FIG. FIG. 2 is a flowchart showing the operation of the speech processing apparatus 100.

音声処理装置１００は、外部から１つ以上の音声信号を受理し、音響多様度算出部１２に提供する（ステップＳ１０１）。音響多様度算出部１２は、受理した１つ以上の音声信号それぞれについて、音声モデル記憶部１１に記憶されている１つ以上の音声モデルを参照し、音響多様度を算出する（ステップＳ１０２）。音響信頼度算出部１４は、受理した１つ以上の音声信号とその音響多様度それぞれと、必要に応じて音響多様度記憶部１３に記憶されている１つ以上の音響多様度とを参照し、音響信頼度を算出する（ステップＳ１０３）。音声処理装置１００は、外部からの音声信号の受理が終了した場合（ステップＳ１０４でＹｅｓ）、一連の処理を終了する。音声処理装置１００は、外部からの音声信号の受理が終了していない場合（ステップＳ１０４でＮｏ）、ステップＳ１０１へ処理が戻る。 The voice processing apparatus 100 receives one or more voice signals from the outside, and provides the voice diversity calculation unit 12 (step S101). The acoustic diversity calculation unit 12 calculates the acoustic diversity with reference to the one or more speech models stored in the speech model storage unit 11 for each of the received one or more speech signals (step S102). The acoustic reliability calculation unit 14 refers to one or more received audio signals and their acoustic diversity, and one or more acoustic diversity stored in the acoustic diversity storage unit 13 as necessary. The acoustic reliability is calculated (step S103). The speech processing apparatus 100 ends the series of processing when reception of the speech signal from the outside is completed (Yes in step S104). When the reception of the external audio signal has not ended (No in step S104), the speech processing device 100 returns the process to step S101.

以上で、第１の実施形態における音声処理装置１００の動作が終了する。 This is the end of the operation of the speech processing device 100 according to the first embodiment.

（第１の実施形態の効果）
以上、説明したように、第１の実施形態の音声処理装置１００によれば、音声信号に対して、音声多様度に基づいて音響信頼度を算出するという、言語情報を利用しない高速な計算方式を用いる。よって音声処理装置１００は、音声信号に含まれる音の種類に偏りや欠落がある場合や、音声信号の音響多様度と話者モデル学習用音声信号の音響多様度との差異が大きい場合に、音声信号の音響信頼度を低く見積もることができる。これにより、第１の実施形態の音声処理装置１００は話者認識に適さない音声信号に対する話者認識結果の出力を抑制することができ、話者認識精度の低下を抑制できる。 (Effects of the first embodiment)
As described above, according to the speech processing apparatus 100 of the first embodiment, a high-speed calculation method using no speech information, in which sound reliability is calculated for speech signals based on speech diversity. Use Therefore, in the case where there is a bias or omission in the types of sounds included in the audio signal, the audio processing apparatus 100 has a large difference between the audio diversity of the audio signal and the audio diversity of the speaker model learning audio signal. The acoustic reliability of the speech signal can be underestimated. Thus, the speech processing apparatus 100 according to the first embodiment can suppress the output of the speaker recognition result for the speech signal not suitable for the speaker recognition, and can suppress the reduction in the speaker recognition accuracy.

＜第２の実施形態＞
図４は、第２の実施形態における話者認識装置２００のブロック図である。話者認識装置２００は、音声区間検出部２１、音声処理部２２、話者モデル記憶部２３、話者認識計算部２４、および話者認識出力部２５を備える。 Second Embodiment
FIG. 4 is a block diagram of the speaker recognition apparatus 200 in the second embodiment. The speaker recognition apparatus 200 includes a voice section detection unit 21, a voice processing unit 22, a speaker model storage unit 23, a speaker recognition calculation unit 24, and a speaker recognition output unit 25.

本実施形態における話者認識装置２００は、音声信号から特定の属性情報を認識する属性認識装置の一例であり本実施形態は属性認識装置全般に適用可能である。属性認識装置の具体例としてほかに例えば言語認識装置がある。話者認識装置は、音声信号を発した話者を示す情報を認識する。言語認識装置は、音声信号が伝える言語を示す情報を認識する。すなわち本実施形態は話者認識装置、言語認識装置に適用可能である。 The speaker recognition device 200 in the present embodiment is an example of an attribute recognition device that recognizes specific attribute information from a voice signal, and the present embodiment is applicable to all attribute recognition devices. Another example of the attribute recognition device is, for example, a language recognition device. The speaker recognizer recognizes information indicative of the speaker who originated the speech signal. The language recognition device recognizes information indicating the language that the speech signal conveys. That is, the present embodiment is applicable to a speaker recognition apparatus and a language recognition apparatus.

音声区間検出部２１は、音声信号を受理し、音声信号に含まれる音声区間を検出してそれを区分化し、その処理結果である区分化音声信号を出力する。ここで、受理とは、例えば、外部の装置からの受信、他の処理装置や他のプログラムからの処理結果の引き渡しのことである。音声区間検出部２１は、例えば、音声信号のうち一定時間継続して音量が所定値より小さい区間を無音と判定し、その区間の前後を異なる音声区間と判定して区分化するようにしてもよい。 The voice segment detection unit 21 receives a voice signal, detects a voice segment included in the voice signal, segments the voice segment, and outputs a segmented voice signal as a processing result. Here, acceptance means, for example, reception from an external device, and delivery of processing results from another processing device or another program. For example, the voice section detection unit 21 determines that a section in which the sound volume is smaller than a predetermined value continuously for a predetermined time in the voice signal is silent, and determines that the sections before and after that section are different voice sections. Good.

音声処理部２２は、音声区間検出部２１が出力した１つ以上の音声信号を受理し、第１の実施形態の音声処理装置１００に相当する音声処理を行って、音声信号の音響信頼度を算出し、処理結果の音響信頼度を出力する。音声処理部２２の構成および動作は、第１の実施形態における音声処理装置１００の構成および動作と同様であってもよい。例えば音声処理部２２は音声処理装置１００であってもよい。 The voice processing unit 22 receives one or more voice signals output from the voice section detection unit 21, performs voice processing corresponding to the voice processing apparatus 100 according to the first embodiment, and determines the acoustic reliability of the voice signal. Calculate and output the acoustic reliability of the processing result. The configuration and operation of the audio processing unit 22 may be similar to the configuration and operation of the audio processing device 100 in the first embodiment. For example, the audio processing unit 22 may be the audio processing device 100.

話者モデル記憶部２３は、１つ以上の話者モデルを格納する。話者モデルは、音声信号に対してそのモデルとの適合度合いを表す数値情報（スコア）を付与するための情報を有する。話者認識装置２００は、例えば、話者モデルが混合ガウス分布である場合、混合ガウス分布が有する平均、分散、および混合係数に基づいて、音声信号の出現確率をスコアとして算出できる。ここで、各話者に対応する混合ガウス分布は、話者ＩＤが教師ラベルとして与えられた音声信号を用いて、最尤基準や最大事後確率基準等の一般的な基準に従って最適化されたものを用いる。ここで、話者ＩＤとは、話者を識別するための識別子である。 The speaker model storage unit 23 stores one or more speaker models. The speaker model has information for giving the speech signal numerical information (score) indicating the degree of matching with the model. For example, when the speaker model has a mixed Gaussian distribution, the speaker recognition apparatus 200 can calculate the appearance probability of the speech signal as a score based on the average, the variance, and the mixing coefficient of the mixed Gaussian distribution. Here, the mixed Gaussian distribution corresponding to each speaker is optimized according to a general standard such as a maximum likelihood standard or a maximum a posteriori probability standard using a speech signal to which speaker IDs are given as teacher labels. Use Here, the speaker ID is an identifier for identifying a speaker.

話者認識計算部２４は、音声区間検出部２１が出力した音声信号を受理し、話者モデル記憶部２３に記憶されている１つ以上の話者モデルを参照して、音声信号とそれぞれの話者モデルに対する適合度合いを計算し、話者認識結果を計算して話者認識出力部２５に出力する。 The speaker recognition calculation unit 24 receives the voice signal output from the voice section detection unit 21, refers to one or more speaker models stored in the speaker model storage unit 23, and receives the voice signal and each voice signal. The degree of matching with the speaker model is calculated, and the speaker recognition result is calculated and output to the speaker recognition output unit 25.

話者認識計算部２４が出力する話者認識結果は、例えば、話者認識装置２００が話者識別することを目的とする場合は、各話者モデルに基づいて算出したスコア順に並べた話者ＩＤのリストの形式である。また、話者認識計算部２４が出力する話者認識結果は、例えば、話者認識装置２００が話者照合することを目的とする場合は、照合対象の話者モデルに基づいて算出したスコアに基づいて照合可否の判定情報である。 The speaker recognition result output from the speaker recognition calculation unit 24 is, for example, a speaker arranged in the order of scores calculated based on each speaker model when the speaker recognition apparatus 200 aims to identify the speakers. It is in the form of a list of IDs. Further, the speaker recognition result output from the speaker recognition calculation unit 24 may be, for example, a score calculated based on the speaker model to be matched when the speaker recognition apparatus 200 aims to perform speaker verification. It is judgment information on the basis of whether or not collation is possible.

話者認識出力部２５は、話者認識計算部２４が出力した話者認識結果と、音声処理部２２が出力した音響信頼度とを受理し、話者認識結果を必要に応じて変更して、外部に出力する。ここで、出力とは、例えば、外部の装置への送信、他の処理装置や他のプログラムへの処理結果の引き渡しのことである。また、出力とは、ディスプレイへの表示、プロジェクタを用いた投影、プリンタでの印字なども含む概念である。 The speaker recognition output unit 25 receives the speaker recognition result output from the speaker recognition calculation unit 24 and the sound reliability output from the voice processing unit 22, and changes the speaker recognition result as necessary. Output to the outside. Here, the output means, for example, transmission to an external device, and delivery of the processing result to another processing device or another program. The term "output" is a concept including display on a display, projection using a projector, printing on a printer, and the like.

話者認識出力部２５が出力する話者認識結果の作成方法の一例を説明する。例えば、話者認識装置２００が話者識別することを目的とする場合、話者認識計算部２４から受理した話者認識結果は、前述の通り、話者ＩＤリストで表される。話者認識出力部２５は、この話者認識結果に対して、話者ＩＤごとに、音声信号の音響多様度と当該話者ＩＤの話者モデルの学習用音声信号の音響多様度とに基づいて算出した音響信頼度を参照する。その上で、話者認識出力部２５は、例えば、当該音響信頼度が所定の値より低い場合にその話者ＩＤを話者認識結果から除く。あるいは、話者認識出力部２５は、例えば、当該音響信頼度に一定の係数を乗じて話者認識の結果に、スコアを加える演算などの再計算に基づき算出した値を含める。話者認識出力部２５は、当該スコア、すなわち話者認識スコアを含めた話者認識結果を新たな話者認識結果とする。または、話者認識出力部２５は、当該新たな話者認識結果を出力する。 An example of a method of creating a speaker recognition result output by the speaker recognition output unit 25 will be described. For example, when the speaker recognition apparatus 200 aims to identify a speaker, the speaker recognition result received from the speaker recognition calculation unit 24 is represented by the speaker ID list as described above. The speaker recognition output unit 25 determines the speaker recognition result based on the acoustic diversity of the speech signal and the acoustic diversity of the learning speech signal of the speaker model of the speaker ID for each speaker ID. Refer to the calculated sound reliability. Then, the speaker recognition output unit 25 removes the speaker ID from the speaker recognition result, for example, when the sound reliability is lower than a predetermined value. Alternatively, the speaker recognition output unit 25 includes, for example, a value calculated based on recalculation such as an operation of adding a score to the result of speaker recognition by multiplying the sound reliability by a constant coefficient. The speaker recognition output unit 25 sets the speaker recognition result including the score, that is, the speaker recognition score, as a new speaker recognition result. Alternatively, the speaker recognition output unit 25 outputs the new speaker recognition result.

また、話者認識出力部２５が出力する話者認識結果の作成方法の他の一例を説明する。例えば、話者認識装置２００が話者照合することを目的とする場合、受理した話者認識結果は、前述の通り、照合可否の判定情報である。話者認識出力部２５は、この話者認識結果に対して、音声信号を音響多様度と照合対象話者の話者モデルの学習用音声信号の音響多様度に基づいて算出した音響信頼度を参照する。話者認識出力部２５は、例えば、当該音響信頼度が所定の値より低い場合に、照合可否情報に加えて、当該音響信頼度を併せて話者認識結果とする。あるいは、話者認識出力部２５は、例えば、当該音響信頼度が所定の値より低い場合に、照合可否情報の代わりに、照合不能であることを示す情報を話者認識結果とする。 In addition, another example of a method of creating a speaker recognition result output by the speaker recognition output unit 25 will be described. For example, when the purpose of the speaker recognition apparatus 200 is to perform speaker verification, the received speaker recognition result is determination information as to whether or not the verification is possible as described above. The speaker recognition output unit 25 calculates the sound reliability based on the sound recognition degree of the speaker recognition result based on the sound diversity degree and the sound diversity degree of the learning speech signal of the speaker model of the speaker to be compared. refer. For example, when the acoustic reliability is lower than a predetermined value, the speaker recognition output unit 25 adds the acoustic reliability to the verification availability information as a speaker recognition result. Alternatively, for example, when the sound reliability is lower than a predetermined value, the speaker recognition output unit 25 uses information indicating that the matching is impossible as the speaker recognition result, instead of the matching availability information.

話者モデル記憶部２３は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。 The speaker model storage unit 23 is preferably a non-volatile storage medium, but can be realized also as a volatile storage medium.

話者モデル記憶部２３に話者モデルが記憶される過程は問わない。例えば、記録媒体を介して話者モデルが話者モデル記憶部２３に記憶されるようになってもよく、通信回線等を介して送信された話者モデルが話者モデル記憶部２３に記憶されるようになってもよく、あるいは、入力デバイスを介して入力された話者モデルが話者モデル記憶部２３で記憶されるようになってもよい。 The process for storing the speaker model in the speaker model storage unit 23 does not matter. For example, the speaker model may be stored in the speaker model storage unit 23 through the recording medium, and the speaker model transmitted through the communication line or the like is stored in the speaker model storage unit 23 The speaker model may be stored in the speaker model storage unit 23 or the speaker model input via the input device.

音声区間検出部２１、音声処理部２２、話者認識計算部２４、話者認識出力２５は、例えば、演算処理装置やメモリ等から実現されうる。音声区間検出部２１等の処理手順は、例えば、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。また、ハードウェア（専用回路）で実現してもよい。 The voice section detection unit 21, the voice processing unit 22, the speaker recognition calculation unit 24, and the speaker recognition output 25 can be realized by, for example, an arithmetic processing unit or a memory. The processing procedure of the voice section detection unit 21 or the like is realized by, for example, software, and the software is stored in a storage medium such as a ROM. Also, it may be realized by hardware (dedicated circuit).

（第２の実施形態における話者認識装置２００の動作）
次に、話者認識装置２００の動作について、図５のフローチャートを用いて説明する。図５は、話者認識装置２００の動作を示すフローチャートである。 (Operation of Speaker Recognition Device 200 in Second Embodiment)
Next, the operation of the speaker recognition apparatus 200 will be described using the flowchart of FIG. FIG. 5 is a flowchart showing the operation of the speaker recognition apparatus 200.

音声区間検出部２１は、外部から音声信号を受理し、受理した音声信号について、音声区間検出による区分化を行い、１つ以上の区分化音声信号を音声処理部２２および話者認識計算部２４に提供する（ステップＳ２０１）。音声処理部２２は、受理した１つ以上の区分化音声信号について、第１の実施形態の音声処理装置１００の音声処理（ステップＳ１０１〜ステップＳ１０４の処理）を施して音響信頼度を算出し、話者認識出力部２５に提供する（ステップＳ２０２）。話者認識計算部２４は、受理した１つ以上の区分化音声信号に対して、話者モデル記憶部２３に記憶されている１つ以上の話者モデルを参照し、話者認識計算を実行し、話者認識出力部２５に話者認識結果を提供する（ステップＳ２０３）。話者認識出力部２５は、音声信号に関する話者認識結果と音響信頼度を受理し、話者認識結果を必要に応じて変更し、その結果を外部に出力し、一連の処理を終了する（ステップＳ２０４）。 The voice section detection unit 21 receives a voice signal from the outside, performs segmentation by voice section detection on the received voice signal, and generates one or more segmented voice signals as the voice processing unit 22 and the speaker recognition calculation unit 24. (Step S201). The voice processing unit 22 performs the voice processing (the processing of steps S101 to S104) of the voice processing apparatus 100 according to the first embodiment on the received one or more segmented voice signals to calculate the sound reliability, This is provided to the speaker recognition output unit 25 (step S202). The speaker recognition calculation unit 24 executes speaker recognition calculation with reference to the one or more speaker models stored in the speaker model storage unit 23 for the received one or more segmented speech signals. Then, the speaker recognition result is provided to the speaker recognition output unit 25 (step S203). The speaker recognition output unit 25 receives the speaker recognition result and the sound reliability regarding the speech signal, changes the speaker recognition result as necessary, outputs the result to the outside, and ends the series of processing ( Step S204).

以上述べたように、本実施形態の話者認識装置２００は、話者認識計算部２４が話者認識計算を行った後に、話者認識出力部２５が音響信頼度に基づいて話者認識結果を変更するというように動作する。 As described above, in the speaker recognition apparatus 200 according to the present embodiment, after the speaker recognition calculation unit 24 performs the speaker recognition calculation, the speaker recognition output unit 25 generates the speaker recognition result based on the acoustic reliability. It works like changing it.

（第２の実施形態の効果）
以上、説明したように、第２の実施形態の話者認識装置２００によれば、第１の実施形態の音声処理装置（本実施形態では音声処理部２２）を利用することにより、入力された音声信号に対して音声多様度に基づく音響信頼度を算出し、音声信号に含まれる音の種類に偏りや欠落がある場合や、音声信号の音響多様度と話者モデル学習用音声信号の音響多様度との差異が大きい場合に、音声信号の音響信頼度を低く見積もることができる。これにより、第２の実施形態の話者認識装置２００は話者認識に適さない音声信号に対する話者認識結果の出力を抑制することができ、話者認識精度の低下を抑制できる。 (Effect of the second embodiment)
As described above, according to the speaker recognition apparatus 200 of the second embodiment, the input is performed using the speech processing apparatus of the first embodiment (in the present embodiment, the speech processing unit 22). The sound reliability is calculated based on the voice diversity degree for the voice signal, and when there is a bias or omission in the type of sound included in the voice signal, the acoustic diversity of the voice signal and the sound of the speaker model learning voice signal If the difference between the diversity is large, the acoustic reliability of the audio signal can be underestimated. Thus, the speaker recognition apparatus 200 according to the second embodiment can suppress the output of the speaker recognition result for the voice signal not suitable for the speaker recognition, and can suppress the reduction in the speaker recognition accuracy.

なお、非特許文献１に記載の技術は、音声信号が長いほど、話者認識の計算時間と話者認識結果を出力するまでにかかる応答時間が長くなるため、高速処理を要する利用場面には適さないという課題がある。一方、第２の実施の形態の話者認識装置２００は、例えば音声信号に含まれる音の種類に偏りや欠落がある場合や、音声信号の音響多様度と話者モデル学習用音声信号の音響多様度との差異が大きい場合に、音声信号の音響信頼度を低く見積もる。したがって話者認識装置２００は、話者認識に適さない音声信号に対する話者認識結果の出力を抑制することができ、高速処理を要する利用場面でも好適である。 In the technique described in Non-Patent Document 1, the longer the speech signal is, the longer it takes to calculate the speaker recognition time and the response time required to output the speaker recognition result. There is a problem that it is not suitable. On the other hand, in the speaker recognition apparatus 200 of the second embodiment, for example, when there is a bias or omission in the type of sound included in the audio signal, the acoustic diversity of the audio signal and the audio of the audio signal for speaker model learning. If the difference between the multiplicity is large, the acoustic reliability of the speech signal is underestimated. Therefore, the speaker recognition apparatus 200 can suppress the output of the speaker recognition result for the voice signal which is not suitable for the speaker recognition, and is suitable even in a use scene requiring high speed processing.

（第２の実施形態の変形例）
第２の実施形態と類似の形態として、音声処理部２２が算出した音響信頼度を話者認識計算部２４に提供するという構成がある。この構成では、話者認識計算部２４が、例えば音響信頼度が所定の値を超えたら話者認識計算を開始するというように、音響信頼度に基づいて話者認識計算を実施するか否かを判定するよう動作する。したがって本実施形態の類似の形態における話者認識装置２００は、話者認識にかかる計算を省くことができるという利点を有する。 (Modification of the second embodiment)
A configuration similar to that of the second embodiment is to provide the speaker recognition calculation unit 24 with the sound reliability calculated by the speech processing unit 22. In this configuration, whether or not the speaker recognition calculation unit 24 performs the speaker recognition calculation based on the sound reliability, for example, starts the speaker recognition calculation when the sound reliability exceeds a predetermined value. Operate to determine Therefore, the speaker recognition apparatus 200 in the similar form of the present embodiment has an advantage that the calculation for the speaker recognition can be omitted.

＜第３の実施形態＞
図６は、第３の実施形態における話者モデル学習装置３００のブロック図である。話者モデル学習装置３００は、音声区間検出部３１、音声処理部３２、音声モデル記憶部３３、話者モデル記憶部３４、話者モデル学習部３５、および音声信号入力要求部３６を備える。 Third Embodiment
FIG. 6 is a block diagram of a speaker model learning device 300 in the third embodiment. The speaker model learning device 300 includes a voice section detection unit 31, a voice processing unit 32, a voice model storage unit 33, a speaker model storage unit 34, a speaker model learning unit 35, and a voice signal input request unit 36.

音声区間検出部３１は、第２の実施形態の話者認識装置２００の音声区間検出部２１と同様に、入力部を介して音声信号を受理し、音声信号に含まれる音声区間を検出して区分化し、その処理結果である区分化音声信号を出力する。音声区間検出部３１の構成および動作は、第２の実施形態における音声区間検出部２１の構成および動作と同様である。 The voice section detection unit 31 receives a voice signal via the input unit and detects a voice section included in the voice signal, as in the voice section detection unit 21 of the speaker recognition apparatus 200 according to the second embodiment. It segments and outputs a segmented speech signal which is the processing result. The configuration and operation of the voice activity detection unit 31 are the same as the configuration and operation of the voice activity detection unit 21 in the second embodiment.

音声処理部３２は、音声区間検出部３１が出力した１つ以上の音声信号を受理し、第１の実施形態の音声処理装置１００に相当する音声処理を行って、音声信号の音響多様度を音声処理装置１００内の音響多様度記憶部に記憶させる。また、音声信号の音響信頼度を算出し、処理結果の音響信頼度を出力する。音声処理部３２の構成および動作は、第１の実施形態における音声処理装置１００および第２の実施形態における音声処理部２２の、構成および動作と同様である。また、音声処理部３２は、例えば、音声処理装置１００である。 The voice processing unit 32 receives one or more voice signals output from the voice section detection unit 31, performs voice processing corresponding to the voice processing apparatus 100 according to the first embodiment, and detects the acoustic diversity of the voice signal. It is stored in the acoustic diversity storage unit in the speech processing apparatus 100. Further, the sound reliability of the audio signal is calculated, and the sound reliability of the processing result is output. The configuration and operation of the audio processing unit 32 are the same as the configuration and operation of the audio processing unit 100 in the first embodiment and the audio processing unit 22 in the second embodiment. Also, the audio processing unit 32 is, for example, the audio processing device 100.

音声モデル記憶部３３は、第１の実施形態の音声モデル記憶部１１と同様に、１つ以上の音声モデルを記憶する。音声モデルは、例えば、混合ガウス分布である。音声モデル記憶部３３の構成および動作は、第１の実施形態における音声モデル記憶部１１の構成および動作と同様である。 The speech model storage unit 33 stores one or more speech models, as in the speech model storage unit 11 of the first embodiment. The speech model is, for example, a mixed Gaussian distribution. The configuration and operation of the speech model storage unit 33 are the same as the configuration and operation of the speech model storage unit 11 in the first embodiment.

話者モデル記憶部３４は、第２の実施形態の話者モデル記憶部２３と同様に、１つ以上の話者モデルを格納する。話者モデルは、例えば、混合ガウス分布である。話者モデル記憶部３４の構成および動作は、第２の実施形態における話者モデル２３の構成および動作と同様である。 The speaker model storage unit 34 stores one or more speaker models, as in the speaker model storage unit 23 of the second embodiment. The speaker model is, for example, a mixed Gaussian distribution. The configuration and operation of the speaker model storage unit 34 are the same as the configuration and operation of the speaker model 23 in the second embodiment.

話者モデル学習部３５は、１つ以上の音声信号を受理し、音声モデル記憶部３３に記憶されている音声モデルを参照し、両者を用いて話者モデルを作成し、話者モデル記憶部３４に記憶させる。音声モデルと話者モデルがともに混合ガウス分布であるとき、話者モデルは、音声モデルを初期モデルとして、音声信号の教師ラベルを話者ＩＤとして、最尤基準や事後確率最大化基準等の一般的な基準に従って最適化されたものを用いる。 The speaker model learning unit 35 receives one or more voice signals, refers to the voice model stored in the voice model storage unit 33, creates a speaker model using both of them, and creates a speaker model storage unit. Store in 34 When both the speech model and the speaker model are mixed Gaussian distributions, the speaker model uses the speech model as the initial model and the teacher label of the speech signal as the speaker ID, and the general rule such as maximum likelihood criterion or posterior probability maximization criterion. Use one that is optimized according to the standard.

音声信号入力要求部３６は、話者モデル学習装置３００に入力された音声信号について音声処理部３２が算出した音響信頼度を受理し、音声信号の入力の必要度を表す要求情報を算出し、外部に出力する。ここで、出力とは、例えば、外部の装置への送信、他の処理装置や他のプログラムへの処理結果の引き渡しのことである。また、出力とは、ディスプレイへの表示、プロジェクタを用いた投影、プリンタでの印字なども含む概念である。 The speech signal input request unit 36 receives the acoustic reliability calculated by the speech processing unit 32 for the speech signal input to the speaker model learning device 300, and calculates request information representing the necessity of the speech signal input, Output to the outside. Here, the output means, for example, transmission to an external device, and delivery of the processing result to another processing device or another program. The term "output" is a concept including display on a display, projection using a projector, printing on a printer, and the like.

音声信号入力要求部３６は、例えば、音声処理部３２から受理した音響信頼度が所定の値より高い場合に、音声信号の入力停止を示す要求情報を出力する。音声信号入力要求部３６が音声信号の入力停止を示す要求情報を出力して、話者に音声信号入力を停止してよい旨を提示することにより、話者が必要以上の音声信号の入力をする負担を軽減できる。 For example, when the sound reliability received from the sound processing unit 32 is higher than a predetermined value, the sound signal input requesting unit 36 outputs request information indicating that input of the sound signal is stopped. The voice signal input request unit 36 outputs request information indicating stop of voice signal input and presents the speaker that the voice signal input may be stopped, so that the speaker can input more voice signals than necessary. Reduce the burden of

音声信号入力要求部３６は、例えば、音声処理部３２から受理した音響信頼度が所定の値より低い場合に、音声信号の入力継続を示す要求情報を出力する。音声信号入力要求部３６が音声信号の入力継続を示す要求情報を出力して、話者に音声信号の入力を続けるよう促す表示を提示することにより、話者が音声信号の入力を早く停止することを抑制できる。 For example, when the sound reliability received from the sound processing unit 32 is lower than a predetermined value, the sound signal input request unit 36 outputs request information indicating input continuation of the sound signal. The speech signal input request unit 36 outputs the request information indicating the continuation of the input of the speech signal and presents a display prompting the speaker to continue the input of the speech signal, whereby the speaker stops the input of the speech signal quickly. Can be suppressed.

音声区間検出部３１は、音声信号入力要求部３６が出力する要求情報を受理する。音声区間検出部３１は、例えば、音声信号の入力停止を示す要求情報を受理したときに、音声区間の検出処理を停止するようにしてもよい。この構成により、話者モデル学習装置３００は、話者モデル学習に適した音声信号を必要最低限の継続時間長で取得できるため、少ない計算量で適切な話者モデルを作成できる。 The voice section detection unit 31 receives the request information output from the voice signal input request unit 36. The voice section detection unit 31 may stop the process of detecting the voice section, for example, when the request information indicating the stop of the input of the voice signal is received. With this configuration, the speaker model learning device 300 can acquire an audio signal suitable for speaker model learning with a minimum necessary duration, and can therefore create an appropriate speaker model with a small amount of calculation.

また、話者モデル学習部３５は、例えば、音声区間検出部３１が音声信号の入力継続を示す要求情報を受理したときに、話者モデル学習を開始しないようにしてもよい。この構成により、話者モデル学習装置３００は、話者モデル学習部３５が話者モデル学習に適さない音声信号で話者モデル学習処理を行うことによる計算量を削減できる。 In addition, the speaker model learning unit 35 may not start speaker model learning, for example, when the voice section detection unit 31 receives the request information indicating the input continuation of the voice signal. With this configuration, the speaker model learning device 300 can reduce the amount of calculation due to the speaker model learning unit 35 performing the speaker model learning process with the audio signal not suitable for the speaker model learning.

音声モデル記憶部３３および話者モデル記憶部３４は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。 Although the voice model storage unit 33 and the speaker model storage unit 34 are preferably non-volatile recording media, they can also be realized as volatile recording media.

音声モデル記憶部３３に話者モデルが記憶される過程は問わない。例えば、記録媒体を介して音声モデルが音声モデル記憶部３３に記憶されるようになってもよく、通信回線等を介して送信された音声モデルが音声モデル記憶部３３に記憶されるようになってもよく、あるいは、入力デバイスを介して入力された音声モデルが音声モデル記憶部３３で記憶されるようになってもよい。話者モデル記憶部３４についても同様である。 There is no limitation on the process in which the speaker model is stored in the speech model storage unit 33. For example, a voice model may be stored in the voice model storage unit 33 via a recording medium, and a voice model transmitted via a communication line or the like is stored in the voice model storage unit 33. Alternatively, the speech model storage unit 33 may store the speech model input through the input device. The same applies to the speaker model storage unit 34.

音声区間検出部３１、音声処理部３２、話者モデル学習部３５、音声信号入力要求部３６は、例えば、演算処理装置やメモリ等から実現されうる。音声区間検出部３１等の処理手順は、例えば、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。また、ハードウェア（専用回路）で実現してもよい。 The voice section detection unit 31, the voice processing unit 32, the speaker model learning unit 35, and the voice signal input request unit 36 can be realized by, for example, an arithmetic processing unit, a memory, or the like. The processing procedure of the voice section detection unit 31 or the like is realized by, for example, software, and the software is stored in a storage medium such as a ROM. Also, it may be realized by hardware (dedicated circuit).

（第３の実施形態における話者モデル学習装置３００の動作）
次に、話者モデル学習装置３００の動作について、図７のフローチャートを用いて説明する。図７は、話者モデル学習装置３００の動作を示すフローチャートである。 (Operation of Speaker Model Learning Device 300 in Third Embodiment)
Next, the operation of the speaker model learning device 300 will be described using the flowchart of FIG. FIG. 7 is a flowchart showing the operation of the speaker model learning device 300.

音声区間検出部３１は、外部から音声信号を受理し、受理した音声信号について、音声区間の検出により区分化を行い、１つ以上の区分化音声信号を出力する（ステップＳ３０１）。音声処理部３２は、受理した音声信号について、第１の実施形態の音声処理装置１００の音声処理（ステップＳ１０１〜ステップＳ１０４の処理）を施して音響多様度を算出して音声処理装置１００内の音響多様度記憶部に記憶し、また、音響信頼度を算出して出力する（ステップＳ３０２）。音声信号入力要求部３６は、音響信頼度を受理し、音声信号の入力を要求することを示す要求情報を外部に出力する（ステップＳ３０３）。音声区間検出部３１が音声信号入力要求部３６から出力される音声区間検出を停止することを示す要求情報を受理した（検知した）場合（ステップＳ３０４でＹｅｓ）、ステップＳ３０１へ処理が戻る。音声区間検出部３１が音声信号入力要求部３６から出力される音声区間検出を停止することを示す要求情報を受理していない場合（ステップＳ３０４でＮｏ）、話者モデル学習部３５は、受理した音声信号に対して、音声モデル記憶部３３に記憶されている音声モデルを参照し、話者モデルを学習して、話者モデル記憶部３４に記憶させ、一連の処理を終了する（ステップＳ３０４）。 The voice section detection unit 31 receives a voice signal from the outside, performs segmentation on the received voice signal by detecting a voice section, and outputs one or more segmented voice signals (step S301). The voice processing unit 32 performs voice processing (processing of steps S101 to S104) of the voice processing apparatus 100 according to the first embodiment on the received voice signal to calculate acoustic diversity, and the voice processing unit 32 It stores in the acoustic diversity storage unit, and calculates and outputs acoustic reliability (step S302). The audio signal input request unit 36 receives the sound reliability and outputs request information indicating that the input of the audio signal is requested (step S303). If the voice activity detection unit 31 receives (detects) request information indicating that the voice activity detection requested by the voice signal input request unit 36 is to be stopped (Yes in step S304), the process returns to step S301. When the speech segment detection unit 31 does not receive the request information indicating that the speech segment detection requested by the speech signal input request unit 36 is to be stopped (No in step S304), the speaker model learning unit 35 receives the request. For the speech signal, the speech model stored in the speech model storage unit 33 is referenced to learn the speaker model, and the speaker model is stored in the speaker model storage unit 34 (step S304). .

また、話者モデル学習装置３００は、音声区間検出部３１への音声入力が終了したこと、あるいは、音声区間検出部３１が音声区間の検出を停止することを示す要求情報を受理したことを検知したとき、その時点で受理した音声信号についてステップＳ３０４までの処理を終えたあと、一連の処理を終了する。 In addition, the speaker model learning device 300 detects that the speech input to the speech segment detection unit 31 is completed, or that the speech segment detection unit 31 receives the request information indicating that the detection of the speech segment is stopped. When the processing has been completed up to step S304 for the audio signal received at that time, the series of processing ends.

（第３の実施形態の効果）
以上、説明したように、第３の実施形態の話者モデル学習装置３００によれば、第１の実施形態の音声処理装置を利用することにより、入力された音声信号に対して音声多様度に基づく音響信頼度を算出し、音声信号に含まれる音の種類に偏りや欠落がある場合に、音声信号入力を要求することができる。これにより、話者認識に適する音声信号で話者モデルを学習することができ、話者認識精度の低下を抑制できる音声処理装置を構成できる。 (Effect of the third embodiment)
As described above, according to the speaker model learning device 300 of the third embodiment, by using the voice processing device of the first embodiment, the input voice signal can be converted to voice diversity degree. The sound reliability based on the sound can be calculated, and the sound signal input can be requested when the types of sounds included in the sound signal are biased or missing. As a result, the speaker model can be learned with a voice signal suitable for speaker recognition, and a voice processing device capable of suppressing a reduction in speaker recognition accuracy can be configured.

＜第４の実施形態＞
第４の実施形態における音声処理装置４００の構成について、図面を参照して説明する。図８は、第４の実施形態における音声処理装置４００の構成を示すブロック図である。 Fourth Embodiment
The configuration of the speech processing apparatus 400 according to the fourth embodiment will be described with reference to the drawings. FIG. 8 is a block diagram showing the configuration of the speech processing apparatus 400 in the fourth embodiment.

第４の実施形態における音声処理装置４００は、音響多様度算出部１２と、音響信頼度算出部１４とを備える。音響多様度算出部１２は、音声を表す音声信号に基づき、音声信号の種類に関するばらつきの程度を表す音響多様度を算出する。音響信頼度算出部１４は、音響多様度算出部１２が算出した音声信号の音響多様度と、基準となる他の音声信号の音響多様度との相違度に基づき、音声信号から特定の属性情報を認識する対象として音声信号が好適である程度を表す音響信頼度を算出する。 The speech processing device 400 in the fourth embodiment includes an acoustic diversity calculation unit 12 and an acoustic reliability calculation unit 14. The acoustic diversity calculation unit 12 calculates an acoustic diversity representing the degree of variation regarding the type of the audio signal based on the audio signal representing the audio. The sound reliability calculation unit 14 calculates specific attribute information from the sound signal based on the degree of difference between the sound diversity of the sound signal calculated by the sound diversity calculation unit 12 and the sound diversity of another sound signal as a reference. The sound reliability is calculated to indicate the degree to which the audio signal is suitable for recognizing.

上記構成を有する音声処理装置４００は、音声を表す音声信号に基づき、音声信号の種類に関するばらつきの程度を表す音響多様度を算出する。音声処理装置４００は、音響多様度算出部１２が算出した音声信号の音響多様度と、基準となる他の音声信号の音響多様度との相違度に基づき、音声信号から特定の属性情報を認識する対象として音声信号が好適である程度を表す音響信頼度を算出する。したがって、音声処理装置４００は、話者認識結果の信頼度を適切に求めることにより、話者認識の精度低下を抑制することができる。 The voice processing apparatus 400 having the above configuration calculates an acoustic diversity degree indicating the degree of variation regarding the type of the voice signal based on the voice signal representing the voice. The voice processing device 400 recognizes specific attribute information from the audio signal based on the degree of difference between the audio diversity of the audio signal calculated by the acoustic diversity calculation unit 12 and the audio diversity of another audio signal as a reference. The acoustic reliability that represents the degree to which the audio signal is suitable as the target is calculated. Therefore, the speech processing apparatus 400 can suppress the decrease in the accuracy of the speaker recognition by appropriately determining the reliability of the speaker recognition result.

なお、上記第１から第３の実施形態における音響多様度算出部１２および音響信頼度算出部１４は、第４の実施形態における音響多様度算出部１２および音響信頼度算出部１４であり、同等の機能を含む。 The sound diversity calculation unit 12 and the sound reliability calculation unit 14 in the first to third embodiments are the sound diversity calculation unit 12 and the sound reliability calculation unit 14 in the fourth embodiment, and are equivalent to each other. Including the features of

以上、実施形態を用いて本発明を説明したが、本発明は、上記実施形態に限定されるものではない。本発明の構成や詳細には、本発明のスコープ内で当業者が理解しうる様々な変更をすることができる。すなわち、本発明は、以上の実施形態に限定されることなく、種々の変更が可能であり、それらも本発明の範囲内に包含されるものであることは言うまでもない。 As mentioned above, although this invention was demonstrated using embodiment, this invention is not limited to the said embodiment. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention. That is, it is needless to say that the present invention is not limited to the above embodiments, and various modifications are possible, which are also included in the scope of the present invention.

上記の実施形態の一部又は全部は、以下の付記のようにも記載されうるが、以下には限られない。 Some or all of the above embodiments may be described as in the following appendices, but is not limited to the following.

［付記１］
音声を表す音声信号に基づき、前記音声信号の音の種類に関するばらつきの程度を表す音響多様度を算出する音響多様度算出手段と、
前記音声信号の前記音響多様度と、基準となる他の音声信号の前記音響多様度との相違度に基づき、音声信号から特定の属性情報を認識する対象として前記音声信号が好適である程度を表す音響信頼度を算出する音響信頼度算出手段と
を備えることを特徴とする音声処理装置。 [Supplementary Note 1]
Acoustic diversity calculation means for calculating an acoustic diversity degree representing a degree of variation regarding the type of sound of the audio signal based on an audio signal representing audio;
Indicates the degree to which the audio signal is suitable for recognizing specific attribute information from the audio signal based on the degree of difference between the acoustic diversity of the audio signal and the acoustic diversity of another audio signal as a reference And a sound reliability calculation means for calculating sound reliability.

［付記２］
前記音響信頼度算出手段は、
前記音声信号の前記音響多様度と、他の音声信号の前記音響多様度との相違度を、２つの確率分布間の距離を表す尺度に基づいて算出すること
を特徴とする付記１に記載の音声処理装置。 [Supplementary Note 2]
The acoustic reliability calculation means
The difference between the acoustic diversity of the audio signal and the acoustic diversity of another audio signal is calculated based on a scale that represents a distance between two probability distributions. Voice processing device.

［付記３］
前記音響多様度算出手段は、
確率分布を要素分布とする混合モデルとして構成され音声信号の出現確率分布を表す音声モデルに基づき、前記音声信号の前記音声モデルの要素分布に対する尤度を算出し、前記尤度、または、前記尤度を全ての要素分布の前記尤度の和で正規化した値に基づき、前記音声信号の音響多様度を算出すること
を特徴とする付記１又は２に記載の音声処理装置。 [Supplementary Note 3]
The acoustic diversity calculation means
The likelihood to the element distribution of the speech model of the speech signal is calculated based on the speech model that is configured as a mixed model in which the probability distribution is an element distribution and represents the appearance probability distribution of the speech signal, the likelihood or the likelihood The speech processing apparatus according to any one of the above 1 to 3, wherein the degree of acoustic multiplicity of the speech signal is calculated based on a value obtained by normalizing the degree with the sum of the likelihoods of all element distributions.

［付記４］
前記音響信頼度算出手段は、
前記音声信号を区分化した信号に基づき、前記音声信号の音響信頼度を算出すること
を特徴とする付記１乃至３のいずれか１つに記載の音声処理装置。 [Supplementary Note 4]
The acoustic reliability calculation means
The sound processing apparatus according to any one of appendices 1 to 3, wherein the sound reliability of the sound signal is calculated based on a signal obtained by dividing the sound signal.

［付記５］
前記音響信頼度算出手段は、
前記音声信号の前記音響多様度と、前記他の音声信号の前記音響多様度との相違度を、カルバック・ライブラー情報量あるいはコサイン類似度に基づいて算出すること
を特徴とする付記１乃至４のいずれか１つに記載の音声処理装置。 [Supplementary Note 5]
The acoustic reliability calculation means
Supplementary note 1 to 4 wherein the degree of difference between the degree of acoustic diversity of the audio signal and the degree of acoustic diversity of the other audio signal is calculated based on the Kullback-Leibler information amount or the cosine similarity. A voice processing device according to any one of the preceding claims.

［付記６］
音声信号入力を要する程度を表す要求情報を外部に出力する音声信号入力要求手段をさらに備え、
前記音声信号入力要求手段は、
前記音声信号の前記音響信頼度に基づき、要求情報を外部に出力するか否か、音声信号入力を停止するか否かを判定すること
を特徴とする付記１乃至５のいずれか１つに記載の音声処理装置。 [Supplementary Note 6]
It further comprises an audio signal input request means for outputting to the outside request information indicating the degree to which an audio signal input is required,
The voice signal input request means
It is determined based on the sound reliability of the audio signal whether or not to output request information to the outside, and whether or not to stop input of the audio signal. Voice processing device.

［付記７］
音声を表す音性信号に基づき、前記音声信号に含まれる音の種類に関するばらつきの程度を表す音響多様度を算出する音響多様度算出手段と、
前記音響多様度が所定の条件を満たした際に、前記音声信号から特定の属性情報を認識する属性認識手段と、を備える音声処理装置。 [Supplementary Note 7]
Acoustic diversity degree calculation means for calculating an acoustic diversity degree that represents the degree of variation regarding the type of sound included in the audio signal, based on a tonality signal representing an audio;
An audio processing device comprising: attribute recognition means for recognizing specific attribute information from the audio signal when the acoustic diversity degree satisfies a predetermined condition.

［付記８］
音声を表す音声信号に基づき、前記音声信号に含まれる音の種類に関するばらつきの程度を表す音響多様度を算出する音響多様度算出手段と、
前記音声信号の前記音響多様度と、基準となる他の音声信号の前記音響多様度との相違度に基づき、音声信号から特定の属性情報を認識する対象として前記音声信号が好適である程度を表す音響信頼度算出する音響信頼度算出手段と、
前記音声信号から特定の属性情報を認識する属性認識手段を備え、
前記属性認識手段は前記音声信号の前記音響信頼度が所定の条件を満たす場合に、属性認識処理を開始すること
を特徴とする音声処理装置。 [Supplementary Note 8]
Acoustic diversity degree calculation means for calculating an acoustic diversity degree that indicates the degree of variation regarding the type of sound included in the audio signal based on an audio signal representing an audio;
Indicates the degree to which the audio signal is suitable for recognizing specific attribute information from the audio signal based on the degree of difference between the acoustic diversity of the audio signal and the acoustic diversity of another audio signal as a reference Sound reliability calculation means for calculating sound reliability;
And an attribute recognition unit that recognizes specific attribute information from the voice signal;
The voice processing apparatus according to claim 1, wherein the attribute recognition unit starts attribute recognition processing when the acoustic reliability of the voice signal satisfies a predetermined condition.

［付記９］
音声を表す音声信号に基づき、前記音声信号に含まれる音の種類に関するばらつきの程度を表す音響多様度を算出する音響多様度算出手段と、
前記音声信号の前記音響多様度と、基準となる他の音声信号の前記音響多様度との相違度に基づき、音声信号から特定の属性情報を認識する対象として前記音声信号が好適である程度を表す音響信頼度を算出する音響信頼度算出手段と、
前記音声信号から特定の属性情報を認識する属性認識手段と、
前記属性認識処理の認識結果を出力する認識結果出力手段を備え、
前記認識結果出力手段は、
前記音響信頼度に基づき、前記認識結果を前記音声信号に対して前記音声モデルとの適合度合いを表す数値情報であるスコアとして算出する、あるいは、前記認識結果を出力すること
を特徴とする付記８に記載の音声処理装置。 [Supplementary Note 9]
Acoustic diversity degree calculation means for calculating an acoustic diversity degree that indicates the degree of variation regarding the type of sound included in the audio signal based on an audio signal representing an audio;
Indicates the degree to which the audio signal is suitable for recognizing specific attribute information from the audio signal based on the degree of difference between the acoustic diversity of the audio signal and the acoustic diversity of another audio signal as a reference Sound reliability calculation means for calculating sound reliability;
Attribute recognition means for recognizing specific attribute information from the voice signal;
A recognition result output unit that outputs a recognition result of the attribute recognition process;
The recognition result output unit
Supplementary note 8: The recognition result is calculated as a score that is numerical information indicating the matching degree of the voice signal with the voice model based on the sound reliability, or the recognition result is output. The voice processing device according to claim 1.

［付記１０］
前記特定属性情報を、音声信号を発した話者、あるいは、音声信号を構成する言語を示す情報とする
付記８又は９の音声処理装置。 [Supplementary Note 10]
The voice processing apparatus according to claim 8 or 9, wherein the specific attribute information is information representing a speaker who issued a voice signal or a language that composes the voice signal.

［付記１１］
音声を表す音声信号に基づき、前記音声信号に含まれる音の種類に関するばらつきの程度を表す音響多様度を算出し、
前記音声信号の前記音響多様度と、基準となる他の音声信号の前記音響多様度との相違度に基づき、音声信号から特定の属性情報を認識する対象として前記音声信号が好適である程度を表す音響信頼度を算出すること
を特徴とする音声処理方法。 [Supplementary Note 11]
Based on an audio signal representing an audio, an acoustic diversity representing the degree of variation regarding the type of sound included in the audio signal is calculated;
Indicates the degree to which the audio signal is suitable for recognizing specific attribute information from the audio signal based on the degree of difference between the acoustic diversity of the audio signal and the acoustic diversity of another audio signal as a reference A voice processing method characterized by calculating acoustic reliability.

［付記１２］
前記音声信号の前記音響多様度と、他の音声信号の前記音響多様度との相違度を、２つの確率分布間の距離を表す尺度に基づいて算出すること
を特徴とする付記１１に記載の音声処理方法。 [Supplementary Note 12]
The difference between the acoustic diversity of the audio signal and the acoustic diversity of another audio signal is calculated based on a scale that represents a distance between two probability distributions. Speech processing method.

［付記１３］
確率分布を要素分布とする混合モデルとして構成され音声信号の出現確率分布を表す音声モデルに基づき、前記音声信号の前記音声モデルの要素分布に対する尤度を算出し、前記尤度、または、前記尤度を全ての要素分布の前記尤度の和で正規化した値に基づき、前記音声信号の音響多様度を算出すること
を特徴とする付記１１又は１２に記載の音声処理方法。 [Supplementary Note 13]
The likelihood to the element distribution of the speech model of the speech signal is calculated based on the speech model that is configured as a mixed model in which the probability distribution is an element distribution and represents the appearance probability distribution of the speech signal, the likelihood or the likelihood 11. The speech processing method according to appendix 11 or 12, wherein an acoustic diversity of the speech signal is calculated based on a value obtained by normalizing the degree with the sum of the likelihoods of all element distributions.

［付記１４］
前記音声信号を区分化した信号に基づき、前記音声信号の音響信頼度を算出すること
を特徴とする付記１１乃至１３のいずれか１つに記載の音声処理方法。 [Supplementary Note 14]
The sound processing method according to any one of appendices 11 to 13, wherein the sound reliability of the sound signal is calculated based on a signal obtained by dividing the sound signal.

［付記１５］
前記音声信号の前記音響多様度と、前記他の音声信号の前記音響多様度との相違度を、カルバック・ライブラー情報量あるいはコサイン類似度に基づいて算出すること
を特徴とする付記１１乃至１４のいずれか１つに記載の音声処理方法。 [Supplementary Note 15]
The difference degree between the acoustic diversity degree of the audio signal and the acoustic diversity degree of the other audio signal is calculated based on the Kullback-Leibler information amount or the cosine similarity. The voice processing method according to any one of the above.

［付記１６］
音声信号入力を要する程度を表す要求情報を外部に出力し、
前記音声信号の前記音響信頼度に基づき、要求情報を外部に出力するか否か、音声信号入力を停止するか否かを判定すること
を特徴とする付記１１乃至１５のいずれか１つに記載の音声処理方法。 [Supplementary Note 16]
Output request information representing the degree to which audio signal input is required,
It is determined based on the sound reliability of the audio signal whether or not to output request information to the outside, and whether or not to stop input of the audio signal. Voice processing method.

［付記１７］
音声を表す音性信号に基づき、前記音声信号に含まれる音の種類に関するばらつきの程度を表す音響多様度を算出し、
前記音響多様度が所定の条件を満たした際に、前記音声信号から特定の属性情報を認識すること、
を特徴とする音声処理装置。 [Supplementary Note 17]
Based on a tonality signal representing speech, an acoustic diversity indicating the degree of variation regarding the type of sound included in the speech signal is calculated;
Recognizing specific attribute information from the audio signal when the acoustic diversity satisfies a predetermined condition;
A voice processing device characterized by

［付記１８］
音声を表す音声信号に基づき、前記音声信号に含まれる音の種類に関するばらつきの程度を表す音響多様度を算出し、
前記音声信号の前記音響多様度と、基準となる他の音声信号の前記音響多様度との相違度に基づき、音声信号から特定の属性情報を認識する対象として前記音声信号が好適である程度を表す音響信頼度を算出し、
前記音声信号から特定の属性情報を認識し、
前記音声信号の前記音響信頼度が所定の条件を満たす場合に、属性認識処理を開始するか否かを判定すること
を特徴とする音声処理方法。 [Supplementary Note 18]
Based on an audio signal representing an audio, an acoustic diversity representing the degree of variation regarding the type of sound included in the audio signal is calculated;
Indicates the degree to which the audio signal is suitable for recognizing specific attribute information from the audio signal based on the degree of difference between the acoustic diversity of the audio signal and the acoustic diversity of another audio signal as a reference Calculate acoustic reliability,
Recognize specific attribute information from the voice signal,
A voice processing method comprising: determining whether to start attribute recognition processing when the acoustic reliability of the voice signal satisfies a predetermined condition.

［付記１９］
音声を表す音声信号に基づき、前記音声信号に含まれる音の種類に関するばらつきの程度を表す音響多様度を算出し、
前記音声信号の前記音響多様度と、基準となる他の音声信号の前記音響多様度との相違度に基づき、音声信号から特定の属性情報を認識する対象として前記音声信号が好適である程度を表す音響信頼度を算出し、
前記音声信号から特定の属性情報を認識し、
前記属性認識処理の認識結果を出力し、
前記音響信頼度に基づき、前記認識結果を前記音声信号に対して前記音声モデルとの適合度合いを表す数値情報であるスコアとして算出する、あるいは、前記認識結果を出力すること
を特徴とする付記１８に記載の音声処理方法。 [Supplementary Note 19]
Based on an audio signal representing an audio, an acoustic diversity representing the degree of variation regarding the type of sound included in the audio signal is calculated;
Indicates the degree to which the audio signal is suitable for recognizing specific attribute information from the audio signal based on the degree of difference between the acoustic diversity of the audio signal and the acoustic diversity of another audio signal as a reference Calculate acoustic reliability,
Recognize specific attribute information from the voice signal,
Outputting the recognition result of the attribute recognition process;
The recognition result is calculated as a score which is numerical information indicating the matching degree of the voice signal with the voice model based on the sound reliability, or the recognition result is output. The voice processing method described in.

［付記２０］
前記特定属性情報を、音声信号を発した話者、あるいは、音声信号を構成する言語を示す情報とする
付記１８又は１９に記載の音声処理方法。 [Supplementary Note 20]
The voice processing method according to appendix 18 or 19, wherein the specific attribute information is information representing a speaker who issued a voice signal or a language constituting the voice signal.

［付記２１］
音声を表す音声信号に基づき、前記音声信号に含まれる音の種類に関するばらつきの程度を表す音響多様度を算出する音響多様度算出処理と、
前記音声信号の前記音響多様度と、基準となる他の音声信号の前記音響多様度との相違度に基づき、音声信号から特定の属性情報を認識する対象として前記音声信号が好適である程度を表す音響信頼度を算出する音響信頼度算出処理と
をコンピュータに実行させることを特徴とするプログラム。 [Supplementary Note 21]
Acoustic diversity calculation processing for calculating acoustic diversity representing the degree of variation regarding the type of sound included in the audio signal based on an audio signal representing audio;
Indicates the degree to which the audio signal is suitable for recognizing specific attribute information from the audio signal based on the degree of difference between the acoustic diversity of the audio signal and the acoustic diversity of another audio signal as a reference A program for causing a computer to execute sound reliability calculation processing for calculating sound reliability.

［付記２２］
前記音響信頼度算出処理は、
前記音声信号の前記音響多様度と、他の音声信号の前記音響多様度との相違度を、２つの確率分布間の距離を表す尺度に基づいて算出すること
を特徴とする付記２１に記載のプログラム。 [Supplementary Note 22]
The sound reliability calculation process is
The difference between the acoustic diversity of the audio signal and the acoustic diversity of another audio signal is calculated based on a scale that represents a distance between two probability distributions. program.

［付記２３］
前記音響多様度算出処理は、
確率分布を要素分布とする混合モデルとして構成され音声信号の出現確率分布を表す音声モデルに基づき、前記音声信号の前記音声モデルの要素分布に対する尤度を算出し、前記尤度、または、前記尤度を全ての要素分布の前記尤度の和で正規化した値に基づき、前記音声信号の音響多様度を算出すること
を特徴とする付記２１又は２２に記載のプログラム。 [Supplementary Note 23]
The acoustic diversity calculation process
The likelihood to the element distribution of the speech model of the speech signal is calculated based on the speech model that is configured as a mixed model in which the probability distribution is an element distribution and represents the appearance probability distribution of the speech signal, the likelihood or the likelihood The program according to Appendix 21 or 22, wherein the degree of acoustic diversity of the audio signal is calculated based on a value obtained by normalizing the degree with the sum of the likelihoods of all element distributions.

［付記２４］
前記音響信頼度算出処理は、
前記音声信号を区分化した信号に基づき、前記音声信号の音響信頼度を算出すること
を特徴とする付記２１乃至２３のいずれか１つに記載のプログラム。 [Supplementary Note 24]
The sound reliability calculation process is
24. The program according to any one of appendices 21 to 23, wherein acoustic reliability of the audio signal is calculated based on a signal obtained by dividing the audio signal.

［付記２５］
前記音響信頼度算出処理は、
前記音声信号の前記音響多様度と、前記他の音声信号の前記音響多様度との相違度を、カルバック・ライブラー情報量あるいはコサイン類似度に基づいて算出すること
を特徴とする付記２１乃至２４のいずれか１つに記載のプログラム。 [Supplementary Note 25]
The sound reliability calculation process is
The difference between the acoustic diversity of the audio signal and the acoustic diversity of the other audio signal is calculated based on the Kullback-Leibler information amount or the cosine similarity. The program described in any one.

［付記２６］
音声信号入力を要する程度を表す要求情報を外部に出力する音声信号入力要求処理をさらに備え、
前記音声信号入力要求処理は、
前記音声信号の前記音響信頼度に基づき、要求情報を外部に出力するか否か、音声信号入力を停止するか否かを判定すること
を特徴とする付記２１乃至２５のいずれか１つに記載のプログラム。 [Supplementary Note 26]
The system further comprises an audio signal input request process of externally outputting request information indicating the degree to which the audio signal input is required,
The voice signal input request process is
It is determined based on the sound reliability of the audio signal whether or not to output request information to the outside, and whether or not to stop input of the audio signal. Programs.

［付記２７］
音声を表す音性信号に基づき、前記音声信号に含まれる音の種類に関するばらつきの程度を表す音響多様度を算出する音響多様度算出処理と、
前記音響多様度が所定の条件を満たした際に、前記音声信号から特定の属性情報を認識する属性認識処理と、
をコンピュータに実行させることを特徴とする音声処理装置。 [Supplementary Note 27]
Acoustic diversity calculation processing for calculating acoustic diversity representing the degree of variation regarding the type of sound included in the audio signal based on a sound signal representing audio;
An attribute recognition process for recognizing specific attribute information from the audio signal when the acoustic diversity degree satisfies a predetermined condition;
A voice processing apparatus characterized by causing a computer to execute.

［付記２８］
音声を表す音声信号に基づき、前記音声信号に含まれる音の種類に関するばらつきの程度を表す音響多様度を算出する音響多様度算出処理と、
前記音声信号の前記音響多様度と、基準となる他の音声信号の前記音響多様度との相違度に基づき、音声信号から特定の属性情報を認識する対象として前記音声信号が好適である程度を表す音響信頼度算出する音響信頼度算出処理と、
前記音声信号から特定の属性情報を認識する属性認識処理を備え、
前記属性認識処理は前記音声信号の前記音響信頼度が所定の条件を満たす場合に、属性認識処理を開始すること
をコンピュータに実行させることを特徴とするプログラム。 [Supplementary Note 28]
Acoustic diversity calculation processing for calculating acoustic diversity representing the degree of variation regarding the type of sound included in the audio signal based on an audio signal representing audio;
Indicates the degree to which the audio signal is suitable for recognizing specific attribute information from the audio signal based on the degree of difference between the acoustic diversity of the audio signal and the acoustic diversity of another audio signal as a reference Sound reliability calculation processing for calculating sound reliability;
And attribute recognition processing for recognizing specific attribute information from the voice signal,
The program is characterized in that the attribute recognition process causes a computer to start the attribute recognition process when the sound reliability of the audio signal satisfies a predetermined condition.

［付記２９］
音声を表す音声信号に基づき、前記音声信号に含まれる音の種類に関するばらつきの程度を表す音響多様度を算出する音響多様度算出処理と、
前記音声信号の前記音響多様度と、基準となる他の音声信号の前記音響多様度との相違度に基づき、音声信号から特定の属性情報を認識する対象として前記音声信号が好適である程度を表す音響信頼度を算出する音響信頼度算出処理と、
前記音声信号から特定の属性情報を認識する属性認識処理と、
前記属性認識処理の認識結果を出力する認識結果出力処理をさらに備え、
前記認識結果出力処理は、
前記音響信頼度に基づき、前記認識結果を前記音声信号に対して前記音声モデルとの適合度合いを表す数値情報であるスコアとして算出する、あるいは、前記認識結果を出力すること
を特徴とする付記２８に記載のプログラム。 [Supplementary Note 29]
Acoustic diversity calculation processing for calculating acoustic diversity representing the degree of variation regarding the type of sound included in the audio signal based on an audio signal representing audio;
Indicates the degree to which the audio signal is suitable for recognizing specific attribute information from the audio signal based on the degree of difference between the acoustic diversity of the audio signal and the acoustic diversity of another audio signal as a reference Sound reliability calculation processing for calculating sound reliability;
Attribute recognition processing for recognizing specific attribute information from the voice signal;
It further comprises a recognition result output process for outputting a recognition result of the attribute recognition process,
The recognition result output process is
The recognition result is calculated as a score, which is numerical information indicating the matching degree of the voice signal with the voice model, based on the sound reliability, or the recognition result is output. The program described in.

［付記３０］
前記特定属性情報を、音声信号を発した話者、あるいは、音声信号を構成する言語を示す情報とする
付記２８又は２９に記載のプログラム。 [Supplementary Note 30]
The program according to Appendix 28 or 29, wherein the specific attribute information is information representing a speaker who issued an audio signal or a language that configures the audio signal.

なお、本発明の各態様において使用者に関する情報を取得、利用する場合は、これを適法に行うものとする。 In addition, when acquiring and using the information regarding a user in each aspect of this invention, this shall be performed legally.

１０情報処理装置
１１音声モデル記憶部
１２音響多様度算出部
１３音響多様度記憶部
１４音響信頼度算出部
２１音声区間検出部
２２音声処理部
２３話者モデル記憶部
２４話者認識計算部
２５話者認識出力部
３１音声区間検出部
３２音声処理部
３３音声モデル記憶部
３４話者モデル記憶部
３５話者モデル学習部
３６音声信号入力要求部
４１音響多様度算出部
４２音響信頼度算出部
１００音声処理装置
２００話者認識装置
３００話者モデル学習装置
４００音声処理装置 DESCRIPTION OF SYMBOLS 10 information processing apparatus 11 speech model storage part 12 acoustic diversity degree calculation part 13 acoustic diversity degree storage part 14 acoustic reliability calculation part 21 speech area detection part 22 speech processing part 23 speaker model storage part 24 speaker recognition calculation part 25 talk Speaker recognition output unit 31 voice section detection unit 32 voice processing unit 33 voice model storage unit 34 speaker model storage unit 35 speaker model learning unit 36 voice signal input request unit 41 acoustic diversity calculation unit 42 acoustic reliability calculation unit 100 voice Processing unit 200 Speaker recognition unit 300 Speaker model learning unit 400 Speech processing unit

Claims

Acoustic diversity degree calculation means for calculating an acoustic diversity degree that indicates the degree of variation regarding the type of sound included in the audio signal based on an audio signal representing an audio;
Based on the difference between the acoustic diversity of the voice signal and the acoustic diversity calculated from other voice signals used for learning a speaker model for recognizing specific attribute information , An audio reliability calculation means for calculating an audio reliability representing the degree to which the audio signal is suitable for recognizing attribute information.

The acoustic reliability calculation means
And the acoustic diversity of the audio signal, the degree of difference between the acoustic diversity of the other audio signals, be calculated based on the measure of the distance between two probability distributions to claim 1, wherein A voice processing device as described.

The acoustic diversity calculation means
The likelihood to the element distribution of the speech model of the speech signal is calculated based on the speech model that is configured as a mixed model in which the probability distribution is an element distribution and represents the appearance probability distribution of the speech signal, the likelihood or the likelihood The speech processing apparatus according to claim 1 or 2, wherein an acoustic diversity degree of the speech signal is calculated based on a value obtained by normalizing the degree with the sum of the likelihoods of all element distributions.

The audio signal is segmented at a predetermined time,
The acoustic reliability calculation means
The audio processing apparatus according to any one of claims 1 to 3, wherein an acoustic reliability is calculated for each of the divided audio signals.

The acoustic reliability calculation means
A difference degree between the acoustic diversity degree of the audio signal and the acoustic diversity degree of the other audio signal is calculated based on the Kullback-Leibler information amount or the cosine similarity. The voice processing device according to any one of 4.

It further comprises an audio signal input request means for outputting to the outside request information indicating the degree to which an audio signal input is required,
The voice signal input request means
The voice processing device according to any one of claims 1 to 5, wherein it is determined whether or not to output request information to the outside based on the sound reliability of the voice signal.

Acoustic diversity degree calculation means for calculating an acoustic diversity degree that indicates the degree of variation regarding the type of sound included in the audio signal based on an audio signal representing an audio;
Based on the difference between the acoustic diversity of the voice signal and the acoustic diversity calculated from other voice signals used for learning a speaker model for recognizing specific attribute information , Acoustic reliability calculation means for calculating an acoustic reliability that represents the degree to which the audio signal is suitable for recognizing attribute information;
Attribute recognition means for recognizing specific attribute information from the voice signal;
A recognition result output unit that outputs a recognition result of the attribute recognition unit ;
The recognition result output unit
Outputting the recognition result calculated as a score representing the degree of matching of the audio signal with specific attribute information based on the sound reliability , or outputting the recognition result according to the sound reliability A voice processing device characterized by

Acoustic diversity degree calculation means for calculating an acoustic diversity degree that indicates the degree of variation regarding the type of sound included in the audio signal based on an audio signal representing an audio;
Identification from the audio signal based on the degree of difference between the acoustic diversity of the audio signal and the acoustic diversity calculated from other audio signals used for learning a speaker model for recognizing specific attribute information Acoustic reliability calculation means for calculating an acoustic reliability that represents the degree to which the audio signal is suitable for recognizing the attribute information of
And an attribute recognition unit that recognizes specific attribute information from the voice signal;
The voice processing apparatus according to claim 1, wherein the attribute recognition unit starts attribute recognition processing when the acoustic reliability of the voice signal satisfies a predetermined condition.

Based on an audio signal representing an audio, an acoustic diversity representing the degree of variation regarding the type of sound included in the audio signal is calculated;
Based on the difference between the acoustic diversity of the voice signal and the acoustic diversity calculated from other voice signals used for learning a speaker model for recognizing specific attribute information , A voice processing method, comprising: calculating an acoustic reliability indicating a degree to which the voice signal is suitable as a target for recognizing attribute information.

Acoustic diversity calculation processing for calculating acoustic diversity representing the degree of variation regarding the type of sound included in the audio signal based on an audio signal representing audio;
Based on the difference between the acoustic diversity of the voice signal and the acoustic diversity calculated from other voice signals used for learning a speaker model for recognizing specific attribute information , A program for causing a computer to execute sound reliability calculation processing for calculating sound reliability indicating the degree to which the audio signal is suitable for recognizing attribute information.