JP6676009B2

JP6676009B2 - Speaker determination device, speaker determination information generation method, and program

Info

Publication number: JP6676009B2
Application number: JP2017123592A
Authority: JP
Inventors: 歩相名神山; 厚志安藤; 哲小橋川; 裕司青野
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2017-06-23
Filing date: 2017-06-23
Publication date: 2020-04-08
Anticipated expiration: 2037-06-23
Also published as: JP2019008131A

Description

本発明は、音声信号から話者を判定する話者判定装置、話者判定情報生成方法、プログラムに関する。 The present invention relates to a speaker determination device that determines a speaker from an audio signal, a speaker determination information generation method, and a program.

近年、企業の顧客満足度が需要視されている。直接要望を聞くことができる企業の応対窓口は顧客満足度において重要な顧客接点である。近年では、窓口に来た顧客の声からマーケティング情報を収集したいという要望や、顧客満足度の向上を目的とした窓口担当者教育において、顧客の属性（性別、年代等）に応じた声の分析や、窓口担当者の応対品質を分析をしたいという要望が存在する。 2. Description of the Related Art In recent years, customer satisfaction of companies has been viewed as demand. A company's customer service that can directly hear requests is an important point of contact for customer satisfaction. In recent years, analysis of voices according to customer attributes (gender, age, etc.) in customer requests to collect marketing information from customers who have come to the counter and in the training of counter staff to improve customer satisfaction There are also requests to analyze the response quality of the contact person.

これらの要望を実現するために、窓口にマイクを置いて音声収録を行い、窓口担当者・顧客が話している区間を判定し、判定した窓口担当者・顧客の音声認識等を行い、頻出している要望やキーワード等を元に顧客要望を分析し、マーケティング情報を収集する方法が考えられる。また顧客の属性推定には、顧客が話している区間について従来の属性推定技術（性別、年代等を推定する技術）を応用すればよい。 In order to fulfill these requests, place a microphone at the counter to record voice, determine the section where the counter / customer is talking, perform voice recognition of the determined counter / customer, etc. There is a method of analyzing customer requests based on current requests and keywords, and collecting marketing information. For the attribute estimation of the customer, a conventional attribute estimation technology (technology for estimating gender, age, etc.) may be applied to the section where the customer is talking.

顧客が話している区間を判定するには、話者ダイアライゼーション技術を応用することができる。話者ダイアライゼーション技術とは、対話等の音声から「いつ、誰が発話したか」を判定する技術である。話者ダイアライゼーション技術を応用すれば、窓口担当者が話している区間と、顧客が話している区間を判定することが可能となる。従来の話者ダイアライゼーション技術の概要を以下に示す。
（ｉ）予め複数名の窓口担当者について話者ダイアライゼーションシステムに窓口担当者音声として登録をする。
（ｉｉ）収録した音声信号について、ＶＡＤ等の発話区間検出技術により、不要な雑音・無音部分を取り除き、音声区間を検出する。
（ｉｉｉ）検出した音声区間それぞれに対して、登録されている窓口担当者との類似度を求め、類似度が所定の閾値以上の場合は窓口担当者の音声、類似度が所定の閾値未満の場合は登録されていない話者であるため顧客の音声と判定する。
（ｉｖ）（ｉｉｉ）において窓口担当者として判定された音声区間については、一番類似度が高い窓口担当者を、その音声区間の話者と判定する。
（ｖ）（ｉｉｉ）において顧客と判定された音声区間については、当該音声区間全体を用いて、話者クラスタリングを行い、各話者の音声区間を判定する。話者クラスタリングとは、発話ごとに話者性を示すベクトル（i-vector等）を求めて空間上にプロットし、近い距離にあるベクトルの発話を同一の話者として判定する技術である。 To determine the section where the customer is talking, speaker dialization technology can be applied. The speaker dialization technology is a technology for determining “when and who uttered” from a voice such as a dialogue. If the speaker dialization technology is applied, it is possible to determine a section in which the contact person is talking and a section in which the customer is talking. The outline of the conventional speaker dialization technology is shown below.
(I) Register a plurality of contact persons in advance as speaker contact voices in the speaker dialization system.
(Ii) Unnecessary noise / silence portions are removed from the recorded audio signal by an utterance interval detection technique such as VAD to detect an audio interval.
(Iii) For each detected voice section, the similarity with the registered contact person is obtained, and when the similarity is equal to or more than a predetermined threshold, the voice of the contact person and the similarity are less than the predetermined threshold. In this case, since the speaker is not registered, it is determined to be a customer's voice.
(Iv) For the voice section determined as the contact person in (iii), the contact person having the highest similarity is determined to be the speaker of the voice section.
(V) For the voice section determined to be a customer in (iii), speaker clustering is performed using the entire voice section to determine the voice section of each speaker. The speaker clustering is a technique in which a vector (i-vector or the like) indicating a speaker property is obtained for each utterance and plotted in a space, and the utterance of a vector at a short distance is determined as the same speaker.

奥貴裕ら、「対談音声認識のための話者ダイアライゼーション」、NHK技研R&D、No.147、2014年9月、p.37-44Takahiro Oku, "Speaker Dialization for Dialogue Speech Recognition," NHK STRL R & D, No.147, September 2014, p.37-44 M. Fujimoto, K. Ishizuka, and T. Nakatani, “A Voice Activity Detection Based on the Adaptive Integration of Multiple Speech Features and a Signal Decision Scheme, ” ICASSP2008, pp. 4441--4444, 2008.M. Fujimoto, K. Ishizuka, and T. Nakatani, “A Voice Activity Detection Based on the Adaptive Integration of Multiple Speech Features and a Signal Decision Scheme,” ICASSP2008, pp. 4441--4444, 2008. N. Dehak, P. Kenny, R. Dehak, P. Dumouchel and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, And Language Processing, vol.19, pp.788--798, 2011.N. Dehak, P. Kenny, R. Dehak, P. Dumouchel and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, And Language Processing, vol.19, pp.788- -798, 2011. D. Pelleg, A.Moore, “X-means: Extending K-means with Efficient Estimation of the Number of Clusters, ” Proc. of ICML, pp.724-734, 2000.D. Pelleg, A. Moore, “X-means: Extending K-means with Efficient Estimation of the Number of Clusters,” Proc. Of ICML, pp.724-734, 2000.

窓口における会話では、複数の顧客と複数の窓口担当者が存在する。従来の話者ダイアライゼーション技術では、音韻の偏りや、短い発話等により、次のような話者の誤判定が発生する場合があった。
（ａ）ある窓口担当者の発話を別の窓口担当者の発話と誤判定
（ｂ）ある顧客の発話を、当該顧客を担当していない窓口担当者の発話と誤判定
（ｃ）ある顧客の発話を、別の顧客の発話と誤判定
そこで本発明では、話者の判定精度が向上する話者判定装置を提供することを目的とする。 In the conversation at the counter, there are a plurality of customers and a plurality of tellers. In the conventional speaker dialization technology, the following erroneous determination of the speaker may occur due to biased phonemes or short utterances.
(A) An erroneous determination of the utterance of a contact person as an utterance of another contact person (b) An erroneous determination of a certain customer's utterance as an utterance of a contact person not in charge of the customer (c) Erroneous determination of an utterance as an utterance of another customer Therefore, an object of the present invention is to provide a speaker determination device that improves the determination accuracy of a speaker.

本発明の話者判定装置は、窓口担当者と顧客の会話を記録した音声信号から話者を判定する話者判定装置である。本発明の話者判定装置は、類似度算出部と、話者一次判定部と、話者二次判定部と、話者クラスタリング部を含む。 The speaker determination device of the present invention is a speaker determination device that determines a speaker from a voice signal in which a conversation between a contact person and a customer is recorded. The speaker determination device of the present invention includes a similarity calculation unit, a primary speaker determination unit, a secondary speaker determination unit, and a speaker clustering unit.

類似度算出部は、音声信号の音声区間を所定時間長に分割してなる各分割音声区間の話者特徴量と、窓口担当者毎に予め生成された話者特徴量との類似度を算出する。 The similarity calculating unit calculates a similarity between a speaker feature amount of each divided voice section obtained by dividing a voice section of the voice signal into a predetermined time length and a speaker feature amount generated in advance for each contact person. I do.

話者一次判定部は、類似度から、各分割音声区間の話者ＩＤを表す一次判定情報を生成する。 The speaker primary determination unit generates primary determination information indicating the speaker ID of each divided voice section from the similarity.

話者二次判定部は、任意の分割音声区間の前または後の所定数の分割音声区間において最も当てはまる話者である近傍話者の話者特徴量と、任意の分割音声区間の話者特徴量との類似度が所定の条件を充たす場合に、近傍話者の話者ＩＤを任意の分割音声区間の二次判定情報とすることにより、二次判定情報を生成する。 The speaker secondary determination unit is configured to determine a speaker feature of a neighboring speaker who is the best-fit speaker in a predetermined number of divided voice sections before or after an arbitrary divided voice section, and a speaker characteristic of an arbitrary divided voice section. When the degree of similarity with the amount satisfies a predetermined condition, the secondary determination information is generated by using the speaker ID of the nearby speaker as the secondary determination information of an arbitrary divided voice section.

話者クラスタリング部は、顧客であることを示す二次判定情報と対応する分割音声区間の話者特徴量、すなわち顧客話者特徴量の集合をクラスタリングして顧客の話者ＩＤを生成し、三次判定情報を生成する。 The speaker clustering unit clusters the speaker feature amount of the divided voice section corresponding to the secondary determination information indicating that the customer is a customer, that is, a set of customer speaker feature amounts, generates a speaker ID of the customer, Generate judgment information.

本発明の話者判定装置によれば、話者の判定精度が向上する。 ADVANTAGE OF THE INVENTION According to the speaker determination apparatus of this invention, the determination accuracy of a speaker improves.

実施例１の話者判定装置の構成を示すブロック図。FIG. 2 is a block diagram illustrating a configuration of a speaker determination device according to the first embodiment. 実施例１の話者判定装置の動作を示すフローチャート。5 is a flowchart illustrating the operation of the speaker determination device according to the first embodiment. 音声信号から検出された音声区間を所定時間長に分割する例を示す図。The figure which shows the example which divides the audio | voice area detected from the audio | voice signal into predetermined time length. 実施例１の話者二次判定部の構成を示すブロック図。FIG. 3 is a block diagram illustrating a configuration of a speaker secondary determination unit according to the first embodiment. 実施例１の話者二次判定部の動作を示すフローチャート。5 is a flowchart illustrating the operation of a speaker secondary determination unit according to the first embodiment. 実施例１の話者クラスタリング部の構成を示すブロック図。FIG. 3 is a block diagram illustrating a configuration of a speaker clustering unit according to the first embodiment. 実施例１の話者クラスタリング部の動作を示すフローチャート。5 is a flowchart illustrating the operation of the speaker clustering unit according to the first embodiment. 実施例２の話者判定装置の構成を示すブロック図。FIG. 9 is a block diagram illustrating a configuration of a speaker determination device according to a second embodiment. 実施例２の話者判定装置の一部の動作を示すフローチャート。9 is a flowchart illustrating a part of the operation of the speaker determination device according to the second embodiment. 実施例２の話者クラスタリング部の構成を示すブロック図。FIG. 9 is a block diagram illustrating a configuration of a speaker clustering unit according to the second embodiment. 実施例２の話者クラスタリング部の動作を示すフローチャート。9 is a flowchart illustrating the operation of the speaker clustering unit according to the second embodiment. 実施例３の話者判定装置の構成を示すブロック図。FIG. 9 is a block diagram illustrating a configuration of a speaker determination device according to a third embodiment. 実施例３の話者判定装置の一部の動作を示すフローチャート。9 is a flowchart illustrating a part of the operation of the speaker determination device according to the third embodiment. 実施例３の話者クラスタリング部の動作を示すフローチャート。13 is a flowchart illustrating the operation of the speaker clustering unit according to the third embodiment.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. Note that components having the same functions are given the same reference numerals, and redundant description is omitted.

以下、図１を参照して実施例１の話者判定装置の構成を説明する。本実施例の話者判定装置１は、音声分割部１１と、話者特徴量抽出部１２と、窓口担当者話者特徴量記憶部１３Ａと、類似度算出部１３と、話者一次判定部１４と、話者二次判定部１５と、話者クラスタリング部１６と、話者発話区間決定部１７を含む。 Hereinafter, the configuration of the speaker determination device of the first embodiment will be described with reference to FIG. The speaker determination device 1 of the present embodiment includes a voice division unit 11, a speaker characteristic amount extraction unit 12, a contact person speaker characteristic amount storage unit 13A, a similarity calculation unit 13, and a speaker primary determination unit. 14, a speaker secondary determination unit 15, a speaker clustering unit 16, and a speaker utterance section determination unit 17.

なお、図1に入力として記載した音声信号s(t)は、窓口担当者と顧客の会話を記録した音声信号であって、そのサンプリング周波数をf_s[Hz]とした場合の、サンプル時間t（t=0,1, …,S-1）における振幅である。以下、図２を参照して各構成要件の動作を説明する。 Note that the audio signal s (t) described as an input in FIG. 1 is an audio signal in which a conversation between a counterperson and a customer is recorded, and a sampling time t when the sampling frequency is f _s [Hz]. (T = 0, 1,..., S−1). The operation of each component will be described below with reference to FIG.

＜音声分割部１１＞
音声分割部１１は、音声信号s(t)を取得して、時間情報T_kを出力する（Ｓ１１）。音声分割部１１の動作について図３を参照してさらに説明する。同図に示すように音声分割部１１は、はじめに音声信号s(t)から音声区間を検出する。続いて音声分割部１１は、音声区間を所定時間長（σ秒、例えばσ＝２）で分割する。音声区間は、例えば非特許文献２のような従来の方法で検出できる。 <Audio division unit 11>
Speech division unit 11 obtains the audio signal s (t), and outputs the time information T _k (S11). The operation of the audio division unit 11 will be further described with reference to FIG. As shown in the figure, the voice division unit 11 first detects a voice section from the voice signal s (t). Subsequently, the voice dividing unit 11 divides the voice section by a predetermined time length (σ seconds, for example, σ = 2). The voice section can be detected by a conventional method as in Non-Patent Document 2, for example.

所定時間長σに分割された音声区間のそれぞれを分割音声区間という。時間情報T_kは、k番目（k=1,...,K, Kは分割音声区間の総数を表す任意の自然数）の分割音声区間の開始時間を意味する。 Each of the voice sections divided into the predetermined time length σ is called a divided voice section. The time information T _k means the start time of the k-th (k = 1,..., K, K is an arbitrary natural number representing the total number of divided voice sections) divided voice sections.

＜話者特徴量抽出部１２＞
話者特徴量抽出部１２は、各分割音声区間について、話者の特徴を示すベクトルを求める（Ｓ１２）。話者の特徴を示すベクトルを以下、話者特徴量と呼ぶ。話者特徴量は、i-vector（非特許文献３）等を用い、従来の技術にて抽出する。T_kから開始するk番目の分割音声区間から抽出した話者特徴量をI_kと表記する。説明を分かりやすくするため、以降は適宜k番目の分割音声区間に注目して説明を進める。 <Speaker feature amount extraction unit 12>
The speaker characteristic amount extraction unit 12 obtains a vector indicating a speaker characteristic for each divided voice section (S12). Hereinafter, a vector indicating a speaker characteristic is referred to as a speaker characteristic amount. The speaker feature amount is extracted by a conventional technique using i-vector (Non-Patent Document 3) or the like. The speaker feature extracted from the k-th divided voice section starting from T _k is denoted as I _k . In order to make the description easy to understand, the description will hereinafter be focused on the k-th divided voice section as appropriate.

＜窓口担当者話者特徴量記憶部１３Ａ＞
窓口担当者話者特徴量記憶部１３Ａには、予め窓口担当者毎の話者特徴量J_mが生成され、登録されている。mは窓口担当者の話者ＩＤであり、m=1,・・・,Mとする。Mは登録されている窓口担当者の総数であり、任意の自然数である。 <Contact person speaker feature amount storage unit 13A>
The responsible person speaker features storage unit 13A, the speaker characteristic quantity J _m of advance for each responsible person is generated and is registered. m is the speaker ID of the person in charge of the window, and m = 1,..., M. M is the total number of registered contact persons, and is an arbitrary natural number.

＜類似度算出部１３＞
類似度算出部１３は、話者特徴量I_kと話者特徴量J_mを用いて類似度S_k=(S_k1,S_k2,S_k3,・・・,S_kM)を算出する（Ｓ１３）。なお、S_kmは、k番目の分割音声区間の話者特徴量I_kと、窓口担当者話者特徴量記憶部１３Ａに予め登録されたm番目の窓口担当者の話者特徴量J_mの類似度であり、ユークリッド距離や、コサイン距離を用いる。 <Similarity calculator 13>
The similarity calculation unit 13 calculates the similarity S _k = (S _k1 , S _k2 , S _k3 ,..., S _kM ) using the speaker feature amount I _k and the speaker feature amount J _m (S13). ). Note that S _km is a value of the speaker feature amount I _k of the k-th divided voice section and the speaker feature amount J _m of the m-th contact person registered in advance in the contact person speaker feature amount storage unit 13A. It is a similarity and uses a Euclidean distance or a cosine distance.

＜話者一次判定部１４＞
話者一次判定部１４は、類似度S_kから一次判定情報SP_kを生成する（Ｓ１４）。一次判定情報SP_kは、k番目の分割音声区間の話者ＩＤを表し、SP_k∈{1,2,3,…,m,…,M}+{0}であり、SP_k=0の場合、窓口担当者ではなく顧客であることを示す。SP_kは閾値δを用いて、次の式から求める。 <Speaker primary determination unit 14>
Speaker primary determination unit 14 generates a primary determination information SP _k from the similarity S _k (S14). The primary determination information SP _k represents the speaker ID of the k-th divided voice section, SP _k ∈ {1,2,3,..., M,..., M} + {0}, and SP _k = 0 In this case, it indicates that the customer is not the contact person. SP _k is _obtained from the following equation using the threshold δ.

上の式はすなわち、類似度S_km（m=1,...,M）のうちの最大値が所定の閾値δ以上となる場合には、その最大値に該当する話者ＩＤを一次判定情報SP_kとすることを表す。一方、類似度S_km（m=1,...,M）のうちの最大値が所定の閾値δ未満となる場合には、何れの窓口担当者も該当せず、顧客であることを示す話者ＩＤ＝０を一次判定情報SP_kとすることを表す。なお、類似度算出部１３で、コサイン距離を用いた場合は、閾値δは0.5近傍の値とすれば好適である。 When the maximum value of the similarities S _km (m = 1,..., M) is equal to or greater than a predetermined threshold δ, the above equation is used to primarily determine the speaker ID corresponding to the maximum value. This represents information SP _k . On the other hand, when the maximum value of the similarities S _km (m = 1,..., M) is less than the predetermined threshold value δ, no contact person is applicable, indicating that the customer is a customer. the speaker ID = 0 indicating that the primary determination information SP _k. Note that when the cosine distance is used in the similarity calculation unit 13, it is preferable that the threshold value δ be a value near 0.5.

＜話者二次判定部１５＞
話者一次判定部１４による判定は、音声の言いよどみや音韻の偏り等により誤判定となる場合がある。そこで、話者二次判定部１５は、一次判定情報SP_kを修正し、二次判定情報SP'_kを生成する（Ｓ１５）。 <Speaker secondary determination unit 15>
The determination by the speaker primary determination unit 14 may be an erroneous determination due to the stagnation of the voice or the bias of the phoneme. Therefore, the speaker secondary determination unit 15 corrects the primary determination information SP _k, to produce a secondary determination information SP _'k (S15).

図４、図５を参照して話者二次判定部１５の構成および動作について説明する。図４に示すように話者二次判定部１５は、近傍話者検出部１５１と、二次判定情報生成部１５２を含む。 The configuration and operation of the secondary speaker determination unit 15 will be described with reference to FIGS. As illustrated in FIG. 4, the secondary speaker determination unit 15 includes a nearby speaker detection unit 151 and a secondary determination information generation unit 152.

注目している任意の分割音声区間（ここでは前述に引き続きk番目の分割音声区間とする）の前後の所定数（L個とする）の分割音声区間において最も当てはまる（最も判定結果が多かった）話者を近傍話者と呼び、近傍話者の話者ＩＤを This applies most to a predetermined number (L) of divided voice sections before and after an arbitrary divided voice section of interest (here, the k-th divided voice section is continued) (the determination result was the largest). The speaker is called a neighbor speaker, and the speaker ID of the neighbor speaker is

とする。近傍話者検出部１５１は、近傍話者の話者ＩＤを下記のように検出する（Ｓ１５１）。 And The nearby speaker detection unit 151 detects the speaker ID of the nearby speaker as follows (S151).

上の式はすなわち、k-L番目〜k+L番目の分割音声区間において当てはまる（τ(0)=1）回数が最も多い話者ＩＤが、近傍話者の話者ＩＤとして求められることを意味する。なお、k番目の分割音声区間の前にある所定数(L個)の分割音声区間のみについて近傍話者を求めてもよいし、k番目の分割音声区間の後ろにある所定数(L個)の分割音声区間のみについて近傍話者を求めてもよい。この場合、上式は以下のように変形される。 That is, the above equation means that the speaker ID that is the most applicable (τ (0) = 1) in the kL-th to k + L-th divided voice sections is obtained as the speaker ID of the nearby speaker. . It should be noted that neighbor speakers may be obtained only for a predetermined number (L) of divided voice sections before the k-th divided voice section, or a predetermined number (L) of separated voice sections after the k-th divided voice section. May be obtained for only the divided voice section of. In this case, the above equation is modified as follows.

次に、二次判定情報生成部１５２は、近傍話者の話者特徴量とk番目の分割音声区間の話者特徴量の類似度S_km^を用いて二次判定情報SP'_kを生成する（Ｓ１５２）。具体的には二次判定情報生成部１５２は、下記のように二次判定情報SP'_kを生成する。 Next, the secondary determination information generation unit 152 generates the secondary determination information SP ′ _k using the similarity S _{km ^} between the speaker feature of the nearby speaker and the speaker feature of the k-th divided voice section. (S152). Specifically secondary decision information generating unit 152 generates a secondary determination information SP _'k as follows.

上の式はすなわち、類似度S_km^が所定の閾値δ'以上となる場合には、近傍話者に該当する話者ＩＤを二次判定情報SP'_kとし、所定の閾値δ'未満となる場合には、顧客に該当する話者ＩＤ＝０を二次判定情報SP'_kとすることを表している。なお、閾値δ'をステップＳ１４で使用したδ以上の値にし、判定を厳しくしてもよい。また、近傍話者は前後の一定の窓幅の類似度S_kを用いて判定してもよい。具体的には以下の式のように近傍話者を計算してもよい。 When the similarity S _{km ^} is equal to or more than a predetermined threshold δ ′, the above equation sets the speaker ID corresponding to the nearby speaker as the secondary determination information SP ′ _k , and If it becomes represents that to the speaker ID = 0 corresponding to the customer and the secondary determination information SP _'k. Note that the threshold δ ′ may be set to a value equal to or larger than δ used in step S14, and the determination may be strict. Also, near the speaker may be determined using a similarity S _k a constant window width of the front and rear. Specifically, neighboring speakers may be calculated as in the following equation.

＜話者クラスタリング部１６＞
以下の説明では、顧客であることを示す二次判定情報(SP'_k=0)と対応する分割音声区間の話者特徴量を顧客話者特徴量と呼ぶ。 <Speaker clustering unit 16>
In the following description, the speaker feature amount of the divided voice section corresponding to the secondary determination information (SP ′ _k = 0) indicating the customer is referred to as a customer speaker feature amount.

話者クラスタリング部１６は、顧客話者特徴量の集合をクラスタリングして、同一の顧客が発話していると思われる分割音声区間をひとまとめにして、顧客の話者ＩＤを生成して付与し、三次判定情報SP''_kを生成する（Ｓ１６）。 The speaker clustering unit 16 clusters a set of the customer speaker feature amounts, collectively divides the divided voice sections considered to be speaking by the same customer, and generates and assigns a speaker ID of the customer. The tertiary determination information SP ″ _k is generated (S16).

図６、図７を参照して話者クラスタリング部１６の構成および動作について説明する。図６に示すように話者クラスタリング部１６は、顧客話者特徴量集合生成部１６１と、特徴クラスタ生成部１６２と、三次判定情報生成部１６３を含む。 The configuration and operation of the speaker clustering unit 16 will be described with reference to FIGS. As shown in FIG. 6, the speaker clustering unit 16 includes a customer speaker feature amount set generation unit 161, a feature cluster generation unit 162, and a tertiary determination information generation unit 163.

顧客話者特徴量集合生成部１６１は、クラスタリング対象である顧客話者特徴量の集合V={I_k|SP'_k=0}を生成する（Ｓ１６１）。 The customer speaker feature set generating unit 161 generates a set of customer speaker feature V = {I _k | SP ′ _k = 0} to be clustered (S161).

特徴クラスタ生成部１６２は、k-means等の一般的なクラスタリング手法により、Vをクラスタリングし、特徴クラスタの集合V₁,V₂,…,V_H（Hは特徴クラスタの総数）を生成する（Ｓ１６２）。 The feature cluster generation unit 162 clusters V using a general clustering technique such as k-means, and generates a set of feature clusters V ₁ , V ₂ ,..., V _H (H is the total number of feature clusters) ( S162).

三次判定情報生成部１６３は、三次判定情報SP''_kを生成する（Ｓ１６３）。三次判定情報生成部１６３は、例えば次のようにSP''_kを生成する。 The tertiary determination information generator 163 generates tertiary determination information SP ″ _k (S163). The tertiary determination information generation unit 163 generates SP ″ _k , for example, as follows.

I_k∈V_h（h=1,2,…,H）となるI_kの所属する特徴クラスタの集合を示す関数である。窓口担当者の話者ＩＤと区別するために、顧客の話者ＩＤは負の値とした。 _{_{I k ∈V h (h = 1,2}} , ..., H) is a function representing the set of features clusters belonging to become I _k. The speaker ID of the customer was set to a negative value to distinguish it from the speaker ID of the contact person.

ステップＳ１６２のクラスタリングのk-means法では、予め特徴クラスタの数として顧客数を与える必要がある。顧客数が未知の場合には、非特許文献４のようにBICを用いたx-means法を用いることも可能である。 In the k-means method of clustering in step S162, the number of customers needs to be given in advance as the number of feature clusters. If the number of customers is unknown, it is possible to use the x-means method using BIC as in Non-Patent Document 4.

＜話者発話区間決定部１７＞
話者発話区間決定部１７は、最終的な処理結果として話者と分割音声区間を統合する。話者発話区間決定部１７は、例えば下記の式で話者と当該話者に対応する分割音声区間の統合を行い、話者判定情報D_kを生成し、出力する（Ｓ１７）。
D_k={SP''_k,T_k,T_k+1-1}
本実施例の話者判定装置１によれば、ある分割音声区間の話者を判定する際に、当該分割音声区間だけではなく、当該分割音声区間の前または後の所定数の分割音声区間の類似度を用いることとしたため、話者の判定精度が向上する。 <Speaker utterance section determination unit 17>
The speaker utterance section determination unit 17 integrates the speaker and the divided voice section as a final processing result. The speaker utterance section determination unit 17 integrates the speaker and the divided voice section corresponding to the speaker using, for example, the following formula, and generates and outputs speaker determination information _Dk (S17).
D _k = {SP '' _k , T _k , T _{k + 1} -1}
According to the speaker determination device 1 of the present embodiment, when determining the speaker of a certain divided voice section, not only the divided voice section but also a predetermined number of divided voice sections before or after the divided voice section are determined. Since the similarity is used, the accuracy of speaker determination is improved.

以下、図８を参照して実施例２の話者判定装置の構成を説明する。本実施例の話者判定装置２は、実施例１と異なる話者クラスタリング部２６を含む。話者クラスタリング部２６以外の構成要件については実施例１と同じである。以下、話者クラスタリング部２６についてのみ説明を行う。 Hereinafter, the configuration of the speaker determination device of the second embodiment will be described with reference to FIG. The speaker determination device 2 of the present embodiment includes a speaker clustering unit 26 different from that of the first embodiment. The components other than the speaker clustering unit 26 are the same as those in the first embodiment. Hereinafter, only the speaker clustering unit 26 will be described.

＜話者クラスタリング部２６＞
本実施例の話者クラスタリング部２６は、顧客話者特徴量と対応する時間情報をベクトル結合してなる顧客結合話者特徴量を用いることを特徴とする。 <Speaker clustering unit 26>
The speaker clustering unit 26 of this embodiment is characterized in that a customer-coupled speaker characteristic amount obtained by vector-connecting a customer speaker characteristic amount and corresponding time information is used.

図９に示すように、話者クラスタリング部２６は、顧客結合話者特徴量の集合をクラスタリングして、同一の顧客が発話していると思われる分割音声区間をひとまとめにして、顧客の話者ＩＤを生成して付与し、三次判定情報SP''_kを生成する（Ｓ２６）。 As shown in FIG. 9, the speaker clustering unit 26 clusters a set of customer-coupled speaker features to group divided voice sections that are considered to be speaking by the same customer, and An ID is generated and assigned, and tertiary determination information SP ″ _k is generated (S26).

図１０、図１１を参照して話者クラスタリング部２６の構成および動作について説明する。図１０に示すように話者クラスタリング部２６は、結合話者特徴量生成部２６１と、顧客結合話者特徴量集合生成部２６２と、特徴クラスタ生成部２６３と、三次判定情報生成部２６４を含む。 The configuration and operation of the speaker clustering unit 26 will be described with reference to FIGS. As illustrated in FIG. 10, the speaker clustering unit 26 includes a combined speaker feature amount generation unit 261, a customer combined speaker feature amount set generation unit 262, a feature cluster generation unit 263, and a tertiary determination information generation unit 264. .

結合話者特徴量生成部２６１は、話者特徴量I_kとこれに対応する時間情報T_kを結合し、結合話者特徴量I'_k＝（I_k,T_k）を生成する（Ｓ２６１）。結合の方法として、通常は（I_k,T_k）を一つのベクトルとするが、重み係数αを用いて、（I_k,αT_k）としたうえで結合してもよい。なお、（I_k,αT_k）を結合してクラスタリングを行ったところ、（I_k,T_k）を結合してクラスタリングを行った場合よりも精度が向上することがわかった。 Binding speaker feature amount generating unit 261 combines the speaker characteristic quantity I _k and time information T _k corresponding thereto, binding speaker feature amount _{_{I 'k = (I k,}} T k) to generate (S261 ). As binding methods, usually but a single vector (I _{_k,} T _k), using the weight coefficient alpha, may be coupled in terms of was (I _{_k,} αT _k). When clustering was performed by combining (I _k , αT _k ), it was found that the accuracy was improved as compared with the case where clustering was performed by combining (I _k , T _k ).

以下、顧客であることを示す二次判定情報(SP'_k=0)と対応する分割音声区間の結合話者特徴量を顧客結合話者特徴量と呼ぶ。 Hereinafter, the combined speaker feature amount of the divided voice section corresponding to the secondary determination information (SP ′ _k = 0) indicating the customer is referred to as a customer combined speaker feature amount.

顧客結合話者特徴量集合生成部２６２は、顧客結合話者特徴量の集合V'={I'_k|SP'_k=0（k=1,2,3,・・・,K）}を生成する（Ｓ２６２）。 The customer-joined speaker feature set generation unit 262 generates a set of customer-joined speaker feature V ′ = {I ′ _k | SP ′ _k = 0 (k = 1, 2, 3,..., K)}. It is generated (S262).

特徴クラスタ生成部２６３は、k-means等の一般的なクラスタリング手法により、V'をクラスタリングし、特徴クラスタの集合V'₁,V'₂,…,V'_Hを生成する（Ｓ２６３）。 The feature cluster generation unit 263 clusters V ′ by a general clustering method such as k-means or the like, and generates a set of feature clusters V ′ ₁ , V ′ ₂ ,..., V ′ _H (S263).

三次判定情報生成部２６４は、三次判定情報SP''_kを生成する（Ｓ２６４）。例えば次のようにSP''_kを生成する。 The tertiary determination information generation unit 264 generates tertiary determination information SP ″ _k (S264). For example, SP ″ _k is generated as follows.

I'_k∈V'_h（h=1,2,…,H）となるI'_kの所属する特徴クラスタの集合を示す関数である。 It is a function indicating a set of feature clusters to which I ′ _k belongs such that I ′ _k ∈V ′ _h (h = 1, 2,..., H).

本実施例の話者判定装置２によれば、クラスタリング対象である顧客結合話者特徴量に時間情報を含むようにしたため、近い時間に話している話者を同一の話者と判定しやすくすることができ、話者の判定精度が向上する。 According to the speaker determination device 2 of the present embodiment, the time information is included in the customer-joined speaker feature quantity to be clustered, so that it is easy to determine a speaker who is speaking at a close time as the same speaker. And the accuracy of speaker determination is improved.

以下、図１２を参照して実施例３の話者判定装置の構成を説明する。本実施例の話者判定装置３は、実施例１、２と異なる話者クラスタリング部３６を含む。話者クラスタリング部３６以外の構成要件については実施例１、２と同じである。以下、話者クラスタリング部３６についてのみ説明を行う。 Hereinafter, the configuration of the speaker determination device of the third embodiment will be described with reference to FIG. The speaker determination device 3 of the present embodiment includes a speaker clustering unit 36 different from the first and second embodiments. The components other than the speaker clustering unit 36 are the same as those in the first and second embodiments. Hereinafter, only the speaker clustering unit 36 will be described.

＜話者クラスタリング部３６＞
図１２、図１３を参照して話者クラスタリング部３６の構成および動作について説明する。図１２に示すように話者クラスタリング部３６は、特徴クラスタ更新部３６１と、時間クラスタ更新部３６２を含む。 <Speaker clustering unit 36>
The configuration and operation of the speaker clustering unit 36 will be described with reference to FIGS. As shown in FIG. 12, the speaker clustering unit 36 includes a feature cluster updating unit 361 and a time cluster updating unit 362.

図１３に示すように、話者クラスタリング部３６は、顧客話者特徴量の集合をクラスタリングして各特徴クラスタを求め、各特徴クラスタの時間方向のセントロイドに基づいて各時間クラスタを求め、各時間クラスタの特徴方向のセントロイドに基づいて各特徴クラスタを更新し、更新された各特徴クラスタの時間方向のセントロイドに基づいて各時間クラスタを更新し、更新を時間方向のセントロイドが収束するまで繰り返し実行する（Ｓ３６）。以下、図１４を参照してステップＳ３６をさらに具体的に説明する。 As shown in FIG. 13, the speaker clustering unit 36 obtains each feature cluster by clustering a set of customer speaker features, obtains each time cluster based on the centroid of each feature cluster in the time direction, and Update each feature cluster based on the centroid in the feature direction of the time cluster, update each time cluster based on the centroid in the time direction of each updated feature cluster, and converge the update in the time direction with the centroid (S36). Hereinafter, step S36 will be described more specifically with reference to FIG.

まず特徴クラスタ更新部３６１は、クラスタリング対象である顧客話者特徴量の集合V={I_k|SP'_k=0（k=1,2,3,・・・,K）}を生成する（Ｓ３６１−１）。 First, the feature cluster updating unit 361 generates a set V = {I _k | SP ′ _k = 0 (k = 1, 2, 3,..., K)} of the customer speaker feature amounts to be clustered (K = 1, 2, 3,..., K)}. S361-1).

次に時間クラスタ更新部３６２は、顧客時間情報の集合U={T_k|SP'_k=0（k=1,2,3,・・・,K）}を生成する（Ｓ３６２−１）。 Next, the time cluster updating unit 362 generates a set of customer time information U = {T _k | SP ′ _k = 0 (k = 1, 2, 3,..., K)} (S362-1).

次に特徴クラスタ更新部３６１は、実施例１と同様に、Vを特徴方向にクラスタリングし、特徴クラスタの集合V₁,V₂,…,V_Hを生成し、各特徴クラスタの特徴方向のセントロイドc₁,c₂,…,c_Hを求める（Ｓ３６１−２）。 Next feature cluster updating unit 361, in the same manner as in Example 1, clustering V to the feature direction, set V _1, V ₂ feature cluster, ..., and generates a V _H, the characteristic directions of the characteristic cluster St. Lloyd c _1, c _2, ..., determine the c _H (S361-2).

次に時間クラスタ更新部３６２は、特徴クラスタの集合V₁,V₂,…,V_Hを用いて、時間方向のセントロイドt₁,t₂,…,t_Hを求める。具体的には、下記のように求める。 Next time cluster update unit 362, the set V _1, V ₂ feature cluster, ..., using a V _H, St. time direction Lloyd t _1, t _2, ..., obtaining the t _H. Specifically, it is obtained as follows.

さらに、時間クラスタ更新部３６２は、各特徴クラスタの時間方向のセントロイドt₁,t₂,…,t_Hから、時間方向の話者の境界r_h,h+1を求め、時間クラスタの集合U₁,U₂,…,U_Hを生成する（Ｓ３６２−２）。例えば時間方向の話者の境界は、下記のように時間方向のセントロイドの中点として与えたり、特徴クラスタにおける所属クラスタと時間クラスタにおける所属クラスタとの一致数の最大化によって、求めることができる。 In addition, the time cluster update section 362, centroid t _1, t ₂ in the time direction of each feature cluster, ..., from t _H, boundary r _h in the time direction of the _speaker, asked the _{h + 1,} a set of time cluster U ₁ , U ₂ ,..., U _H are generated (S362-2). For example, the boundary of the speaker in the time direction can be obtained by giving the midpoint of the centroid in the time direction as described below, or by maximizing the number of matches between the belonging cluster in the feature cluster and the belonging cluster in the time cluster. .

（１）時間方向の中点に基づく方法
(１−１)時間方向の境界を次のように求める。
r_h,h+1=(t_h+t_h+1)/2
（１−２）時間クラスタを次のように求める。
U_h={T_k|T_k≧r_h,h+1∧T_k<r_h+1,h+2} (1) Method based on midpoint in time direction
(1-1) The boundary in the time direction is obtained as follows.
r _{h, h + 1} = (t _h + t _{h + 1} ) / 2
(1-2) The time cluster is obtained as follows.
U _h = {T _k | T _k ≧ r _{h, h + 1} ∧T _k <r _{h + 1, h + 2} }

（２）一致数の最大化に基づく方法
（２−１）時間方向の境界を次のように求める。
r_h,h+1=arg_tmax{|g_h(t)|+|g'_h(t)|}
ただし、g_h(t)={k|I_k∈V_h∧T_k≦t(k=1,2,…,K)},g'_h(t)={k|I_k∈V_h+1∧T_k>t(k=1,2,…,K)}とし、g,g'は時間境界t以下／より大にあり、同様に同じクラスタに属している音声の数となる。
（２−２）（１−２）と同様に、時間クラスタを求める。
U_h={T_k|T_k≧r_h,h+1∧T_k<r_h+1,h+2} (2) Method based on maximizing the number of matches (2-1) A boundary in the time direction is obtained as follows.
r _{h, h + 1} = arg _t max {| g _h (t) | + | g ' _h (t) |}
Where g _h (t) = {k | I _k ∈V _h ∧T _k ≤t (k = 1,2,…, K)}, g ' _h (t) = {k | I _k ∈V _{h + 1} {T _k > t (k = 1, 2,..., K)}, and g and g ′ are equal to or smaller than the time boundary t and larger than the number of voices belonging to the same cluster.
(2-2) A time cluster is obtained in the same manner as in (1-2).
U _h = {T _k | T _k ≧ r _{h, h + 1} ∧T _k <r _{h + 1, h + 2} }

次に特徴クラスタ更新部３６１は、各時間クラスタから特徴方向の初期セントロイドc₁,c₂,…,c_Hを求め、特徴方向のクラスタリングを行い、特徴方向のセントロイドc₁,c₂,…,c_Hと特徴クラスタの集合V₁,V₂,…,V_Hを更新する（Ｓ３６１−３）。特徴クラスタ更新部３６１は、特徴方向の初期セントロイドを下記のように求める。 Next, the feature cluster updating unit 361 obtains initial centroids c ₁ , c ₂ ,..., C _H in the feature direction from each time cluster, performs clustering in the feature direction, and performs centroid c ₁ , c ₂ , , C _H and the set V ₁ , V ₂ ,..., V _H of the feature clusters are updated (S361-3). The feature cluster updating unit 361 obtains an initial centroid in the feature direction as follows.

次に時間クラスタ更新部３６２は、各特徴クラスタの時間方向のセントロイドt₁,t₂,…,t_Hから、時間方向の話者の境界r_h,h+1を求め、時間クラスタの集合U₁,U₂,…,U_Hを更新する（Ｓ３６２−３）。 Next, the time cluster updating unit 362 obtains the speaker boundaries rh _{, h + 1} in the time direction from the centroids t ₁ , t ₂ ,..., T _{H in} the time direction of each feature cluster, and sets a set of time clusters. U _1, U _2, ..., updates the U _H (S362-3).

ステップＳ３６１−３、Ｓ３６２−３は、時間方向のセントロイドt_hが収束し、変化しなくなるまで交互に繰り返し実行される。 Step S361-3, S362-3 is converged centroid t _h time direction, it is repeatedly executed alternately until no change.

その後、時間クラスタ更新部３６２は、時間クラスタの集合U₁,U₂,…,U_Hを、特徴クラスタの集合V₁,V₂,…,V_Hに反映し（V_h={I_k|T_k∈U_h(k=1,2,…,K)}）、三次判定情報SP''_kを生成する（Ｓ３６２−４）。時間クラスタ更新部３６２は、例えば次のようにSP''_kを生成する。 After that, the time cluster updating unit 362 reflects the set U ₁ , U ₂ ,..., U _H of the time clusters on the set V ₁ , V ₂ ,..., V _H of the feature clusters (V _h = {I _k | T _k ∈U _h (k = 1, 2,..., K)}), and generates the tertiary determination information SP ″ _k (S362-4). The time cluster updating unit 362 generates SP ″ _k , for example, as follows.

I_k∈V_h（h=1,2,…,H）となるI_kの所属する特徴クラスタの集合を示す関数である。 _{_{I k ∈V h (h = 1,2}} , ..., H) is a function representing the set of features clusters belonging to become I _k.

本実施例における交互の話者クラスタリングも同様に、x-meansの方法を応用し話者数を自動決定してもよい。また、実施例２と実施例３を組み合わせて、顧客結合話者特徴量の集合に対して特徴方向のクラスタリングと、時間方向のクラスタリングを交互に実施してもよい。 Similarly, in the alternate speaker clustering in the present embodiment, the number of speakers may be automatically determined by applying the x-means method. Further, by combining the second embodiment and the third embodiment, the clustering in the feature direction and the clustering in the time direction may be alternately performed on the set of the customer-coupled speaker feature amounts.

本実施例の話者判定装置３によれば、特徴方向のクラスタリングと、時間方向のクラスタリングを交互に実行することにより、誤判定を削減することができる。時間方向のクラスタリングについては、窓口の応対では頻繁に顧客が変わらないであろうという仮定を置いており、この仮定は多くの場合は正しいため、時間方向の１次元のクラスタリングを実行することにより、話者の判定精度が向上する。 According to the speaker determination device 3 of the present embodiment, erroneous determination can be reduced by alternately performing clustering in the feature direction and clustering in the time direction. For temporal clustering, we make the assumption that customers will not change frequently at the window, and since this assumption is often correct, by performing one-dimensional clustering in the temporal direction, The determination accuracy of the speaker is improved.

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ−ＲＯＭなどの記録媒体を読み書きできる装置（ドライブ）などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Supplementary note>
The device of the present invention includes, for example, an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) that can communicate outside the hardware entity as a single hardware entity , A communication unit, a CPU (which may include a Central Processing Unit, a cache memory and a register), a RAM or ROM as a memory, an external storage device as a hard disk, and an input unit, an output unit, and a communication unit thereof. , A CPU, a RAM, a ROM, and a bus connected so that data can be exchanged between the external storage devices. If necessary, the hardware entity may be provided with a device (drive) that can read and write a recording medium such as a CD-ROM. A physical entity provided with such hardware resources includes a general-purpose computer.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program necessary for realizing the above-described functions, data necessary for processing the program, and the like. It may be stored in a ROM that is a dedicated storage device). Data obtained by the processing of these programs is appropriately stored in a RAM, an external storage device, or the like.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭなど）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段などと表した各構成要件）を実現する。 In the hardware entity, each program stored in the external storage device (or ROM or the like) and data necessary for processing of each program are read into the memory as needed, and interpreted and executed / processed by the CPU as appropriate. . As a result, the CPU realizes predetermined functions (the above-described components, such as components, means, etc.).

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the embodiments described above, and can be appropriately modified without departing from the spirit of the present invention. In addition, the processes described in the above embodiments may be performed not only in chronological order according to the order described, but also in parallel or individually according to the processing capability of the device that executes the processes or as necessary. .

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing function of the hardware entity (the device of the present invention) described in the above embodiment is implemented by a computer, the processing content of the function that the hardware entity should have is described by a program. By executing this program on a computer, the processing functions of the hardware entity are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 A program describing this processing content can be recorded on a computer-readable recording medium. As a computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, a hard disk device, a flexible disk, a magnetic tape, or the like is used as a magnetic recording device, and a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), or a CD-ROM (Compact Disc Read Only) is used as an optical disk. Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording media, MO (Magneto-Optical disc), semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The distribution of the program is performed by, for example, selling, transferring, lending, or the like, a portable recording medium such as a DVD or a CD-ROM on which the program is recorded. Further, the program may be stored in a storage device of a server computer, and the program may be distributed by transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. Then, when executing the processing, the computer reads the program stored in its own recording medium and executes the processing according to the read program. Further, as another execution form of the program, the computer may directly read the program from the portable recording medium and execute processing according to the program, and further, the program may be transferred from the server computer to the computer. Each time, the processing according to the received program may be sequentially executed. A configuration in which the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by executing an instruction and acquiring a result without transferring a program from the server computer to the computer. It may be. It should be noted that the program in the present embodiment includes information used for processing by the computer and which is similar to the program (data that is not a direct command to the computer but has characteristics that define the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, in this embodiment, a hardware entity is configured by executing a predetermined program on a computer. However, at least a part of the processing contents may be realized by hardware.

なお、明細書、特許請求の範囲に記載された各ステップは各種の情報を生成する方法の各ステップに該当する。ここでいう各種の情報は特許法第二条第四項に規定するプログラム等（プログラム…その他電子計算機による処理の用に供する情報であってプログラムに準ずるもの）に該当するため、ここでいう各種の情報は、特許法第二条第三項第一号に規定する物に該当する。従って、明細書、特許請求の範囲に記載された各種の情報を生成する方法はすなわち、特許法第二条第三項第三号に規定する物を生産する方法に該当することはいうまでもない。 Each step described in the specification and the claims corresponds to each step of a method for generating various information. The various types of information referred to here correspond to programs and the like (programs and other information used for processing by a computer and conform to the programs) prescribed in Article 2, paragraph 4 of the Patent Act. The information in (1) corresponds to the provisions of Article 2, Paragraph 3, Item 1 of the Patent Act. Therefore, it goes without saying that the method of generating various kinds of information described in the description and the claims corresponds to the method of producing a product specified in Article 2, Paragraph 3, Item 3 of the Patent Act. Absent.

Claims

A speaker determination device that determines a speaker from an audio signal in which a conversation between a contact person and a customer is recorded,
A similarity calculation for calculating a similarity between a speaker feature of each divided voice section obtained by dividing a voice section of the voice signal into a predetermined time length and a speaker feature generated in advance for each contact person. Department and
A speaker primary determination unit that generates primary determination information indicating a speaker ID of each divided voice section from the similarity;
The similarity between the speaker feature of the neighboring speaker who is the most applicable speaker in the predetermined number of divided speech sections before or after the arbitrary divided speech section and the speaker feature of the arbitrary divided speech section is predetermined. When the condition of is satisfied, by making the speaker ID of the nearby speaker the secondary determination information of the arbitrary divided voice section, a speaker secondary determination unit that generates secondary determination information,
Clustering a set of speaker features of the divided voice section corresponding to the secondary determination information indicating the customer, that is, a set of customer speaker features, generates a speaker ID of the customer, and generates tertiary determination information A speaker determination device including a speaker clustering unit for performing the above.

The speaker determination device according to claim 1,
The speaker clustering unit includes:
A speaker judging device for clustering a set of customer combined speaker features obtained by combining the customer speaker features and time information corresponding thereto.

The speaker determination device according to claim 1,
The speaker clustering unit includes:
Clustering the set of customer speaker feature amounts to obtain each feature cluster, obtaining each time cluster based on the centroid in the time direction of each feature cluster, and based on the centroid in the feature direction of each time cluster A speaker determination device that updates each of the feature clusters, updates each time cluster based on the updated centroid of each feature cluster, and repeatedly executes the update until the centroid in the time direction converges. .

The speaker determination device according to claim 2,
The speaker clustering unit includes:
The set of customer-coupled speaker features is clustered to determine each feature cluster, each time cluster is determined based on the centroid in the time direction of each feature cluster, and based on the centroid in the feature direction of each time cluster. Speaker determination that updates each of the feature clusters, updates each time cluster based on the updated centroid of each feature cluster, and repeatedly executes the update until the centroid in the time direction converges. apparatus.

A speaker determination information generation method for determining a speaker from a voice signal in which a conversation between a contact person and a customer is recorded,
A similarity calculation for calculating a similarity between a speaker feature of each divided voice section obtained by dividing a voice section of the voice signal into a predetermined time length and a speaker feature generated in advance for each contact person. Steps and
A speaker primary determination step of generating primary determination information representing a speaker ID of each divided voice section from the similarity;
The similarity between the speaker feature of the neighboring speaker who is the most applicable speaker in the predetermined number of divided speech sections before or after the arbitrary divided speech section and the speaker feature of the arbitrary divided speech section is predetermined. When the condition of is satisfied, by making the speaker ID of the nearby speaker the secondary determination information of the arbitrary divided voice section, a speaker secondary determination step of generating secondary determination information;
Clustering a set of speaker features of the divided voice section corresponding to the secondary determination information indicating the customer, that is, a set of customer speaker features, generates a speaker ID of the customer, and generates tertiary determination information Speaker clustering step
A speaker determination information generation method executed by the speaker determination device.

A method for generating speaker determination information according to claim 5, wherein
The speaker clustering step includes:
A speaker determination information generating method for clustering a set of customer combined speaker features obtained by combining the customer speaker features and time information corresponding thereto.

A method for generating speaker determination information according to claim 5, wherein
The speaker clustering step includes:
Clustering the set of customer speaker feature amounts to obtain each feature cluster, obtaining each time cluster based on the centroid in the time direction of each feature cluster, and based on the centroid in the feature direction of each time cluster Speaker determination information that updates each feature cluster, updates each time cluster based on the updated centroid of each feature cluster, and repeatedly executes the update until the centroid in the time direction converges Generation method.

A program for causing a computer to function as the speaker determination device according to any one of claims 1 to 4.