JP6927308B2

JP6927308B2 - Voice control device and its control method

Info

Publication number: JP6927308B2
Application number: JP2019532562A
Authority: JP
Inventors: 宗忠安諸; 正典溝口
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2017-07-26
Filing date: 2018-07-20
Publication date: 2021-08-25
Anticipated expiration: 2038-07-20
Also published as: US11961534B2; JP2021184282A; JP7314975B2; EP3660842A4; US20200202886A1; JPWO2019021953A1; EP3660842A1; WO2019021953A1

Description

本発明は、話者識別機能を有する音声操作装置に関するものである。 The present invention relates to a voice operating device having a speaker identification function.

近年、音声による操作が可能な音声アシスタント機能を有するＡＩスピーカー（スマートスピーカーとも呼ばれる）が注目されている。このような音声操作装置によれば、所望の情報検索や連携する家電製品の操作を音声によって行えるため、利用者は、家事、掃除、洗濯等の作業を行いつつ、手を用いることなく所望の操作を行うことができる。また、音声操作装置は、マイク及びスピーカーと音声認識用のプロセッサを最小限の構成要素とする簡素な構成で低コストに実現可能であるため、手軽に複数の部屋に配置することが可能である。 In recent years, AI speakers (also called smart speakers) having a voice assistant function capable of being operated by voice have been attracting attention. According to such a voice operation device, desired information retrieval and operation of linked home appliances can be performed by voice, so that the user can perform tasks such as housework, cleaning, and washing without using his / her hands. You can perform operations. In addition, since the voice control device can be realized at low cost with a simple configuration that uses a microphone, a speaker, and a processor for voice recognition as the minimum components, it can be easily arranged in a plurality of rooms. ..

特許文献１に記載の音声操作装置は、利用者による音声操作を認識する音声認識機能に加え、音声の声質から話者が誰であるかを識別する話者識別機能を有している。これにより、利用者の好みに合わせたサービスを提供可能としている。 The voice operation device described in Patent Document 1 has a speaker identification function for identifying who the speaker is from the voice quality of the voice, in addition to the voice recognition function for recognizing the voice operation by the user. This makes it possible to provide services that match the tastes of users.

特開２０１０−２８６７０２号公報Japanese Unexamined Patent Publication No. 2010-286702

しかし、話者識別の技術は、研究開発が進んでいるものの依然として識別精度は高くない。特にＡＩスピーカーでは、録音された操作音声を利用したなりすましを避けるために、操作音声を非定型とすることが望まれるが、操作音声が非定型でかつ短文である場合は話者を識別することが特に難しい。また、利用者が発する音声の声質は、音源の距離、周囲の温度や反射等の環境に応じて大きく変化するため、操作音声を、予め登録された利用者の単一の声質モデルと比較するだけでは話者を識別することが難しい。そこで本発明は、話者識別の精度をより向上させることが可能な音声操作装置及びその制御方法を提供することを目的とする。 However, although the speaker identification technology is being researched and developed, the identification accuracy is still not high. Especially for AI speakers, it is desirable to make the operation voice atypical in order to avoid spoofing using the recorded operation voice, but if the operation voice is atypical and short, identify the speaker. Is especially difficult. In addition, since the voice quality of the voice emitted by the user changes greatly depending on the distance of the sound source, the ambient temperature, the environment such as reflection, etc., the operation voice is compared with a single voice quality model of the user registered in advance. It is difficult to identify the speaker by itself. Therefore, an object of the present invention is to provide a voice operation device capable of further improving the accuracy of speaker identification and a control method thereof.

本発明の一観点によれば、音声情報と予め登録された利用者の声質モデルに基づいて利用者を音声操作の話者として識別する話者識別部と、音声情報を音声認識して音声操作情報を生成する音声操作認識部と、を備えた音声操作装置であって、話者識別部は、音声操作情報、音声操作装置の位置情報、話者の方向情報、話者の距離情報、時刻情報のうちの少なくとも１つを補助情報として用いて話者を識別する音声操作装置が提供される。 According to one aspect of the present invention, a speaker identification unit that identifies a user as a speaker for voice operation based on voice information and a user's voice quality model registered in advance, and a voice recognition unit that recognizes voice information for voice operation. It is a voice operation device including a voice operation recognition unit that generates information, and the speaker identification unit includes voice operation information, position information of the voice operation device, direction information of the speaker, distance information of the speaker, and time. A voice control device that identifies a speaker by using at least one of the information as auxiliary information is provided.

また、本発明の別観点によれば、音声情報と予め登録された利用者の声質モデルに基づいて利用者を音声操作の話者として識別する話者識別部と、音声情報を音声認識して音声操作情報を生成する音声操作認識部と、を備えた音声操作装置の制御方法であって、話者識別部は、音声操作情報、音声操作装置の位置情報、話者の方向情報、話者の距離情報、時刻情報のうちの少なくとも１つを補助情報として用いて話者を識別するステップを有する音声操作装置の制御方法が提供される。 Further, according to another viewpoint of the present invention, a speaker identification unit that identifies the user as a speaker of voice operation based on the voice information and a voice quality model of the user registered in advance, and voice recognition of the voice information are performed. It is a control method of a voice operation device including a voice operation recognition unit for generating voice operation information, and the speaker identification unit is a voice operation information, a position information of the voice operation device, a speaker's direction information, and a speaker. Provided is a control method of a voice operation device having a step of identifying a speaker by using at least one of the distance information and the time information of the above as auxiliary information.

本発明によれば、話者識別の精度をより向上させることが可能な音声操作装置及びその制御方法を提供することができる。 According to the present invention, it is possible to provide a voice operation device capable of further improving the accuracy of speaker identification and a control method thereof.

第１実施形態に係る音声操作装置の構成を概略的に示すブロック図である。It is a block diagram which shows schematic structure of the voice operation apparatus which concerns on 1st Embodiment. 本発明に係る音声操作装置の配置例を示す図である。It is a figure which shows the arrangement example of the voice operation apparatus which concerns on this invention. 第１実施形態に係る音声操作装置の制御方法を示すフローチャートである。It is a flowchart which shows the control method of the voice operation apparatus which concerns on 1st Embodiment. 第２実施形態に係る音声操作装置の構成を概略的に示すブロック図である。It is a block diagram which shows schematic structure of the voice operation apparatus which concerns on 2nd Embodiment. 第２実施形態に係る音声操作装置の制御方法を示すフローチャートである。It is a flowchart which shows the control method of the voice operation apparatus which concerns on 2nd Embodiment. 第３実施形態に係る音声操作装置の構成を概略的に示すブロック図である。It is a block diagram which shows schematic structure of the voice operation apparatus which concerns on 3rd Embodiment. 第４実施形態に係る音声操作装置の構成を概略的に示すブロック図である。It is a block diagram which shows schematic structure of the voice operation apparatus which concerns on 4th Embodiment. 第５実施形態に係る音声操作装置の構成を概略的に示すブロック図である。It is a block diagram which shows schematic structure of the voice operation apparatus which concerns on 5th Embodiment. 第６実施形態に係る音声操作装置の構成を概略的に示すブロック図である。It is a block diagram which shows schematic structure of the voice operation apparatus which concerns on 6th Embodiment.

以下、本発明の好適な実施形態について図面を用いて説明する。なお、本発明は以下の実施形態に限定されるものではなく、その要旨を逸脱しない範囲において適宜変更可能である。各図において同一、又は相当する機能を有するものは、同一符号を付し、その説明を省略又は簡潔にすることもある。 Hereinafter, preferred embodiments of the present invention will be described with reference to the drawings. The present invention is not limited to the following embodiments, and can be appropriately modified without departing from the gist thereof. Those having the same or corresponding functions in each figure are designated by the same reference numerals, and the description thereof may be omitted or simplified.

（第１実施形態）
図１は、第１実施形態に係る音声操作装置の構成を概略的に示すブロック図である。図１は、本実施形態の音声操作装置をＡＩスピーカーに対して適用した例を示している。本図は一例に過ぎず、本実施形態の音声操作装置は、例えば利用者による音声操作を認識可能なロボット、スマートフォン、カーナビ等に対して適用することも可能である。本実施形態の音声操作装置は、スピーカー１、マイクロフォン２、音声出力部１１、音声入力部２１、及び制御演算部３を備えて構成される。典型的なＡＩスピーカーでは、これらの構成要素はスピーカー１の筐体内に格納される。(First Embodiment)
FIG. 1 is a block diagram schematically showing a configuration of a voice operating device according to the first embodiment. FIG. 1 shows an example in which the voice control device of the present embodiment is applied to an AI speaker. This figure is only an example, and the voice operation device of the present embodiment can be applied to, for example, a robot, a smartphone, a car navigation system, or the like that can recognize a voice operation by a user. The voice operation device of the present embodiment includes a speaker 1, a microphone 2, a voice output unit 11, a voice input unit 21, and a control calculation unit 3. In a typical AI speaker, these components are housed in the housing of the speaker 1.

制御演算部３は、記憶媒体に記憶されたプログラムを実行し、本実施形態の音声操作装置の各構成要素の制御及び演算を行うＣＰＵ等のプロセッサである。制御演算部３は、プログラムが記憶された不図示の記憶媒体を含む。制御演算部３は、音声操作応答部３１、音声操作認識部３２、話者識別部３３、声質モデル３３１、及び無線通信部３５を有している。 The control calculation unit 3 is a processor such as a CPU that executes a program stored in a storage medium and controls and calculates each component of the voice operation device of the present embodiment. The control calculation unit 3 includes a storage medium (not shown) in which the program is stored. The control calculation unit 3 includes a voice operation response unit 31, a voice operation recognition unit 32, a speaker identification unit 33, a voice quality model 331, and a wireless communication unit 35.

マイクロフォン２は、利用者の音声操作による音声振動を電気信号に変換する。音声入力部２１は、マイクロフォン２から入力したアナログ信号をデジタルの音声情報に変換するとともに、音声情報に対して音声信号処理を施す。ここで、音声入力部２１は、マイクロフォン２に含まれる構成としてもよい。 The microphone 2 converts the voice vibration caused by the user's voice operation into an electric signal. The voice input unit 21 converts the analog signal input from the microphone 2 into digital voice information, and performs voice signal processing on the voice information. Here, the voice input unit 21 may be configured to be included in the microphone 2.

音声操作認識部３２は、音声入力部２１から入力した音声情報を音声認識して音声操作情報を生成する。制御演算部３は、生成された音声操作情報に対応する操作処理を実行する。具体的には、所望の情報検索や連携する家電製品の操作処理を、無線通信部３５を介して実施する。 The voice operation recognition unit 32 voice-recognizes the voice information input from the voice input unit 21 and generates voice operation information. The control calculation unit 3 executes an operation process corresponding to the generated voice operation information. Specifically, desired information retrieval and operation processing of linked home appliances are performed via the wireless communication unit 35.

その後、音声操作応答部３１は、操作処理の実行結果を音声出力部１１及びスピーカー１を介して利用者に通知する。具体的には、音声出力部１１は、音声操作応答部３１から入力したデジタル情報をアナログ信号に変換する。スピーカー１は、音声出力部１１から入力した電気信号を音声振動に変換して出力する。ここで、音声出力部１１は、スピーカー１に含まれる構成としてもよい。 After that, the voice operation response unit 31 notifies the user of the execution result of the operation process via the voice output unit 11 and the speaker 1. Specifically, the voice output unit 11 converts the digital information input from the voice operation response unit 31 into an analog signal. The speaker 1 converts the electric signal input from the voice output unit 11 into voice vibration and outputs it. Here, the audio output unit 11 may be configured to be included in the speaker 1.

登録利用者情報３３０は、制御演算部３の記憶媒体に記憶されている。登録利用者情報３３０には、利用者の声質モデル３３１が予め登録されている。話者識別部３３は、音声入力部２１から入力した音声情報と利用者の声質モデル３３１との類似度を算出し、類似度が最も大きい利用者を音声操作の話者として識別する。これにより、利用者の好みに応じたサービスを提供することが可能となり、利用者の利便性をより向上させることができる。例えば、利用者の好みに合わせた情報を予め登録利用者情報３３０に登録しておき、音声操作情報に対応する操作処理を利用者の好みに合わせて実行したり、登録利用者情報３３０に予め登録された利用者以外による音声操作を禁止したりすることができる。 The registered user information 330 is stored in the storage medium of the control calculation unit 3. The voice quality model 331 of the user is registered in advance in the registered user information 330. The speaker identification unit 33 calculates the degree of similarity between the voice information input from the voice input unit 21 and the user's voice quality model 331, and identifies the user with the highest degree of similarity as the speaker of the voice operation. As a result, it is possible to provide a service according to the user's preference, and it is possible to further improve the convenience of the user. For example, information according to the user's preference is registered in the registered user information 330 in advance, and operation processing corresponding to the voice operation information is executed according to the user's preference, or information according to the user's preference can be executed in advance in the registered user information 330. It is possible to prohibit voice operations by anyone other than the registered user.

しかし前述のように、話者識別の技術は、研究開発が進んでいるものの依然として識別精度は高くない。そこで、本発明の話者識別部３３は、音声操作情報、音声操作装置の位置情報、話者の方向情報、話者の距離情報、時刻情報のうちの少なくとも１つを補助情報として用いて話者を識別することを特徴とする。例えば、図１に示す本実施形態の音声操作装置では、補助情報を算出する補助情報算出部３４として、音声操作認識部３２を利用している。 However, as mentioned above, the speaker identification technology is still not highly accurate, although research and development are progressing. Therefore, the speaker identification unit 33 of the present invention uses at least one of voice operation information, position information of the voice control device, direction information of the speaker, distance information of the speaker, and time information as auxiliary information. It is characterized by identifying a person. For example, in the voice operation device of the present embodiment shown in FIG. 1, the voice operation recognition unit 32 is used as the auxiliary information calculation unit 34 for calculating the auxiliary information.

図２は、本発明に係る音声操作装置の配置例を示す図である。図２に示すマイクロフォン２の位置には、本発明に係る音声操作装置が配置されている。図２の部屋に存在する利用者６１、６２は、例えば、自宅のリビングルームでくつろいでいるお母さんとお父さんであるものとする。 FIG. 2 is a diagram showing an arrangement example of the voice operation device according to the present invention. The voice control device according to the present invention is arranged at the position of the microphone 2 shown in FIG. It is assumed that the users 61 and 62 existing in the room of FIG. 2 are, for example, a mother and a father relaxing in the living room at home.

この場合、本実施形態の話者識別部３３は、例えば音声操作による指示が天気情報の取得であるときは、音声操作の話者がお母さんである可能性が高いと判断する。或いは、音声操作による指示が株価情報の取得であるときは、音声操作の話者がお父さんである可能性が高いと判断する。そして、話者識別部３３は、音声操作情報に応じて利用者の声質モデル３３１との類似度を補正する。 In this case, the speaker identification unit 33 of the present embodiment determines that there is a high possibility that the speaker of the voice operation is the mother, for example, when the instruction by the voice operation is the acquisition of the weather information. Alternatively, when the instruction by voice operation is the acquisition of stock price information, it is determined that the speaker of voice operation is likely to be a father. Then, the speaker identification unit 33 corrects the similarity with the user's voice quality model 331 according to the voice operation information.

ここで、利用者に特徴的なキーワード等の情報は、予め音声操作情報に含まれるキーワードと関連付けて登録利用者情報３３０に登録しておいてもよいし、利用者が使用するキーワードの頻度に基づいて随時学習するようにしてもよい。音声操作装置が学習する場合は、類似度の補正値を予め登録利用者情報３３０に利用者ごとに登録しておき、話者を識別した結果を補正値に反映させるようにする。これにより、話者識別部３３は、利用者に特徴的な補助情報と類似度の補正値との相関性を学習することになるので、音声操作に対応する操作処理を行うごとに、話者識別の精度を向上させることができる。 Here, information such as keywords characteristic of the user may be registered in the registered user information 330 in advance in association with the keywords included in the voice operation information, or the frequency of the keywords used by the user may be changed. You may study at any time based on this. When the voice operation device learns, the correction value of the similarity is registered in advance in the registered user information 330 for each user, and the result of identifying the speaker is reflected in the correction value. As a result, the speaker identification unit 33 learns the correlation between the auxiliary information characteristic of the user and the correction value of the similarity. Therefore, each time the speaker performs an operation process corresponding to the voice operation, the speaker The accuracy of identification can be improved.

図３は、第１実施形態に係る音声操作装置の制御方法を示すフローチャートである。ステップＳ１０１において、マイクロフォン２及び音声入力部２１は、利用者の音声操作による音声振動を電気信号に変換する。ステップＳ１０２において、音声操作認識部３２は、マイクロフォン２及び音声入力部２１から入力した音声情報を音声認識して音声操作情報を生成する。ここで、音声操作認識部３２が行う音声認識方法としては周知の技術が用いられ得る。 FIG. 3 is a flowchart showing a control method of the voice operation device according to the first embodiment. In step S101, the microphone 2 and the voice input unit 21 convert the voice vibration caused by the user's voice operation into an electric signal. In step S102, the voice operation recognition unit 32 voice-recognizes the voice information input from the microphone 2 and the voice input unit 21 and generates the voice operation information. Here, a well-known technique can be used as the voice recognition method performed by the voice operation recognition unit 32.

ステップＳ１０３において、話者識別部３３は、登録利用者情報３３０に予め登録された利用者の声質モデル３３１を、登録利用者情報３３０とともに読み出す。ステップＳ１０４において、話者識別部３３は、音声入力部２１から入力した音声情報と利用者の声質モデル３３１との類似度を算出する。 In step S103, the speaker identification unit 33 reads out the voice quality model 331 of the user registered in advance in the registered user information 330 together with the registered user information 330. In step S104, the speaker identification unit 33 calculates the degree of similarity between the voice information input from the voice input unit 21 and the user's voice quality model 331.

ここで、話者識別部３３が行う話者認識方法としては、周知の技術が用いられ得る。例えば、登録利用者情報３３０に予め登録された利用者の声質モデル３３１の音声波形ｆ_０と、音声入力部２１から入力した音声情報の音声波形ｆ_１との距離ｓを下式（１）により求め、その逆数１／ｓを類似度として用いる。ここでΣは、所定の期間内の複数の時刻ｔにおいて和をとるものとする。また、音声波形ｆ_０、ｆ_１は適宜、正規化されているものとする。
ｓ^２＝Σ｛ｆ_１（ｔ）−ｆ_０（ｔ）｝^２（１）Here, a well-known technique can be used as the speaker recognition method performed by the speaker identification unit 33. _{For example, the distance s between the voice waveform f 0} of the user's voice quality model 331 registered in advance in the registered user information 330 and _{the voice waveform f 1} of the voice information input from the voice input unit 21 is calculated by the following equation (1). Find it and use its inverse 1 / s as the similarity. Here, Σ is assumed to be summed at a plurality of time t within a predetermined period. Further, it is assumed that the voice waveforms f ₀ and f _{1 are appropriately normalized.}
s ² = Σ {f ₁ (t) -f ₀ (t)} ² (1)

或いは、登録利用者情報３３０に予め登録された利用者の声質モデル３３１の周波数スペクトルＦ_０と、音声入力部２１から入力した音声情報の周波数スペクトルＦ_１との距離ｓを下式（２）により求め、その逆数１／ｓを類似度として用いてもよい。ここでΣは、所定の範囲内の複数の周波数ｋにおいて和をとるものとする。また、周波数スペクトルＦ_０、Ｆ_１は適宜、正規化されているものとする。
ｓ^２＝Σ｛Ｆ_１（ｋ）−Ｆ_０（ｋ）｝^２（２） _{Alternatively, the distance s between the frequency spectrum F 0} of the user's voice quality model 331 registered in advance in the registered user information 330 and _{the frequency spectrum F 1} of the voice information input from the voice input unit 21 is calculated by the following equation (2). The reciprocal 1 / s may be used as the degree of similarity. Here, Σ is assumed to be summed at a plurality of frequencies k within a predetermined range. Further, it is assumed that the frequency spectra F ₀ and F _{1 are appropriately normalized.}
s ² = Σ {F ₁ (k) -F ₀ (k)} ² (2)

ステップＳ１０５において、話者識別部３３は、上式（１）又は（２）により求めた類似度を、音声操作情報に含まれるキーワードに応じて補正する。例えば、音声操作による指示が天気情報の取得であるときは、お母さんの声質モデル３３１の音声波形の類似度１／ｓを＋５点する。或いは、音声操作による指示が株価情報の取得であるときは、お父さんの声質モデル３３１の音声波形の類似度１／ｓを＋５点する。ここで、類似度の補正値の大きさは、用途や利用者等に合わせてキーワードごとに適宜決定され得る。 In step S105, the speaker identification unit 33 corrects the similarity obtained by the above equation (1) or (2) according to the keyword included in the voice operation information. For example, when the instruction by voice operation is the acquisition of weather information, the similarity 1 / s of the voice waveform of the mother's voice quality model 331 is increased by +5 points. Alternatively, when the instruction by voice operation is the acquisition of stock price information, the similarity 1 / s of the voice waveform of the father's voice quality model 331 is increased by +5 points. Here, the magnitude of the correction value of the similarity can be appropriately determined for each keyword according to the application, the user, and the like.

ステップＳ１０６において、話者識別部３３は、補正後の類似度が最も大きい利用者を音声操作の話者として識別する。ステップＳ１０７において、制御演算部３は、前記音声操作情報に対応する操作処理を、識別した利用者の好みに合わせて実行する。ステップＳ１０８において、音声操作応答部３１は、ステップＳ１０７で実行した処理の結果を、スピーカー１及び音声出力部１１を介して、音声により利用者に通知する。 In step S106, the speaker identification unit 33 identifies the user with the highest degree of similarity after correction as the speaker of the voice operation. In step S107, the control calculation unit 3 executes the operation process corresponding to the voice operation information according to the preference of the identified user. In step S108, the voice operation response unit 31 notifies the user by voice of the result of the process executed in step S107 via the speaker 1 and the voice output unit 11.

以上のように、本実施形態の音声操作装置では、登録利用者情報は利用者に特徴的なキーワードを含み、話者識別部は、音声操作情報に含まれるキーワードに応じて類似度を補正する。このような構成によれば、話者識別の精度をより向上させることが可能な音声操作装置及びその制御方法を提供することができる。 As described above, in the voice operation device of the present embodiment, the registered user information includes keywords characteristic to the user, and the speaker identification unit corrects the similarity according to the keywords included in the voice operation information. .. According to such a configuration, it is possible to provide a voice operation device capable of further improving the accuracy of speaker identification and a control method thereof.

（第２実施形態）
図４は、第２実施形態に係る音声操作装置の構成を概略的に示すブロック図である。本実施形態の音声操作装置は、ＧＰＳ装置４１及び位置算出部３４１を備えている点が、先の第１実施形態と異なっている。その他の構成については、第１実施形態と概ね同じである。以下では、主に第１実施形態と異なる構成について説明する。(Second Embodiment)
FIG. 4 is a block diagram schematically showing the configuration of the voice operation device according to the second embodiment. The voice operation device of this embodiment is different from the previous first embodiment in that it includes a GPS device 41 and a position calculation unit 341. Other configurations are substantially the same as those of the first embodiment. Hereinafter, a configuration different from that of the first embodiment will be mainly described.

ＧＰＳ装置４１は、ＧＰＳ（ＧｌｏｂａｌＰｏｓｉｔｉｏｎｉｎｇＳｙｓｔｅｍ）情報を取得することが可能である。位置算出部３４１は、ＧＰＳ装置４１から入力したＧＰＳ情報を利用して、音声操作装置が配置された位置情報を算出する。ここで、位置算出部３４１は、ＧＰＳ装置４１に含まれる構成としてもよい。このように本実施形態の音声操作装置は、補助情報を算出する補助情報算出部３４として、位置算出部３４１を備えていることを特徴とする。 The GPS device 41 can acquire GPS (Global Positioning System) information. The position calculation unit 341 calculates the position information in which the voice operation device is arranged by using the GPS information input from the GPS device 41. Here, the position calculation unit 341 may be configured to be included in the GPS device 41. As described above, the voice operation device of the present embodiment is characterized by including the position calculation unit 341 as the auxiliary information calculation unit 34 for calculating the auxiliary information.

前述のように、利用者が発する音声の声質は、音源の距離、周囲の温度や反射等の環境に応じて大きく変化するため、操作音声を、予め登録された利用者の単一の声質モデル３３１と比較するだけでは話者を識別することが難しい。 As described above, the voice quality of the voice emitted by the user changes greatly depending on the environment such as the distance of the sound source, the ambient temperature, and the reflection. Therefore, the operation voice is a single voice quality model of the user registered in advance. It is difficult to identify the speaker just by comparing with 331.

そこで、本実施形態の登録利用者情報３３０は、利用者ごとに複数の声質モデル３３１をデータベースとして有している。そして、話者識別部３３は、位置算出部３４１が算出した音声操作装置の位置情報に応じて声質モデル３３１を選択する。例えば、音声操作装置が自宅に配置されている場合は、自宅の環境に適した声質モデル３３１を選択する。或いは、音声操作装置が職場に配置されている場合は、職場の環境に適した声質モデル３３１を選択する。また或いは、音声操作装置が移動中の車内に配置されている場合は、車内の環境に適した声質モデル３３１を選択する。これらの複数の声質モデル３３１は、予め登録利用者情報３３０に、音声操作装置の位置情報と関連付けて登録しておく。 Therefore, the registered user information 330 of the present embodiment has a plurality of voice quality models 331 as a database for each user. Then, the speaker identification unit 33 selects the voice quality model 331 according to the position information of the voice operation device calculated by the position calculation unit 341. For example, when the voice control device is arranged at home, the voice quality model 331 suitable for the home environment is selected. Alternatively, if the voice control device is located in the workplace, the voice quality model 331 suitable for the workplace environment is selected. Alternatively, when the voice control device is arranged in the moving vehicle, the voice quality model 331 suitable for the environment in the vehicle is selected. These plurality of voice quality models 331 are registered in advance in the registered user information 330 in association with the position information of the voice operation device.

これにより、ロボット、スマートフォン、カーナビ等のような移動し得る装置に対して本実施形態の音声操作装置を適用する場合でも、周囲の環境に適した声質モデル３３１が選択されることで、話者識別の精度をより向上させることができる。 As a result, even when the voice operation device of the present embodiment is applied to a movable device such as a robot, a smartphone, a car navigation system, etc., the voice quality model 331 suitable for the surrounding environment is selected, so that the speaker The accuracy of identification can be further improved.

図５は、第２実施形態に係る音声操作装置の制御方法を示すフローチャートである。図５に示す本実施形態のフローチャートは、ステップＳ２０３において声質モデル３３１を読み出す際に、音声操作装置の位置情報に応じて声質モデル３３１を選択している点が、先の第１実施形態の図３と異なっている。なお、図５のフローチャートでは、第１実施形態の図３のステップＳ１０５に示した類似度の補正処理を省略したが、本実施形態でも図３と同様に併せてステップＳ１０５を実施してもよい。 FIG. 5 is a flowchart showing a control method of the voice operation device according to the second embodiment. In the flowchart of the present embodiment shown in FIG. 5, when the voice quality model 331 is read out in step S203, the voice quality model 331 is selected according to the position information of the voice operating device. It is different from 3. In the flowchart of FIG. 5, the similarity correction process shown in step S105 of FIG. 3 of the first embodiment is omitted, but in this embodiment as well, step S105 may be performed in the same manner as in FIG. ..

以上のように、本実施形態の音声操作装置は、音声操作装置の位置を位置情報として取得するＧＰＳ装置を更に備えている。そして、登録利用者情報は、利用者ごとに複数の声質モデルを有し、話者識別部は位置情報に応じて声質モデルを選択する。このような構成によっても、話者識別の精度をより向上させることが可能な音声操作装置及びその制御方法を提供することができる。 As described above, the voice control device of the present embodiment further includes a GPS device that acquires the position of the voice control device as position information. Then, the registered user information has a plurality of voice quality models for each user, and the speaker identification unit selects the voice quality model according to the position information. Even with such a configuration, it is possible to provide a voice operation device and a control method thereof that can further improve the accuracy of speaker identification.

（第３実施形態）
図６は、第３実施形態に係る音声操作装置の構成を概略的に示すブロック図である。本実施形態の音声操作装置は、アレイマイク４２及び音声方向算出部３４２を備えている点が、先の第１実施形態と異なっている。その他の構成については、第１実施形態と概ね同じである。以下では、主に第１実施形態と異なる構成について説明する。(Third Embodiment)
FIG. 6 is a block diagram schematically showing the configuration of the voice operation device according to the third embodiment. The voice operation device of the present embodiment is different from the first embodiment in that it includes an array microphone 42 and a voice direction calculation unit 342. Other configurations are substantially the same as those in the first embodiment. Hereinafter, a configuration different from that of the first embodiment will be mainly described.

アレイマイク４２は、複数のマイクが配列された構成を有し、音声操作の話者の方向を特定することが可能である。音声方向算出部３４２は、アレイマイク４２から入力した情報に基づいて、音声操作装置から見た話者の方向情報を算出する。ここで、音声方向算出部３４２は、アレイマイク４２に含まれる構成としてもよい。このように本実施形態の音声操作装置は、補助情報を算出する補助情報算出部３４として、音声方向算出部３４２を備えていることを特徴とする。 The array microphone 42 has a configuration in which a plurality of microphones are arranged, and it is possible to specify the direction of a speaker for voice operation. The voice direction calculation unit 342 calculates the direction information of the speaker as seen from the voice control device based on the information input from the array microphone 42. Here, the voice direction calculation unit 342 may be included in the array microphone 42. As described above, the voice operation device of the present embodiment is characterized by including the voice direction calculation unit 342 as the auxiliary information calculation unit 34 for calculating the auxiliary information.

例えば、図２に示した部屋において、音声操作装置から見て左側の椅子がお母さん用の席であり、右側の椅子がお父さん用の席であるとする。この場合、本実施形態の話者識別部３３は、例えば、音声操作の話者の方向が左側であるときは、音声操作の話者がお母さんである可能性が高いと判断する。或いは、音声操作の話者の方向が右側であるときは、音声操作の話者がお父さんである可能性が高いと判断する。このような利用者に特徴的な情報は、予め登録利用者情報３３０に、方向情報と関連付けて登録しておく。 For example, in the room shown in FIG. 2, the chair on the left side when viewed from the voice control device is the seat for the mother, and the chair on the right side is the seat for the father. In this case, the speaker identification unit 33 of the present embodiment determines that, for example, when the direction of the voice-operated speaker is on the left side, the voice-operated speaker is likely to be the mother. Alternatively, when the direction of the voice-operated speaker is on the right side, it is determined that the voice-operated speaker is likely to be the father. Information characteristic of such a user is registered in advance in the registered user information 330 in association with the direction information.

本実施形態の音声操作装置における音声操作方法は、補助情報が方向情報である点を除いて、先の図３に示した第１実施形態のフローチャート、及び先の図５に示した第２実施形態のフローチャートが、ともに適用可能である。 The voice operation method in the voice operation device of the present embodiment is the flowchart of the first embodiment shown in FIG. 3 above and the second embodiment shown in FIG. 5 above, except that the auxiliary information is direction information. Both form flowcharts are applicable.

以上のように、本実施形態の音声操作装置は、音声操作の話者の方向を方向情報として取得するアレイマイクを更に備えている。そして、話者識別部は方向情報を用いて話者を識別する。このような構成によっても、話者識別の精度をより向上させることが可能な音声操作装置及びその制御方法を提供することができる。なお、アレイマイク４２を更に備える代わりに、マイクロフォン２を複数のマイクで構成して音声操作の話者の方向を特定するようにしてもよい。 As described above, the voice operation device of the present embodiment further includes an array microphone that acquires the direction of the speaker of the voice operation as direction information. Then, the speaker identification unit identifies the speaker using the direction information. Even with such a configuration, it is possible to provide a voice operation device and a control method thereof that can further improve the accuracy of speaker identification. Instead of further providing the array microphone 42, the microphone 2 may be configured by a plurality of microphones to specify the direction of the speaker of the voice operation.

（第４実施形態）
図７は、第４実施形態に係る音声操作装置の構成を概略的に示すブロック図である。本実施形態の音声操作装置は、測距センサ４３及び距離算出部３４３を備えている点が、先の第１実施形態と異なっている。その他の構成については、第１実施形態と概ね同じである。以下では、主に第１実施形態と異なる構成について説明する。(Fourth Embodiment)
FIG. 7 is a block diagram schematically showing the configuration of the voice operation device according to the fourth embodiment. The voice operation device of this embodiment is different from the first embodiment in that it includes a distance measuring sensor 43 and a distance calculation unit 343. Other configurations are substantially the same as those in the first embodiment. Hereinafter, a configuration different from that of the first embodiment will be mainly described.

測距センサ４３は、話者までの距離を取得することが可能である。測距方式としては、例えば、視差を利用して距離を測定する方法や、光や電波の反射波を利用して距離を測定する周知の技術が用いられ得る。距離算出部３４３は、測距センサ４３から入力した情報に基づいて、音声操作装置から話者までの距離を算出する。ここで、距離算出部３４３は、測距センサ４３に含まれる構成としてもよい。このように本実施形態の音声操作装置は、補助情報を算出する補助情報算出部３４として、距離算出部３４３を備えていることを特徴とする。 The distance measuring sensor 43 can acquire the distance to the speaker. As the distance measuring method, for example, a method of measuring a distance using a parallax or a well-known technique of measuring a distance using a reflected wave of light or a radio wave can be used. The distance calculation unit 343 calculates the distance from the voice control device to the speaker based on the information input from the distance measurement sensor 43. Here, the distance calculation unit 343 may be configured to be included in the distance measurement sensor 43. As described above, the voice operation device of the present embodiment is characterized by including the distance calculation unit 343 as the auxiliary information calculation unit 34 for calculating the auxiliary information.

例えば、図２に示した部屋において、音声操作装置から遠い方の椅子がお母さん用の席であり、音声操作装置から近い方の椅子がお父さん用の席であるとする。この場合、本実施形態の話者識別部３３は、例えば、音声操作装置から話者までの距離が遠いときは、音声操作の話者がお母さんである可能性が高いと判断する。或いは、音声操作装置から話者までの距離が近いときは、音声操作の話者がお父さんである可能性が高いと判断する。このような利用者に特徴的な情報は、予め登録利用者情報３３０に、距離情報と関連付けて登録しておく。 For example, in the room shown in FIG. 2, the chair farther from the voice control device is the seat for the mother, and the chair closer to the voice control device is the seat for the father. In this case, the speaker identification unit 33 of the present embodiment determines that, for example, when the distance from the voice operation device to the speaker is long, the speaker of the voice operation is likely to be a mother. Alternatively, when the distance from the voice control device to the speaker is short, it is determined that the voice control speaker is likely to be the father. Information characteristic of such a user is registered in advance in the registered user information 330 in association with the distance information.

本実施形態の音声操作装置における音声操作方法は、補助情報が距離情報である点を除いて、先の図３に示した第１実施形態のフローチャート、及び先の図５に示した第２実施形態のフローチャートが、ともに適用可能である。 The voice operation method in the voice operation device of the present embodiment is the flowchart of the first embodiment shown in FIG. 3 above and the second embodiment shown in FIG. 5 above, except that the auxiliary information is distance information. Both form flowcharts are applicable.

以上のように、本実施形態の音声操作装置は、音声操作の話者までの距離を距離情報として取得する測距センサを更に備えている。そして、話者識別部は距離情報を用いて話者を識別する。このような構成によっても、話者識別の精度をより向上させることが可能な音声操作装置及びその制御方法を提供することができる。 As described above, the voice operation device of the present embodiment further includes a distance measuring sensor that acquires the distance to the speaker of the voice operation as distance information. Then, the speaker identification unit identifies the speaker using the distance information. Even with such a configuration, it is possible to provide a voice operation device and a control method thereof that can further improve the accuracy of speaker identification.

（第５実施形態）
図８は、第５実施形態に係る音声操作装置の構成を概略的に示すブロック図である。本実施形態の音声操作装置は、時計４４及び時刻算出部３４４を備えている点が、先の第１実施形態と異なっている。その他の構成については、第１実施形態と概ね同じである。以下では、主に第１実施形態と異なる構成について説明する。(Fifth Embodiment)
FIG. 8 is a block diagram schematically showing the configuration of the voice operation device according to the fifth embodiment. The voice operation device of the present embodiment is different from the first embodiment in that it includes a clock 44 and a time calculation unit 344. Other configurations are substantially the same as those in the first embodiment. Hereinafter, a configuration different from that of the first embodiment will be mainly described.

時計４４は、現在の時刻を取得することが可能である。時刻算出部３４４は、時計４４から入力した情報に基づいて、音声操作の発声時刻を算出する。ここで、時刻算出部３４４は、時計４４に含まれる構成としてもよい。このように本実施形態の音声操作装置は、補助情報を算出する補助情報算出部３４として、時刻算出部３４４を備えていることを特徴とする。 The clock 44 can acquire the current time. The time calculation unit 344 calculates the utterance time of the voice operation based on the information input from the clock 44. Here, the time calculation unit 344 may be configured to be included in the clock 44. As described above, the voice operation device of the present embodiment is characterized by including the time calculation unit 344 as the auxiliary information calculation unit 34 for calculating the auxiliary information.

本実施形態の話者識別部３３は、例えば、音声操作の発声時刻が昼間のときは、音声操作の話者がお母さんである可能性が高いと判断する。或いは、音声操作の発声時刻が深夜のときは、音声操作の話者がお父さんである可能性が高いと判断する。このような利用者に特徴的な情報は、予め登録利用者情報３３０に、時刻情報と関連付けて登録しておく。 For example, when the utterance time of the voice operation is daytime, the speaker identification unit 33 of the present embodiment determines that the speaker of the voice operation is likely to be a mother. Alternatively, when the utterance time of the voice operation is midnight, it is determined that the speaker of the voice operation is likely to be the father. Information characteristic of such a user is registered in advance in the registered user information 330 in association with the time information.

本実施形態の音声操作装置における音声操作方法は、補助情報が時刻情報である点を除いて、先の図３に示した第１実施形態のフローチャート、及び先の図５に示した第２実施形態のフローチャートが、ともに適用可能である。 The voice operation method in the voice operation device of the present embodiment is the flowchart of the first embodiment shown in FIG. 3 above and the second embodiment shown in FIG. 5 above, except that the auxiliary information is time information. Both form flowcharts are applicable.

以上のように、本実施形態の音声操作装置は、音声操作の発声時刻を時刻情報として取得する時計を更に備えている。そして、話者識別部は時刻情報を用いて話者を識別する。このような構成によっても、話者識別の精度をより向上させることが可能な音声操作装置及びその制御方法を提供することができる。 As described above, the voice operation device of the present embodiment further includes a clock that acquires the utterance time of the voice operation as time information. Then, the speaker identification unit identifies the speaker using the time information. Even with such a configuration, it is possible to provide a voice operation device and a control method thereof that can further improve the accuracy of speaker identification.

（第６実施形態）
図９は、第６実施形態に係る音声操作装置の構成を概略的に示すブロック図である。本実施形態の音声操作装置は、上述の実施形態の構成を組み合わせた構成を有している。図９に示す補助情報取得装置４０は、上述の実施形態のＧＰＳ装置４１、アレイマイク４２、測距センサ４３、時計４４のうちの少なくとも１つ含んでいる。また、図９に示す補助情報算出部３４は、上述の実施形態の音声操作認識部３２、位置算出部３４１、音声方向算出部３４２、距離算出部３４３、時刻算出部３４４のうちの、補助情報取得装置４０が出力する補助情報を処理可能な構成を少なくとも含んでいる。ここで、補助情報算出部３４は、補助情報取得装置４０に含まれる構成としてもよい。(Sixth Embodiment)
FIG. 9 is a block diagram schematically showing the configuration of the voice operation device according to the sixth embodiment. The voice operation device of the present embodiment has a configuration in which the configurations of the above-described embodiments are combined. The auxiliary information acquisition device 40 shown in FIG. 9 includes at least one of the GPS device 41, the array microphone 42, the distance measuring sensor 43, and the clock 44 of the above-described embodiment. Further, the auxiliary information calculation unit 34 shown in FIG. 9 is the auxiliary information among the voice operation recognition unit 32, the position calculation unit 341, the voice direction calculation unit 342, the distance calculation unit 343, and the time calculation unit 344 of the above-described embodiment. It includes at least a configuration capable of processing auxiliary information output by the acquisition device 40. Here, the auxiliary information calculation unit 34 may be configured to be included in the auxiliary information acquisition device 40.

本実施形態の音声操作装置における音声操作方法は、補助情報が上述の実施形態の音声操作情報、位置情報、方向情報、距離情報、時刻情報の組み合わせである点を除いて、先の図３及び図５に示したフローチャートが、ともに適用可能である。 The voice operation method in the voice operation device of the present embodiment is described in FIG. 3 and above, except that the auxiliary information is a combination of the voice operation information, the position information, the direction information, the distance information, and the time information of the above-described embodiment. The flowchart shown in FIG. 5 is applicable to both.

以上のように、本実施形態の音声操作装置は、音声情報と予め登録された利用者の声質モデルに基づいて利用者を音声操作の話者として識別する話者識別部を備えている。そして、話者識別部は、音声操作情報、音声操作装置の位置情報、話者の方向情報、話者の距離情報、時刻情報のうちの少なくとも１つを補助情報として用いて話者を識別する。このような構成によれば、話者識別の精度を更に向上させることが可能な音声操作装置及びその制御方法を提供することができる。 As described above, the voice operation device of the present embodiment includes a speaker identification unit that identifies the user as the speaker of the voice operation based on the voice information and the voice quality model of the user registered in advance. Then, the speaker identification unit identifies the speaker by using at least one of voice operation information, position information of the voice operation device, direction information of the speaker, distance information of the speaker, and time information as auxiliary information. .. According to such a configuration, it is possible to provide a voice operation device capable of further improving the accuracy of speaker identification and a control method thereof.

（その他の実施形態）
なお、上述の実施形態は、いずれも本発明を実施するにあたっての具体化の例を示したものに過ぎず、これらによって本発明の技術的範囲が限定的に解釈されてはならないものである。すなわち、本発明はその技術思想、又はその主要な特徴から逸脱することなく、様々な形で実施することができる。(Other embodiments)
It should be noted that the above-described embodiments are merely examples of embodiment of the present invention, and the technical scope of the present invention should not be construed in a limited manner by these. That is, the present invention can be implemented in various forms without departing from the technical idea or its main features.

例えば、利用者のスケジュール情報を予め登録利用者情報３３０に登録しておき、スケジュール情報に応じて類似度を補正するようにしてもよい。一例としては、スケジュール情報においてお父さんが出張中であるときには、音声入力部２１から入力した音声情報とお父さんの声質モデル３３１との類似度１／ｓを−５点する。 For example, the user's schedule information may be registered in the registered user information 330 in advance, and the similarity may be corrected according to the schedule information. As an example, when the father is on a business trip in the schedule information, the similarity 1 / s between the voice information input from the voice input unit 21 and the voice quality model 331 of the father is set to -5 points.

或いは、利用者に特徴的な音声の大きさを予め登録利用者情報３３０に登録しておき、音声操作の音声の大きさに応じて類似度を補正するようにしてもよい。一例としては、音声操作の音声が大きい場合は、音声入力部２１から入力した音声情報とお父さんの声質モデル３３１との類似度１／ｓを＋５点する。 Alternatively, the volume of the voice characteristic of the user may be registered in the registered user information 330 in advance, and the similarity may be corrected according to the volume of the voice of the voice operation. As an example, when the voice of the voice operation is loud, the similarity 1 / s between the voice information input from the voice input unit 21 and the voice quality model 331 of the father is +5 points.

利用者の存在する確率が高い方向や距離といった情報は、予め登録利用者情報３３０に、補助情報と関連付けて登録しておいてもよいし、音声操作装置が、随時学習するようにしてもよい。音声操作装置が学習する場合は、類似度の補正値を予め登録利用者情報３３０に利用者ごとに登録しておき、話者を識別した結果を補正値に反映させるようにする。或いは、音声操作装置は、利用者が発するキーワードや位置や方向や距離や時刻などの情報を常に収集し、これらの情報に応じて登録利用者情報３３０の利用者に特徴的な情報を更新するようにしてもよい。これにより、話者識別部３３は、登録利用者情報３３０に登録された利用者に特徴的な情報と、補助情報算出部３４を介して入力される補助情報との相関性を学習することになる。したがって、話者識別部３３は、音声操作に対応する操作処理を行うごとに、話者識別の精度を向上させることができる。 Information such as the direction and distance in which the user is likely to exist may be registered in advance in the registered user information 330 in association with the auxiliary information, or the voice operating device may learn at any time. .. When the voice operation device learns, the correction value of the similarity is registered in advance in the registered user information 330 for each user, and the result of identifying the speaker is reflected in the correction value. Alternatively, the voice operation device constantly collects information such as keywords, positions, directions, distances, and times issued by the user, and updates information characteristic to the user of the registered user information 330 according to the information. You may do so. As a result, the speaker identification unit 33 learns the correlation between the information characteristic of the user registered in the registered user information 330 and the auxiliary information input via the auxiliary information calculation unit 34. Become. Therefore, the speaker identification unit 33 can improve the accuracy of speaker identification each time the operation process corresponding to the voice operation is performed.

また、第６実施形態のように複数種類の補助情報を併用する場合は、話者識別に用いる補助情報又はその組み合わせを、利用者の位置や方向等に応じて選択するようにしてもよい。この場合も、利用者の存在確率が高い位置や方向といった利用者に特徴的な情報を予め登録利用者情報３３０に登録しておき、これらの情報に応じて話者識別に用いる補助情報を選択するようにしてもよいし、音声操作装置が、随時学習するようにしてもよい。 Further, when a plurality of types of auxiliary information are used together as in the sixth embodiment, the auxiliary information used for speaker identification or a combination thereof may be selected according to the position and direction of the user. In this case as well, information characteristic of the user, such as the position and direction in which the user has a high probability of existence, is registered in the registered user information 330 in advance, and auxiliary information used for speaker identification is selected according to this information. Alternatively, the voice control device may learn at any time.

また、上述の第１〜第６実施形態では、先の図３に示した第１実施形態のフローチャート、及び先の図５に示した第２実施形態のフローチャートが、ともに適用可能である。すなわち、利用者ごとに複数の声質モデル３３１を登録しておき、補助情報に応じて声質モデル３３１を選択する方法と、利用者に特徴的な情報を補助情報と関連付けて登録しておき、補助情報に応じて類似度を補正する方法の、いずれを用いてもよい。 Further, in the above-described first to sixth embodiments, both the flowchart of the first embodiment shown in FIG. 3 and the flowchart of the second embodiment shown in FIG. 5 can be applied. That is, a method of registering a plurality of voice quality models 331 for each user and selecting the voice quality model 331 according to the auxiliary information, and registering information characteristic of the user in association with the auxiliary information to assist. Any method of correcting the similarity according to the information may be used.

上述の実施形態の一部又は全部は、以下の付記のようにも記載され得るが、以下には限られない。 Some or all of the above embodiments may also be described, but not limited to:

（付記１）
音声情報と予め登録された利用者の声質モデルに基づいて前記利用者を音声操作の話者として識別する話者識別部と、
前記音声情報を音声認識して音声操作情報を生成する音声操作認識部と、
を備えた音声操作装置であって、
前記話者識別部は、前記音声操作情報、音声操作装置の位置情報、話者の方向情報、話者の距離情報、時刻情報のうちの少なくとも１つを補助情報として用いて話者を識別する
音声操作装置。(Appendix 1)
A speaker identification unit that identifies the user as a speaker for voice operation based on voice information and a user's voice quality model registered in advance.
A voice operation recognition unit that recognizes the voice information and generates voice operation information,
It is a voice operation device equipped with
The speaker identification unit identifies the speaker by using at least one of the voice operation information, the position information of the voice operation device, the direction information of the speaker, the distance information of the speaker, and the time information as auxiliary information. Voice control device.

（付記２）
音声振動を前記音声情報に変換するマイクロフォンと、
前記利用者の前記声質モデルが予め登録された登録利用者情報と、
を更に備える
付記１に記載の音声操作装置。(Appendix 2)
A microphone that converts voice vibration into the voice information,
Registered user information in which the voice quality model of the user is registered in advance, and
The voice operation device according to Appendix 1.

（付記３）
前記登録利用者情報は前記利用者ごとに複数の前記声質モデルを有し、
前記話者識別部は前記補助情報に応じて前記声質モデルを選択する
付記２に記載の音声操作装置。(Appendix 3)
The registered user information has a plurality of the voice quality models for each user.
The voice operation device according to Appendix 2, wherein the speaker identification unit selects the voice quality model according to the auxiliary information.

（付記４）
前記話者識別部は、前記補助情報に応じて前記利用者ごとのデータベースである前記声質モデルを選択する
付記３に記載の音声操作装置。(Appendix 4)
The voice operation device according to Appendix 3, wherein the speaker identification unit selects the voice quality model, which is a database for each user, according to the auxiliary information.

（付記５）
前記話者識別部は、前記音声情報と前記声質モデルの類似度を算出し、前記類似度に基づいて話者を識別する
付記２から４のいずれか１項に記載の音声操作装置。(Appendix 5)
The voice operation device according to any one of Supplementary note 2 to 4, wherein the speaker identification unit calculates the similarity between the voice information and the voice quality model, and identifies the speaker based on the similarity.

（付記６）
前記話者識別部は、前記類似度が最も大きい前記利用者を音声操作の話者として識別する
付記５に記載の音声操作装置。(Appendix 6)
The voice operation device according to Appendix 5, wherein the speaker identification unit identifies the user having the highest degree of similarity as a speaker for voice operation.

（付記７）
前記登録利用者情報は、前記利用者に特徴的な情報を前記補助情報と関連付けて有し、
前記話者識別部は前記補助情報に応じて前記類似度を補正する
付記５又は６に記載の音声操作装置。(Appendix 7)
The registered user information has information characteristic of the user in association with the auxiliary information.
The voice operation device according to Appendix 5 or 6, wherein the speaker identification unit corrects the similarity according to the auxiliary information.

（付記８）
前記登録利用者情報は前記類似度の補正値を前記利用者ごとに有し、
前記話者識別部は話者を識別した結果を前記補正値に反映させて、前記補助情報と前記補正値との相関性を学習する
付記５から７のいずれか１項に記載の音声操作装置。(Appendix 8)
The registered user information has a correction value of the similarity for each user.
The voice operation device according to any one of Appendix 5 to 7, wherein the speaker identification unit reflects the result of identifying the speaker in the correction value and learns the correlation between the auxiliary information and the correction value. ..

（付記９）
前記登録利用者情報は前記利用者に特徴的なキーワードを含み、
前記話者識別部は前記音声操作情報に含まれる前記キーワードに応じて前記類似度を補正する
付記５から８のいずれか１項に記載の音声操作装置。(Appendix 9)
The registered user information includes keywords characteristic of the user, and includes keywords that are characteristic of the user.
The voice operation device according to any one of Appendix 5 to 8, wherein the speaker identification unit corrects the similarity according to the keyword included in the voice operation information.

（付記１０）
前記マイクロフォンは、音声操作の話者の方向を前記方向情報として取得するアレイマイクであり、
前記話者識別部は前記方向情報を用いて話者を識別する
付記２から９のいずれか１項に記載の音声操作装置。(Appendix 10)
The microphone is an array microphone that acquires the direction of a speaker for voice operation as the direction information.
The voice operation device according to any one of Supplementary note 2 to 9, wherein the speaker identification unit identifies a speaker by using the direction information.

（付記１１）
前記音声操作の話者までの距離を前記距離情報として取得する測距センサを更に備え、
前記話者識別部は前記距離情報を用いて話者を識別する
付記２から１０のいずれか１項に記載の音声操作装置。(Appendix 11)
Further equipped with a distance measuring sensor that acquires the distance to the speaker of the voice operation as the distance information,
The voice operation device according to any one of Supplementary note 2 to 10, wherein the speaker identification unit identifies a speaker by using the distance information.

（付記１２）
前記音声操作の発声時刻を前記時刻情報として取得する時計を更に備え、
前記話者識別部は前記時刻情報を用いて話者を識別する
付記２から１１のいずれか１項に記載の音声操作装置。(Appendix 12)
A clock for acquiring the utterance time of the voice operation as the time information is further provided.
The voice operation device according to any one of Supplementary note 2 to 11, wherein the speaker identification unit identifies a speaker by using the time information.

（付記１３）
音声操作装置の位置を前記位置情報として取得するＧＰＳ装置を更に備え、
前記話者識別部は前記位置情報に応じて前記声質モデルを選択する
付記２から１２のいずれか１項に記載の音声操作装置。(Appendix 13)
A GPS device that acquires the position of the voice control device as the position information is further provided.
The voice operation device according to any one of Appendix 2 to 12, wherein the speaker identification unit selects the voice quality model according to the position information.

（付記１４）
前記登録利用者情報は前記利用者のスケジュール情報を更に含み
前記話者識別部は前記スケジュール情報を更に用いて話者を識別する
付記２から１３のいずれか１項に記載の音声操作装置。(Appendix 14)
The voice operation device according to any one of Supplementary note 2 to 13, wherein the registered user information further includes the schedule information of the user, and the speaker identification unit further uses the schedule information to identify the speaker.

（付記１５）
前記登録利用者情報は前記利用者の好みに合わせた情報を前記利用者ごとに有し、
前記音声操作情報に対応する操作処理を前記利用者の好みに合わせて実行する制御演算部と、
実行結果を音声により通知するスピーカーと、
を更に有する
付記２から１４のいずれか１項に記載の音声操作装置。(Appendix 15)
The registered user information has information according to the preference of the user for each user.
A control calculation unit that executes operation processing corresponding to the voice operation information according to the preference of the user, and
A speaker that notifies the execution result by voice and
The voice operating device according to any one of Supplementary note 2 to 14.

（付記１６）
音声情報と予め登録された利用者の声質モデルに基づいて前記利用者を音声操作の話者として識別する話者識別部と、
前記音声情報を音声認識して音声操作情報を生成する音声操作認識部と、
を備えた音声操作装置の制御方法であって、
前記話者識別部は、前記音声操作情報、音声操作装置の位置情報、話者の方向情報、話者の距離情報、時刻情報のうちの少なくとも１つを補助情報として用いて話者を識別するステップを有する
音声操作装置の制御方法。(Appendix 16)
A speaker identification unit that identifies the user as a speaker for voice operation based on voice information and a user's voice quality model registered in advance.
A voice operation recognition unit that recognizes the voice information and generates voice operation information,
It is a control method of a voice operation device equipped with
The speaker identification unit identifies the speaker by using at least one of the voice operation information, the position information of the voice operation device, the direction information of the speaker, the distance information of the speaker, and the time information as auxiliary information. A method of controlling a voice control device having steps.

以上、実施形態を参照して本発明を説明したが、本発明は上記実施形態に限定されものではない。本発明の構成や詳細には、本発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 Although the present invention has been described above with reference to the embodiments, the present invention is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the structure and details of the present invention within the scope of the present invention.

この出願は、２０１７年７月２６日に出願された日本出願特願２０１７−１４４３３６を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims priority on the basis of Japanese application Japanese Patent Application No. 2017-144336 filed on July 26, 2017, and incorporates all of its disclosures herein.

１：スピーカー
２：マイクロフォン
３：制御演算部
１１：音声出力部
２１：音声入力部
３１：音声操作応答部
３２：音声操作認識部
３３：話者識別部
３４：補助情報算出部
３５：無線通信部
４０：補助情報取得装置
４１：ＧＰＳ装置
４２：アレイマイク
４３：測距センサ
４４：時計
６１：利用者
６２：利用者
３３０：登録利用者情報
３３１：声質モデル
３４１：位置算出部
３４２：音声方向算出部
３４３：距離算出部
３４４：時刻算出部1: Speaker 2: Microphone 3: Control calculation unit 11: Voice output unit 21: Voice input unit 31: Voice operation response unit 32: Voice operation recognition unit 33: Speaker identification unit 34: Auxiliary information calculation unit 35: Wireless communication unit 40: Auxiliary information acquisition device 41: GPS device 42: Array microphone 43: Distance measurement sensor 44: Clock 61: User 62: User 330: Registered user information 331: Voice quality model 341: Position calculation unit 342: Voice direction calculation Unit 343: Distance calculation unit 344: Time calculation unit

Claims

A speaker identification unit that identifies the user as a speaker for voice operation based on voice information and a user's voice quality model registered in advance.
A voice operation recognition unit that recognizes the voice information and generates voice operation information,
GPS device and
A microphone that converts voice vibration into the voice information,
Registered user information in which the voice quality model of the user is registered in advance, and
It is a voice operation device equipped with
The registered user information has a correction value of similarity between the voice information and the voice quality model for each user.
The speaker identification unit assists by using at least one of the voice operation information, the position information of the voice operation device acquired by the GPS device, the direction information of the speaker, the distance information of the speaker, and the time information. A voice operation device that identifies a speaker using information and the similarity , reflects the identification result in the correction value, and learns the correlation between the auxiliary information and the correction value.

The auxiliary information is at least one of the position information of the voice operating device, the direction information of the speaker, the distance information of the speaker, and the time information.
The registered user information has a plurality of the voice quality models for each user.
The speaker identification unit selects the voice quality model from the plurality of voice quality models according to the auxiliary information, and makes the user perform the voice operation based on the voice information and the selected voice quality model. The voice operating device according to claim 1, which identifies the speaker.

The voice operation device according to claim 2 , wherein the speaker identification unit selects the voice quality model, which is a database for each user, according to the auxiliary information.

The voice operation device according to any one of claims 1 to 3, wherein the speaker identification unit identifies the user having the highest degree of similarity as a speaker for voice operation.

The registered user information has information characteristic of the user in association with the auxiliary information.
The voice operation device according to any one of claims 1 to 4, wherein the speaker identification unit corrects the similarity according to the auxiliary information.

The registered user information includes keywords characteristic of the user, and includes keywords that are characteristic of the user.
The voice operation device according to any one of claims 1 to 5 , wherein the speaker identification unit corrects the similarity according to the keyword included in the voice operation information.

The microphone is an array microphone that acquires the direction of a speaker for voice operation as the direction information.
The voice operation device according to any one of claims 1 to 6 , wherein the speaker identification unit identifies a speaker by using the direction information.

Further equipped with a distance measuring sensor that acquires the distance to the speaker of the voice operation as the distance information,
The voice operation device according to any one of claims 1 to 7 , wherein the speaker identification unit identifies a speaker by using the distance information.

A clock for acquiring the utterance time of the voice operation as the time information is further provided.
The voice operation device according to any one of claims 1 to 8 , wherein the speaker identification unit identifies a speaker by using the time information.

The voice operation device according to any one of claims 1 to 9 , wherein the registered user information further includes a schedule information of the user, and the speaker identification unit further uses the schedule information to identify a speaker.

The registered user information has information according to the preference of the user for each user.
A control calculation unit that executes operation processing corresponding to the voice operation information according to the preference of the user, and
A speaker that notifies the execution result by voice and
Further voice operation device according to any one of claims 1 1 0 having.