JP6927905B2

JP6927905B2 - Decision device, decision method, decision program and program

Info

Publication number: JP6927905B2
Application number: JP2018027245A
Authority: JP
Inventors: 伸裕鍜治
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2018-02-19
Filing date: 2018-02-19
Publication date: 2021-09-01
Anticipated expiration: 2038-02-19
Also published as: JP2021177261A; JP2019144355A; JP7278340B2

Description

本発明は、決定装置、決定方法、決定プログラム及びプログラムに関する。 The present invention relates to a determination device, a determination method, a determination program and a program .

従来、ユーザの発話に対する音声認識に関する技術が提供されている。例えば、利用者からの音声が入力され、その音声認識結果が確信度不十分であるとき、利用者へ更なる入力を求める表示を行う技術が知られている。 Conventionally, a technique related to voice recognition for a user's utterance has been provided. For example, when a voice from a user is input and the voice recognition result is insufficient in certainty, there is known a technique for displaying a request for further input from the user.

特開２０００−１４８１８１号公報Japanese Unexamined Patent Publication No. 2000-148181

しかしながら、上記の従来技術では、音声認識の精度を向上させることができるとは限らない。例えば、上記の従来技術では、利用者からの音声が入力され、その音声認識結果が確信度不十分であるとき、利用者へ更なる入力を求める表示を行うに過ぎず、音声認識の精度を向上させることができるとは限らない。 However, it is not always possible to improve the accuracy of voice recognition by the above-mentioned conventional technique. For example, in the above-mentioned prior art, when a voice from a user is input and the voice recognition result is insufficient in certainty, only a display requesting the user for further input is performed, and the accuracy of voice recognition is improved. It is not always possible to improve.

本願は、上記に鑑みてなされたものであって、音声認識の精度を向上させることができる決定装置、決定方法、決定プログラム及びモデルを提供することを目的とする。 The present application has been made in view of the above, and an object of the present application is to provide a determination device, a determination method, a determination program, and a model capable of improving the accuracy of speech recognition.

本願に係る決定装置は、ユーザによって発話された音声に対応する発話内容を推定する推定部と、前記推定部によって第１音声から推定された第１推定結果における推定精度と、前記第１音声に続いて前記ユーザによって繰り返し発話された第２音声から推定された第２推定結果における推定精度とに基づいて、前記第１音声及び前記第２音声に対応する音声認識結果を決定する決定部と、を備えたことを特徴とする。 The determination device according to the present application includes an estimation unit that estimates the utterance content corresponding to the voice uttered by the user, estimation accuracy in the first estimation result estimated from the first voice by the estimation unit, and the first voice. Subsequently, a determination unit that determines the voice recognition result corresponding to the first voice and the second voice based on the estimation accuracy in the second estimation result estimated from the second voice repeatedly uttered by the user. It is characterized by having.

実施形態の一態様によれば、音声認識の精度を向上させることができるという効果を奏する。 According to one aspect of the embodiment, there is an effect that the accuracy of voice recognition can be improved.

図１は、実施形態に係る決定装置が実行する決定処理の一例を示す図である。FIG. 1 is a diagram showing an example of a determination process executed by the determination device according to the embodiment. 図２は、実施形態に係る決定装置が実行する音声認識結果の決定処理の一例を示す図である。FIG. 2 is a diagram showing an example of a voice recognition result determination process executed by the determination device according to the embodiment. 図３は、実施形態に係る決定装置の構成例を示す図である。FIG. 3 is a diagram showing a configuration example of the determination device according to the embodiment. 図４は、実施形態に係る音声情報記憶部の一例を示す図である。FIG. 4 is a diagram showing an example of a voice information storage unit according to the embodiment. 図５は、実施形態に係る推定結果情報記憶部の一例を示す図である。FIG. 5 is a diagram showing an example of the estimation result information storage unit according to the embodiment. 図６は、実施形態に係るスコア情報記憶部の一例を示す図である。FIG. 6 is a diagram showing an example of the score information storage unit according to the embodiment. 図７は、実施形態に係る決定装置が実行する決定処理の流れの一例を示すフローチャートである。FIG. 7 is a flowchart showing an example of the flow of the determination process executed by the determination device according to the embodiment. 図８は、実施形態に係る音声認識結果の決定の一例を示す図である。FIG. 8 is a diagram showing an example of determining the voice recognition result according to the embodiment. 図９は、変形例に係る決定装置が実行する決定処理の一例を示す図である。FIG. 9 is a diagram showing an example of a determination process executed by the determination device according to the modified example. 図１０は、決定装置の機能を実現するコンピュータの一例を示すハードウェア構成図である。FIG. 10 is a hardware configuration diagram showing an example of a computer that realizes the function of the determination device.

以下に、本願に係る決定装置、決定方法、決定プログラム及びモデルの実施するための形態（以下、「実施形態」と呼ぶ）について図面を参照しつつ詳細に説明する。なお、この実施形態により本願に係る決定装置、決定方法、決定プログラム及びモデルが限定されるものではない。また、各実施形態は、処理内容を矛盾させない範囲で適宜組み合わせることが可能である。また、以下の各実施形態において同一の部位には同一の符号を付し、重複する説明は省略される。 Hereinafter, modes for implementing the determination device, determination method, determination program, and model according to the present application (hereinafter, referred to as “embodiments”) will be described in detail with reference to the drawings. It should be noted that this embodiment does not limit the determination device, determination method, determination program and model according to the present application. In addition, each embodiment can be appropriately combined as long as the processing contents do not contradict each other. Further, in each of the following embodiments, the same parts are designated by the same reference numerals, and duplicate description is omitted.

〔１．決定装置が示す決定処理の一例〕
図１を用いて、実施形態に係る決定装置が実行する決定処理の一例について説明する。図１は、実施形態に係る決定装置が実行する決定処理の一例を示す図である。図１では、決定装置１００により決定処理が実行される例を示す。 [1. An example of the determination process indicated by the determination device]
An example of the determination process executed by the determination apparatus according to the embodiment will be described with reference to FIG. FIG. 1 is a diagram showing an example of a determination process executed by the determination device according to the embodiment. FIG. 1 shows an example in which a determination process is executed by the determination device 100.

図１に示すように、決定システム１は、端末装置１０と、決定装置１００とを含む。端末装置１０及び決定装置１００は、図示しない所定の通信網を介して、有線又は無線により通信可能に接続される。なお、図１に示す決定システム１には、複数台の端末装置１０や、複数台の決定装置１００が含まれてもよい。 As shown in FIG. 1, the determination system 1 includes a terminal device 10 and a determination device 100. The terminal device 10 and the determination device 100 are communicably connected by wire or wirelessly via a predetermined communication network (not shown). The determination system 1 shown in FIG. 1 may include a plurality of terminal devices 10 and a plurality of determination devices 100.

端末装置１０は、ブラウザに表示されるウェブページやアプリケーションに表示されるコンテンツ等のウェブコンテンツにアクセスするユーザによって利用される情報処理装置である。例えば、端末装置１０は、デスクトップ型ＰＣ（Personal Computer）や、ノート型ＰＣや、タブレット端末や、携帯電話機や、ＰＤＡ（Personal Digital Assistant）等である。また、端末装置１０は、ユーザによる操作や、端末装置１０が有する機能に応じて、ユーザによって発話される音声を取得し、端末装置１０の所定の記憶領域にかかる音声に関する情報を記憶する。 The terminal device 10 is an information processing device used by a user who accesses web contents such as a web page displayed on a browser and contents displayed on an application. For example, the terminal device 10 is a desktop PC (Personal Computer), a notebook PC, a tablet terminal, a mobile phone, a PDA (Personal Digital Assistant), or the like. Further, the terminal device 10 acquires the voice uttered by the user according to the operation by the user or the function of the terminal device 10, and stores the information related to the voice in the predetermined storage area of the terminal device 10.

例えば、端末装置１０は、予め音声を取得可能なアプリケーションをインストールしているものとする。この場合、端末装置１０は、内蔵されたマイクにより、ユーザによって発話される音声を取得し、端末装置１０の所定の記憶領域にかかる音声に関する情報を記憶する。なお、端末装置１０による上記処理は、例えば、音声検索等の周知技術によって実現可能である。 For example, it is assumed that the terminal device 10 has an application capable of acquiring voice installed in advance. In this case, the terminal device 10 acquires the voice uttered by the user by the built-in microphone, and stores the information related to the voice in the predetermined storage area of the terminal device 10. The above processing by the terminal device 10 can be realized by a well-known technique such as voice search.

決定装置１００は、ユーザによって発話される音声に対応する音声認識結果を決定し、かかる音声認識結果を端末装置１０に提供する決定装置であり、例えば、サーバ装置等により実現される。この点について説明する。従来の音声認識解析では、音声認識が困難だった場合、ユーザによって発話された音声に対応する音声認識結果がかかる音声と逸脱したものをユーザに提供してしまうことがある。また、従来の音声認識解析では、音声認識が困難だった場合、ユーザに対して発話を繰り返させたり、ユーザによって発話された音声に対して応答なしであったりする。この場合、ユーザは、同じ内容の音声を繰り返し発話することがある。この結果、ユーザに対して同じ内容の音声を繰り返し発話させるという負荷を強いることになる。そのため、実施形態に係る決定装置１００は、ユーザによって繰り返し発話された音声に対する音声認識の精度の向上を実現する。具体的には、決定装置１００は、ユーザによって最初に発話された音声（以下、第１音声と表記する場合がある）から推定された第１音声の推定結果とユーザによって続いて発話された音声（以下、第２音声と表記する場合がある）から推定された第２音声の推定結果との組み合わせ毎のスコアに基づいて、音声認識結果を決定する。この点について以下で詳細に説明する。なお、以下では、ユーザが端末装置１０に対して音声検索を行う場合に、ユーザによって複数回発話された音声が同じ内容の音声であるものとして説明する。 The determination device 100 is a determination device that determines a voice recognition result corresponding to a voice uttered by the user and provides the voice recognition result to the terminal device 10, and is realized by, for example, a server device or the like. This point will be described. In the conventional speech recognition analysis, when the speech recognition is difficult, the speech recognition result corresponding to the speech uttered by the user may be provided to the user as a deviation from the speech. Further, in the conventional voice recognition analysis, when voice recognition is difficult, the user may be made to repeat the utterance, or the user may not respond to the voice uttered by the user. In this case, the user may repeatedly utter the same voice. As a result, the user is forced to repeatedly utter the same voice. Therefore, the determination device 100 according to the embodiment realizes improvement in the accuracy of voice recognition for the voice repeatedly spoken by the user. Specifically, the determination device 100 determines the estimation result of the first voice estimated from the voice first uttered by the user (hereinafter, may be referred to as the first voice) and the voice subsequently uttered by the user. The voice recognition result is determined based on the score for each combination with the estimation result of the second voice estimated from (hereinafter, may be referred to as the second voice). This point will be described in detail below. In the following, when the user performs a voice search on the terminal device 10, the voice uttered by the user a plurality of times will be described as having the same content.

以下、図１を用いて、決定装置１００による音声認識結果の決定処理の一例を流れに沿って説明する。 Hereinafter, an example of the determination process of the voice recognition result by the determination device 100 will be described along the flow with reference to FIG.

まず、図１に示すように、ユーザは、端末装置１０に対して「連想ゲーム」と２回発話するものとする（ステップＳ１）。例えば、ユーザが「連想ゲーム」に関するゲームの種類を検索するために、端末装置１０に対して「連想ゲーム」と発話するものとする。この場合、ユーザは、端末装置１０に対してユーザ自身によって発話された音声が認識されたかどうか心配になり、自発的に「連想ゲーム」と２回発話するものとする。そして、端末装置１０は、ユーザによって２回発話された「連想ゲーム」を受け付ける。 First, as shown in FIG. 1, the user utters the "associative game" twice to the terminal device 10 (step S1). For example, it is assumed that the user speaks "associative game" to the terminal device 10 in order to search for the type of game related to "associative game". In this case, the user is worried about whether or not the voice spoken by the user himself / herself is recognized by the terminal device 10, and spontaneously speaks the "associative game" twice. Then, the terminal device 10 accepts the "associative game" uttered twice by the user.

なお、例えば、端末装置１０は、決定装置１００による音声認識が困難だった場合、決定装置１００によって通知された音声認識不可に関する情報に基づいて、「もう一度言ってください。」等の音声ナビゲーションによって、ユーザに繰り返し発話を促してもよい。 In addition, for example, when the voice recognition by the determination device 100 is difficult, the terminal device 10 uses voice navigation such as "Please say again" based on the information regarding the voice recognition failure notified by the decision device 100. The user may be prompted to speak repeatedly.

続いて、決定装置１００は、ユーザによって繰り返し発話された音声を受け付ける（ステップＳ２）。例えば、決定装置１００は、端末装置１０がユーザによって発話された音声を受け付けたことに関する情報を決定装置１００へ送信することに基づいて、ユーザによって繰り返し発話された音声を受け付ける。この場合、決定装置１００は、ユーザによって発話された音声が繰り返し発話されたものであるか否かを判定する。具体的には、決定装置１００は、ユーザによって発話された音声の音声波形と他の音声の音声波形との類似性に基づいて、ユーザによって発話された音声が繰り返し発話されたか否かを判定する。 Subsequently, the determination device 100 receives the voice repeatedly spoken by the user (step S2). For example, the determination device 100 receives the voice repeatedly spoken by the user based on transmitting information to the determination device 100 regarding the fact that the terminal device 10 has received the voice spoken by the user. In this case, the determination device 100 determines whether or not the voice spoken by the user is repeatedly spoken. Specifically, the determination device 100 determines whether or not the voice spoken by the user has been repeatedly spoken based on the similarity between the voice waveform of the voice spoken by the user and the voice waveform of another voice. ..

例えば、決定装置１００は、ユーザによって発話された音声の音声波形と他の音声の音声波形との類似度が所定値以上である場合、ユーザによって発話された音声が繰り返し発話されたと判定する。なお、決定装置１００は、音声の音声波形と他の音声の音声波形との類似性に基づいて判定する前に、かかる音声の音声波形に対して振幅、位相、周波数等による補正を行ってもよい。これにより、決定装置１００は、タイミングや音量の異なる音声の音声波形同士の類似性を判定することができる。 For example, when the similarity between the voice waveform of the voice spoken by the user and the voice waveform of another voice is equal to or higher than a predetermined value, the determination device 100 determines that the voice spoken by the user has been repeatedly spoken. Even if the determination device 100 corrects the voice waveform of the voice by amplitude, phase, frequency, etc. before making a determination based on the similarity between the voice waveform of the voice and the voice waveform of another voice. good. As a result, the determination device 100 can determine the similarity between the voice waveforms of the voices having different timings and volumes.

なお、決定装置１００が実行する上記処理は、例えば、音声信号処理等に関する周知技術により、音声の音声波形に対する振幅、位相、周波数等の補正及び音声の音声波形同士の類似度を算出することで実現可能である。また、例えば、決定装置１００が実行する音声波形同士の類似性の決定処理は、機械学習等の周知技術により、ユーザによって発話された音声が繰り返しか否かを判定してもよい。 The processing executed by the determination device 100 is, for example, by correcting the amplitude, phase, frequency, etc. of the voice waveform with respect to the voice waveform and calculating the similarity between the voice waveforms of the voice by using a well-known technique related to voice signal processing and the like. It is feasible. Further, for example, in the process of determining the similarity between voice waveforms executed by the determination device 100, it may be determined whether or not the voice uttered by the user is repeated by a well-known technique such as machine learning.

続いて、決定装置１００は、ユーザによって繰り返し発話された音声に対応する発話内容を推定し、かかる音声から推定された推定結果のランク付けを行う（ステップＳ３）。具体的には、決定装置１００は、ユーザによって繰り返し発話されたと判定された音声のうち、第１音声に対応する発話内容を推定する。そして、決定装置１００は、第１音声の発話内容を推定すると共に、第１音声の推定結果の正確性を示す情報である推定精度を算出する。また、決定装置１００は、ユーザによって繰り返し発話されたと判定された音声のうち、第２音声に対応する発話内容を推定する。そして、決定装置１００は、第２音声の発話内容を推定すると共に、第２音声の推定結果の推定精度を算出する。 Subsequently, the determination device 100 estimates the utterance content corresponding to the voice repeatedly spoken by the user, and ranks the estimation result estimated from the voice (step S3). Specifically, the determination device 100 estimates the utterance content corresponding to the first voice among the voices determined to have been repeatedly uttered by the user. Then, the determination device 100 estimates the utterance content of the first voice and calculates the estimation accuracy, which is information indicating the accuracy of the estimation result of the first voice. In addition, the determination device 100 estimates the utterance content corresponding to the second voice among the voices determined to have been repeatedly uttered by the user. Then, the determination device 100 estimates the utterance content of the second voice and calculates the estimation accuracy of the estimation result of the second voice.

例えば、決定装置１００は、ユーザによって発話された音声「連想ゲーム」に対応する第１音声から推定された推定結果として、「演奏ゲーム」、「塩素ゲーム」、「連想ゲーム」を推定する。そして、決定装置１００は、第１音声の推定結果である「演奏ゲーム」、「塩素ゲーム」、「連想ゲーム」に対応する推定精度として、「１．０」、「０．９」、「０．８」と算出する。また、決定装置１００は、ユーザによって第１音声に続いて発話された音声「連想ゲーム」に対応する第２音声から推定された推定結果として、「清掃ゲーム」、「連想ゲーム」、「炎症ゲーム」を推定する。そして、決定装置１００は、第２音声の推定結果である「清掃ゲーム」、「連想ゲーム」、「炎症ゲーム」に対応する推定精度として、「１．０」、「０．９」、「０．８」と算出する。 For example, the determination device 100 estimates a "performance game", a "chlorine game", and an "associative game" as estimation results estimated from the first voice corresponding to the voice "associative game" uttered by the user. Then, the determination device 100 has "1.0", "0.9", and "0" as the estimation accuracy corresponding to the "performance game", "chlorine game", and "associative game" which are the estimation results of the first sound. It is calculated as "0.8". Further, the determination device 100 has a "cleaning game", an "associative game", and an "inflammation game" as estimation results estimated from the second voice corresponding to the voice "associative game" uttered by the user following the first voice. Is estimated. Then, the determination device 100 has "1.0", "0.9", and "0" as the estimation accuracy corresponding to the "cleaning game", "associative game", and "inflammation game" which are the estimation results of the second voice. It is calculated as "0.8".

そして、決定装置１００は、ユーザによって発話された音声から推定された推定結果を、推定結果の推定精度に基づいて、ランク付けを行う。例えば、決定装置１００は、第１音声の推定結果に対応する推定精度に基づいて、「演奏ゲーム」、「塩素ゲーム」、「連想ゲーム」の順でランク付けを行う。また、決定装置１００は、第２音声の推定結果に対応する推定精度に基づいて、「清掃ゲーム」、「連想ゲーム」、「炎症ゲーム」の順でランク付けを行う。 Then, the determination device 100 ranks the estimation results estimated from the voice uttered by the user based on the estimation accuracy of the estimation results. For example, the determination device 100 ranks the "performance game", the "chlorine game", and the "associative game" in this order based on the estimation accuracy corresponding to the estimation result of the first voice. Further, the determination device 100 ranks the "cleaning game", the "associative game", and the "inflammation game" in this order based on the estimation accuracy corresponding to the estimation result of the second voice.

なお、決定装置１００が実行する上記処理は、例えば、音声認識解析等に関する周知技術により、ユーザによって発話された音声に対応する発話内容を推定し、かかる音声から推定された推定結果の推定精度を算出することで実現可能である。 In the above processing executed by the determination device 100, for example, the utterance content corresponding to the voice uttered by the user is estimated by a well-known technique related to voice recognition analysis and the like, and the estimation accuracy of the estimation result estimated from the voice is estimated. It can be realized by calculating.

続いて、決定装置１００は、第１音声の推定結果と第２音声の推定結果との組み合わせ毎に、スコアを算出する（ステップＳ４）。例えば、決定装置１００は、以下のような式（１）によりスコアＳＣ１を算出する。 Subsequently, the determination device 100 calculates a score for each combination of the estimation result of the first voice and the estimation result of the second voice (step S4). For example, the determination device 100 calculates the score SC1 by the following formula (1).

スコアＳＣ１＝Ａｃｃ１＋Ａｃｃ２＋Ｒｅｐ・・・（１） Score SC1 = Acc1 + Acc2 + Rep ... (1)

上記式（１）では、「Ａｃｃ１」は、第１音声の推定結果の推定精度を示し、「Ａｃｃ２」は、第２音声の推定結果の推定精度を示し、「Ｒｅｐ」は、第１音声の推定結果と第２音声の推定結果とに含まれる単語が重複する度合いに関する情報（以下、重複度と表記する）を示す。そして、決定装置１００は、スコアＳＣ１が最も大きい第１音声の推定結果と第２音声の推定結果との組み合わせを決定する。 In the above equation (1), "Acc1" indicates the estimation accuracy of the estimation result of the first voice, "Acc2" indicates the estimation accuracy of the estimation result of the second voice, and "Rep" indicates the estimation accuracy of the first voice. Information on the degree of duplication of words contained in the estimation result and the estimation result of the second voice (hereinafter referred to as the degree of duplication) is shown. Then, the determination device 100 determines the combination of the estimation result of the first voice and the estimation result of the second voice having the highest score SC1.

ここで、図２を用いて、音声認識結果を決定する処理の一例を説明する。図２は、第１音声の推定結果とかかる推定結果の推定精度と、第２音声の推定結果とかかる推定結果の推定精度とにおける組み合わせを示す図である。例えば、図２に示す例においては、第１音声の推定結果ＷＴ１に示すように「演奏ゲーム」は、推定精度「１．０」である。 Here, an example of the process of determining the voice recognition result will be described with reference to FIG. FIG. 2 is a diagram showing a combination of the estimation result of the first voice and the estimation accuracy of the estimation result, and the estimation result of the second voice and the estimation accuracy of the estimation result. For example, in the example shown in FIG. 2, as shown in the estimation result WT1 of the first voice, the “performance game” has an estimation accuracy of “1.0”.

図２に示す例では、決定装置１００は、上記式（１）により、第１音声の推定結果の推定精度と第２音声の推定結果の推定精度と重複度とを加味したスコアを算出する。例えば、第１音声の推定結果が「演奏ゲーム」であり、第２音声の推定結果が「清掃ゲーム」である組み合わせ（以下、「演奏ゲーム×清掃ゲーム」と表記する場合がある）において、形態素解析等により、「ゲーム」が一致していることから、重複度が「１．０」と算出されるものとする。この場合、決定装置１００は、図２中の算出式スコアＣＴ１に示す式により、第１音声の推定結果ＷＴ１「演奏ゲーム」の推定精度「１．０」と、第２音声の推定結果ＷＴ２「清掃ゲーム」の推定精度「１．０」と、重複度「１．０」とであることから、スコアＣＴ１「３．０」と算出する。 In the example shown in FIG. 2, the determination device 100 calculates a score in which the estimation accuracy of the estimation result of the first voice, the estimation accuracy of the estimation result of the second voice, and the multiplicity are added by the above equation (1). For example, in a combination in which the estimation result of the first voice is a "playing game" and the estimation result of the second voice is a "cleaning game" (hereinafter, may be referred to as "playing game x cleaning game"), the morphological element Since the "games" match by analysis or the like, it is assumed that the degree of duplication is calculated as "1.0". In this case, the determination device 100 uses the formula shown in the calculation formula score CT1 in FIG. 2 to determine the estimation accuracy “1.0” of the first voice estimation result WT1 “playing game” and the second voice estimation result WT2 “. Since the estimation accuracy of the "cleaning game" is "1.0" and the degree of overlap is "1.0", the score CT1 is calculated as "3.0".

また、例えば、「連想ゲーム×連想ゲーム」は、形態素解析等により、「連想」と「ゲーム」とが一致していることから、重複度が「２．０」と算出されるものとする。この場合、決定装置１００は、図２中の算出式スコアＣＴ２に示す式により、第１音声の推定結果ＷＴ３「連想ゲーム」の推定精度「０．８」と、第２音声の推定結果ＷＴ４「連想ゲーム」の推定精度「０．９」と、重複度「２．０」とであるから、スコアＣＴ２「３．７」と算出する。 Further, for example, in the "associative game x associative game", since the "associative" and the "game" match by morphological analysis or the like, the degree of duplication is calculated to be "2.0". In this case, the determination device 100 uses the formula shown in the calculation formula score CT2 in FIG. 2 to determine the estimation accuracy “0.8” of the first voice estimation result WT3 “associative game” and the second voice estimation result WT4 “. Since the estimation accuracy of the "associative game" is "0.9" and the degree of overlap is "2.0", the score is calculated as CT2 "3.7".

また、例えば、図２に示す例において、以下の組み合わせにおいて上記算出方法に基づいてスコアＣＴ３〜ＣＴ９を算出する。
スコアＣＴ３（「演奏ゲーム×連想ゲーム」）＝２．９・・・（２）
スコアＣＴ４（「演奏ゲーム×炎症ゲーム」）＝２．８・・・（３）
スコアＣＴ５（「塩素ゲーム×清掃ゲーム」）＝２．９・・・（４）
スコアＣＴ６（「塩素ゲーム×連想ゲーム」）＝２．８・・・（５）
スコアＣＴ７（「塩素ゲーム×炎症ゲーム」）＝２．７・・・（６）
スコアＣＴ８（「連想ゲーム×清掃ゲーム」）＝２．８・・・（７）
スコアＣＴ９（「連想ゲーム×炎症ゲーム」）＝２．６・・・（８） Further, for example, in the example shown in FIG. 2, the scores CT3 to CT9 are calculated based on the above calculation method in the following combinations.
Score CT3 ("playing game x associative game") = 2.9 ... (2)
Score CT4 ("playing game x inflammation game") = 2.8 ... (3)
Score CT5 ("Chlorine game x cleaning game") = 2.9 ... (4)
Score CT6 ("chlorine game x associative game") = 2.8 ... (5)
Score CT7 ("Chlorine game x inflammation game") = 2.7 ... (6)
Score CT8 ("associative game x cleaning game") = 2.8 ... (7)
Score CT9 ("associative game x inflammation game") = 2.6 ... (8)

そして、決定装置１００は、各スコアＣＴ１〜ＣＴ９を比較する。例えば、決定装置１００は、第１音声の推定結果ＷＴ３「連想ゲーム」と、第２音声の推定結果ＷＴ４「連想ゲーム」とのスコアＣＴ２が最も大きいため、第１音声の推定結果ＷＴ３及び第２音声の推定結果ＷＴ４から音声認識結果を選択し、「連想ゲーム」を音声認識結果として決定する。 Then, the determination device 100 compares the scores CT1 to CT9. For example, the determination device 100 has the largest score CT2 between the first voice estimation result WT3 "associative game" and the second voice estimation result WT4 "associative game", so that the first voice estimation result WT3 and the second The voice recognition result is selected from the voice estimation result WT4, and the "associative game" is determined as the voice recognition result.

図１に戻り、実施形態に係る決定処理の一例を説明する。決定装置１００は、スコアＳＣ１に基づいて、音声認識結果Ｃ１を「連想ゲーム」であると決定する（ステップＳ５）。例えば、決定装置１００は、「連想ゲーム×連想ゲーム」におけるスコアＳＣ１が「３．７」であり、「演奏ゲーム×清掃ゲーム」におけるスコアＳＣ１が「３．０」であることから、「連想ゲーム×連想ゲーム」のスコアＳＣ１が最も大きいため、第１音声の推定結果及び第２音声の推定結果から選択された「連想ゲーム」を音声認識結果として決定する。そして、決定装置１００は、音声認識結果Ｃ１「連想ゲーム」を端末装置１０に提供する（ステップＳ６）。例えば、決定装置１００は、音声認識結果Ｃ１「連想ゲーム」をテキスト化して端末装置１０に提供する。そして、端末装置１０は、ユーザに対して「連想ゲーム」というテキストを表示する。なお、端末装置１０は、音声読み上げ機能を用いて、ユーザに対して「連想ゲーム」と読み上げてもよい。 Returning to FIG. 1, an example of the determination process according to the embodiment will be described. The determination device 100 determines the voice recognition result C1 as the "associative game" based on the score SC1 (step S5). For example, the determination device 100 has a score SC1 of "3.7" in the "associative game x associative game" and a score SC1 of "3.0" in the "playing game x cleaning game", and thus the "associative game". Since the score SC1 of the "associative game" is the largest, the "associative game" selected from the estimation result of the first voice and the estimation result of the second voice is determined as the voice recognition result. Then, the determination device 100 provides the voice recognition result C1 “associative game” to the terminal device 10 (step S6). For example, the determination device 100 converts the voice recognition result C1 “associative game” into text and provides it to the terminal device 10. Then, the terminal device 10 displays the text "associative game" to the user. The terminal device 10 may read aloud "associative game" to the user by using the voice reading function.

このように、実施形態に係る決定装置１００は、ユーザによって発話された音声に対応する第１音声から推定された推定結果とユーザによって第１音声に続いて発話された音声に対応する第２音声から推定された推定結果との組み合わせに対応するスコアに基づいて、音声認識結果を決定する。これにより、実施形態に係る決定装置１００は、音声認識の精度を向上させることができる。この点について説明する。図１の例を用いて説明すると、決定装置１００は、複数の第１音声の推定結果のうち、推定精度が高い第１音声の推定結果を選択し、複数の第２音声の推定結果のうち、推定精度が高い推定結果を選択する。そして、決定装置１００は、推定精度の高い第１音声の推定結果と第２音声の推定結果との組み合わせ毎に重複度に基づいてスコアを算出する。これにより、決定装置１００は、かかるスコアが高い組み合わせにおける第１音声の推定結果及び第２音声の推定結果を音声認識結果として決定するため、より高い精度で音声認識結果を決定することができる。したがって、決定装置１００は、ユーザによって繰り返し発話された音声の組み合わせ毎に算出されるスコアに基づいて音声認識結果を決定することができるので、音声認識の精度を向上させることができる。 As described above, the determination device 100 according to the embodiment has an estimation result estimated from the first voice corresponding to the voice uttered by the user and a second voice corresponding to the voice uttered following the first voice by the user. The speech recognition result is determined based on the score corresponding to the combination with the estimation result estimated from. As a result, the determination device 100 according to the embodiment can improve the accuracy of voice recognition. This point will be described. Explaining with reference to the example of FIG. 1, the determination device 100 selects the estimation result of the first voice having high estimation accuracy from the estimation results of the plurality of first voices, and among the estimation results of the plurality of second voices. , Select the estimation result with high estimation accuracy. Then, the determination device 100 calculates the score based on the degree of overlap for each combination of the estimation result of the first voice and the estimation result of the second voice with high estimation accuracy. As a result, the determination device 100 determines the estimation result of the first voice and the estimation result of the second voice in the combination having a high score as the voice recognition result, so that the voice recognition result can be determined with higher accuracy. Therefore, the determination device 100 can determine the voice recognition result based on the score calculated for each combination of voices repeatedly uttered by the user, so that the accuracy of voice recognition can be improved.

〔２．決定装置の構成〕
次に、図３を用いて、実施形態に係る決定装置１００の構成について説明する。図３は、実施形態に係る決定装置１００の構成例を示す図である。図３に示すように、決定装置１００は、通信部１１０と、記憶部１２０と、制御部１３０とを有する。 [2. Configuration of determination device]
Next, the configuration of the determination device 100 according to the embodiment will be described with reference to FIG. FIG. 3 is a diagram showing a configuration example of the determination device 100 according to the embodiment. As shown in FIG. 3, the determination device 100 includes a communication unit 110, a storage unit 120, and a control unit 130.

（通信部１１０について）
通信部１１０は、例えば、ＮＩＣ（Network Interface Card）等によって実現される。そして、通信部１１０は、ネットワークと有線又は無線で接続され、端末装置１０との間で情報の送受信を行う。 (About communication unit 110)
The communication unit 110 is realized by, for example, a NIC (Network Interface Card) or the like. Then, the communication unit 110 is connected to the network by wire or wirelessly, and transmits / receives information to / from the terminal device 10.

（記憶部１２０について）
記憶部１２０は、例えば、ＲＡＭ（Random Access Memory)、フラッシュメモリ（Flash Memory）等の半導体メモリ素子、または、ハードディスク、光ディスク等の記憶装置によって実現される。記憶部１２０は、音声情報記憶部１２１と、推定結果情報記憶部１２２と、スコア情報記憶部１２３とを有する。 (About storage unit 120)
The storage unit 120 is realized by, for example, a semiconductor memory element such as a RAM (Random Access Memory) or a flash memory (Flash Memory), or a storage device such as a hard disk or an optical disk. The storage unit 120 includes a voice information storage unit 121, an estimation result information storage unit 122, and a score information storage unit 123.

（音声情報記憶部１２１について）
実施形態に係る音声情報記憶部１２１は、ユーザによって発話された音声に関する情報を記憶する。ここで、図４に、実施形態に係る音声情報記憶部１２１の一例を示す。図４に示した例では、音声情報記憶部１２１は、「音声ＩＤ（Identifier）」、「音声」、「第１音声との類似度」といった項目を有する。 (About voice information storage unit 121)
The voice information storage unit 121 according to the embodiment stores information related to voice spoken by the user. Here, FIG. 4 shows an example of the voice information storage unit 121 according to the embodiment. In the example shown in FIG. 4, the voice information storage unit 121 has items such as "voice ID (Identifier)", "voice", and "similarity with the first voice".

「音声ＩＤ」は、ユーザによって発話された音声を識別する識別子である。「音声」は、「音声ＩＤ」と対応付けられた音声の波形データである。「第１音声との類似度」は、「音声ＩＤ」と対応付けられた直前に受け付けた音声との類似度に関する情報である。例えば、図４では、音声ＩＤによって識別される「Ｖ２」は、音声が「ＷＶ２」であり、音声ＩＤ「Ｖ１」によって識別される音声「ＷＶ１」との類似度である第１音声との類似度が「０．９」である。 The "voice ID" is an identifier that identifies the voice spoken by the user. The “voice” is voice waveform data associated with the “voice ID”. The "similarity with the first voice" is information regarding the similarity with the voice received immediately before being associated with the "voice ID". For example, in FIG. 4, “V2” identified by the voice ID is similar to the first voice, which is the voice “WV2” and has a degree of similarity to the voice “WV1” identified by the voice ID “V1”. The degree is "0.9".

（音声認識結果情報記憶部１２２について）
実施形態に係る推定結果情報記憶部１２２は、ユーザによって発話された音声に対応する推定結果に関する各種情報を記憶する。ここで、図５に、実施形態に係る推定結果情報記憶部１２２の一例を示す。図５に示した例では、推定結果情報記憶部１２２は、「音声ＩＤ」、「推定結果ＩＤ」、「推定結果のランキング順位」、「推定結果」、「推定精度」といった項目を有する。 (About voice recognition result information storage unit 122)
The estimation result information storage unit 122 according to the embodiment stores various information regarding the estimation result corresponding to the voice uttered by the user. Here, FIG. 5 shows an example of the estimation result information storage unit 122 according to the embodiment. In the example shown in FIG. 5, the estimation result information storage unit 122 has items such as "voice ID", "estimation result ID", "ranking order of estimation result", "estimation result", and "estimation accuracy".

「音声ＩＤ」は、ユーザによって発話された音声を識別する識別子である。「推定結果ＩＤ」は、音声から推定された推定結果を識別するための識別情報を示す。「推定結果のランキング順位」は、音声に対応する推定結果と共に算出された推定精度の大きさによってランク付けされた順位を示す。「推定結果」は、ユーザによって発話された音声に対応する推定結果を示す。「推定精度」は、「推定結果」と共に算出された推定精度を示す。例えば、図５では、音声ＩＤによって識別される「Ｖ１」に対応し、かつ、推定結果ＩＤによって識別される「ＶＣ１」に対応する推定結果のランキング順位は「１位」であり、推定結果「演奏ゲーム」の推定精度は「１．０」である。 The "voice ID" is an identifier that identifies the voice spoken by the user. The "estimation result ID" indicates identification information for identifying the estimation result estimated from the voice. The "ranking rank of the estimation result" indicates the ranking ranked by the magnitude of the estimation accuracy calculated together with the estimation result corresponding to the voice. The "estimation result" indicates an estimation result corresponding to the voice spoken by the user. "Estimation accuracy" indicates the estimation accuracy calculated together with the "estimation result". For example, in FIG. 5, the ranking rank of the estimation result corresponding to "V1" identified by the voice ID and corresponding to "VC1" identified by the estimation result ID is "1st place", and the estimation result " The estimation accuracy of the "playing game" is "1.0".

（スコア情報記憶部１２３について）
実施形態に係るスコア情報記憶部１２３は、第１音声の推定結果と第２音声の推定結果との組み合わせ毎のスコアに関する情報を記憶する。ここで、図６に、実施形態に係るスコア情報記憶部１２３の一例を示す。図６に示した例では、スコア情報記憶部１２３は、「スコアＩＤ」、「推定結果の組み合わせ」、「スコア」といった項目を有する。 (About score information storage unit 123)
The score information storage unit 123 according to the embodiment stores information regarding the score for each combination of the estimation result of the first voice and the estimation result of the second voice. Here, FIG. 6 shows an example of the score information storage unit 123 according to the embodiment. In the example shown in FIG. 6, the score information storage unit 123 has items such as “score ID”, “combination of estimation results”, and “score”.

「スコアＩＤ」は、第１音声の推定結果と第２音声の推定結果との組み合わせに対応するスコアを識別するための識別情報を示す。「推定結果の組み合わせ」は、第１音声の推定結果と第２音声の推定結果との組み合わせに対応するスコアを示す。例えば、図６では、スコアＩＤによって識別される「ＳＣＣ１」に対応する推定結果の組み合わせは「連想ゲーム×連想ゲーム」であり、かかる推定結果の組み合わせにおけるスコアは「３．７」であることを示す。 The “score ID” indicates identification information for identifying the score corresponding to the combination of the estimation result of the first voice and the estimation result of the second voice. The "combination of estimation results" indicates a score corresponding to the combination of the estimation result of the first voice and the estimation result of the second voice. For example, in FIG. 6, the combination of estimation results corresponding to "SCC1" identified by the score ID is "associative game x associative game", and the score in the combination of such estimation results is "3.7". show.

（制御部１３０について）
制御部１３０は、コントローラ（Controller）であり、例えば、ＣＰＵ（Central Processing Unit）やＭＰＵ（Micro Processing Unit）等によって、決定装置１００内部の記憶装置に記憶されている各種プログラム（決定プログラムの一例に相当）がＲＡＭを作業領域として実行されることにより実現される。また、制御部１３０は、コントローラであり、例えば、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）等の集積回路により実現される。 (About control unit 130)
The control unit 130 is a controller, and for example, various programs (as an example of a determination program) stored in a storage device inside the determination device 100 by a CPU (Central Processing Unit), an MPU (Micro Processing Unit), or the like. (Equivalent) is realized by executing the RAM as a work area. Further, the control unit 130 is a controller, and is realized by, for example, an integrated circuit such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array).

図３に示すように、制御部１３０は、受付部１３１と、判定部１３２と、推定部１３３と、算出部１３４と、決定部１３５と、提供部１３６とを有し、以下に説明する情報処理の機能や作用を実現または実行する。なお、制御部１３０の内部構成は、図３に示した構成に限られず、後述する情報処理を行う構成であれば他の構成であってもよい。また、制御部１３０が有する各処理部の接続関係は、図３に示した接続関係に限られず、他の接続関係であってもよい。 As shown in FIG. 3, the control unit 130 includes a reception unit 131, a determination unit 132, an estimation unit 133, a calculation unit 134, a determination unit 135, and a provision unit 136, and the information described below. Realize or execute the function or action of processing. The internal configuration of the control unit 130 is not limited to the configuration shown in FIG. 3, and may be another configuration as long as it is a configuration for performing information processing described later. Further, the connection relationship of each processing unit included in the control unit 130 is not limited to the connection relationship shown in FIG. 3, and may be another connection relationship.

（受付部１３１について）
受付部１３１は、ユーザによって発話された音声を受け付ける。例えば、受付部１３１は、端末装置１０がユーザによって発話された音声を受け付けたことに関する情報を受付部１３１へ送信することに基づいて、ユーザによって発話された音声を受け付ける。そして、受付部１３１は、かかる音声の波形データを音声情報記憶部１２１に格納する。 (About reception desk 131)
The reception unit 131 receives the voice uttered by the user. For example, the reception unit 131 receives the voice uttered by the user based on transmitting the information regarding the reception of the voice uttered by the user by the terminal device 10 to the reception unit 131. Then, the reception unit 131 stores the waveform data of the voice in the voice information storage unit 121.

（判定部１３２について）
判定部１３２は、受付部１３１によって受け付けられた第１音声と、受付部１３１によって第１音声の後に受け付けられた第２音声との類似性に基づいて、第２音声が第１音声に続いて繰り返し発話された音声であるかを判定する。具体的には、判定部１３２は、音声情報記憶部１２１を参照し、第１音声と第２音声との類似度を音声情報記憶部１２１に格納し、第２音声が第１音声に続いて繰り返し発話された音声であると判定する。 (About judgment unit 132)
In the determination unit 132, the second voice follows the first voice based on the similarity between the first voice received by the reception unit 131 and the second voice received after the first voice by the reception unit 131. Determine if the voice is spoken repeatedly. Specifically, the determination unit 132 refers to the voice information storage unit 121, stores the similarity between the first voice and the second voice in the voice information storage unit 121, and the second voice follows the first voice. It is determined that the voice is repeatedly spoken.

例えば、音声ＩＤにより識別される第１音声「ＷＶ１」と、第２音声「ＷＶ２」とであるとする。また、類似度の所定値が０．８であるとする。この場合、判定部１３２は、第１音声「ＷＶ１」の音声波形と第２音声「ＷＶ２」の音声波形との類似度「０．９」を音声情報記憶部１２１に格納する。そして、判定部１３２は、かかる類似度が所定値以上であるため、第１音声「ＷＶ１」と第２音声「ＷＶ２」とが繰り返し発話されたと判定する。 For example, it is assumed that the first voice "WV1" and the second voice "WV2" are identified by the voice ID. Further, it is assumed that the predetermined value of the similarity is 0.8. In this case, the determination unit 132 stores the similarity “0.9” between the voice waveform of the first voice “WV1” and the voice waveform of the second voice “WV2” in the voice information storage unit 121. Then, the determination unit 132 determines that the first voice "WV1" and the second voice "WV2" have been repeatedly uttered because the similarity is equal to or higher than a predetermined value.

（推定部１３３について）
推定部１３３は、ユーザによって発話された音声に対応する発話内容を推定する。具体的には、推定部１３３は、音声情報記憶部１２１を参照して、ユーザによって最初に発話された音声である第１音声に対応する発話内容を推定する。そして、推定部１３３は、第１音声の発話内容を推定すると共に、第１音声の推定結果の推定精度を算出する。そして、推定部１３３は、かかる第１音声の推定結果と推定精度とを推定結果情報記憶部１２２に格納する。なお、推定部１３３は、例えば、音声の波形や振幅等のパラメータと文言との対応関係に関する情報に基づいて、発話内容を推定する。そして、推定部１３３は、例えば、音声の波形と推定結果の波形との一致度に基づいて、推定精度を算出する。 (About estimation unit 133)
The estimation unit 133 estimates the utterance content corresponding to the voice uttered by the user. Specifically, the estimation unit 133 refers to the voice information storage unit 121 and estimates the utterance content corresponding to the first voice, which is the voice first spoken by the user. Then, the estimation unit 133 estimates the utterance content of the first voice and calculates the estimation accuracy of the estimation result of the first voice. Then, the estimation unit 133 stores the estimation result and the estimation accuracy of the first voice in the estimation result information storage unit 122. The estimation unit 133 estimates the utterance content based on, for example, information on the correspondence between the wording and parameters such as the waveform and amplitude of the voice. Then, the estimation unit 133 calculates the estimation accuracy based on, for example, the degree of coincidence between the waveform of the voice and the waveform of the estimation result.

また、推定部１３３は、音声情報記憶部１２１を参照して、ユーザによって第１音声に続いて発話された音声である第２音声に対応する発話内容を推定する。そして、推定部１３３は、第２音声の発話内容を推定すると共に、第２音声の推定結果の推定精度を算出する。そして、推定部１３３は、かかる第２音声の推定結果と推定精度とを推定結果情報記憶部１２２に格納する。 In addition, the estimation unit 133 refers to the voice information storage unit 121 to estimate the utterance content corresponding to the second voice, which is the voice uttered by the user following the first voice. Then, the estimation unit 133 estimates the utterance content of the second voice and calculates the estimation accuracy of the estimation result of the second voice. Then, the estimation unit 133 stores the estimation result and the estimation accuracy of the second voice in the estimation result information storage unit 122.

例えば、音声ＩＤにより識別される第１音声「ＷＶ１」と、第２音声「ＷＶ２」とであるとする。この場合、推定部１３３は、音声情報記憶部１２１を参照し、第１音声「ＷＶ１」に対応する第１音声の推定結果として、「演奏ゲーム」、「塩素ゲーム」、「連想ゲーム」を推定する。そして、推定部１３３は、第１音声の推定結果である「演奏ゲーム」、「塩素ゲーム」、「連想ゲーム」に対応する推定精度として、「１．０」、「０．９」、「０．８」と算出する。そして、推定部１３３は、かかる第１音声の推定結果と推定精度とを推定結果情報記憶部１２２に格納する。また、推定部１３３は、音声情報記憶部１２１を参照し、第２音声「ＷＶ２」に対応する第２音声の推定結果として、「清掃ゲーム」、「連想ゲーム」、「炎症ゲーム」を推定する。そして、推定部１３３は、第２音声の推定結果である「清掃ゲーム」、「連想ゲーム」、「炎症ゲーム」に対応する推定精度として、「１．０」、「０．９」、「０．８」と算出する。そして、推定部１３３は、かかる第２音声の推定結果と推定精度とを推定結果情報記憶部１２２に格納する。 For example, it is assumed that the first voice "WV1" and the second voice "WV2" are identified by the voice ID. In this case, the estimation unit 133 refers to the voice information storage unit 121 and estimates the "performance game", the "chlorine game", and the "associative game" as the estimation result of the first voice corresponding to the first voice "WV1". do. Then, the estimation unit 133 sets the estimation accuracy corresponding to the first sound estimation result of the "performance game", "chlorine game", and "associative game" to be "1.0", "0.9", and "0". It is calculated as "0.8". Then, the estimation unit 133 stores the estimation result and the estimation accuracy of the first voice in the estimation result information storage unit 122. Further, the estimation unit 133 refers to the voice information storage unit 121 and estimates the "cleaning game", the "associative game", and the "inflammation game" as the estimation result of the second voice corresponding to the second voice "WV2". .. Then, the estimation unit 133 sets the estimation accuracy corresponding to the estimation result of the second voice, "cleaning game", "associative game", and "inflammation game", to be "1.0", "0.9", and "0". It is calculated as "0.8". Then, the estimation unit 133 stores the estimation result and the estimation accuracy of the second voice in the estimation result information storage unit 122.

また、推定部１３３は、ユーザによって発話された音声から推定された推定結果を、推定結果の推定精度に基づいて、ランク付けを行う。そして、推定部１３３は、かかる推定結果のランキング情報を推定結果情報記憶部１２２に格納する。例えば、推定部１３３は、第１音声の推定結果に対応する推定精度に基づいて、「演奏ゲーム」、「塩素ゲーム」、「連想ゲーム」の順でランク付けを行う。そして、推定部１３３は、かかる第１音声の推定結果のランキング情報を推定結果情報記憶部１２２に格納する。また、推定部１３３は、第２音声の推定結果に対応する推定精度に基づいて、「清掃ゲーム」、「連想ゲーム」、「炎症ゲーム」の順でランク付けを行う。そして、推定部１３３は、かかる第２音声の推定結果のランキング情報を推定結果情報記憶部１２２に格納する。 In addition, the estimation unit 133 ranks the estimation results estimated from the voice uttered by the user based on the estimation accuracy of the estimation results. Then, the estimation unit 133 stores the ranking information of the estimation result in the estimation result information storage unit 122. For example, the estimation unit 133 ranks the "performance game", the "chlorine game", and the "associative game" in this order based on the estimation accuracy corresponding to the estimation result of the first voice. Then, the estimation unit 133 stores the ranking information of the estimation result of the first voice in the estimation result information storage unit 122. Further, the estimation unit 133 ranks the "cleaning game", the "associative game", and the "inflammation game" in this order based on the estimation accuracy corresponding to the estimation result of the second voice. Then, the estimation unit 133 stores the ranking information of the estimation result of the second voice in the estimation result information storage unit 122.

（算出部１３４について）
算出部１３４は、推定部１３３によって推定された第１の推定結果（第１音声の推定結果に相当）と第２の推定結果（第２音声の推定結果に相当）との組み合わせ毎に、第１音声の推定結果の推定精度と第２音声の推定結果の推定精度とに基づいてスコアを算出する。例えば、算出部１３４は、推定結果情報記憶部１２２から、「連想ゲーム×連想ゲーム」と「演奏ゲーム×清掃ゲーム」とを読み出す。例えば、「連想ゲーム×連想ゲーム」において、形態素解析等により、「連想」と「ゲーム」とが一致していることから、重複度が「２．０」と算出されるものとする。この場合、算出部１３４は、「連想ゲーム×連想ゲーム」において、第１音声の推定結果「連想ゲーム」の推定精度が「０．８」であり、第２音声の推定結果「連想ゲーム」の推定精度が「０．９」であり、重複度が「２．０」であることから、スコアＳＣ１は「３．７」と算出する。また、例えば、「演奏ゲーム×清掃ゲーム」において、形態素解析等により、「ゲーム」が一致していることから、重複度が「１．０」と算出されるものとする。この場合、算出部１３４は、「演奏ゲーム×清掃ゲーム」において、第１音声の推定結果「演奏ゲーム」の推定精度が「１．０」であり、第２音声の推定結果「清掃ゲーム」の推定精度が「１．０」であり、重複度が「１．０」であることから、スコアＳＣ１は「３．０」と算出する。そして、算出部１３４は、各組み合わせに対応するスコアＳＣ１をスコア情報記憶部１２３に格納する。 (About calculation unit 134)
The calculation unit 134 uses the calculation unit 134 for each combination of the first estimation result (corresponding to the estimation result of the first voice) and the second estimation result (corresponding to the estimation result of the second voice) estimated by the estimation unit 133. The score is calculated based on the estimation accuracy of the estimation result of the first voice and the estimation accuracy of the estimation result of the second voice. For example, the calculation unit 134 reads out the “associative game × associative game” and the “playing game × cleaning game” from the estimation result information storage unit 122. For example, in the "associative game x associative game", since the "associative" and the "game" match by morphological analysis or the like, the degree of duplication is calculated to be "2.0". In this case, the calculation unit 134 has an estimation accuracy of "0.8" for the estimation result "associative game" of the first voice in the "associative game x associative game", and the estimation result "associative game" for the second voice. Since the estimation accuracy is "0.9" and the degree of overlap is "2.0", the score SC1 is calculated as "3.7". Further, for example, in the "playing game x cleaning game", since the "games" match by morphological analysis or the like, the degree of duplication is calculated to be "1.0". In this case, the calculation unit 134 has an estimation accuracy of "1.0" for the estimation result "playing game" of the first sound in the "playing game x cleaning game", and the estimation result "cleaning game" for the second sound. Since the estimation accuracy is "1.0" and the degree of overlap is "1.0", the score SC1 is calculated as "3.0". Then, the calculation unit 134 stores the score SC1 corresponding to each combination in the score information storage unit 123.

（決定部１３５について）
決定部１３５は、推定部１３３によって第１音声から推定された第１推定結果（第１音声の推定結果に相当）における推定精度と、第１音声に続いてユーザによって繰り返し発話された第２音声から推定された第２推定結果（第２音声の推定結果に相当）における推定精度とに基づいて、第１音声及び第２音声に対応する音声認識結果を決定する。具体的には、決定部１３５は、スコア情報記憶部１２３に記憶されるスコアＳＣ１が最も大きい第１音声の推定結果と第２音声の推定結果との組み合わせから、第１音声及び第２音声に対応する音声認識結果を決定する。例えば、決定部１３５は、スコア情報記憶部１２３を参照して、「連想ゲーム×連想ゲーム」におけるスコアＳＣ１が「３．７」であり、「演奏ゲーム×清掃ゲーム」におけるスコアＳＣ１が「３．０」であることから、「連想ゲーム×連想ゲーム」のスコアＳＣ１が最も大きいため、第１音声の推定結果及び第２音声の推定結果から選択された「連想ゲーム」を音声認識結果として決定する。 (About decision unit 135)
The determination unit 135 has the estimation accuracy of the first estimation result (corresponding to the estimation result of the first voice) estimated from the first voice by the estimation unit 133, and the second voice repeatedly uttered by the user following the first voice. The voice recognition result corresponding to the first voice and the second voice is determined based on the estimation accuracy in the second estimation result (corresponding to the estimation result of the second voice) estimated from. Specifically, the determination unit 135 converts the combination of the estimation result of the first voice and the estimation result of the second voice having the largest score SC1 stored in the score information storage unit 123 into the first voice and the second voice. Determine the corresponding speech recognition result. For example, the determination unit 135 refers to the score information storage unit 123, and the score SC1 in the “associative game × associative game” is “3.7”, and the score SC1 in the “playing game × cleaning game” is “3. Since it is "0", the score SC1 of "associative game x associative game" is the largest, so the "associative game" selected from the estimation result of the first voice and the estimation result of the second voice is determined as the voice recognition result. ..

（提供部１３６について）
提供部１３６は、決定部１３５によって決定された音声認識結果を端末装置１０に提供する。例えば、音声認識結果が「連想ゲーム」であるとする。この場合、提供部１３６は、「連想ゲーム」をテキスト化して端末装置１０に提供する。そして、端末装置１０は、ユーザに対して「連想ゲーム」というテキストを表示する。 (About the provider 136)
The providing unit 136 provides the terminal device 10 with the voice recognition result determined by the determining unit 135. For example, assume that the voice recognition result is an "associative game". In this case, the providing unit 136 converts the "associative game" into text and provides it to the terminal device 10. Then, the terminal device 10 displays the text "associative game" to the user.

〔３．決定処理のフローチャート〕
次に、図７を用いて、実施形態に係る決定装置１００が実行する決定処理の手順について説明する。図７は、実施形態に係る決定装置が実行する決定処理の流れの一例を示すフローチャートである。 [3. Flowchart of decision processing]
Next, the procedure of the determination process executed by the determination device 100 according to the embodiment will be described with reference to FIG. 7. FIG. 7 is a flowchart showing an example of the flow of the determination process executed by the determination device according to the embodiment.

図７に示すように、受付部１３１は、ユーザによって発話された音声を受け付ける（ステップＳ１０１）。そして、判定部１３２は、第１音声と第２音声との類似度が所定値以上である場合、第１音声と第２音声とが繰り返し発話されたと判定する（ステップＳ１０２）。判定部１３２は、ユーザによって繰り返し発話された音声であると判定しない場合（ステップＳ１０２；Ｎｏ）、ユーザによって発話された音声を受け付けるまで待機する。 As shown in FIG. 7, the reception unit 131 receives the voice uttered by the user (step S101). Then, when the similarity between the first voice and the second voice is equal to or higher than a predetermined value, the determination unit 132 determines that the first voice and the second voice have been repeatedly uttered (step S102). When the determination unit 132 does not determine that the voice is repeatedly spoken by the user (step S102; No), the determination unit 132 waits until the voice spoken by the user is received.

一方、判定部１３２がユーザによって繰り返し発話された音声と判定した場合（ステップＳ１０２；Ｙｅｓ）、推定部１３３は、判定部１３２によって判定された第１音声と第２音声とに対応する発話内容を推定し、推定結果のランク付けを行う（ステップＳ１０３）。 On the other hand, when the determination unit 132 determines that the voice is repeatedly spoken by the user (step S102; Yes), the estimation unit 133 determines the utterance content corresponding to the first voice and the second voice determined by the determination unit 132. Estimate and rank the estimation results (step S103).

決定部１３５は、算出部１３４が推定部１３３によって推定された推定結果の組み合わせにおいて算出したスコアが最も大きい第１音声の推定結果及び第２音声の推定結果を音声認識結果として決定する（ステップＳ１０４）。そして、提供部１３６は、決定部１３５によって決定された音声認識結果を端末装置１０に提供する（ステップＳ１０５）。 The determination unit 135 determines the estimation result of the first voice and the estimation result of the second voice having the highest score calculated by the calculation unit 134 in the combination of the estimation results estimated by the estimation unit 133 as the voice recognition result (step S104). ). Then, the providing unit 136 provides the terminal device 10 with the voice recognition result determined by the determining unit 135 (step S105).

〔４．音声認識結果の決定処理〕
次に、図８を用いて、実施形態に係る決定システム１における音声認識結果の決定について説明する。図８は、実施形態に係る音声認識結果の決定の一例を示す図である。 [4. Speech recognition result determination process]
Next, the determination of the voice recognition result in the determination system 1 according to the embodiment will be described with reference to FIG. FIG. 8 is a diagram showing an example of determining the voice recognition result according to the embodiment.

図８を用いて第１音声の推定結果と第２音声の推定結果とが異なる場合について説明する。図８は、第１音声の推定結果とかかる推定結果の推定精度と、第２音声の推定結果とかかる推定結果の推定精度とにおける組み合わせを示す図である。図８に示す例においては、第１音声の推定結果ＷＴ５に示すように「まつしま」は、推定精度「１．０」である。 A case where the estimation result of the first voice and the estimation result of the second voice are different from each other will be described with reference to FIG. FIG. 8 is a diagram showing a combination of the estimation result of the first voice and the estimation accuracy of the estimation result, and the estimation result of the second voice and the estimation accuracy of the estimation result. In the example shown in FIG. 8, “Matsushima” has an estimation accuracy of “1.0” as shown in the estimation result WT5 of the first voice.

図８に示す例では、決定装置１００は、第１音声の推定結果の推定精度と第２音声の推定結果の推定精度と重複度とを加味したスコアを算出する。例えば、「まつしま×やつしま」において、文字の重複度を解析することにより、「つ」と「し」と「ま」が一致していることから、重複度が「３．０」と算出されるものとする。この場合、決定装置１００は、図８中の算出式スコアＣＴ２１に示す式により、第１音声の推定結果ＷＴ５「まつしま」の推定精度「１．０」と、第２音声の推定結果ＷＴ６「やつしま」の推定精度「０．９」と、重複度「３．０」であるから、スコアＣＴ２１「４．９」と算出する。 In the example shown in FIG. 8, the determination device 100 calculates the score in consideration of the estimation accuracy of the estimation result of the first voice, the estimation accuracy of the estimation result of the second voice, and the degree of overlap. For example, in "Matsushima x Yatsushima", the multiplicity is calculated as "3.0" because "tsu", "shi", and "ma" match by analyzing the multiplicity of characters. It shall be done. In this case, the determination device 100 uses the formula shown in the calculation formula score CT21 in FIG. 8 to determine the estimation accuracy “1.0” of the first voice estimation result WT5 “Matsushima” and the second voice estimation result WT6 “. Since the estimation accuracy of "Yatsushima" is "0.9" and the degree of overlap is "3.0", the score is calculated as CT21 "4.9".

また、例えば、図８に示す例において、以下の組み合わせにおいて上記算出方法に基づいてスコアＣＴ２２〜ＣＴ２４を算出する。
スコアＣＴ２２（「まつしま×はつしま」）＝４．８・・・（９）
スコアＣＴ２３（「たつしま×やつしま」）＝４．８・・・（１０）
スコアＣＴ２４（「たつしま×はつしま」）＝４．６・・・（１１） Further, for example, in the example shown in FIG. 8, the scores CT22 to CT24 are calculated based on the above calculation method in the following combinations.
Score CT22 ("Matsushima x Hatsushima") = 4.8 ... (9)
Score CT23 ("Tatsushima x Yatsushima") = 4.8 ... (10)
Score CT24 ("Tatsushima x Hatsushima") = 4.6 ... (11)

そして、決定装置１００は、各スコアＣＴ２１〜２４を比較する。例えば、決定装置１００は、第１音声の推定結果ＷＴ５「まつしま」と、第２音声の推定結果ＷＴ６「やつしま」とのスコアＣＴ２１が最も大きいため、第１音声の推定結果ＷＴ５及び第２音声の推定結果ＷＴ６を音声認識結果として決定する。 Then, the determination device 100 compares each score CT21 to 24. For example, the determination device 100 has the largest score CT21 between the first voice estimation result WT5 "Matsushima" and the second voice estimation result WT6 "Yatsushima", so that the first voice estimation result WT5 and the second voice The voice estimation result WT6 is determined as the voice recognition result.

〔５．変形例〕
上述した決定装置１００は、上記実施形態以外にも種々の異なる形態にて実施されてよい。そこで、以下では、決定装置１００の他の実施形態について説明する。 [5. Modification example]
The determination device 100 described above may be implemented in various different forms other than the above embodiment. Therefore, another embodiment of the determination device 100 will be described below.

〔５−１．決定装置（１）〕
図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。例えば、決定装置１００は、受付部１３１と判定部１３２とで構成される受付装置と、推定部１３３と算出部１３４と決定部１３５と提供部１３６とで構成される決定装置とに分散させてもよい。 [5-1. Determining device (1)]
Each component of each of the illustrated devices is a functional concept and does not necessarily have to be physically configured as shown in the figure. That is, the specific form of distribution / integration of each device is not limited to the one shown in the figure, and all or part of the device is functionally or physically dispersed / physically distributed in any unit according to various loads and usage conditions. Can be integrated and configured. For example, the determination device 100 is dispersed into a reception device including a reception unit 131 and a determination unit 132, and a determination device including an estimation unit 133, a calculation unit 134, a determination unit 135, and a provision unit 136. May be good.

〔５−２．決定装置（２）〕
上記実施形態では、決定装置１００がユーザによって２回発話された音声に対応する推定結果の組み合わせにおけるスコアに基づいて、音声認識結果を決定する決定処理の一例を説明したが、決定装置１００は、これに限定されるものではない。例えば、決定装置１００が実行する決定処理は、端末装置１０がスタンドアローンで実行してもよい。 [5-2. Determining device (2)]
In the above embodiment, an example of the determination process in which the determination device 100 determines the voice recognition result based on the score in the combination of the estimation results corresponding to the voice uttered twice by the user has been described. It is not limited to this. For example, the determination process executed by the determination device 100 may be executed standalone by the terminal device 10.

〔５−３．発話回数〕
上記実施形態では、決定装置１００の決定部１３５がユーザによって２回発話された音声に対応する推定結果の組み合わせ毎のスコアに基づいて、音声認識結果を決定する決定処理の一例を説明したが、発話回数は、これに限定されるものではない。例えば、決定部１３５は、ユーザによって３回以上発話された音声に対応する推定結果の組み合わせにおけるスコアに基づいて、音声認識結果を決定してもよい。 [5-3. Number of utterances]
In the above embodiment, an example of the determination process in which the determination unit 135 of the determination device 100 determines the voice recognition result based on the score for each combination of the estimation results corresponding to the voice uttered twice by the user has been described. The number of utterances is not limited to this. For example, the determination unit 135 may determine the voice recognition result based on the score in the combination of the estimation results corresponding to the voice uttered by the user three or more times.

〔５−４．判定処理〕
上記実施形態では、決定装置１００の判定部１３２が第１音声と、第１音声の後に受け付けられた第２音声との類似性に基づいて、第２音声が第１音声に続いて繰り返し発話された音声であるかを判定する判定処理の一例を説明したが、判定部１３２は、判定処理を行う前に、ユーザによって発話された音声のうち、感嘆詞等と推定される音声波形を除去した音声に基づいて、第２音声が第１音声に続いて繰り返し発話された音声であるかを判定してもよい。例えば、第１音声「ＷＶ３」には「あっ」と推定される音声波形が含まれているとする。また、第１音声「ＷＶ４」には「えー」と推定される音声波形が含まれているとする。この場合、判定部１３２は、第１音声に含まれる「あっ」に対応する音声波形を削除し、第２音声に含まれる「えー」に対応する音声波形を削除する。そして、判定部１３２は、第１音声「ＷＶ３」と第２音声「ＷＶ４」との類似性に基づいて、第２音声が第１音声に続いて繰り返し発話された音声であるかを判定してもよい。 [5-4. Determination process〕
In the above embodiment, the determination unit 132 of the determination device 100 repeatedly utters the second voice following the first voice based on the similarity between the first voice and the second voice received after the first voice. Although an example of the determination process for determining whether or not the voice is a voice has been described, the determination unit 132 removes a voice waveform presumed to be an exclamation word or the like from the voice uttered by the user before performing the determination process. Based on the voice, it may be determined whether the second voice is a voice repeatedly spoken following the first voice. For example, it is assumed that the first voice "WV3" includes a voice waveform presumed to be "A". Further, it is assumed that the first voice "WV4" includes a voice waveform presumed to be "er". In this case, the determination unit 132 deletes the voice waveform corresponding to "A" included in the first voice and deletes the voice waveform corresponding to "Eh" included in the second voice. Then, the determination unit 132 determines whether the second voice is a voice repeatedly uttered following the first voice based on the similarity between the first voice "WV3" and the second voice "WV4". May be good.

〔５−５．算出処理〕
上記実施形態では、決定装置１００の算出部１３４が推定結果の組み合わせ毎のスコアを算出する算出処理の一例を説明したが、算出部１３４は、重複度に限らず、単語の内容に基づいて、スコアを算出してもよい。例えば、算出部１３４は、以下のような式（１２）によりスコアＳＣ２を算出する。 [5-5. Calculation process]
In the above embodiment, an example of the calculation process in which the calculation unit 134 of the determination device 100 calculates the score for each combination of the estimation results has been described, but the calculation unit 134 is not limited to the degree of duplication, but is based on the content of the word. The score may be calculated. For example, the calculation unit 134 calculates the score SC2 by the following formula (12).

スコアＳＣ２＝Ａｃｃ１＋Ａｃｃ２＋Ｃｏｎ・・・（１２） Score SC2 = Acc1 + Acc2 + Con ... (12)

上記式（１２）では、「Ａｃｃ１」は、第１音声の推定結果の推定精度を示し、「Ａｃｃ２」は、第２音声の推定結果の推定精度を示し、「Ｃｏｎ」は、第１音声の推定結果と第２音声の推定結果とに含まれる単語の意味に関する情報（以下、意味重複度と記載する）を示す。例えば、「写真アプリケーション×画像アプリケーション」と「演奏アプリケーション×清掃アプリケーション」とにおけるスコアＳＣ２を算出するとする。また、「写真アプリケーション×画像アプリケーション」において、意味解析等により、「写真」と「画像」の意味が一致していることと、「アプリケーション」が一致していることとから、意味重複度が「２．０」と算出されるものとする。この場合、「写真アプリケーション×画像アプリケーション」において、第１音声の推定結果「写真アプリケーション」の推定精度が「０．８」であり、第２音声の推定結果「画像アプリケーション」の推定精度が「０．９」であり、意味重複度が「２．０」であることから、スコアＳＣ２を「３．７」と算出する。また、例えば、「演奏アプリケーション×清掃アプリケーション」において、意味解析等により、「アプリケーション」が一致しているため、意味重複度が「１．０」と算出されるものとする。この場合、算出部１３４は、「演奏アプリケーション×清掃アプリケーション」において、第１音声の推定結果「演奏アプリケーション」の推定精度が「１．０」であり、第２音声の推定結果「清掃アプリケーション」の推定精度が「１．０」であり、意味重複度が「１．０」であることから、スコアＳＣ２は「３．０」と算出する。 In the above equation (12), "Acc1" indicates the estimation accuracy of the estimation result of the first voice, "Acc2" indicates the estimation accuracy of the estimation result of the second voice, and "Con" indicates the estimation accuracy of the first voice. Information on the meaning of the word included in the estimation result and the estimation result of the second voice (hereinafter referred to as the degree of meaning duplication) is shown. For example, suppose that the score SC2 in "photograph application x image application" and "performance application x cleaning application" is calculated. In addition, in the "photo application x image application", the degree of semantic overlap is "" because the meanings of "photo" and "image" match and the "application" matches by semantic analysis or the like. It shall be calculated as "2.0". In this case, in the "photograph application x image application", the estimation accuracy of the first sound estimation result "photograph application" is "0.8", and the estimation accuracy of the second sound estimation result "image application" is "0". Since it is "0.9" and the degree of semantic overlap is "2.0", the score SC2 is calculated as "3.7". Further, for example, in the "performance application x cleaning application", since the "applications" match by semantic analysis or the like, the semantic overlap degree is calculated to be "1.0". In this case, in the "playing application x cleaning application", the calculation unit 134 has an estimation accuracy of "1.0" for the estimation result "playing application" of the first sound, and the estimation result "cleaning application" of the second sound. Since the estimation accuracy is "1.0" and the semantic overlap degree is "1.0", the score SC2 is calculated as "3.0".

〔５−６．決定処理〕
上記実施形態では、決定装置１００の決定部１３５がユーザによって２回発話された音声から推定された推定結果の組み合わせ毎のスコアに基づいて、音声認識結果を決定する決定処理の一例を説明したが、決定部１３５は、ユーザによって繰り返し発話された音声から推定された正しい推定結果とそれ以外の推定結果とに基づいて生成された学習モデルに基づいて、音声認識結果を決定してもよい。例えば、決定部１３５は、ユーザによって繰り返し発話された音声から推定された正しい推定結果を正例として学習し、それ以外の推定結果を負例として学習する。そして、決定部１３５は、ユーザによって繰り返し発話された音声が予測対象の音声として入力された場合に、推定結果に対応するスコアに基づいて、音声認識結果を決定する。 [5-6. Decision processing]
In the above embodiment, an example of the determination process in which the determination unit 135 of the determination device 100 determines the voice recognition result based on the score for each combination of the estimation results estimated from the voice uttered twice by the user has been described. , The determination unit 135 may determine the voice recognition result based on the learning model generated based on the correct estimation result estimated from the voice repeatedly spoken by the user and the other estimation results. For example, the determination unit 135 learns the correct estimation result estimated from the voice repeatedly spoken by the user as a positive example, and learns the other estimation results as a negative example. Then, the determination unit 135 determines the voice recognition result based on the score corresponding to the estimation result when the voice repeatedly uttered by the user is input as the voice to be predicted.

図９を用いて、変形例に係る決定装置１００が実行する決定処理の一例について説明する。図９は、変形例に係る決定装置１００が実行する決定処理の一例を示す図である。以下、図９を用いて、決定装置１００が実行する決定処理の一例を流れに沿って説明する。 An example of the determination process executed by the determination device 100 according to the modified example will be described with reference to FIG. FIG. 9 is a diagram showing an example of a determination process executed by the determination device 100 according to the modified example. Hereinafter, an example of the determination process executed by the determination device 100 will be described along the flow with reference to FIG.

まず、図９に示すように、決定装置１００は、ユーザによって繰り返し発話された音声から推定された正しい推定結果と、それ以外の推定結果とに基づいて、学習モデルＭ１を生成する（ステップＳ２１）。例えば、決定装置１００は、ユーザによって繰り返し発話された音声から推定された正しい推定結果を正例として学習し、それ以外の推定結果を負例として学習する。そして、決定装置１００は、予測対象の音声が入力された場合に、予測対象の音声に対するスコアを算出する学習モデルＭ１を生成する。そして、決定装置１００は、学習モデルＭ１に予測対象の音声を入力する（ステップＳ２２）。例えば、決定装置１００は、学習モデルＭ１にユーザによって繰り返し発話された音声を予測対象の音声として入力する。続いて、決定装置１００は、学習モデルＭ１に基づいて、推定結果に対応するスコアを算出する（ステップＳ２３）。例えば、決定装置１００は、学習モデルＭ１にユーザによって繰り返し発話された音声を予測対象の音声として入力し、正例のスコアを１として、負例のスコアを０として、予測対象の音声に対する推定結果に対応するスコアを０から１のスコアとして算出する。そして、決定装置１００は、スコアに基づいて、音声認識結果を決定する（ステップＳ２４）。 First, as shown in FIG. 9, the determination device 100 generates the learning model M1 based on the correct estimation result estimated from the voice repeatedly spoken by the user and the other estimation results (step S21). .. For example, the determination device 100 learns the correct estimation result estimated from the voice repeatedly spoken by the user as a positive example, and learns the other estimation results as a negative example. Then, the determination device 100 generates a learning model M1 that calculates a score for the voice to be predicted when the voice to be predicted is input. Then, the determination device 100 inputs the voice to be predicted into the learning model M1 (step S22). For example, the determination device 100 inputs the voice repeatedly spoken by the user into the learning model M1 as the voice to be predicted. Subsequently, the determination device 100 calculates a score corresponding to the estimation result based on the learning model M1 (step S23). For example, the determination device 100 inputs the voice repeatedly spoken by the user into the learning model M1 as the voice to be predicted, sets the score of the positive example to 1, and sets the score of the negative example to 0, and estimates the result for the voice to be predicted. The score corresponding to is calculated as a score of 0 to 1. Then, the determination device 100 determines the voice recognition result based on the score (step S24).

なお、決定装置１００は、いかなる学習アルゴリズムを用いて学習モデルＭ１を生成してもよい。例えば、決定装置１００は、ニューラルネットワーク（neural network）、サポートベクターマシン（support vector machine）、クラスタリング、強化学習等の学習アルゴリズムを用いて学習モデルＭ１を生成する。一例として、決定装置１００がニューラルネットワークを用いて学習モデルＭ１を生成する場合、学習モデルＭ１は、１以上のニューロンを含む入力層と、１以上のニューロンを含む中間層と、１以上のニューロンを含む出力層とを有する。 The determination device 100 may generate the learning model M1 using any learning algorithm. For example, the determination device 100 generates a learning model M1 using learning algorithms such as a neural network, a support vector machine, clustering, and reinforcement learning. As an example, when the determination device 100 uses a neural network to generate a learning model M1, the learning model M1 includes an input layer containing one or more neurons, an intermediate layer containing one or more neurons, and one or more neurons. It has an output layer including.

これにより、実施形態に係る決定装置１００の決定部１３５は、ユーザによって発話された音声の傾向を学習した学習モデルにより、予測対象として入力される繰り返し音声に対応する音声認識結果を高精度に決定することができる。 As a result, the determination unit 135 of the determination device 100 according to the embodiment determines the speech recognition result corresponding to the repeated speech input as the prediction target with high accuracy by the learning model that learns the tendency of the speech spoken by the user. can do.

〔６．ハードウェア構成〕
また、上述してきた実施形態に係る端末装置１０及び決定装置１００は、例えば図１０に示すような構成のコンピュータ１０００によって実現される。以下、決定装置１００を例に挙げて説明する。図１０は、決定装置１００の機能を実現するコンピュータ１０００の一例を示すハードウェア構成図である。コンピュータ１０００は、ＣＰＵ１１００、ＲＡＭ１２００、ＲＯＭ（Read Only Memory）１３００、ＨＤＤ（Hard Disk Drive）１４００、通信インターフェイス（Ｉ／Ｆ）１５００、入出力インターフェイス（Ｉ／Ｆ）１６００、及びメディアインターフェイス（Ｉ／Ｆ）１７００を有する。 [6. Hardware configuration]
Further, the terminal device 10 and the determination device 100 according to the above-described embodiment are realized by, for example, a computer 1000 having a configuration as shown in FIG. Hereinafter, the determination device 100 will be described as an example. FIG. 10 is a hardware configuration diagram showing an example of a computer 1000 that realizes the function of the determination device 100. The computer 1000 includes a CPU 1100, a RAM 1200, a ROM (Read Only Memory) 1300, an HDD (Hard Disk Drive) 1400, a communication interface (I / F) 1500, an input / output interface (I / F) 1600, and a media interface (I / F). ) Has 1700.

ＣＰＵ１１００は、ＲＯＭ１３００又はＨＤＤ１４００に格納されたプログラムに基づいて動作し、各部の制御を行う。ＲＯＭ１３００は、コンピュータ１０００の起動時にＣＰＵ１１００によって実行されるブートプログラムや、コンピュータ１０００のハードウェアに依存するプログラム等を格納する。 The CPU 1100 operates based on a program stored in the ROM 1300 or the HDD 1400, and controls each part. The ROM 1300 stores a boot program executed by the CPU 1100 when the computer 1000 is started, a program depending on the hardware of the computer 1000, and the like.

ＨＤＤ１４００は、ＣＰＵ１１００によって実行されるプログラム、及び、かかるプログラムによって使用されるデータ等を格納する。通信インターフェイス１５００は、ネットワークＮを介して他の機器からデータを受信してＣＰＵ１１００へ送り、ＣＰＵ１１００がネットワークＮを介して生成したデータを他の機器へ送信する。 The HDD 1400 stores a program executed by the CPU 1100, data used by such a program, and the like. The communication interface 1500 receives data from another device via the network N and sends it to the CPU 1100, and the CPU 1100 transmits the data generated by the CPU 1100 via the network N to the other device.

ＣＰＵ１１００は、入出力インターフェイス１６００を介して、ディスプレイやプリンタ等の出力装置、及び、キーボードやマウス等の入力装置を制御する。ＣＰＵ１１００は、入出力インターフェイス１６００を介して、入力装置からデータを取得する。また、ＣＰＵ１１００は、入出力インターフェイス１６００を介して生成したデータを出力装置へ出力する。 The CPU 1100 controls an output device such as a display or a printer, and an input device such as a keyboard or a mouse via the input / output interface 1600. The CPU 1100 acquires data from the input device via the input / output interface 1600. Further, the CPU 1100 outputs the data generated via the input / output interface 1600 to the output device.

メディアインターフェイス１７００は、記録媒体１８００に格納されたプログラム又はデータを読み取り、ＲＡＭ１２００を介してＣＰＵ１１００に提供する。ＣＰＵ１１００は、かかるプログラムを、メディアインターフェイス１７００を介して記録媒体１８００からＲＡＭ１２００上にロードし、ロードしたプログラムを実行する。記録媒体１８００は、例えばＤＶＤ（Digital Versatile DiＳＣ１）、ＰＤ（Phase change rewritable Disk）等の光学記録媒体、ＭＯ（Magneto-Optical disk）等の光磁気記録媒体、テープ媒体、磁気記録媒体、または半導体メモリ等である。 The media interface 1700 reads a program or data stored in the recording medium 1800 and provides the program or data to the CPU 1100 via the RAM 1200. The CPU 1100 loads the program from the recording medium 1800 onto the RAM 1200 via the media interface 1700, and executes the loaded program. The recording medium 1800 is, for example, an optical recording medium such as a DVD (Digital Versatile DiSC1) or PD (Phase change rewritable Disk), a magneto-optical recording medium such as an MO (Magneto-Optical disk), a tape medium, a magnetic recording medium, or a semiconductor memory. And so on.

例えば、コンピュータ１０００が実施形態に係る決定装置１００として機能する場合、コンピュータ１０００のＣＰＵ１１００は、ＲＡＭ１２００上にロードされたプログラムを実行することにより、制御部１３０の機能を実現する。また、ＨＤＤ１４００には、記憶部１２０内のデータが格納される。コンピュータ１０００のＣＰＵ１１００は、これらのプログラムを記録媒体１８００から読み取って実行するが、他の例として、他の装置からネットワークＮを介してこれらのプログラムを取得してもよい。 For example, when the computer 1000 functions as the determination device 100 according to the embodiment, the CPU 1100 of the computer 1000 realizes the function of the control unit 130 by executing the program loaded on the RAM 1200. Further, the data in the storage unit 120 is stored in the HDD 1400. The CPU 1100 of the computer 1000 reads and executes these programs from the recording medium 1800, but as another example, these programs may be acquired from another device via the network N.

〔７．その他〕
また、上記実施形態及び変形例において説明した各処理のうち、自動的に行われるものとして説明した処理の全部または一部を手動的に行うこともでき、あるいは、手動的に行われるものとして説明した処理の全部または一部を公知の方法で自動的に行うこともできる。この他、上記文書中や図面中で示した処理手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。例えば、各図に示した各種情報は、図示した情報に限られない。 [7. others〕
Further, among the processes described in the above-described embodiments and modifications, all or part of the processes described as being automatically performed can be manually performed, or are described as being manually performed. It is also possible to automatically perform all or part of the processed processing by a known method. In addition, the processing procedure, specific name, and information including various data and parameters shown in the above document and drawings can be arbitrarily changed unless otherwise specified. For example, the various information shown in each figure is not limited to the illustrated information.

また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。 Further, each component of each of the illustrated devices is a functional concept, and does not necessarily have to be physically configured as shown in the figure. That is, the specific form of distribution / integration of each device is not limited to the one shown in the figure, and all or part of the device is functionally or physically dispersed / physically distributed in any unit according to various loads and usage conditions. Can be integrated and configured.

また、上述してきた実施形態及び変形例は、処理内容を矛盾させない範囲で適宜組み合わせることが可能である。 Further, the above-described embodiments and modifications can be appropriately combined as long as the processing contents do not contradict each other.

また、上述してきた「部（section、module、unit）」は、「手段」や「回路」などに読み替えることができる。例えば、決定部は、決定手段や決定回路に読み替えることができる。 Further, the above-mentioned "section, module, unit" can be read as "means" or "circuit". For example, the determination unit can be read as a determination means or a determination circuit.

〔８．効果〕
上述してきたように、実施形態に係る決定装置１００は、推定部１３３と、決定部１３５とを有する。推定部１３３は、ユーザによって発話された音声に対応する発話内容を推定する。決定部１３５は、推定部１３３によって第１音声から推定された第１推定結果（第１音声の推定結果に相当）における推定精度と、第１音声に続いてユーザによって繰り返し発話された第２音声から推定された第２推定結果（第２音声の推定結果に相当）における推定精度とに基づいて、第１音声及び第２音声に対応する音声認識結果を決定する。 [8. effect〕
As described above, the determination device 100 according to the embodiment includes an estimation unit 133 and a determination unit 135. The estimation unit 133 estimates the utterance content corresponding to the voice uttered by the user. The determination unit 135 has the estimation accuracy of the first estimation result (corresponding to the estimation result of the first voice) estimated from the first voice by the estimation unit 133, and the second voice repeatedly uttered by the user following the first voice. The voice recognition result corresponding to the first voice and the second voice is determined based on the estimation accuracy in the second estimation result (corresponding to the estimation result of the second voice) estimated from.

これにより、実施形態に係る決定装置１００は、ユーザによって繰り返し発話された音声の組み合わせ毎に算出されるスコアに基づいて音声認識結果を決定することができるので、音声認識の精度を向上させることができる。 As a result, the determination device 100 according to the embodiment can determine the voice recognition result based on the score calculated for each combination of voices repeatedly uttered by the user, so that the accuracy of voice recognition can be improved. can.

また、実施形態に係る決定装置１００において、決定部１３５は、第１推定結果における推定精度と、第２推定結果における推定精度とに基づいて、第１推定結果及び第２推定結果のいずれかを第１音声及び第２音声に対応する音声認識結果として決定する。 Further, in the determination device 100 according to the embodiment, the determination unit 135 determines either the first estimation result or the second estimation result based on the estimation accuracy in the first estimation result and the estimation accuracy in the second estimation result. It is determined as the voice recognition result corresponding to the first voice and the second voice.

これにより、実施形態に係る決定装置１００は、ユーザによって繰り返し発話された音声の組み合わせにおいて、第１音声の推定結果と第２音声の推定結果とが異なる場合、各推定結果の推定精度に基づいて音声認識結果を決定することができるので、音声認識の精度を向上させることができる。 As a result, when the estimation result of the first voice and the estimation result of the second voice are different in the combination of the voices repeatedly uttered by the user, the determination device 100 according to the embodiment is based on the estimation accuracy of each estimation result. Since the voice recognition result can be determined, the accuracy of voice recognition can be improved.

また、実施形態に係る決定装置１００において、推定部１３３は、ユーザによって発話された１の音声に対応する複数の発話内容を推定し、決定部１３５は、第１音声から推定された複数の第１推定結果におけるそれぞれの推定精度と、第２音声から推定された複数の第２推定結果におけるそれぞれの推定精度とに基づいて、複数の第１推定結果及び複数の第２推定結果のいずれか１つを第１音声及び第２音声に対応する音声認識結果として決定する。 Further, in the determination device 100 according to the embodiment, the estimation unit 133 estimates a plurality of utterance contents corresponding to one voice uttered by the user, and the determination unit 135 estimates a plurality of utterances estimated from the first voice. One of a plurality of first estimation results and a plurality of second estimation results based on each estimation accuracy in one estimation result and each estimation accuracy in a plurality of second estimation results estimated from the second voice. This is determined as the voice recognition result corresponding to the first voice and the second voice.

これにより、実施形態に係る決定装置１００は、ユーザによって繰り返し発話された音声の組み合わせ毎に算出されるスコアに基づいて決定された組み合わせにおいて、第１音声の推定結果と第２音声の推定結果とが異なる場合、各推定結果の推定精度に基づいて音声認識結果を決定することができるので、音声認識の精度を向上させることができる。 As a result, the determination device 100 according to the embodiment has the estimation result of the first voice and the estimation result of the second voice in the combination determined based on the score calculated for each combination of the voices repeatedly spoken by the user. When is different, the voice recognition result can be determined based on the estimation accuracy of each estimation result, so that the accuracy of voice recognition can be improved.

また、実施形態に係る決定装置１００において、第１の推定結果と第２の推定結果との組み合わせ毎に、第１推定結果の推定精度と第２推定結果の推定精度とに基づいてスコアを算出する算出部１３４をさらに備え、決定部１３５は、算出部１３４によって算出されたスコアに基づいて選択される第１推定結果と第２推定結果との組み合わせに含まれる第１推定結果及び第２推定結果のいずれかを第１音声及び第２音声に対応する音声認識結果として決定する。 Further, in the determination device 100 according to the embodiment, the score is calculated based on the estimation accuracy of the first estimation result and the estimation accuracy of the second estimation result for each combination of the first estimation result and the second estimation result. The calculation unit 134 is further provided, and the determination unit 135 includes the first estimation result and the second estimation result included in the combination of the first estimation result and the second estimation result selected based on the score calculated by the calculation unit 134. One of the results is determined as the voice recognition result corresponding to the first voice and the second voice.

また、実施形態に係る決定装置１００において、算出部１３４は、第１推定結果と第２推定結果との組み合わせ毎に、第１推定結果と第２推定結果とが重複する度合いに基づいてスコアを算出する。 Further, in the determination device 100 according to the embodiment, the calculation unit 134 calculates a score based on the degree of overlap between the first estimation result and the second estimation result for each combination of the first estimation result and the second estimation result. calculate.

これにより、実施形態に係る決定装置１００は、ユーザによって繰り返し発話された同じ内容の音声を音声ごとの重複度に基づいて音声認識結果を決定することができるので、音声認識の精度を向上させることができる。 As a result, the determination device 100 according to the embodiment can determine the voice recognition result based on the degree of duplication of each voice of the same content repeatedly uttered by the user, thus improving the accuracy of voice recognition. Can be done.

また、実施形態に係る決定装置１００において、算出部１３４は、第１推定結果と第２推定結果との組み合わせ毎に、第１推定結果と第２推定結果とに含まれる類似の意味を有する単語に関する重複する度合いに基づいてスコアを算出する。 Further, in the determination device 100 according to the embodiment, the calculation unit 134 uses a word having a similar meaning included in the first estimation result and the second estimation result for each combination of the first estimation result and the second estimation result. Calculate the score based on the degree of overlap with respect to.

これにより、実施形態に係る決定装置１００は、ユーザによって複数回発話された同じ内容の音声を音声内に使用される単語の意味に関する類似性に基づいて音声認識結果を決定することができるので、音声認識の精度を向上させることができる。 As a result, the determination device 100 according to the embodiment can determine the voice recognition result based on the similarity regarding the meaning of the words used in the voice for the voice of the same content uttered a plurality of times by the user. The accuracy of voice recognition can be improved.

また、実施形態に係る決定装置１００において、ユーザによって発話された音声を受け付ける受付部１３１と、受付部１３１によって受け付けられた第１音声と、受付部１３１によって第１音声の後に受け付けられた第２音声との類似性に基づいて、第２音声が第１音声に続いて繰り返し発話された音声であるかを判定する判定部１３２とをさらに備え、決定部１３５は、判定部１３２によって第２音声が第１音声に続いて繰り返し発話された音声であると判定された場合に、第１音声及び第２音声に対応する音声認識結果を決定する。 Further, in the determination device 100 according to the embodiment, the reception unit 131 that receives the voice spoken by the user, the first voice received by the reception unit 131, and the second voice received by the reception unit 131 after the first voice. Further, a determination unit 132 for determining whether the second voice is a voice repeatedly spoken following the first voice based on the similarity with the voice is provided, and the determination unit 135 is provided with the second voice by the determination unit 132. When it is determined that is a voice repeatedly spoken following the first voice, the voice recognition result corresponding to the first voice and the second voice is determined.

これにより、実施形態に係る決定装置１００は、複数の音声がユーザによって複数回発話されたか否かを精密に判定することができるため、ユーザの負担が最小限に抑えられた状態で音声認識を行うことができる。 As a result, the determination device 100 according to the embodiment can accurately determine whether or not a plurality of voices have been uttered by the user a plurality of times, so that the voice recognition can be performed in a state where the burden on the user is minimized. It can be carried out.

以上、本願の実施形態のいくつかを図面に基づいて詳細に説明したが、これらは例示であり、発明の開示の欄に記載の態様を始めとして、当業者の知識に基づいて種々の変形、改良を施した他の形態で本発明を実施することが可能である。 Although some of the embodiments of the present application have been described in detail with reference to the drawings, these are examples, and various modifications are made based on the knowledge of those skilled in the art, including the embodiments described in the disclosure column of the invention. It is possible to practice the present invention in other improved forms.

１決定システム
１０端末装置
１００決定装置
１１０通信部
１２０記憶部
１２１音声情報記憶部
１２２推定結果情報記憶部
１２３スコア情報記憶部
１３０制御部
１３１受付部
１３２判定部
１３３推定部
１３４算出部
１３５決定部
１３６提供部 1 Decision system 10 Terminal device 100 Decision device 110 Communication unit 120 Storage unit 121 Voice information storage unit 122 Estimated result information storage unit 123 Score information storage unit 130 Control unit 131 Reception unit 132 Judgment unit 133 Estimate unit 134 Calculation unit 135 Decision unit 136 Providing department

Claims

An estimation unit that estimates the utterance content corresponding to the voice spoken by the user,
The estimation accuracy in the first estimation result estimated from the first voice by the estimation unit, and the estimation accuracy in the second estimation result estimated from the second voice repeatedly spoken by the user following the first voice . A calculation unit that calculates a score for each combination of the first estimation result and the second estimation result based on the degree of overlap indicating the degree of overlap between the first estimation result and the second estimation result.
Based on the score calculated by the calculation unit, the determination unit that determines the voice recognition result corresponding to the first voice and the second voice, and
A determination device characterized by being equipped with.

The decision unit
Based on the estimation accuracy in the first estimation result and the estimation accuracy in the second estimation result, either the first estimation result or the second estimation result corresponds to the first voice and the second voice. Determined as a voice recognition result,
The determination device according to claim 1.

The estimation unit
Estimate multiple utterance contents corresponding to one voice uttered by the user,
The decision unit
Based on the respective estimation accuracy in the plurality of first estimation results estimated from the first voice and the respective estimation accuracy in the plurality of second estimation results estimated from the second voice, the plurality of first Any one of the estimation result and the plurality of second estimation results is determined as the voice recognition result corresponding to the first voice and the second voice.
The determination device according to claim 1 or 2.

Before Symbol determining unit,
Before the selected on the basis of the kiss core first estimated result and said second estimation result the included in the combination of the first estimated result and said second estimated 1 wherein said one of the results voice and the second Determined as the voice recognition result corresponding to the voice,
The determination device according to any one of claims 1 to 3.

The calculation unit
For each combination of the first estimation result and the second estimation result, the score is calculated based on the degree of overlap between the first estimation result and the second estimation result.
The determination device according to claim 4, wherein the determination device is characterized by the above.

The calculation unit
For each combination of the first estimation result and the second estimation result, the score is calculated based on the degree of duplication of words having similar meanings contained in the first estimation result and the second estimation result. ,
The determination device according to claim 4, wherein the determination device is characterized by the above.

A reception unit that receives voice uttered by the user,
Based on the similarity between the first voice received by the reception unit and the second voice received by the reception unit after the first voice, the second voice follows the first voice. Further equipped with a determination unit for determining whether the voice is repeatedly spoken,
The decision unit
When the determination unit determines that the second voice is a voice repeatedly uttered following the first voice, the voice recognition result corresponding to the first voice and the second voice is determined.
The determination device according to any one of claims 1 to 6, characterized in that.

It ’s a decision method that a computer makes.
An estimation process that estimates the utterance content corresponding to the voice spoken by the user,
The estimation accuracy in the first estimation result estimated from the first voice by the estimation step, and the estimation accuracy in the second estimation result estimated from the second voice repeatedly spoken by the user following the first voice . A calculation step of calculating a score for each combination of the first estimation result and the second estimation result based on the degree of overlap indicating the degree of overlap between the first estimation result and the second estimation result.
Based on the score calculated by the calculation step, a determination step of determining the voice recognition result corresponding to the first voice and the second voice, and
A determination method characterized by having.

An estimation procedure for estimating the utterance content corresponding to the voice spoken by the user, and
The estimation accuracy in the first estimation result estimated from the first voice by the estimation procedure, and the estimation accuracy in the second estimation result estimated from the second voice repeatedly spoken by the user following the first voice . A calculation procedure for calculating a score for each combination of the first estimation result and the second estimation result based on the degree of overlap indicating the degree of overlap between the first estimation result and the second estimation result.
A determination procedure for determining the voice recognition results corresponding to the first voice and the second voice based on the score calculated by the calculation procedure, and
A decision program characterized by having a computer execute.

The first estimation result estimated from the first voice spoken by a predetermined user, the second estimation result estimated from the second voice repeatedly spoken by the user following the first voice, and the first estimation result . An input layer in which an input layer indicating the degree of overlap between the estimation result and the second estimation result is input, and
Output layer and
A first element that is any layer from the input layer to the output layer and belongs to a layer other than the output layer.
A second element, a a computer with a including model value is calculated based on the weight of said first element and said first element,
With respect to the first estimation result, the second estimation result, and the multiplicity input to the input layer, each element belonging to each layer other than the output layer is set as the first element, and the first element and the first element are used. By performing the calculation based on the weight of the element, the voice recognition result corresponding to the first voice and the second voice is output from the output layer.
A program that makes a computer work.