JP4845955B2

JP4845955B2 - Speech recognition result correction apparatus and speech recognition result correction method

Info

Publication number: JP4845955B2
Application number: JP2008315766A
Authority: JP
Inventors: 志鵬張; 信彦仲
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2008-12-11
Filing date: 2008-12-11
Publication date: 2011-12-28
Anticipated expiration: 2028-12-11
Also published as: JP2010139744A

Description

本発明は、音声認識結果訂正装置および音声認識結果訂正方法に関する。 The present invention relates to a speech recognition result correction apparatus and a speech recognition result correction method.

携帯端末において入力された音声をサーバに出力し、当該サーバにおいて音声を認識して、このサーバが認識結果を携帯端末に送信することで、携帯端末において音声結果を取得することができる技術が、下記特許文献１に記載されている。
特開２００３−２９５８９３号公報 A technology that outputs voice input in a mobile terminal to a server, recognizes the voice in the server, and transmits a recognition result to the mobile terminal so that the voice result can be acquired in the mobile terminal. It is described in the following Patent Document 1.
JP 2003-295893 A

しかしながら、認識結果は一つしか得られないため、サーバでの認識の精度がよくない場合には、誤った認識結果を得ることになる。よって、この誤った認識結果に基づいて訂正処理を行おうとしても、適切な訂正結果を得ることができない。 However, since only one recognition result can be obtained, an incorrect recognition result is obtained when the accuracy of recognition at the server is not good. Therefore, even if correction processing is performed based on the erroneous recognition result, an appropriate correction result cannot be obtained.

そこで、本発明では、サーバでの認識の精度が良くない場合でも適切な訂正処理を行うことができる音声認識結果訂正装置および音声認識結果訂正方法を提供することを目的とする。 Therefore, an object of the present invention is to provide a speech recognition result correction apparatus and a speech recognition result correction method capable of performing appropriate correction processing even when the accuracy of recognition at the server is not good.

上述の課題を解決するために、本発明の音声認識結果訂正装置は、音声を入力する入力手段と、前記入力手段により入力された音声を音声認識サーバに認識させるための情報を、前記音声認識サーバに送信する送信手段と、前記送信手段により前記音声認識サーバに認識させるための情報に基づいて処理された音声に対する複数の認識結果を取得する取得手段と、前記入力手段により入力された音声に対する複数の認識結果を取得する取得手段と、前記取得手段により取得された複数の認識結果を、認識結果に付与されている未知語または既知語に対する認識方法の種別を示す識別子に従って振り分ける振分手段と、前記振分手段により振り分けられた認識結果に対して、前記識別子ごとに対応付けられた認識用辞書データを用いて訂正処理を実行する訂正手段と、前記訂正手段により訂正された訂正結果を、所定の順序に並び替えてユーザに提示する提示手段と、を備えている。 In order to solve the above-described problem, the speech recognition result correcting apparatus according to the present invention includes an input unit for inputting speech, and information for causing a speech recognition server to recognize the speech input by the input unit. A transmission means for transmitting to the server; an acquisition means for acquiring a plurality of recognition results for the voice processed based on information for causing the voice recognition server to recognize by the transmission means; and for the voice input by the input means. An acquisition unit that acquires a plurality of recognition results; and a distribution unit that distributes the plurality of recognition results acquired by the acquisition unit according to an identifier indicating an unknown word or a recognition method type for a known word given to the recognition result; , Correction processing using the recognition dictionary data associated with each identifier for the recognition result sorted by the sorting means And correction means for performing a correction result which is corrected by said correcting means, and a, and presenting means for presenting to the user are rearranged in a predetermined order.

また、本発明の音声認識結果訂正方法は、音声を入力する入力ステップと、前記入力ステップにより入力された音声を音声認識サーバに認識させるための情報を、前記音声認識サーバに送信する送信ステップと、前記送信ステップにより前記音声認識サーバに認識させるための情報に基づいて処理された音声に対する複数の認識結果を取得する取得ステップと、前記入力ステップにより入力された音声に対する複数の認識結果を取得する取得ステップと、前記取得ステップにより取得された複数の認識結果を、認識結果に付与されている未知語または既知語に対する認識方法の種別を示す識別子に従って振り分ける振分ステップと、前記振分ステップにより振り分けられた認識結果に対して、前記識別子ごとに対応付けられた認識用辞書データを用いて訂正処理を実行する訂正ステップと、前記訂正ステップにより訂正された訂正結果を、所定の順序に並び替えてユーザに提示する提示ステップと、を備えている。 The speech recognition result correction method of the present invention includes an input step of inputting speech, and a transmission step of transmitting information for causing the speech recognition server to recognize the speech input by the input step, to the speech recognition server; An acquisition step of acquiring a plurality of recognition results for the speech processed based on the information to be recognized by the speech recognition server in the transmission step, and a plurality of recognition results for the speech input by the input step An allocation step, an allocation step of distributing a plurality of recognition results acquired by the acquisition step according to an identifier indicating a type of recognition method for an unknown word or a known word given to the recognition result, and an allocation by the allocation step Recognition dictionary data associated with each identifier for the recognized recognition result. A correction step of performing correction processing using a correction result corrected by said correction step, and a, a presentation step of presenting to the user are rearranged in a predetermined order.

この発明によれば、入力された音声に対する複数の認識結果を取得し、複数の認識結果を、認識結果に付与されている識別子に従って振り分け、この認識結果に対して、前記識別子ごとに対応付けられた認識用辞書データを用いて訂正処理を実行することができ、サーバ装置側での認識処理の仕方に応じた訂正処理を可能にすることができる。例えば、サーバ装置において未知語として認識処理をした場合に、その旨を示す識別子を付与することで、音声認識結果訂正装置側では、その識別子に応じて未知語のための訂正処理を行うことができる。よって、より適切な訂正処理を可能にすることができる。 According to the present invention, a plurality of recognition results for the input speech are acquired, the plurality of recognition results are distributed according to the identifier given to the recognition result, and the recognition result is associated with each identifier. Correction processing can be executed using the recognized dictionary data, and correction processing according to the recognition processing method on the server device side can be made possible. For example, when recognition processing is performed as an unknown word in the server device, by adding an identifier indicating that fact, the speech recognition result correction device can perform correction processing for the unknown word according to the identifier. it can. Therefore, more appropriate correction processing can be made possible.

本発明の音声認識結果訂正装置において、前記訂正手段は、前記取得手段により取得された認識結果に含まれている認識区間に対応する認識結果に対する訂正処理を実行することが好ましい。 In the speech recognition result correction apparatus according to the present invention, it is preferable that the correction unit performs a correction process on a recognition result corresponding to a recognition section included in the recognition result acquired by the acquisition unit.

この発明によれば、認識結果に含まれている認識区間に対応する認識結果に対する訂正処理を実行することができ、適切に訂正処理を行うことができる。 According to the present invention, it is possible to execute the correction process for the recognition result corresponding to the recognition section included in the recognition result and appropriately perform the correction process.

本発明の音声認識結果訂正装置において、前記入力手段により入力された音声の特徴量データを記憶する記憶手段をさらに備え、前記訂正手段は、認識結果で示されている区間情報に基づいて定められ、前記記憶手段に記憶された特徴量データを用いて訂正処理を行うことが好ましい。 The speech recognition result correction apparatus according to the present invention further includes storage means for storing voice feature value data input by the input means, and the correction means is determined based on section information indicated by the recognition result. The correction processing is preferably performed using the feature data stored in the storage means .

この発明によれば、認識結果で示されている区間情報に基づいた特徴量データを用いて訂正処理を行うことができ、訂正処理を実現することができる。 According to the present invention, correction processing can be performed using feature amount data based on section information indicated by the recognition result , and correction processing can be realized.

本発明の音声認識結果訂正装置において、前記提示手段は、前記訂正手段により訂正処理される際に算出された類似度に従った順番で訂正結果を提示することが好ましい。 In the speech recognition result correction apparatus according to the present invention, it is preferable that the presenting means presents the correction results in an order according to the similarity calculated when the correction process is performed by the correcting means.

この発明によれば、訂正処理される際に算出された類似度に従った順番で訂正結果を提示することで、ユーザにとって選択しやくすることができる。 According to the present invention, the correction results can be presented to the user easily by presenting the correction results in the order according to the similarity calculated when the correction process is performed.

本発明によれば、サーバ装置側での音声認識の制度が悪い場合でも、音声認識結果に対する適切な訂正処理を行うことができる。 ADVANTAGE OF THE INVENTION According to this invention, even when the system | strain of the speech recognition by the server apparatus side is bad, the appropriate correction process with respect to a speech recognition result can be performed.

添付図面を参照しながら本発明の実施形態を説明する。可能な場合には、同一の部分には同一の符号を付して、重複する説明を省略する。 Embodiments of the present invention will be described with reference to the accompanying drawings. Where possible, the same parts are denoted by the same reference numerals, and redundant description is omitted.

図１は、本実施形態の音声認識訂正装置であるクライアント装置１１０およびクライアント装置１１０を含んだ通信システムのシステム構成図である。本実施形態におけるクライアント装置１１０は、例えば携帯電話などの携帯端末であって、ユーザが発声した音声を入力し、入力した音声を、無線通信を用いてネットワーク１００を介してサーバ装置１２０に送信する。サーバ装置１２０では、このクライアント装置１１０から入力された音声の返信として、サーバ装置１２０において音声認識した結果である複数の認識結果をクライアント装置１１０に送信することができる。すなわち、このサーバ装置１２０は、音声認識部を備え、入力された音声を、音響モデルまたは言語モデルなどのデータベースを用いて音声認識を行い、その認識結果をクライアント装置１１０に返信することができる。なお、ここではサーバ装置１２０を例に説明するが、サーバ装置に限定するものではなく、任意の音声認識装置であればよい。 FIG. 1 is a system configuration diagram of a communication system including a client apparatus 110 and a client apparatus 110 that are voice recognition and correction apparatuses according to the present embodiment. The client device 110 in the present embodiment is a mobile terminal such as a mobile phone, for example, and inputs the voice uttered by the user, and transmits the input voice to the server device 120 via the network 100 using wireless communication. . The server apparatus 120 can transmit a plurality of recognition results, which are the results of voice recognition in the server apparatus 120, to the client apparatus 110 as a reply of the voice input from the client apparatus 110. That is, the server device 120 includes a speech recognition unit, and can perform speech recognition on the input speech using a database such as an acoustic model or a language model, and return the recognition result to the client device 110. Here, the server apparatus 120 will be described as an example, but the present invention is not limited to the server apparatus, and any voice recognition apparatus may be used.

ここでサーバ装置１２０における音声認識について簡単に説明する。サーバ装置１２０において、認識用辞書データに登録されていない単語（未知語）を含む音声を認識しなければならない場合がある。この場合には、未知語を無理やり辞書にある単語に当てはめることにより認識結果を生成しようとすることが行われているが、認識用辞書データに無い単語であるため、正しく音声認識することができない。そこで、未知語については未知語用の音声認識を行う。例えばサブワードモデルを用いた音声認識処理を行う。しかし、実際は、既知語を未知語として認識する場合や、その逆で未知語を既知語と認識する場合がある。これは発音が不明瞭であったり、似たような発音の別の言葉であったりすることによる。本実施形態では未知語の可能性がある区間については、未知語として音声認識するとともに、既知語として音声認識することで、その両方の認識結果を生成することにより誤認識を低減しようとするものである。 Here, speech recognition in the server device 120 will be briefly described. In the server apparatus 120, it may be necessary to recognize a voice including a word (unknown word) that is not registered in the recognition dictionary data. In this case, an attempt is made to generate a recognition result by forcibly applying an unknown word to a word in the dictionary, but since the word is not in the recognition dictionary data, speech recognition cannot be performed correctly. . Therefore, for unknown words, speech recognition for unknown words is performed. For example, speech recognition processing using a subword model is performed. However, actually, there are cases where a known word is recognized as an unknown word and vice versa. This is because the pronunciation is unclear or it is another word with a similar pronunciation. In this embodiment, for a section having a possibility of an unknown word, voice recognition is performed as an unknown word, and voice recognition is performed as a known word, thereby generating recognition results of both to reduce misrecognition. It is.

図２に、サーバ装置１２０における認識結果の例を示す。この図２では、入力音声、認識結果である候補１および候補２を対応付けて示している。入力音声としては、以下の通りである。
「お久しぶり！野比駅の近くでお酒でも一緒に飲みませんか？」 In FIG. 2, the example of the recognition result in the server apparatus 120 is shown. In FIG. 2, the input speech and the candidate 1 and the candidate 2 that are recognition results are shown in association with each other. The input voice is as follows.
"Long time no see! Would you like to drink with alcohol near Nobi Station?"

ここで、「野比駅」という単語は、固有名詞であり、サーバ装置１２０の認識用辞書データにはないものとしている。このとき、未知語として処理した結果である「ノビエキ」（候補２）と、既知語として処理した結果である「のんびり」（候補１）の二つの結果が生成される。一方、入力音声「お酒でも」の区間はサーバ装置１２０の認識用辞書データに登録されているものの、認識処理の過程において未知語の可能性があるとして、未知語として処理した結果である「オカモト」（候補２）と、既知語として処理した結果である「お酒でも」（候補１）の二つの結果が生成される。 Here, the word “Nobi Station” is a proper noun and is not included in the recognition dictionary data of the server device 120. At this time, two results are generated: “Nobieki” (candidate 2), which is the result of processing as an unknown word, and “Leisurely” (candidate 1), which is the result of processing as a known word. On the other hand, although the section of the input voice “Alcohol even” is registered in the recognition dictionary data of the server device 120, it is a result of processing as an unknown word because there is a possibility of an unknown word in the process of recognition processing. Two results are generated: “Okamoto” (candidate 2) and “Alcohol” (candidate 1), which is the result of processing as a known word.

これら認識結果を、サーバ装置１２０はクライアント装置１１０に送信するものである。ここでサーバ装置１２０は、これら認識結果に対して識別子を付与している。この識別子は、認識方法の種別を示すための情報であって、例えば、未知語のための認識処理をしたものか、既知語として認識処理をしたものかを示すための情報である。クライアント装置１１０では、この識別子に基づいて振分処理を行い、対応する訂正処理を実行することができるように構成されている。 The server device 120 transmits these recognition results to the client device 110. Here, the server apparatus 120 assigns identifiers to these recognition results. This identifier is information for indicating the type of recognition method, for example, information indicating whether recognition processing for an unknown word or recognition processing for a known word has been performed. The client device 110 is configured to perform a sorting process based on this identifier and execute a corresponding correction process.

また、サーバ装置１２０においては、この二つの区間を特定するための区間番号２と区間番号４をクライアント装置１１０に送信する。これによりクライアント装置１１０においては、当該区間を用いた訂正処理を行うことができる（後述する変形例を参照）。また、区間番号以外にほかの区間index、たとえば単語そのものまたは当該区間の時間を示す時間情報など、クライアント装置１１０側で特徴量データを保存していた場合（変形例参照）にどの特徴量データを抽出すべきか特定することができる情報であればよい。 Further, the server device 120 transmits the section number 2 and the section number 4 for specifying these two sections to the client device 110. As a result, the client device 110 can perform correction processing using the section (see a modification example described later). In addition to the section number, if the feature data is stored on the client device 110 side, such as another section index such as the word itself or time information indicating the time of the section (refer to the modification example), which feature data is stored. Any information may be used as long as it can be specified whether it should be extracted.

つぎに、このクライアント装置１１０の構成について説明する。図２は、クライアント装置１１０の機能を示すブロック図である。このクライアント装置１１０は、特徴量算出部２１０、特徴量圧縮部２１５、送信部２１６、受信部２１７、訂正処理部２３０（２３０_１〜２３０_ｎ）、比較部２４０、および結果提示部２５０を含んで構成されている。以下、図２に示す機能ブロックに基づいて、各機能ブロックを説明する。 Next, the configuration of the client device 110 will be described. FIG. 2 is a block diagram illustrating functions of the client device 110. The client device 110 includes a feature amount calculation unit 210, a feature amount compression unit 215, a transmission unit 216, a reception unit 217, a correction processing unit 230 (230 _{1 to} 230 _n ), a comparison unit 240, and a result presentation unit 250. It is configured. Hereinafter, each functional block will be described based on the functional blocks shown in FIG.

特徴量算出部２１０は、マイク（図示せず）から入力されたユーザの声を入力し、当該入力された声から音声認識スペクトルであって、音響特徴を示す特徴量データを算出する部分である。例えば、特徴量算出部２１０は、ＭＦＣＣ（Mel Frequency Cepstrum Coefficient）のような周波数で表される音響特徴を示す特徴量データを算出する。 The feature amount calculation unit 210 is a part that inputs a user's voice input from a microphone (not shown) and calculates feature amount data indicating a speech recognition spectrum and indicating acoustic features from the input voice. . For example, the feature amount calculation unit 210 calculates feature amount data indicating an acoustic feature represented by a frequency such as MFCC (Mel Frequency Cepstrum Coefficient).

特徴量圧縮部２１５は、特徴量算出部２１０において算出された特徴量データを圧縮する部分である。 The feature amount compression unit 215 is a portion that compresses the feature amount data calculated by the feature amount calculation unit 210.

送信部２１６は、特徴量圧縮部２１５において圧縮された圧縮特徴量データをサーバ装置１２０に送信する部分である。この送信部２１６は、ＨＴＴＰ（Hyper Text Transfer Protocol）、ＭＲＣＰ（Media Resource Control Protocol）、ＳＩＰ（Session Initiation Protocol）などを用いて送信処理を行う。また、このサーバ装置１２０では、これらプロトコルを用いて受信処理を行い、また返信処理を行う。さらに、このサーバ装置１２０では、圧縮特徴量データを解凍することができ、特徴量データを用いて音声認識処理を行うことができる。この特徴量圧縮部２１５は、通信トラフィックを軽減するためにデータ圧縮するためのものであることから、この送信部２１６は、圧縮されることなくそのままの特徴量データを送信することも可能である。 The transmission unit 216 is a part that transmits the compressed feature value data compressed by the feature value compression unit 215 to the server device 120. The transmission unit 216 performs transmission processing using Hyper Text Transfer Protocol (HTTP), Media Resource Control Protocol (MRCP), Session Initiation Protocol (SIP), and the like. The server device 120 performs reception processing and reply processing using these protocols. Further, the server device 120 can decompress the compressed feature amount data, and perform voice recognition processing using the feature amount data. Since the feature amount compression unit 215 is for data compression to reduce communication traffic, the transmission unit 216 can also transmit the feature amount data as it is without being compressed. .

受信部２１７は、サーバ装置１２０から返信された複数の認識結果を含む音声認識結果情報を受信する部分である。この各音声認識結果には、テキストデータ、時間情報、サーバ装置１２０における認識方法を示す識別子（既知語用であるか、未知語用であるか）および信頼度情報が含まれており、時間情報はテキストデータの一認識単位ごとの経過時間を示し、信頼度情報は、その認識結果における正当確度を示す情報である。 The receiving unit 217 is a part that receives voice recognition result information including a plurality of recognition results returned from the server device 120. Each speech recognition result includes text data, time information, an identifier (whether for a known word or an unknown word) indicating the recognition method in the server device 120, and reliability information. Indicates the elapsed time for each recognition unit of the text data, and the reliability information is information indicating the correctness of the recognition result.

複数候補振分部２２０は、受信部２１７により受信された複数の認識結果の夫々に含まれている識別子に従って、訂正処理部２３０_１〜２３０_ｎのいずれかに振り分ける部分である。すなわち、複数候補振分部２２０は、認識結果に含まれている識別子に予め対応付けされている訂正処理部２３０に、認識結果を出力する。 The multiple candidate distribution unit 220 is a part that distributes to any of the correction processing units 230 _{1 to} 230 _n according to the identifier included in each of the plurality of recognition results received by the reception unit 217. That is, the multiple candidate distribution unit 220 outputs the recognition result to the correction processing unit 230 associated in advance with the identifier included in the recognition result.

訂正処理部２３０は、それぞれの訂正処理部に付随する辞書データベースＤＢ−１〜ＤＢ−Ｎ（認識用辞書データ）に従って、複数候補振分部２２０により振り分けられた認識結果を訂正処理する部分であり、認識結果に含まれている識別子に対応付けされた、複数の訂正処理部２３０_１〜２３０_ｎから構成されている。例えば、訂正処理部２３０_１は、既知語用の辞書データベースＤＢ−１を備え、訂正処理部２３０_２は、未知語用の辞書データベースＤＢ−２を備えているものとする。なお、未知語用の辞書データベースＤＢ−２は、ユーザが任意に入力した辞書データベースであったり、電話帳またはアドレス帳である。識別子に応じて振り分けられた認識結果は、訂正処理部２３０_１、および訂正処理部２３０_２により訂正処理がなされる。また、辞書データベースＤＢ−１〜ＤＢ−Ｎは、サーバ装置１２０に備えられている辞書データベースとは異なるものであることが好ましい。これは、同じ辞書データベースを用いて認識処理を行ったとしても同じ結果を得ることになってしまい、訂正処理にならないためである。 The correction processing unit 230 is a part that corrects the recognition results distributed by the plurality of candidate distribution units 220 according to the dictionary databases DB-1 to DB-N (recognition dictionary data) attached to each correction processing unit. , And a plurality of correction processing units 230 _{1 to} 230 _n associated with the identifiers included in the recognition result. For example, correction processing unit 230 ₁ includes a dictionary database DB-1 for the known words, correction processing unit 230 _2, assumed to have a dictionary database DB-2 for unknown words. The unknown word dictionary database DB-2 is a dictionary database arbitrarily input by the user, or a telephone book or an address book. The recognition result distributed according to the identifier is corrected by the correction processing unit 230 ₁ and the correction processing unit 230 ₂ . The dictionary databases DB-1 to DB-N are preferably different from the dictionary database provided in the server device 120. This is because even if recognition processing is performed using the same dictionary database, the same result is obtained and correction processing is not performed.

比較部２４０は、訂正処理部２３０において訂正処理されて得られた訂正結果に基づいて類似度を算出し、その類似度を相互に比較することにより訂正結果の提示順を決定する部分である。この類似度は、例えば、サーバ装置１２０において「野比駅」を未知語として「ノヒエキ」と認識してしまい、一方で、ユーザにより作成された認識用辞書データには「野比駅（ノビエキ）」が登録されていた場合を想定する。この場合には、“ヒ”と“ビ”とが異なっているだけであることから、訂正処理部２３０は、「ノヒエキ」を「野比駅」と認識して訂正することができる。この際、類似度としては、４文字中一文字が異なっているだけで、ほか３文字は一致していることから、類似度は７５％と算出することができる。なお、類似度の計算は、これに限定するものではなく、そのほか文脈から判断する方法など周知の技術を用いても良い。 The comparison unit 240 is a part that calculates the similarity based on the correction result obtained by the correction processing in the correction processing unit 230, and determines the presentation order of the correction result by comparing the similarities with each other. For example, the server device 120 recognizes “Nohi Station” as “Nohieki” as an unknown word in the server device 120, while “Nobi Station” is included in the recognition dictionary data created by the user. Assume that it has been registered. In this case, since “hi” and “bi” are only different, the correction processing unit 230 can recognize and correct “Nohieki” as “Nobi Station”. At this time, the similarity can be calculated as 75% because only one of the four characters is different and the other three characters are the same. Note that the calculation of the similarity is not limited to this, and other well-known techniques such as a method of judging from the context may be used.

結果提示部２５０は、比較部２４０により決定された提示順で訂正結果をユーザに提示する部分である。例えば、表示することによりユーザに訂正結果を提示することができる。 The result presentation unit 250 is a part that presents correction results to the user in the order of presentation determined by the comparison unit 240. For example, the correction result can be presented to the user by displaying.

つぎに、このように構成されたクライアント装置１１０の処理について説明する。図４は、クライアント装置１１０における訂正処理を示すフローチャートである。 Next, processing of the client device 110 configured as described above will be described. FIG. 4 is a flowchart showing the correction process in the client device 110.

特徴量算出部２１０により音声の特徴量データが算出され、その後必要に応じて特徴量圧縮部２１５により圧縮処理が行われ（Ｓ３１０）、送信部２１６により特徴量データが、サーバ装置１２０に送信される（Ｓ３２０）。 The feature amount calculation unit 210 calculates voice feature amount data, and then the feature amount compression unit 215 performs compression processing as necessary (S310), and the transmission unit 216 transmits the feature amount data to the server device 120. (S320).

サーバ装置１２０においては上述したとおり複数種類の音声認識処理が行われ、その音声認識結果がクライアント装置１１０に送信される。クライアント装置１１０では、複数の認識結果が受信され（Ｓ３３０）、複数候補振分部２２０により振分処理が行われる（Ｓ３４０）。 As described above, the server device 120 performs a plurality of types of speech recognition processing, and transmits the speech recognition results to the client device 110. In the client device 110, a plurality of recognition results are received (S330), and a sorting process is performed by the plurality of candidate sorting units 220 (S340).

各訂正処理部２３０においては、振り分けられた認識結果が訂正され、訂正結果が生成される（Ｓ３５０）。そして、訂正処理部２３０において訂正処理された結果により算出された類似度に基づいてスコアが計算され（Ｓ３６０）、そのスコアに従って訂正結果の提示順が比較部２４０により決定される（Ｓ３７０）。そして、結果提示部２５０に、決定された提示順に従って訂正結果が表示される（Ｓ３８０）。 In each correction processing unit 230, the sorted recognition result is corrected, and a correction result is generated (S350). Then, a score is calculated based on the similarity calculated based on the correction processing result in the correction processing unit 230 (S360), and the presentation order of the correction results is determined by the comparison unit 240 according to the score (S370). Then, the correction result is displayed on the result presentation unit 250 according to the determined presentation order (S380).

このようにして、クライアント装置１１０においては、複数の認識結果を適宜訂正することにより、より正確な認識結果を得ることができる。 In this way, the client device 110 can obtain more accurate recognition results by appropriately correcting a plurality of recognition results.

つぎに、本実施形態の変形例について説明する。図５は、変形例におけるクライアント装置１１０ａの機能を示すブロック図である。この変形例によると、クライアント装置１１０ａは、特徴量算出部５１０（特徴量算出部２１０に相当）、特徴量圧縮部５１１（特徴量圧縮部２１５に相当）、送信部５１２（送信部２１６に相当）、特徴量保存部５２０、受信部５１３（受信部２１７に相当）、訂正処理部５４０（５４０_１〜５４０_ｎ）、比較部５５０（比較部２４０に相当）、および結果提示部５６０（結果提示部２５０に相当）を含んで構成されている。この変形例においては、クライアント装置１１０ａは、特徴量算出部５１０により算出された特徴量データを記憶する特徴量保存部５２０を備え、訂正処理部５４０は、この特徴量保存部５２０に記憶されている特徴量データを用いて訂正処理を行う点で、上述実施形態と相違している。以下、相違点を中心に、各構成要素について説明する。 Next, a modification of this embodiment will be described. FIG. 5 is a block diagram illustrating functions of the client device 110a according to the modification. According to this modification, the client device 110a includes a feature amount calculation unit 510 (corresponding to the feature amount calculation unit 210), a feature amount compression unit 511 (corresponding to the feature amount compression unit 215), and a transmission unit 512 (corresponding to the transmission unit 216). ), Feature quantity storage unit 520, reception unit 513 (corresponding to reception unit 217), correction processing unit 540 (540 _{1 to} 540 _n ), comparison unit 550 (corresponding to comparison unit 240), and result presentation unit 560 (result presentation) Equivalent to the portion 250). In this modification, the client device 110a includes a feature amount storage unit 520 that stores the feature amount data calculated by the feature amount calculation unit 510, and the correction processing unit 540 is stored in the feature amount storage unit 520. This is different from the above-described embodiment in that correction processing is performed using the feature amount data. Hereinafter, each component will be described focusing on the differences.

特徴量保存部５２０は、特徴量算出部５１０において算出された特徴量データを一時的に記憶する部分である。 The feature quantity storage unit 520 is a part that temporarily stores the feature quantity data calculated by the feature quantity calculation unit 510.

各訂正処理部５４０は、特徴量保存部５２０に記憶されている特徴量データのうち、複数候補振分部５３０において振り分けられた認識結果に含まれている区間情報に対応する特徴量データに対して再認識処理を行うことにより訂正処理を行う部分である。 Each correction processing unit 540 applies to the feature amount data corresponding to the section information included in the recognition result distributed in the plurality of candidate distribution units 530 among the feature amount data stored in the feature amount storage unit 520. This is the part that performs correction processing by performing re-recognition processing.

このように構成されたクライアント装置１１０ａの処理について説明する。図６は、クライアント装置１１０ａの訂正処理を示すフローチャートである。特徴量算出部５１０により音声の特徴量データが算出され（Ｓ６１０）、特徴量保存部５２０に記憶される（Ｓ６２０）。一方で、必要に応じて特徴量圧縮部５１１により圧縮処理が行われ、送信部５１２により特徴量データが、サーバ装置１２０に送信される（Ｓ６３０）。 Processing of the client device 110a configured as described above will be described. FIG. 6 is a flowchart showing the correction processing of the client device 110a. The feature amount calculation unit 510 calculates voice feature amount data (S610), and stores it in the feature amount storage unit 520 (S620). On the other hand, compression processing is performed by the feature amount compression unit 511 as necessary, and feature amount data is transmitted to the server device 120 by the transmission unit 512 (S630).

サーバ装置１２０においては上述したとおり複数種類の音声認識処理が行われ、その音声認識結果がクライアント装置１１０ａに送信される。クライアント装置１１０ａでは、複数の認識結果が受信され（Ｓ６４０）、複数候補振分部５３０により振分処理が行われる（Ｓ６５０）。 As described above, the server apparatus 120 performs a plurality of types of voice recognition processing, and transmits the voice recognition results to the client apparatus 110a. In the client device 110a, a plurality of recognition results are received (S640), and the plurality of candidate distribution units 530 perform distribution processing (S650).

各訂正処理部５４０においては、振り分けられた認識結果が訂正され、訂正結果が生成される（Ｓ６６０）。そして、訂正処理部５４０において訂正処理された結果により算出された類似度に基づいてスコアが計算され、そのスコアに従って訂正結果の提示順が比較部５５０により決定される（Ｓ６７０）。そして、結果提示部５６０に、決定された提示順に従って訂正結果が表示される（Ｓ６８０）。 In each correction processing unit 540, the sorted recognition result is corrected, and a correction result is generated (S660). Then, a score is calculated based on the similarity calculated based on the correction processing result in the correction processing unit 540, and the presentation order of the correction results is determined by the comparison unit 550 according to the score (S670). Then, the correction result is displayed on the result presentation unit 560 according to the determined presentation order (S680).

つぎに、本実施形態のクライアント装置１１０の作用効果について説明する。本実施形態のクライアント装置１１０は、マイク等を介して入力された音声の特徴量データを特徴量算出部２１０が算出し、これをサーバ装置１２０に送信する。サーバ装置１２０においては、複数種類の音声認識処理を行い、その結果得られた複数の認識結果をクライアント装置１１０に送信する。この際、音声認識の種別（使用した認識用辞書データ）を示す識別子を認識結果に付与する。クライアント装置１１０では、複数候補振分部２２０が複数の認識結果を識別子ごとに振り分け、識別子に対応した訂正処理部２３０に出力する。各訂正処理部２３０は、それぞれ備えられている辞書データベースＤＢ−１〜ＤＢ−Ｎを用いて再認識処理を行うことにより訂正処理を行う。そして、比較部２４０は、その結果を類似度に基づいて提示順を決定し、結果提示部２５０は、その提示順で認識結果を提示する。 Next, operational effects of the client device 110 according to the present embodiment will be described. In the client device 110 of the present embodiment, the feature amount calculation unit 210 calculates voice feature amount data input via a microphone or the like, and transmits this to the server device 120. The server device 120 performs a plurality of types of voice recognition processing, and transmits a plurality of recognition results obtained as a result thereof to the client device 110. At this time, an identifier indicating the type of voice recognition (used recognition dictionary data) is given to the recognition result. In the client device 110, the multiple candidate sorting unit 220 sorts a plurality of recognition results for each identifier, and outputs them to the correction processing unit 230 corresponding to the identifier. Each correction processing unit 230 performs correction processing by performing re-recognition processing using the dictionary databases DB-1 to DB-N respectively provided. Then, the comparison unit 240 determines the presentation order based on the results, and the result presentation unit 250 presents the recognition results in the presentation order.

これによりサーバ装置１２０側での認識処理の仕方に応じた訂正処理を可能にすることができる。例えば、サーバ装置において未知語として認識処理をした場合に、その旨を示す識別子を付与することで、音声認識結果訂正装置側では、その識別子に応じて未知語のための訂正処理を行うことができる。よって、より適切な訂正処理を可能にすることができる。 Thereby, the correction process according to the way of the recognition process on the server apparatus 120 side can be made possible. For example, when recognition processing is performed as an unknown word in the server device, by adding an identifier indicating that fact, the speech recognition result correction device can perform correction processing for the unknown word according to the identifier. it can. Therefore, more appropriate correction processing can be made possible.

また、クライアント装置１１０においては、訂正処理部２３０が、認識結果に含まれている認識区間に対応する認識結果に対する訂正処理を実行することができ、適切に訂正処理を行うことができる。すなわち、各訂正処理部２３０は、認識区間（インデックス、時間情報等で示されているもの）に基づいて訂正処理を行うことができ、適切な訂正処理を実現することができる。 Further, in the client device 110, the correction processing unit 230 can execute a correction process on the recognition result corresponding to the recognition section included in the recognition result, and can appropriately perform the correction process. That is, each correction processing unit 230 can perform a correction process based on a recognition section (indicated by an index, time information, etc.), and can realize an appropriate correction process.

また、クライアント装置１１０においては、訂正処理部２３０は、認識結果で示されているテキストに対する訂正処理を行うことができる。また、クライアント装置１１０ａにおいては、訂正処理部５４０が、認識結果で示されている区間情報に基づいて、特徴量保存部５２０に記憶されている特徴量データを抽出し、当該区間情報に対応する特徴量データに対して訂正処理を行うことができる。よって、区間ごとに訂正処理を実現することができ、適切な訂正処理を実現することができる。 In the client device 110, the correction processing unit 230 can perform correction processing on the text indicated by the recognition result. Further, in the client device 110a, the correction processing unit 540 extracts feature amount data stored in the feature amount storage unit 520 based on the section information indicated by the recognition result, and corresponds to the section information. Correction processing can be performed on the feature data. Therefore, correction processing can be realized for each section, and appropriate correction processing can be realized.

また、クライアント装置１００においては、比較部２４０は、訂正処理される際に算出された類似度に従った提示順を決定し、結果提示部２５０は、その提示順で訂正結果を提示することで、ユーザにとって選択しやくすることができる。 In the client device 100, the comparison unit 240 determines the presentation order according to the similarity calculated when the correction process is performed, and the result presentation unit 250 presents the correction result in the presentation order. This makes it easier for the user to select.

本実施形態の音声認識訂正装置であるクライアント装置１１０およびクライアント装置１１０を含んだ通信システムのシステム構成図である。1 is a system configuration diagram of a communication system including a client device 110 and a client device 110 that are voice recognition and correction devices of the present embodiment. サーバ装置１２０における認識結果の一例を示す図である。It is a figure which shows an example of the recognition result in the server apparatus. クライアント装置１１０の機能を示すブロック図である。3 is a block diagram illustrating functions of a client device 110. FIG. クライアント装置１１０における訂正処理を示すフローチャートである。5 is a flowchart showing correction processing in the client device 110. 変形例におけるクライアント装置１１０ａの機能を示すブロック図である。It is a block diagram which shows the function of the client apparatus 110a in a modification. クライアント装置１１０ａの訂正処理を示すフローチャートである。It is a flowchart which shows the correction process of the client apparatus 110a.

Explanation of symbols

１００…クライアント装置、１１０ａ…クライアント装置、１２０…サーバ装置、２１０…特徴量算出部、２１５…特徴量圧縮部、２１６…送信部、２１７…受信部、２２０…複数候補振分部、２３０…訂正処理部、２４０…比較部、２５０…結果提示部、５１０…特徴量算出部、５１１…特徴量圧縮部、５１２…送信部、５１３…受信部、５２０…特徴量保存部、５３０…複数候補振分部、５４０…訂正処理部、５５０…比較部、５６０…結果提示部。
DESCRIPTION OF SYMBOLS 100 ... Client apparatus, 110a ... Client apparatus, 120 ... Server apparatus, 210 ... Feature-value calculation part, 215 ... Feature-value compression part, 216 ... Transmission part, 217 ... Reception part, 220 ... Multiple candidate distribution part, 230 ... Correction Processing unit, 240 ... Comparison unit, 250 ... Result presentation unit, 510 ... Feature amount calculation unit, 511 ... Feature amount compression unit, 512 ... Transmission unit, 513 ... Reception unit, 520 ... Feature amount storage unit, 530 ... Multiple candidates Minutes part, 540... Correction processing part, 550... Comparison part, 560.

Claims

An input means for inputting voice;
Transmitting means for transmitting to the voice recognition server information for causing the voice recognition server to recognize the voice input by the input means;
Obtaining means for obtaining a plurality of recognition results for speech processed based on information for causing the speech recognition server to recognize by the transmission means;
Sorting means for sorting a plurality of recognition results acquired by the acquisition means according to an identifier indicating a type of recognition method for unknown words or known words given to the recognition results;
Correction means for executing a correction process using the recognition dictionary data associated with each identifier for the recognition result distributed by the distribution means;
A speech recognition result correction apparatus comprising: a presentation unit that rearranges the correction results corrected by the correction unit in a predetermined order and presents the correction results to the user.

The speech recognition result correction apparatus according to claim 1, wherein the correction unit performs a correction process on a recognition result corresponding to a recognition section included in the recognition result acquired by the acquisition unit.

And further comprising storage means for storing the feature data of the voice input by the input means,
3. The correction unit according to claim 1, wherein the correction unit performs correction processing using feature amount data that is determined based on section information indicated by a recognition result and stored in the storage unit. Speech recognition result correction device.

The speech recognition according to any one of claims 1 to 3, wherein the presenting means presents correction results in an order according to a similarity calculated when correction processing is performed by the correcting means. Result correction device.

An input step for inputting voice;
A transmission step of transmitting information for causing the voice recognition server to recognize the voice input in the input step, to the voice recognition server;
An acquisition step of acquiring a plurality of recognition results for the speech processed based on the information for causing the speech recognition server to recognize by the transmission step;
A sorting step of sorting a plurality of recognition results acquired by the acquisition step according to an identifier indicating a type of recognition method for unknown words or known words given to the recognition results;
A correction step for executing a correction process using the recognition dictionary data associated with each identifier for the recognition result distributed by the distribution step;
A speech recognition result correction method comprising: a presentation step of rearranging the correction results corrected by the correction step in a predetermined order and presenting the correction results to the user.