JP5244663B2

JP5244663B2 - Speech recognition processing method and system for inputting text by speech

Info

Publication number: JP5244663B2
Application number: JP2009065542A
Authority: JP
Inventors: 利明内部; 安孝新堂; 朋広小原
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2009-03-18
Filing date: 2009-03-18
Publication date: 2013-07-24
Anticipated expiration: 2029-03-18
Also published as: JP2010217628A

Description

本発明は、音声によってテキストを入力する音声認識処理方法及びシステムに関する。 The present invention relates to a speech recognition processing method and system for inputting text by speech.

パーソナルコンピュータのように比較的処理能力が高い端末を用いて、利用者が発声した音声を、テキストデータに変換するソフトウェアがある。端末は、マイクによって利用者が発声した音声を取得する。その音声は、音声データに符号化される。そして、その音声データは、音声認識処理によってテキストデータに変換される。 There is software that converts a voice uttered by a user into text data using a terminal having a relatively high processing capability such as a personal computer. The terminal acquires the voice uttered by the user through the microphone. The voice is encoded into voice data. The voice data is converted into text data by voice recognition processing.

また、携帯端末のように比較的処理能力が低い端末の場合、数千語彙程度のキーワードを音声認識することはできる。しかしながら、処理能力の観点から、ディクテーションのように数万語以上の大語彙に対応する文章を、音声認識することはできない。 In addition, in the case of a terminal having a relatively low processing capability such as a portable terminal, it is possible to recognize a voice of about several thousand vocabulary keywords. However, from the viewpoint of processing capability, a sentence corresponding to a large vocabulary of more than tens of thousands of words such as dictation cannot be recognized as speech.

そのために、携帯端末が、ネットワークを介して音声認識サーバに接続することによって、音声認識処理を実行する技術がある。この技術によれば、携帯端末は、符号化された音声データを、ＨＴＴＰ(HyperText Transfer Protocol)によって音声認識サーバへ一括して送信する。音声認識サーバは、音声認識処理によって音声データをテキストデータへ変換する。変換されたテキストデータは、携帯端末へ返信される。これにより、音声認識処理の中で負荷が大きい処理を、サーバで実行することができる。即ち、処理能力の低い携帯端末であっても、大語彙の高精度な音声認識を実行することができる。 For this purpose, there is a technique for executing a voice recognition process by connecting a portable terminal to a voice recognition server via a network. According to this technology, the portable terminal transmits the encoded voice data to the voice recognition server in a batch by HTTP (HyperText Transfer Protocol). The voice recognition server converts voice data into text data by voice recognition processing. The converted text data is returned to the mobile terminal. Thereby, a process with a heavy load in the voice recognition process can be executed by the server. That is, even a portable terminal with low processing capability can perform high-accuracy speech recognition of a large vocabulary.

特開２００５−２８３９７２号公報JP 2005-283972 A

従来技術によれば、携帯端末は、音声認識サーバへ、ＨＴＴＰリクエストを用いて音声データを一括して送信する。これに対し、音声認識サーバも、音声データを一括してテキストデータに変換する。そして、全てのテキストデータを、ＨＴＴＰレスポンスによって一括して返信する。ＨＴＴＰのリクエスト及びレスポンスのシーケンスを用いることによって、複数の携帯端末から１つの音声認識サーバへのアクセスも可能とする。 According to the prior art, the portable terminal transmits voice data to the voice recognition server in a batch using an HTTP request. On the other hand, the voice recognition server also converts the voice data into text data at once. Then, all the text data is returned in a batch with an HTTP response. By using an HTTP request and response sequence, it is possible to access a single voice recognition server from a plurality of portable terminals.

しかしながら、利用者は、マイクへ発声しながら、ディスプレイでテキストデータを視認することができない。特に、入力される文章が長くなるほど、テキストデータの表示までに遅延が発生し、利便性に欠ける。また、ＨＴＴＰの場合、下位プロトコルにＴＣＰ(Transmission Control Protocol)を用いるために、エラーフリーである反面、オーバヘッドが大きく且つネットワークへの負荷が大きい。 However, the user cannot visually recognize the text data on the display while speaking to the microphone. In particular, the longer the input text is, the more delay occurs until the text data is displayed, resulting in less convenience. In addition, in the case of HTTP, since TCP (Transmission Control Protocol) is used as a lower protocol, it is error free but has a large overhead and a heavy load on the network.

そこで、本発明は、リアルタイムに音声データが認識され、且つ、ネットワークの負荷をできる限り小さくすることができる音声認識方法及びシステムを提供することを目的とする。 Accordingly, an object of the present invention is to provide a voice recognition method and system that can recognize voice data in real time and can reduce the load on the network as much as possible.

本発明によれば、
セッション制御サーバと、
セッション制御サーバに対する呼接続手段と、テキスト処理アプリケーションと、利用者から音声データを入力する音声入力インタフェース手段とを起動する端末と、
セッション制御サーバに対する呼接続手段と、音声データをテキストデータに変換する音声認識処理手段とを有する音声認識サーバと
を有するシステムにおける音声認識処理方法であって、
端末が、テキスト処理アプリケーションに対する音声入力インタフェース手段を起動した際に、端末の呼接続手段が、コーデック情報及び音声認識種別を含む音声認識パラメータを含む呼接続要求を、セッション制御サーバを介して音声認識サーバへ送信し、音声認識サーバによって、コーデック情報に基づいて復号処理が実行されると共に、音声認識種別を用いて辞書が切り替えられ、端末が、音声認識サーバから呼接続受付応答を受信した後、音声認識サーバとの間で、音声データ用の第１のセッションと、テキストデータ用の第２のセッションとを確立する第１のステップと、
端末が、利用者によって発声された所定単位の音声データを、第１のセッションを介して音声認識サーバへ送信する第２のステップと、
音声認識サーバが、音声認識処理手段を用いて変換した１次候補テキストデータを、第２のセッションを介して端末へ送信する第３のステップと、
端末及び音声認識サーバが、利用者による音声入力が終了するまで、第２のステップ及び第３のステップを連続的に繰り返す第４のステップと、
利用者によって音声入力が終了した際に、音声認識サーバが、既に送信した１次候補テキストデータ以外の他候補テキストデータが存在する場合、１次候補テキストデータに対応させた１つ以上の他候補テキストデータを、端末へ送信する第５のステップと
を有し、
端末が、当該１次候補テキストデータの部分について、当該１次候補テキストデータか又はいずれの他候補テキストデータであるかを利用者に選択させて確定することを特徴とする。 According to the present invention,
A session control server;
A terminal that activates call connection means for the session control server, text processing application, and voice input interface means for inputting voice data from the user;
A speech recognition processing method in a system having a call connection means for a session control server and a speech recognition server having speech recognition processing means for converting speech data into text data,
When the terminal activates the voice input interface means for the text processing application, the call connection means of the terminal recognizes the call connection request including the voice recognition parameters including the codec information and the voice recognition type through the session control server. After being transmitted to the server and decoded by the voice recognition server based on the codec information, the dictionary is switched using the voice recognition type, and the terminal receives the call connection acceptance response from the voice recognition server. Establishing a first session for speech data and a second session for text data with a speech recognition server;
A second step in which the terminal transmits voice data of a predetermined unit uttered by the user to the voice recognition server via the first session;
A third step in which the speech recognition server transmits the primary candidate text data converted using the speech recognition processing means to the terminal via the second session;
A fourth step in which the terminal and the voice recognition server continuously repeat the second step and the third step until the voice input by the user is completed;
When there is other candidate text data other than the primary candidate text data already transmitted by the voice recognition server when the voice input is finished by the user, one or more other candidates associated with the primary candidate text data text data, possess a fifth step of transmitting to the terminal,
The terminal is characterized in that the user selects and determines whether the primary candidate text data is the primary candidate text data or any other candidate text data .

本発明の音声認識処理方法における他の実施形態によれば、第１のステップについて、音声データ用の第１のセッションは、ＲＴＰ(Realtime Transport Protocol)によって確立されており、テキストデータ用の第２のセッションは、ＴＣＰ(Transmission Control Protocol)によって確立されていることも好ましい。 According to another embodiment of the speech recognition processing method of the present invention, for the first step, the first session for speech data is established by RTP (Realtime Transport Protocol) and the second for text data. The session is preferably established by TCP (Transmission Control Protocol).

本発明によれば、
端末と音声認識サーバとが、セッション制御サーバによって呼接続されるシステムにおいて、
端末は、
テキスト処理アプリケーションと、
利用者から音声データを入力する音声入力インタフェース手段と、
テキスト処理アプリケーションに対する音声入力インタフェース手段を起動した際に、コーデック情報及び音声認識種別を含む音声認識パラメータを含む呼接続要求を、セッション制御サーバを介して音声認識サーバへ送信し、音声認識サーバから呼接続受付応答を受信した後、音声認識サーバとの間で、音声データ用の第１のセッションと、テキストデータ用の第２のセッションとを確立する呼接続手段と、
音声入力インタフェース手段によって取得された所定単位の音声データを、第１のセッションを介して音声認識サーバへ送信する音声データ送信手段と
を有し、
音声認識サーバは、
セッション制御サーバに対する呼接続手段と、
コーデック情報に基づいて復号処理を実行すると共に、音声認識種別を用いて辞書を切り替えて、音声データをテキストデータに変換する音声認識処理手段と、
１次候補テキストデータを第２のセッションを介して端末へ送信するテキストデータ送信手段と、
利用者による音声入力が終了するまで、音声認識処理手段及びテキストデータ送信手段を連続的に繰り返す音声認識制御手段と、
利用者によって音声入力が終了した際に、音声認識サーバが、既に送信した１次候補テキストデータ以外の他候補テキストデータが存在する場合、１次候補テキストデータに対応させた１つ以上の他候補テキストデータを、端末へ送信する他候補蓄積手段と
を有し、
端末が、当該１次候補テキストデータの部分について、当該１次候補テキストデータか又はいずれの他候補テキストデータであるかを利用者に選択させて確定することを特徴とする。 According to the present invention,
In a system in which a terminal and a voice recognition server are call-connected by a session control server,
The terminal
A text processing application;
Voice input interface means for inputting voice data from a user;
When the voice input interface means for the text processing application is activated, a call connection request including a voice recognition parameter including codec information and a voice recognition type is transmitted to the voice recognition server via the session control server, and the call is received from the voice recognition server. Call connection means for establishing a first session for voice data and a second session for text data with the voice recognition server after receiving the connection acceptance response ;
Voice data transmitting means for transmitting voice data of a predetermined unit acquired by the voice input interface means to the voice recognition server via the first session;
The speech recognition server
Call connection means to the session control server;
Voice recognition processing means for performing decoding processing based on the codec information, switching a dictionary using a voice recognition type, and converting voice data into text data;
Text data transmitting means for transmitting the primary candidate text data to the terminal via the second session;
Voice recognition control means for continuously repeating the voice recognition processing means and the text data transmission means until the voice input by the user is completed;
When there is other candidate text data other than the primary candidate text data already transmitted by the voice recognition server when the voice input is finished by the user, one or more other candidates associated with the primary candidate text data text data, possess the other candidate storing means for transmitting to the terminal,
The terminal is characterized in that the user selects and determines whether the primary candidate text data is the primary candidate text data or any other candidate text data .

本発明のシステムにおける他の実施形態によれば、音声データ用の第１のセッションは、ＲＴＰによって確立されており、テキストデータ用の第２のセッションは、ＴＣＰによって確立されていることも好ましい。 According to another embodiment of the system of the present invention, it is also preferable that the first session for voice data is established by RTP and the second session for text data is established by TCP.

本発明の音声認識方法及びシステムによれば、携帯端末は、所定単位の音声データをＲＴＰのデータストリームで送信すると共に、音声認識によって変換されたテキストデータをＴＣＰのデータストリームで受信する。これにより、音声データとテキストデータとを一括して送受信するＨＴＴＰの場合に比べて、ネットワークの負荷をできる限り小さくすることができる。 According to the voice recognition method and system of the present invention, the mobile terminal transmits a predetermined unit of voice data as an RTP data stream and receives text data converted by voice recognition as a TCP data stream. As a result, the load on the network can be reduced as much as possible compared to the case of HTTP in which voice data and text data are transmitted and received in a batch.

また、携帯端末は、音声入力中には、音声データを逐次的に変換した第１候補テキストデータをディスプレイに表示すると共に、音声入力終了後に、他候補テキストデータをディスプレイに表示する。これにより、利用者から見て、音声入力中に、リアルタイムに音声データが認識されると共に、音声入力終了後に、最適なテキストデータを選択することができる。 In addition, during the voice input, the mobile terminal displays the first candidate text data obtained by sequentially converting the voice data on the display, and displays the other candidate text data on the display after the voice input is completed. As a result, the voice data is recognized in real time during the voice input from the viewpoint of the user, and the optimum text data can be selected after the voice input is completed.

本発明における第１のシステム構成図である。It is a 1st system block diagram in this invention. 本発明における端末及び音声認識サーバの機能構成図である。It is a functional block diagram of the terminal and voice recognition server in this invention. 本発明におけるフローチャートである。It is a flowchart in this invention. 本発明における端末の第１の表示画面例である。It is a 1st display screen example of the terminal in this invention. 本発明における端末の第２の表示画面例である。It is a 2nd example of a display screen of the terminal in this invention. INVITEリクエストのＳＤＰの記述例である。It is an example of description of SDP of INVITE request. INVITEレスポンスのＳＤＰの記述例である。It is an example of description of SDP of INVITE response. 本発明における第２のシステム構成図である。It is a 2nd system block diagram in this invention. 本発明における第３のシステム構成図である。It is a 3rd system block diagram in this invention.

以下、本発明の実施の形態について、図面を用いて詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１は、本発明における第１のシステム構成図である。 FIG. 1 is a first system configuration diagram according to the present invention.

図１によれば、端末１は、セッション制御サーバ３を介して、音声認識サーバ２と呼接続する。端末１は、例えば携帯電話機のような比較的処理能力が低い携帯端末である。セッション制御サーバ３は、呼制御プロトコルとしてのＳＩＰ(Session Initiation Protocol)サーバであって、例えばＩＭＳ／ＭＭＤ(IP Multimedia Subsystem / Multimedia Domain)網のコントロールネットワークに接続される。携帯端末１は、例えば携帯電話網のようなアクセスネットワークを介して、ＩＭＳ／ＭＭＤ網に接続する。 According to FIG. 1, the terminal 1 makes a call connection with the voice recognition server 2 via the session control server 3. The terminal 1 is a mobile terminal having a relatively low processing capability such as a mobile phone. The session control server 3 is a SIP (Session Initiation Protocol) server as a call control protocol, and is connected to a control network such as an IMS / MMD (IP Multimedia Subsystem / Multimedia Domain) network. The mobile terminal 1 is connected to the IMS / MMD network via an access network such as a mobile phone network.

図１によれば、携帯端末１は、テキスト処理アプリケーションとして、例えばメールソフトウェアを起動する。ここで、利用者は、携帯端末１のマイクに向かって発声することによって、メールソフトウェアのエディタにテキストを入力することができる。 According to FIG. 1, the mobile terminal 1 activates, for example, mail software as a text processing application. Here, the user can input text into the editor of the mail software by speaking into the microphone of the mobile terminal 1.

音声認識サーバ２は、ＳＩＰサーバ３を介して、携帯端末１から呼接続される。音声認識サーバ２は、携帯端末１から受信した音声データを、音声認識処理によってテキストデータに変換する。変換されたテキストデータは、携帯端末１へ返信させる。 The voice recognition server 2 is call-connected from the portable terminal 1 via the SIP server 3. The voice recognition server 2 converts the voice data received from the portable terminal 1 into text data by voice recognition processing. The converted text data is returned to the portable terminal 1.

携帯端末１と音声認識サーバ２との間では、音声データ用のＲＴＰのセッションと、認識候補となるテキストデータ用のＴＣＰのセッションとが確立される。ＲＴＰは、音声又は動画等のデータをストリーミングで伝送するためのプロトコルである。ＴＣＰは、ファイル等のデータをエラーフリーで伝送するためのプロトコルである。 Between the portable terminal 1 and the voice recognition server 2, an RTP session for voice data and a TCP session for text data as a recognition candidate are established. RTP is a protocol for transmitting data such as audio or moving images by streaming. TCP is a protocol for transmitting data such as files without error.

ＲＴＰは、下位プロトコルにＵＤＰ(User Datagram Protocol)が用いられる。そのために、ＲＴＰパケットに、ＦＥＣ(Forward Error Correction：前方誤り訂正)やＭＦＴ(Missing Feature Theory：ミッシングフィーチャー理論)の誤り訂正符号を付加することも好ましい。これによって、パケットロスによる認識性能への影響が軽減される。 RTP uses UDP (User Datagram Protocol) as a lower protocol. Therefore, it is also preferable to add an error correction code of FEC (Forward Error Correction) or MFT (Missing Feature Theory) to the RTP packet. Thereby, the influence on the recognition performance due to the packet loss is reduced.

図２は、本発明における端末及び音声認識サーバの機能構成図である。 FIG. 2 is a functional configuration diagram of the terminal and the voice recognition server in the present invention.

携帯端末１は、ハードウェアとして、通信インタフェース部１０１と、利用者によって発声された音声を取得するマイク１０２と、テキストデータを表示するディスプレイ１０３と、操作及びテキストを選択するキー操作部１０４とを有する。 The mobile terminal 1 includes, as hardware, a communication interface unit 101, a microphone 102 that acquires voice uttered by a user, a display 103 that displays text data, and a key operation unit 104 that selects an operation and text. Have.

また、携帯端末１は、ソフトウェアとして、呼接続部１１１と、トランスポートインタフェース部１１２と、テキスト処理アプリケーション１１３と、音声入力インタフェース部１１４と、音声データ送信部１２１と、テキストデータ受信部１２２と、他候補選択部１２３とを有する。これら機能構成部は、携帯端末に搭載されたコンピュータを機能させるプログラムを実行することによって実現される。 The mobile terminal 1 includes, as software, a call connection unit 111, a transport interface unit 112, a text processing application 113, a voice input interface unit 114, a voice data transmission unit 121, a text data reception unit 122, And another candidate selection unit 123. These functional components are realized by executing a program that causes a computer mounted on the mobile terminal to function.

テキスト処理アプリケーション１１３は、テキストエディタ機能を有し、例えばメールアプリケーションであってもよい。テキスト処理アプリケーション１１３は、音声認識パラメータを引数として音声入力インタフェース部１１４を起動する。音声認識パラメータは、少なくとも、復号処理のためのコーデック情報と、辞書切替のための音声認識種別とである。音声入力終了後、テキスト処理アプリケーション１１３は、音声入力インタフェース部１１４から、テキストデータを取得する。 The text processing application 113 has a text editor function, and may be, for example, a mail application. The text processing application 113 activates the voice input interface unit 114 with the voice recognition parameter as an argument. The speech recognition parameters are at least codec information for decoding processing and speech recognition type for dictionary switching. After completing the voice input, the text processing application 113 acquires text data from the voice input interface unit 114.

音声入力インタフェース部１１４は、ユーザインタフェースとして機能する。音声入力インタフェース部１１４は、利用者が発声した音声をマイク１０２から取得し、ＡＭＲ(Adaptive Multi-Rate)やＥＶＲＣ(Enhanced Variable Rate Codec)等によって音声データに符号化する。又は、信号処理によって特徴抽出した音声データに変換するものであってもよい。音声入力インタフェース部１１４は、音声入力が終了した際に、呼接続部１１１へ、音声入力終了を通知する。 The voice input interface unit 114 functions as a user interface. The voice input interface unit 114 acquires voice uttered by the user from the microphone 102 and encodes it into voice data using AMR (Adaptive Multi-Rate), EVRC (Enhanced Variable Rate Codec), or the like. Or you may convert into the audio | voice data which extracted the characteristic by signal processing. When the voice input is completed, the voice input interface unit 114 notifies the call connection unit 111 of the end of the voice input.

尚、音声入力インタフェース部１１４は、テキスト処理アプリケーション１１３と重畳的に機能する。即ち、音声入力インタフェース部１１４は、種々のアプリケーションから共通に利用可能なミドルウェアとして実装される。そのため、テキスト処理アプリケーション１１３を設計する際に、音声認識処理を考慮する必要がない。また、 Note that the voice input interface unit 114 functions in a superimposed manner with the text processing application 113. That is, the voice input interface unit 114 is implemented as middleware that can be commonly used by various applications. Therefore, when designing the text processing application 113, it is not necessary to consider speech recognition processing. Also,

呼接続部１１１は、ＳＩＰサーバ３に対してクライアントとして機能する。呼接続部１１１は、音声入力インタフェース部１１４が起動された際に、音声認識パラメータを含むINVITEメッセージ(呼接続要求)を、ＳＩＰサーバ３を介して音声認識サーバ２へ送信する。また、呼接続部１１１は、音声入力インタフェース部１１４の指示に応じて、音声入力開始又は終了の制御情報を含むINFOメッセージを、ＳＩＰサーバ３を介して音声認識サーバ２へ送信する。 The call connection unit 111 functions as a client for the SIP server 3. When the voice input interface unit 114 is activated, the call connection unit 111 transmits an INVITE message (call connection request) including a voice recognition parameter to the voice recognition server 2 via the SIP server 3. The call connection unit 111 transmits an INFO message including control information for starting or ending voice input to the voice recognition server 2 via the SIP server 3 in accordance with an instruction from the voice input interface unit 114.

トランスポートインタフェース部１１２は、音声認識サーバ２との間で、音声データ用のＲＴＰのデータストリームと、テキストデータ用のＴＣＰのデータストリームとを確立する。 The transport interface unit 112 establishes an RTP data stream for voice data and a TCP data stream for text data with the voice recognition server 2.

音声データ送信部１２１は、音声入力インタフェース部１１４によって取得された所定単位の音声データを、ＲＴＰのデータストリームを介して音声認識サーバ２へ送信する。 The voice data transmission unit 121 transmits the predetermined unit of voice data acquired by the voice input interface unit 114 to the voice recognition server 2 via the RTP data stream.

テキストデータ受信部１２２は、音声認識サーバ２から、音声認識によって得られたテキストデータを受信する。音声入力中には、逐次的に１次候補テキストデータを受信する。また、音声入力終了後には、１次候補テキストデータと、１つ以上の他候補テキストデータとの組み合わせを受信する。受信されたテキストデータは、テキスト処理アプリケーション１１３へ出力される。 The text data receiving unit 122 receives text data obtained by voice recognition from the voice recognition server 2. During the voice input, the primary candidate text data is sequentially received. Further, after the voice input is completed, a combination of the primary candidate text data and one or more other candidate text data is received. The received text data is output to the text processing application 113.

他候補選択部１２３は、利用者によって他候補テキストデータを選択させる。音声入力終了後、テキスト処理アプリケーションは、既にディスプレイに表示しているテキストデータの中から、他候補テキストデータに対応する１次候補テキストデータを検索する。一致した１次候補テキストデータについて、他候補テキストデータをディスプレイに表示し、利用者に選択させる。 The other candidate selection unit 123 causes the user to select other candidate text data. After completing the voice input, the text processing application searches for primary candidate text data corresponding to the other candidate text data from the text data already displayed on the display. For the matched primary candidate text data, the other candidate text data is displayed on the display, and the user is allowed to select it.

音声認識サーバ２は、通信インタフェース部２０１と、呼接続部２１１と、トランスポートインタフェース部２１２と、音声認識処理部２２１と、テキストデータ送信部２２２と、他候補蓄積部２２３と、音声認識制御部２２４とを有する。これら機能構成部は、サーバに搭載されたコンピュータを機能させるプログラムを実行させることによって実現される。 The voice recognition server 2 includes a communication interface unit 201, a call connection unit 211, a transport interface unit 212, a voice recognition processing unit 221, a text data transmission unit 222, another candidate accumulation unit 223, and a voice recognition control unit. 224. These functional components are realized by executing a program that causes a computer mounted on the server to function.

呼接続部２１１は、ＳＩＰサーバ３に対してクライアントとして機能する。呼接続部２１１は、受信したINVITEメッセージから、音声認識パラメータを取得する。その音声認識パラメータは、音声認識処理部２２１へ出力される。 The call connection unit 211 functions as a client for the SIP server 3. The call connection unit 211 acquires a voice recognition parameter from the received INVITE message. The speech recognition parameters are output to the speech recognition processing unit 221.

トランスポートインタフェース部２１２は、携帯端末１との間で、音声データ用のＲＴＰのデータストリームと、テキストデータ用のＴＣＰのデータストリームとを確立する。 The transport interface unit 212 establishes an RTP data stream for voice data and a TCP data stream for text data with the mobile terminal 1.

音声認識処理部２２１は、ＲＴＰのデータストリームを介して音声データを受信し、音声認識によってその音声データをテキストデータに変換する。ここで、音声認識処理部２２１は、発声途中の暫定的な１次候補テキストデータをテキストデータ送信部２２２へ出力し、他の候補テキストデータを他候補蓄積部２２３へ出力する。 The voice recognition processing unit 221 receives voice data via an RTP data stream, and converts the voice data into text data by voice recognition. Here, the speech recognition processing unit 221 outputs provisional primary candidate text data in the middle of utterance to the text data transmission unit 222 and outputs other candidate text data to the other candidate storage unit 223.

音声認識処理部２２１は、辞書及び言語モデルを参照し、音声データをテキストデータに変換する。音声認識処理部２２１には、例えば、文章の「てにをは」を含めて認識するＮグラムモデルがある。Ｎグラムモデルは、サンプルデータから統計的に確率を計算する言語モデルである。Ｎ＝３（トライグラム）として、与えられた単語列ｗ_１ｗ_２・・・ｗ_ｎの出現確率Ｐ（ｗ_１ｗ_２・・・ｗ_ｎ）の推定をする場合に、Ｐ（ｗ_１ｗ_２・・・ｗ_ｎ）＝ΠＰ（ｗ_i｜ｗ_i-2、ｗ_i-1）×Ｐ（ｗ_１ｗ_２）のように近似する。右辺のＰ（ｗ_i｜ｗ_i-2、ｗ_i-1）は、単語ｗ_i-2、ｗ_i-1と来たときに、次にｗ_iが来る条件付確率を表す。Ｐ（ｗ_i｜ｗ_i-2、ｗ_i-1）の全ての積を計算し、Ｐ（ｗ_１ｗ_２・・・ｗ_ｎ）が最も大きな値を取る単語列の組み合わせを認識結果として決定する。 The voice recognition processing unit 221 converts the voice data into text data with reference to the dictionary and the language model. The speech recognition processing unit 221 includes, for example, an N-gram model that recognizes a sentence including “Tenanoha”. The N-gram model is a language model that statistically calculates probabilities from sample data. When N = 3 (trigram) is used to estimate the appearance probability P (w ₁ w ₂ ... W _n ) of a given word string w ₁ w ₂ ... W _n , P (w ₁ w ₂ ... W _n ) = ΠP (w _i | w _i−2 , w _i−1 ) × P (w ₁ w ₂ ) P (w _i | w _i−2 , w _i−1 ) on the right side represents the conditional probability that w _i comes next when the word w _i−2 and w _i−1 come. All products of P (w _i | w _i-2 , w _i-1 ) are calculated, and a combination of word strings in which P (w ₁ w ₂ ... W _n ) has the largest value is determined as a recognition result. To do.

Ｎグラムモデルでは、発話中のある部分の音声認識に、前後の単語との相関を用いる。このため、ある部分の音声認識結果を得るために、その後の部分の発話が必要となる。当該部分の発話よりも数単語先までの発話を得てから音声認識結果が確定する。つまり、音声認識結果が確定するのは、当該部分の発話がなされてから数単語分遅れることになる。 In the N-gram model, correlation with words before and after is used for speech recognition of a certain part during speech. For this reason, in order to obtain the speech recognition result of a certain part, the subsequent part of speech is required. The speech recognition result is determined after obtaining an utterance up to several words ahead of the utterance of the part. In other words, the voice recognition result is confirmed by a delay of several words after the portion is uttered.

そこで、音声認識処理部２２１は、Ｎグラムモデルにおける１次候補テキストデータを、テキストデータ送信部２２２へ出力する。また、音声認識処理部２２１は、１次候補テキストデータと、その１次候補テキストデータから数単語先で確定した他候補テキストデータとの組み合わせを、他候補蓄積部２２３へ出力する。 Therefore, the speech recognition processing unit 221 outputs the primary candidate text data in the N-gram model to the text data transmission unit 222. In addition, the speech recognition processing unit 221 outputs a combination of the primary candidate text data and other candidate text data determined several words ahead of the primary candidate text data to the other candidate storage unit 223.

テキストデータ送信部２２２は、発声途中の暫定的な１次候補テキストデータを、ＴＣＰのデータストリームを介して、携帯端末１へ送信する。 The text data transmission unit 222 transmits temporary primary candidate text data in the middle of utterance to the mobile terminal 1 via a TCP data stream.

音声認識制御部２２４は、利用者による音声入力が終了するまで、音声認識処理部２２１及びテキストデータ送信部２２２を繰り返し機能させる。 The voice recognition control unit 224 causes the voice recognition processing unit 221 and the text data transmission unit 222 to repeatedly function until the voice input by the user is completed.

他候補蓄積部２２３は、利用者による音声入力が終了した際に、１次候補テキストデータと、１つ以上の他候補テキストデータとの組み合わせを、携帯端末１へ送信する。 The other candidate accumulation unit 223 transmits a combination of the primary candidate text data and one or more other candidate text data to the mobile terminal 1 when the voice input by the user is completed.

図３は、本発明におけるフローチャートである。図３のシーケンスに対応して、図４は、本発明における端末の第１の表示画面例である。また、図５は、本発明における端末の第２の表示画面例である。 FIG. 3 is a flowchart in the present invention. Corresponding to the sequence of FIG. 3, FIG. 4 is a first display screen example of the terminal in the present invention. FIG. 5 is a second display screen example of the terminal according to the present invention.

（Ｓ３０１）音声認識サーバ２は、REGISTERメソッドを用いて、当該サーバの位置情報（ＡＯＲ(Address-Of-Record)、コンタクトアドレス）を、ＳＩＰサーバ３へ登録する。 (S301) The speech recognition server 2 registers the location information (AOR (Address-Of-Record), contact address) of the server in the SIP server 3 using the REGISTER method.

（Ｓ３０２）図４（ａ）によれば、テキスト処理アプリケーションは、メールソフトウェアであって、利用者がメールの「本文」にテキストを入力しようとしている。 (S302) According to FIG. 4A, the text processing application is mail software, and the user is going to input text into the “body” of the mail.

（Ｓ３０３）図４（ｂ）によれば、携帯端末１について、メールソフトウェアのエディタが起動している。そして、利用者は、項目「認識開始」を選択する。 (S303) According to FIG. 4B, the mail software editor is activated for the portable terminal 1. Then, the user selects the item “recognition start”.

（Ｓ３０４）図４（ｃ）によれば、携帯端末１は、利用者へ、マイクに向かって発声するべく指示する。このとき、テキスト処理アプリケーションは、音声入力インタフェース部へ、音声認識パラメータを引き渡す。これにより、音声入力インタフェース部が起動する。 (S304) According to FIG. 4C, the mobile terminal 1 instructs the user to speak into the microphone. At this time, the text processing application delivers the speech recognition parameters to the speech input interface unit. As a result, the voice input interface unit is activated.

（Ｓ３０５）携帯端末１は、REGISTERメソッドを用いて、当該端末の位置情報（ＡＯＲ、コンタクトアドレス）を、ＳＩＰサーバ３へ登録する。ＡＯＲは、ＳＩＰにおける端末のロケーションを表す論理的なアドレスである。ここで、ＡＯＲは、音声認識サーバのアドレスを表す。コンタクトアドレスは、携帯端末の実アドレスであり、ＡＯＲと紐付けされる。これにより、ＡＯＲからコンタクトアドレスを検索することができる。尚、ＡＯＲとコンタクトアドレスとは、必ずしも１対１とは限らない。１つのＡＯＲに対して複数のコンタクトアドレスを割り当てることにより、複数の音声認識サーバに対して同時に発信することもできる。 (S305) The mobile terminal 1 registers the location information (AOR, contact address) of the terminal in the SIP server 3 using the REGISTER method. AOR is a logical address representing the location of the terminal in SIP. Here, AOR represents the address of the speech recognition server. The contact address is a real address of the mobile terminal and is associated with AOR. Thereby, a contact address can be searched from AOR. The AOR and the contact address are not necessarily one-to-one. By assigning a plurality of contact addresses to one AOR, it is possible to simultaneously send to a plurality of voice recognition servers.

（Ｓ３１１）携帯端末１は、呼接続要求(INVITE)を、ＳＩＰサーバ３を介して音声認識サーバ２へ送信する。ここで、INVITEリクエストのＳＤＰ(Specification Description Protocol)には、携帯端末１と音声認識サーバ２との間で、音声データ用のＲＴＰのデータストリームと、テキストデータ用のＴＣＰのデータストリームとを確立するべく記述される。 (S311) The portable terminal 1 transmits a call connection request (INVITE) to the voice recognition server 2 via the SIP server 3. Here, in the SDP (Specification Description Protocol) of the INVITE request, an RTP data stream for voice data and a TCP data stream for text data are established between the mobile terminal 1 and the voice recognition server 2. It is described accordingly.

図６は、INVITEリクエストのＳＤＰの記述例である。 FIG. 6 is a description example of the SDP of the INVITE request.

"m"は、データストリーム種別を表し、"a"は、そのデータストリームに対するパラメータを表す。本発明のＳＤＰには、音声データストリーム（m=audio）と、テキストデータストリーム（m=message）とが記述される。また、ペイロードタイプとコーデック／フォーマットとがマッピングされる。更に、音声認識パラメータが設定される。音声認識パラメータとしては、例えば、音声認識種別、パケットサイズ、転送間隔、及びテキストデータの出力候補数が設定されている。 “m” represents a data stream type, and “a” represents a parameter for the data stream. In the SDP of the present invention, an audio data stream (m = audio) and a text data stream (m = message) are described. Also, the payload type and codec / format are mapped. Furthermore, voice recognition parameters are set. As the speech recognition parameters, for example, a speech recognition type, a packet size, a transfer interval, and the number of text data output candidates are set.

図６によれば、種々のパラメータも設定されている。例えば"gps"によれば、携帯端末の位置情報に基づいて音声認識の辞書を切り替えることもできる。また、例えば"user"によれば、個人識別情報又は個人履歴情報に基づいて音声認識の辞書を切り替えることもできる。 According to FIG. 6, various parameters are also set. For example, according to “gps”, a dictionary for voice recognition can be switched based on position information of a mobile terminal. Further, for example, according to “user”, the dictionary for voice recognition can be switched based on personal identification information or personal history information.

音声認識サーバ２は、INVITEリクエストを受信した際に、音声認識パラメータを判定する。音声認識サーバ２は、その音声認識パラメータを許容できる場合、INVITEレスポンスを返信する。 The voice recognition server 2 determines the voice recognition parameters when receiving the INVITE request. If the voice recognition server 2 can accept the voice recognition parameter, the voice recognition server 2 returns an INVITE response.

図７は、INVITEレスポンスのＳＤＰの記述例である。 FIG. 7 is a description example of the SDP of the INVITE response.

INVITEレスポンスには、データストリーム毎に、音声認識サーバ側のポート番号が記述される。 In the INVITE response, the port number on the voice recognition server side is described for each data stream.

（Ｓ３１２）携帯端末１と音声認識サーバ２との間で、音声データ用のＲＴＰのデータストリームと、テキストデータ用のＴＣＰのデータストリームとが確立される。 (S312) An RTP data stream for voice data and a TCP data stream for text data are established between the mobile terminal 1 and the voice recognition server 2.

音声入力開始時に、音声認識サーバ２とのセッションが既に確立されている場合、REGISTERメソッド（Ｓ３０５）及びINVITEメソッド（Ｓ３１１）は省略する。 If a session with the speech recognition server 2 has already been established at the start of speech input, the REGISTER method (S305) and the INVITE method (S311) are omitted.

（Ｓ３２１）携帯端末１は、音声入力開始の制御情報を含むINFOメッセージを、ＳＩＰサーバ３を介して音声認識サーバ２へ送信する。INFOメソッドは、音声認識パラメータの設定変更、及び音声認識処理の制御情報（開始・終了・中止、エラー等）の通知に用いられる。 (S321) The portable terminal 1 transmits an INFO message including control information for starting voice input to the voice recognition server 2 via the SIP server 3. The INFO method is used to change the setting of voice recognition parameters and notify control information (start / end / stop, error, etc.) of voice recognition processing.

（Ｓ３２２）携帯端末１は、利用者によって発声された所定単位の音声データを、ＲＴＰのデータストリームを介して音声認識サーバ２へ送信する。これに対し、音声認識サーバ２は、音声認識処理によってテキストデータに変換し、発声途中の暫定的な１次候補テキストデータを、ＴＣＰのデータストリームを介して携帯端末１へ返信する。ここで、逐次的に返信されるテキストデータは、音声認識処理による１次候補のものである。携帯端末１は、ＴＣＰのデータストリームを介してテキストデータを受信すると同時に、利用者に視認させるべくディスプレイに表示する。 (S322) The portable terminal 1 transmits a predetermined unit of voice data uttered by the user to the voice recognition server 2 via an RTP data stream. On the other hand, the voice recognition server 2 converts it into text data by voice recognition processing, and returns temporary primary candidate text data in the middle of utterance to the portable terminal 1 via the TCP data stream. Here, the text data sequentially returned is the primary candidate by the speech recognition process. The mobile terminal 1 receives the text data via the TCP data stream and simultaneously displays the text data on the display so that the user can visually recognize the text data.

音声データにおける所定単位は、パラメータで指定した転送サイズであって、ネットワーク状態に応じたバッファリングサイズ等によって可変に制御されるものであってもよい。 The predetermined unit in the audio data may be a transfer size designated by a parameter, and may be variably controlled by a buffering size or the like according to the network state.

図４（ｄ）によれば、利用者の発声から認識された「おはようございます」が表示されている。
図４（ｅ）によれば、利用者の発声から認識された「今日の」が表示されている。
図４（ｆ）によれば、利用者の発声から認識された「回避は」が表示されている。実は、利用者は、「会議は」の意味で発声しているにも関わらず、音声認識処理によって１次候補として「回避は」と認識された。
図５（ａ）によれば、利用者の発声から認識された「午後３次より」が表示されている。実は、利用者は、「午後３時より」の意味で発声しているにも関わらず、音声認識処理によって１次候補として「午後３次より」と認識された。
図５（ｂ）によれば、利用者の発声から認識された「いつもの場所ではじめます」が表示されている。 According to FIG. 4D, “Good morning” recognized from the user's utterance is displayed.
According to FIG. 4E, “today” recognized from the user's utterance is displayed.
According to FIG. 4 (f), “avoidance” recognized from the user's utterance is displayed. In fact, although the user uttered in the meaning of “meeting”, it was recognized as “avoidance” as the primary candidate by the voice recognition processing.
According to FIG. 5 (a), “From 3rd PM” recognized from the user's utterance is displayed. In fact, the user was recognized as “primary from the afternoon” as the primary candidate by the voice recognition process, even though the user uttered in the meaning of “from 3 pm”.
According to FIG. 5B, “Start at the usual place” recognized from the user's utterance is displayed.

（Ｓ３２３）携帯端末１は、利用者による音声入力が終了すると、音声入力終了の制御情報を含むINFOメッセージを、ＳＩＰサーバ３を介して音声認識サーバ２へ送信する。 (S323) When the voice input by the user is completed, the mobile terminal 1 transmits an INFO message including the control information of the voice input end to the voice recognition server 2 via the SIP server 3.

音声入力終了のINFOメッセージを受信した音声認識サーバ２は、既に送信した１次候補以外の他候補テキストデータが存在する場合、他候補テキストデータを、携帯端末１へ送信する。 The voice recognition server 2 that has received the INFO message indicating the completion of voice input transmits the other candidate text data to the mobile terminal 1 when there is other candidate text data other than the primary candidate that has already been transmitted.

図５（ｃ）によれば、携帯端末１は、１次候補テキストデータ「回避は」に対して、他候補テキストデータ「会費は」「会議は」を受信する。このとき、既にディスプレイに表示された１次候補テキストデータ「回避は」を検索し、その位置にアンカーを表示する。そして、「回避は」「会費は」「会議は」の中で、いずれが正しいテキストデータであるかを、利用者に選択させる。ここでは、「会議は」が選択されている。 According to FIG. 5C, the mobile terminal 1 receives the other candidate text data “Membership is” and “Meeting is” for the primary candidate text data “Avoid”. At this time, the primary candidate text data “Avoid is” already displayed on the display is searched, and an anchor is displayed at that position. Then, the user selects which is the correct text data among “evasion is”, “membership is” and “meeting”. Here, “Meeting” is selected.

図５（ｄ）によれば、携帯端末１は、１次候補テキストデータ「３次」に対して、他候補テキストデータ「賛辞」「３次」を受信する。このとき、既にディスプレイに表示された１次候補テキストデータ「３次」を検索し、その位置にアンカーを表示する。そして、「３次」「賛辞」「３時」の中で、いずれが正しいテキストデータであるかを、利用者に選択させる。ここでは、「３時」が選択されている。 According to FIG. 5 (d), the mobile terminal 1 receives the other candidate text data “Compliance” and “Third” for the primary candidate text data “Third”. At this time, the primary candidate text data “tertiary” already displayed on the display is searched, and an anchor is displayed at that position. Then, the user is made to select which is the correct text data among “tertiary”, “compliance”, and “3 o'clock”. Here, “3 o'clock” is selected.

（Ｓ３２４）携帯端末１は、利用者によるテキストデータの選択が終了すると、テキスト処理アプリケーションへテキストデータを引き渡す。これによって、テキスト処理アプリケーションに対するテキストデータの入力が終了する。 (S324) When the user finishes selecting text data, the mobile terminal 1 delivers the text data to the text processing application. This completes the input of text data to the text processing application.

（Ｓ３２５）携帯端末１は、BYEメソッドで音声認識サーバ２と接続を切断し、セッションを終了する。 (S325) The mobile terminal 1 disconnects from the voice recognition server 2 by the BYE method, and ends the session.

（Ｓ３２６）最後に、携帯端末１は、REGISTERメソッドを用いて、当該携帯端末の位置登録を削除する。 (S326) Finally, the mobile terminal 1 deletes the location registration of the mobile terminal using the REGISTER method.

図８は、本発明における第２のシステム構成図である。 FIG. 8 is a second system configuration diagram according to the present invention.

図８のシステムは、ＩＰ電話又は電話会議システムへの適用例である。例えば、利用者自身又は相手方の発声を音声認識し、テキストデータを得る。そのテキストデータは、メモや議事録として保存され、又は、メールで第三者に転送されることもできる。 The system shown in FIG. 8 is an application example to an IP telephone or a telephone conference system. For example, the user's own or the other party's utterance is recognized as speech to obtain text data. The text data can be stored as a memo or minutes, or transferred to a third party by e-mail.

図９は、本発明における第３のシステム構成図である。 FIG. 9 is a third system configuration diagram in the present invention.

図９のシステムは、テレビ字幕システムへの適用例である。例えば、ＩＰテレビ受信端末によって、放送番組又はビデオストリーミング番組を視聴する場合、放送内容の音声を認識し、テキストデータとして携帯端末で字幕表示する。 The system of FIG. 9 is an example applied to a television caption system. For example, when a broadcast program or a video streaming program is viewed by an IP television receiving terminal, the sound of the broadcast content is recognized and captions are displayed as text data on a portable terminal.

以上、詳細に説明したように、本発明の音声認識方法及びシステムによれば、携帯端末は、所定単位の音声データをＲＴＰのデータストリームで送信すると共に、音声認識によって変換されたテキストデータをＴＣＰのデータストリームで受信する。これにより、音声データとテキストデータとを一括して送受信するＨＴＴＰの場合に比べて、ネットワークの負荷をできる限り小さくすることができる。 As described above in detail, according to the speech recognition method and system of the present invention, the mobile terminal transmits a predetermined unit of speech data as an RTP data stream, and transmits text data converted by speech recognition to the TCP. Received in the data stream. As a result, the load on the network can be reduced as much as possible compared to the case of HTTP in which voice data and text data are transmitted and received in a batch.

前述した本発明の種々の実施形態において、本発明の技術思想及び見地の範囲の種々の変更、修正及び省略は、当業者によれば容易に行うことができる。前述の説明はあくまで例であって、何ら制約しようとするものではない。本発明は、特許請求の範囲及びその均等物として限定するものにのみ制約される。 In the various embodiments of the present invention described above, various changes, modifications, and omissions in the scope of the technical idea and the viewpoint of the present invention can be easily made by those skilled in the art. The above description is merely an example, and is not intended to be restrictive. The invention is limited only as defined in the following claims and the equivalents thereto.

１携帯端末、端末、携帯電話機
１０１通信インタフェース部
１０２マイク
１０３ディスプレイ
１０４キー操作部
１１１呼接続部
１１２トランスポートインタフェース部
１１３テキスト処理アプリケーション
１１４音声入力インタフェース部
１２１音声データ送信部
１２２テキストデータ受信部
１２３他候補選択部
２音声認識サーバ
２０１通信インタフェース部
２１１呼接続部
２１２トランスポートインタフェース部
２２１音声認識処理部
２２２テキストデータ送信部
２２３他候補蓄積部
２２４音声認識制御部
３ＳＩＰサーバ、セッション制御サーバ DESCRIPTION OF SYMBOLS 1 Mobile terminal, terminal, mobile phone 101 Communication interface unit 102 Microphone 103 Display 104 Key operation unit 111 Call connection unit 112 Transport interface unit 113 Text processing application 114 Voice input interface unit 121 Voice data transmission unit 122 Text data reception unit 123 Other Candidate selection unit 2 Voice recognition server 201 Communication interface unit 211 Call connection unit 212 Transport interface unit 221 Speech recognition processing unit 222 Text data transmission unit 223 Other candidate storage unit 224 Speech recognition control unit 3 SIP server, session control server

Claims

A session control server;
A terminal that activates call connection means for the session control server, a text processing application, and voice input interface means for inputting voice data from a user;
A speech recognition processing method in a system comprising: a call connection means for the session control server; and a speech recognition server having speech recognition processing means for converting the speech data into text data,
When the terminal activates the voice input interface means for the text processing application, the call connection means of the terminal sends the call connection request including a voice recognition parameter including codec information and a voice recognition type to the session. The voice recognition server transmits to the voice recognition server, the voice recognition server performs a decoding process based on the codec information, and the dictionary is switched using the voice recognition type. A first step of establishing a first session for voice data and a second session for text data with the voice recognition server after receiving a call connection acceptance response from the voice recognition server ;
A second step in which the terminal transmits voice data of a predetermined unit uttered by a user to the voice recognition server via a first session;
A third step in which the voice recognition server transmits the primary candidate text data converted by using the voice recognition processing means to the terminal via a second session;
A fourth step in which the terminal and the voice recognition server continuously repeat the second step and the third step until voice input by the user is completed;
When speech input by the user is completed, if there is other candidate text data other than the primary candidate text data that the speech recognition server has already transmitted , one or more corresponding to the primary candidate text data other candidate text data, have a a fifth step of transmitting to said terminal,
A speech recognition processing method, characterized in that the terminal makes a user select and determine whether the primary candidate text data is the primary candidate text data or which other candidate text data .

Regarding the first step, the first session for voice data is established by RTP (Realtime Transport Protocol), and the second session for text data is established by TCP (Transmission Control Protocol). The speech recognition processing method according to claim 1.

In a system in which a terminal and a voice recognition server are call-connected by a session control server,
The terminal
A text processing application;
Voice input interface means for inputting voice data from a user;
When the voice input interface means for the text processing application is activated, the call connection request including a voice recognition parameter including codec information and a voice recognition type is transmitted to the voice recognition server via the session control server, Call connection means for establishing a first session for voice data and a second session for text data with the voice recognition server after receiving a call connection acceptance response from the voice recognition server ;
Voice data transmitting means for transmitting voice data of a predetermined unit acquired by the voice input interface means to the voice recognition server via a first session;
The voice recognition server
Call connection means for the session control server;
Voice decoding processing means for performing decoding processing based on the codec information, switching a dictionary using the voice recognition type, and converting the voice data into text data;
Text data transmitting means for transmitting primary candidate text data to the terminal via a second session;
Voice recognition control means for continuously repeating the voice recognition processing means and the text data transmission means until the voice input by the user is completed;
When speech input by the user is completed, if there is other candidate text data other than the primary candidate text data that the speech recognition server has already transmitted , one or more corresponding to the primary candidate text data other candidate text data, possess the other candidate storing means for transmitting to said terminal,
The system wherein the terminal determines and confirms whether the primary candidate text data portion is the primary candidate text data or which other candidate text data .

4. The system according to claim 3 , wherein the first session for voice data is established by RTP and the second session for text data is established by TCP.