JP5224966B2

JP5224966B2 - Voice transcription server

Info

Publication number: JP5224966B2
Application number: JP2008200756A
Authority: JP
Inventors: 弘貴吉岡; 隆行湊
Original assignee: Fujitsu Ltd; Fujitsu Semiconductor Ltd
Current assignee: Fujitsu Ltd; Fujitsu Semiconductor Ltd
Priority date: 2008-08-04
Filing date: 2008-08-04
Publication date: 2013-07-03
Anticipated expiration: 2028-08-04
Also published as: JP2010041301A

Description

本発明は、音声通話している複数の通話者に対して、同時的に通話音声を文字化してデータ配信を行う音声文字化サーバー及び音声文字化方法に関する。 The present invention relates to a voice characterizing server and a voice characterizing method for simultaneously characterizing a call voice and distributing data to a plurality of callers who have a voice call.

例えば、自動認識機能などを備えた装置が備え付けのマイクロフォンで収集した音声を文字化することが行われている。更に、通信手段から受信した受話音声を文字化して、自身の音声と受話音声とを文字化して出力したり、受信者側の携帯電話機から送信者側の音声をそのまま送信したり文字化して送信したりする選択を可能とする通信ホスト装置が提案されている。 For example, voice collected by a microphone provided in an apparatus having an automatic recognition function or the like is converted into text. Furthermore, the received voice received from the communication means is converted into text, and the own voice and the received voice are converted into text and output, or the sender's voice is transmitted as it is from the receiver's mobile phone or converted into text and transmitted. There has been proposed a communication host device that enables selection to be performed.

また、ユーザ端末から送信されたユーザが発声する単音ごとの音声データをユーザ毎の単音声辞書データで管理して、ユーザ端末において音声入力され送信された音声データを単音ごとに分離することにより音声認識の精度を改善することが提案されている。また、データ検索を改善するために、検索キーとしてのファイリングキーコードと索引コードのチェックサムとが一致した場合に従来のファイリングキーコードによる検索を行うことなどが提案されている。
特開２０００−２４４６８３号公報特開２００６−３９８０５号公報特開２００５−４９７１３号公報特開平６−１６２０９６号公報 In addition, voice data for each single sound uttered by the user transmitted from the user terminal is managed by single voice dictionary data for each user, and the voice data input and transmitted by the user terminal is separated for each single voice. It has been proposed to improve the accuracy of recognition. Further, in order to improve data search, it has been proposed to perform a search using a conventional filing key code when a filing key code as a search key matches an index code checksum.
JP 2000-244683 A JP 2006-39805 A JP 2005-49713 A JP-A-6-162096

上記通話中の音声を文字化する従来の技術では、各通信ホスト装置に音声認識機能を備えなければならず、また音声ごとに音声認識を行わなければならないため、通信ホスト装置に対する負荷が増大する。また、音声データの蓄積や会話の類似性によって精度を改善する仕組みがないため音声認識の精度改善を実現できない。 In the conventional technology for converting the voice during the call to a character, each communication host device must be provided with a voice recognition function, and voice recognition must be performed for each voice, which increases the load on the communication host device. . Further, since there is no mechanism for improving accuracy by storing voice data or similarity of conversation, it is impossible to improve the accuracy of speech recognition.

また、単音ごとに音声を登録する従来の方法では、ＨＨＭ（Hidden Markov Mode）による音声認識方法によって収集した音声の文字化の精度を改善しているが、登録した特定の装置に対して会話内容を送信しなければならず、日常的な自然な通話から音声データを収集することができない。 In addition, in the conventional method of registering a voice for each single tone, the accuracy of voice conversion collected by a voice recognition method using HHM (Hidden Markov Mode) is improved. Voice data cannot be collected from everyday natural calls.

更に、従来の検索技術では、予め所定の規則で管理された文字列を管理されたファイルから文字検索を行うための技術であり、文字データに比べ容量の大きい音声データの検索には不向きである。 Furthermore, the conventional search technique is a technique for performing a character search from a file in which a character string previously managed according to a predetermined rule is managed, and is not suitable for searching voice data having a larger capacity than character data. .

発音やアクセントなどの音声の特性は個人々で異なり、音声認識を完全に行うことは困難な場合が多く、更に、雑音のある環境での音声認識は更に困難となる。また、音声認識後の文字データへの変換時に言語解析などが必要となるが、例えば日本語独自の構文解析及び漢字変換の処理でも完全で正確な変換を行うことは困難である。 Speech characteristics such as pronunciation and accent differ among individuals, and it is often difficult to perform speech recognition completely, and further, speech recognition in a noisy environment becomes even more difficult. In addition, language analysis or the like is required at the time of conversion to character data after speech recognition. For example, it is difficult to perform complete and accurate conversion even with Japanese original syntax analysis and kanji conversion processing.

音声及び言語の解析率を高めるには音声データ及び文字データの蓄積が必要であるが、特定の環境及び限られた時間に制限され、通常の日常的な会話を蓄積することが困難である場合が多い。従って、変換した文字データを活用しようとしても、文字データを取りだして提供する環境が制限されているなどの問題があった。 In order to increase the analysis rate of speech and language, it is necessary to accumulate speech data and text data, but it is limited to a specific environment and limited time, and it is difficult to accumulate ordinary daily conversation There are many. Therefore, even if the converted character data is used, there is a problem that the environment for extracting and providing the character data is limited.

よって、本発明の目的は、音声通話している複数の発声者に対して、同時的に通話音声を文字化してデータ配信を行うことである。 Therefore, an object of the present invention is to perform data distribution by simultaneously converting a call voice to a plurality of speakers who are making a voice call.

上記課題を解決するため、音声文字化サーバーは、通話中の発声者の音声データを通信回線を介して受信する音声データ受信手段と、前記音声データ受信手段が受信した発声者の音声データが、該音声データのチェックサムと該音声データから変換された文字データとを対応させた音声データベースに登録されているか否かを、該音声データのチェックサムの一致で判断する音声登録判断手段と、前記チェックサムの一致した音声データに対応する文字データを前記発声者に配信する配信手段と、前記音声データベースに前記チェックサムの一致する音声データが存在しない場合、該音声データを文字データに変換する音声文字変換手段と、前記音声データのチェックサムと、前記音声文字変換手段によって該音声データから変換した文字データとを対応させて前記音声データベースに格納することによって、該音声データを登録する音声登録手段とを有するように構成される。 In order to solve the above-mentioned problem, the phonetic transcription server includes voice data receiving means for receiving voice data of a speaker who is talking through a communication line, and voice data of the speaker who is received by the voice data receiving means . Voice registration determination means for determining whether or not a checksum of the voice data is registered in a voice database in which the checksum of the voice data and character data converted from the voice data are associated with each other; and Distribution means for distributing character data corresponding to voice data with a matching checksum to the speaker , and voice for converting the voice data into character data when there is no voice data with the matching checksum in the voice database a character conversion unit, and a checksum of the audio data, character de converted from voice data by the voice character converting means By storing in the speech database and data in correspondence, configured with a voice registration means for registering the voice data.

携帯電話機を用いた会話を音声を文字化し、同時的に通話中の発声者の指定する送信先に変換された文字データを配信することができる。 It is possible to convert the voice of a conversation using a mobile phone into text and simultaneously deliver the converted character data to a destination specified by the speaker who is talking.

以下、本発明の実施の形態を図面に基づいて説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図１は、全体処理の概要を説明するための図である。図１に示す音声文字化システム１００は、電話回線網で接続された２以上の局交換機１１、１２・・・と、これら局交換機１１、１２・・・の各々から送信される通話中の音声を文字に変換する音声文字化サーバー５とで構成される。携帯端末３１、３２・・・と、音声認識を必要とするサービス提供会社に設置される通話着信可能な端末５０とは、所定の電波塔２１、２２・・・を介してこの音声文字化システム１００を利用し、ＰＣ（Personal Computer）６はネットワーク網を介して音声文字化サーバー５と接続し音声文字化サーバー５が提供するサービスを受ける。音声文字化サーバー５は、ＣＰＵによって制御されるコンピュータ装置であり、主記憶装置、補助記憶装置などの記憶領域と、ネットワーク網に接続するための通信装置と、入出力装置と、表示装置とを有する。 FIG. 1 is a diagram for explaining the outline of the overall processing. 1 includes two or more exchanges 11, 12... Connected via a telephone line network, and a voice during a call transmitted from each of these exchanges 11, 12... And a voice characterizing server 5 for converting the character into a character. The mobile terminals 31, 32... And the terminal 50 that is installed at a service provider that requires voice recognition and can receive incoming calls are connected via the predetermined radio towers 21, 22. A PC (Personal Computer) 6 is connected to the voice characterizing server 5 through a network and receives a service provided by the voice characterizing server 5. The phonetic transcription server 5 is a computer device controlled by a CPU, and includes a storage area such as a main storage device and an auxiliary storage device, a communication device for connecting to a network, an input / output device, and a display device. Have.

図１において、発声者Ａと発声者Ｂとが各自で所有する携帯電話機３１と３２とを用いて互いに通話する場合で全体処理の概要を説明する。以下、本実施例において、「発声者」は、携帯電話機などを利用して通話相手と通話する場合には「通話者」を意味し、インターネットなどを介して音声文字化サーバー５が提供するサービスを利用する場合には「利用者」を意味する。 In FIG. 1, an outline of the entire process will be described in the case where the speaker A and the speaker B talk to each other using the mobile phones 31 and 32 that they own. Hereinafter, in the present embodiment, “speaker” means “caller” when talking to the other party using a mobile phone or the like, and is a service provided by the voice transcription server 5 via the Internet or the like. When using, it means “user”.

例えば、発声者Ａが相手側として携帯電話機３１から発声者Ｂにダイヤル発信すると、携帯電話機３１の位置をエリア内とする電波塔２１を介して局交換機１１で受信され、更に電話回線網を介して発声者Ｂ側の局交換機１２で所定の通信手順を経て着呼する。局交換機１２は、発声者Ｂの携帯電話機３２の位置がエリア内となる電波塔２２を介して携帯電話機３２で着信する。発声者Ｂの携帯電話機３１で発声者Ａによるダイヤル発信を着信することによって音声通話が開始される。 For example, when the speaker A dials to the speaker B from the mobile phone 31 as the other party, it is received by the exchange 11 via the radio tower 21 having the location of the mobile phone 31 in the area, and further via the telephone network. Then, the office exchange 12 on the speaker B side receives the call through a predetermined communication procedure. The exchange 12 receives the incoming call at the mobile phone 32 via the radio tower 22 where the position of the mobile phone 32 of the speaker B is within the area. A voice call is started by receiving a dial transmission by the speaker A at the cellular phone 31 of the speaker B.

発声者Ａ側の局交換機１１は、回線が切断されるまで通信を継続させると共に、開始された音声通話のうち発声者Ａの音声を、通信回線を介して音声データを収集して文字データに変換する音声文字化サーバー５へ送信する。同様に、発声者Ｂ側の局交換機１２からも発声者Ｂの音声が音声文字化サーバー５へと送信される。以下、発声者Ａに係る音声データに対する処理で説明するが、発声者Ｂに係る音声データに対しても同様の処理がなされる。 The exchange 11 on the side of the speaker A continues communication until the line is disconnected, and also collects voice data of the speaker A from the started voice call via the communication line and converts it into character data. It transmits to the voice characterizing server 5 to convert. Similarly, the voice of the speaker B is transmitted from the office switch 12 on the speaker B side to the voice characterizing server 5. In the following, the processing for the voice data related to the speaker A will be described, but the same processing is performed for the voice data related to the speaker B.

音声文字化サーバー５は、発声者Ａが音声処理のために、音声データ、発声者名、日時、場所、及び前後文脈などの付加情報を発声者情報として蓄積することに同意しているか否かを含めて、発声者ＤＢ（データベース）４１を用いて発声者Ａの認証を行う（ステップＳ１）。 Whether or not the voice characterizing server 5 agrees that the voicer A accumulates additional information such as voice data, voicer name, date and time, place, and context before and after as voicer information for voice processing. The speaker A is authenticated using the speaker DB (database) 41 (step S1).

発声者Ａが認証され、かつ同意していることが確認された場合のみ以下の処理が継続して行われる。発声者Ａの認証が失敗した場合、又は認証は成功したものの同意していない場合は、以下の処理を実行することなく発声者Ａの音声データに係る処理を終了する。音声文字化サーバー５は、同意・否同意を含む認証結果を局交換機１１に対して通知するようにする。認証され同意確認された場合のみ局交換機１１と音声文字化サーバー５との間で発声者Ａに係る１つのセッションが成立し、局交換機１１から発声者Ａの音声をデジタル化した音声データが音声文字化サーバー５に送信されることにより音声処理が実行される。このような認証により、プライバシー問題を考慮したうえでの音声処理の実行することができる。なお、発声者Ａの携帯電話を他の人が利用する場合、音声収集を行いたくない通話である場合などに対応するために、通話の前に音声収集を行わない設定を可能とするようにしてもよい。通話の前に音声収集を行わない設定がなされた場合は、認証時に上述した同意が確認された場合であっても音声収集を行わない。つまり、否同意を示す認証結果を局交換機１１に対して通知すればよい。 Only when it is confirmed that the speaker A is authenticated and agrees, the following processing is continuously performed. If the authentication of the speaker A fails, or if the authentication is successful but not agreed, the processing related to the voice data of the speaker A is terminated without executing the following processing. The phonetic transcription server 5 notifies the central office switch 11 of the authentication result including the agreement / denial agreement. Only when the authentication is confirmed and the consent is confirmed, one session relating to the speaker A is established between the central office switch 11 and the voice conversion server 5, and the voice data obtained by digitizing the voice of the speaker A from the central office switch 11 is the voice. The voice processing is executed by being transmitted to the characterizing server 5. With such authentication, it is possible to execute voice processing in consideration of privacy issues. It should be noted that when other person uses the cellular phone of the speaker A, it is possible to set so that voice collection is not performed before the call in order to cope with a case where the voice is not desired to be collected. May be. If a setting is made so that voice collection is not performed before a call, voice collection is not performed even if the above-described consent is confirmed at the time of authentication. That is, the authentication result indicating the disapproval may be notified to the central office switch 11.

音声文字化サーバー５は、局交換機１１から音声データを受信すると、音声アナログ信号の大きさを二進の数値データとして表現されたデジタル信号に変換して、音声が無い部分を区切りにした発声音節ごとのフレーム単位に分割した音声データを音声データファイル４２に保存する。音声が無い部分とは、デジタル値で表わされる音声の強度のうち所定の強度以下となる部分で定義される。その際、音声文字化サーバー５は、音声データファイル４２毎に保存された音声デジタル信号を全て加算した数値をチェックサムとして計算して音声データファイル４２に付加する（ステップＳ２）。 When the voice conversion server 5 receives voice data from the central office 11, the voice conversion server 5 converts the size of the voice analog signal into a digital signal expressed as binary numerical data, and uttered syllables with a portion where there is no voice as a delimiter. The audio data divided for each frame is stored in the audio data file 42. The part having no sound is defined as a part having a predetermined intensity or less in the intensity of the sound represented by a digital value. At that time, the voice characterizing server 5 calculates a value obtained by adding all the voice digital signals stored for each voice data file 42 as a checksum and adds the calculated value to the voice data file 42 (step S2).

そして、音声文字化サーバー５は、算出したチェックサムと発声者名などの付加情報とが一致する音声データが音声ＤＢ４３に登録されているか否かを、音声ＤＢ４３を検索することによって確認する（ステップＳ３）。チェックサムを利用することによって、音声デジタル信号による一致検索に比べ、より高速に検索処理を行うことができる。 Then, the phonetic transcription server 5 checks whether or not voice data in which the calculated checksum and additional information such as a speaker name match are registered in the voice DB 43 by searching the voice DB 43 (step). S3). By using the checksum, the search process can be performed at a higher speed than the matching search using the audio digital signal.

ステップＳ３において、音声データが音声ＤＢ４３に登録されていない場合、後述される所定の区切り方法で音声データを分割し、分割した音声データ毎にチェックサムを算出して、同様に、算出したチェックサムと発声者名などの付加情報とが一致する音声データが音声ＤＢ４３に登録されているか否かを、音声ＤＢ４３を検索することによって確認する。 In step S3, if the audio data is not registered in the audio DB 43, the audio data is divided by a predetermined delimiter method to be described later, a checksum is calculated for each divided audio data, and the calculated checksum is similarly calculated. It is confirmed by searching the voice DB 43 whether or not voice data matching the additional information such as the speaker name is registered in the voice DB 43.

そして、音声文字化サーバー５は、検索により登録されていたかを判断する（ステップＳ４）。音声データが分割された場合、分割した音声データ毎に登録の判断処理を行う。音声文字化サーバー５は、音声ＤＢ４３を検索して読み出した文字データを発声者Ａの発声者ＩＤと共に送信待機ＤＢ４７に保存する（ステップＳ５）。 Then, the phonetic transcription server 5 determines whether it has been registered by the search (step S4). When the audio data is divided, registration determination processing is performed for each divided audio data. The phonetic character server 5 stores the character data read out by searching the voice DB 43 together with the speaker ID of the speaker A in the transmission standby DB 47 (step S5).

音声文字化サーバー５は、発声者ＤＢ４１に発声者Ａの発声者情報に保存されている文字データへの変換結果を送信するための変換結果送信タイミングの条件が成立するまで、送信待機ＤＢ４７に保存しておく。変換結果送信タイミングの条件が成立したら、音声文字化サーバー５は、発声者ＤＢ４１から発声者Ａの発声者情報に保存されている発信方法に従って文字データを送信する（ステップＳ６）。 The phonetic transcription server 5 stores in the transmission standby DB 47 until the conversion result transmission timing condition for transmitting the conversion result to the character data stored in the speaker information of the speaker A is satisfied in the speaker DB 41. Keep it. If the conversion result transmission timing condition is satisfied, the phonetic character server 5 transmits the character data according to the calling method stored in the speaker information of the speaker A from the speaker DB 41 (step S6).

一方、音声データを分割しても音声ＤＢ４３から対応する文字データを検索できなかった場合、つまり音声データが未登録の場合、音声文字化サーバー５は、音声認識機能及び言語解析機能を使って音声データを文字データに変換し、音声データと、文字データと、付加情報とを発声者Ａの発声者ＩＤに対応させて音声ＤＢ４３に登録する（ステップＳ７）。音声データファイル４２に格納されている音声データのうち、音声ＤＢ４３から検索できなかった音声データのみが文字データに変換される。付加情報として、チェックサムなどの項目が含まれる。音声ＤＢ４３のデータ構成については後述される。 On the other hand, if the corresponding character data cannot be retrieved from the voice DB 43 even if the voice data is divided, that is, if the voice data is not registered, the voice characterizing server 5 uses the voice recognition function and the language analysis function. The data is converted into character data, and the voice data, the character data, and the additional information are registered in the voice DB 43 in association with the speaker ID of the speaker A (step S7). Of the voice data stored in the voice data file 42, only the voice data that could not be retrieved from the voice DB 43 is converted into character data. Additional information includes items such as a checksum. The data structure of the voice DB 43 will be described later.

そして、音声文字化サーバー５は、音声認識の精度や発声者Ａの契約条件に基づく発声者ＤＢ４１のオペレータ校正フラグの状態からオペレータによる校正を行うか否かを判断する（ステップＳ８及びＳ９）。オペレータ校正フラグがＯＮ（値「１」）の場合、音声文字化サーバー５は、発声者Ａの発声者ＩＤ、音声データ、変換した文字データ、付加情報などを含む校正依頼情報を校正依頼キュー４９に保存する（ステップＳ１０）。一人又は複数のオペレータによって必要に応じて校正を行ったうえで音声ＤＢ４３に登録し、校正依頼キュー４９から登録済みの校正依頼情報を削除する（ステップＳ１１）。 Then, the phonetic transcription server 5 determines whether or not to perform calibration by the operator from the state of the operator calibration flag of the speaker DB 41 based on the accuracy of speech recognition and the contract conditions of the speaker A (steps S8 and S9). When the operator calibration flag is ON (value “1”), the phoneticization server 5 sends the calibration request information including the speaker ID of the speaker A, voice data, converted character data, additional information, etc. to the calibration request queue 49. (Step S10). One or a plurality of operators calibrate as necessary and register in the voice DB 43, and the registered calibration request information is deleted from the calibration request queue 49 (step S11).

また、オペレータは、発声者Ａの契約条件において既に音声ＤＢ４３に登録されている音声データに基づく文字データの校正が指定されている場合、所定のタイミングで音声ＤＢ４３に保存されている文字データを校正するようにしてもよい。音声ＤＢ４３に保存されている文字データを校正により修正する際には、オペレータによる校正パターンを蓄積しておいてもよい。また、音声ＤＢ４３の検索効率を高める目的などで古いデータを削除する場合もある。校正作業においては、オペレータが変換精度を上げるために、音声認識機能および言語解析機能に対するオプションをカスタマイズすることができるようにする方法も含まれる。 In addition, when the operator specifies the calibration of the character data based on the voice data already registered in the voice DB 43 in the contract condition of the speaker A, the operator calibrates the character data stored in the voice DB 43 at a predetermined timing. You may make it do. When correcting the character data stored in the speech DB 43 by calibration, a calibration pattern by the operator may be accumulated. Moreover, old data may be deleted for the purpose of improving the search efficiency of the voice DB 43. The calibration operation also includes a method that allows the operator to customize options for the speech recognition function and the language analysis function in order to increase conversion accuracy.

音声認識機能と言語解析機能の解析能力が完全でないために、音声データを音声認識機能と言語解析機能の相互補完によって変換したうえで、オペレータなどの人間による校正作業及び校正パターンという情報を加えることにより、次回の音声処理時にフィードバッグさせ、音声認識及び言語解析率を高めるようにすることができる。結果として、例えば、音声データが音声認識機能によって「すいぞっかんいゆく」と認識された場合でも、「すいぞくかんにいく」と認識された場合でも同様に「水族館に行く」に変換するようにできる。 Since the speech recognition function and the language analysis function are not completely capable of analysis, the speech data is converted by mutual complementation of the speech recognition function and the language analysis function, and information such as calibration work and calibration patterns by humans such as operators is added. Thus, it is possible to feed back at the time of the next voice processing and increase the voice recognition and language analysis rate. As a result, for example, even if the voice data is recognized as “Suizokaniku” by the voice recognition function, it will be converted to “Go to the aquarium” in the same way even if it is recognized as “Going to Saizokan”. Can be.

また、発声者Ａが自身の携帯電話３１やＰＣ６などから認証後に音声ＤＢ４３に保存されている自分の音声データから変換された文字データを確認するインタフェースを設け、既に変換されている文字データを校正するようにしてもよい。 In addition, an interface is provided for the speaker A to check the character data converted from his / her voice data stored in the voice DB 43 after authentication from his / her mobile phone 31 or PC 6, and the already converted character data is calibrated. You may make it do.

ステップＳ１１にてオペレータによって校正された文字データは、発声者Ａの発声者ＩＤと共に送信待機ＤＢ４７に保存され、発声者ＤＢ４１に発声者Ａの発声者情報に保存されている送信タイミングの条件が成立した時に、音声文字化サーバー５によって、発声者ＤＢ４１から発声者Ａの発声者情報に保存されている発信方法に従って文字データが送信される（ステップＳ１２）。 The character data corrected by the operator in step S11 is stored in the transmission standby DB 47 together with the speaker ID of the speaker A, and the transmission timing condition stored in the speaker information of the speaker A is satisfied in the speaker DB 41. Then, the character conversion server 5 transmits the character data according to the transmission method stored in the speaker information of the speaker A from the speaker DB 41 (step S12).

送信待機ＤＢ４７に発声者Ａの発声者ＩＤに対応させて蓄積されている文字データは、発声者Ａが文字データを提供するサービスを要求して認証された場合に、発声者Ａの発声者情報に保存されている送信タイミングと発信方法とに基づいて文字データを提供する。送信タイミングとして、リアルタイム、ｎ文字単位、通話完了後などが指定される。発信方法として、Ｗｅｂでの情報提供、電子メールでの提供方法などが指定される。発声者Ａによる通話が完了後に、発声者Ａ宛に電子メールによって通話による音声データから変換された文字データが送信される。発声者Ａが発声者Ｂの電子メールを指定しておいて、発声者Ｂにも文字データを送信するようにもできる。 The character data stored in the transmission standby DB 47 in association with the speaker ID of the speaker A is the speaker information of the speaker A when the speaker A is authenticated by requesting a service that provides character data. Character data is provided based on the transmission timing and transmission method stored in the. The transmission timing is specified in real time, in units of n characters, after the call is completed, or the like. As a transmission method, information provision via the Web, provision via electronic mail, or the like is designated. After the call by the speaker A is completed, the character data converted from the voice data by the call is transmitted to the speaker A by e-mail. It is also possible for the utterer A to designate the e-mail of the utterer B and to transmit the character data to the utterer B.

上述では、発声者Ａの音声データに対する処理を説明したが、通話相手の発声者Ｂの音声データに対しても同様である。また、発声者Ａ又はＢが音声認識を必要とするサービス提供会社の着信機能を備えた装置３３へダイヤルする場合においても同様の処理が行われる。 In the above description, the processing for the voice data of the speaker A has been described, but the same applies to the voice data of the speaker B who is the other party. The same processing is also performed when the speaker A or B dials the device 33 having an incoming function of a service provider that requires voice recognition.

また、携帯電話機３１及び３２を用いて音声データが収集される例を説明したが、公衆電話機やホーム電話機のようなネットワークを介さない所定位置に設置される設置型電話機、ネットワークを用いたＩＰ電話機なども適応可能である。 Moreover, although the example which collects audio | voice data using the mobile telephones 31 and 32 was demonstrated, the installation-type telephone installed in the predetermined position which does not go through a network like a public telephone and a home telephone, IP telephone using a network Etc. are also applicable.

図２は、音声データを文字データに変換して登録する音声登録処理を説明するためのフローチャート図である。図２において、音声文字化サーバー５は、発信者と受信者の電話番号或いは利用者ＩＤを取得する（ステップＳ５１）。例えば、発声者Ａと発声者Ｂとが通話する場合、音声文字化サーバー５は、発声者Ａの電話番号と発声者Ｂの電話番号とを夫々の側の局交換機１１及び１２から取得する。一方、発声者として登録済みの利用者が、変換された文字データの提供を受けるために音声文字化サーバー５にアクセスした場合、音声文字化サーバー５は利用者から発声者ＩＤを取得する。 FIG. 2 is a flowchart for explaining a voice registration process in which voice data is converted into character data and registered. In FIG. 2, the phonetic transcription server 5 acquires the telephone numbers or user IDs of the sender and receiver (step S51). For example, when the speaker A and the speaker B make a call, the phonetic transcription server 5 acquires the telephone number of the speaker A and the telephone number of the speaker B from the exchanges 11 and 12 on the respective sides. On the other hand, when a user who has been registered as a speaker accesses the phonetic character server 5 to receive provision of converted character data, the phonetic server 5 obtains the speaker ID from the user.

音声文字化サーバー５は、ステップＳ５１で取得した電話番号又は発声者ＩＤを用いて発声者ＤＢ４１から発声者情報を検索して取得する（ステップＳ５２）。音声文字化サーバー５は、発声者情報を取得でき、かつ音声データから文字データへの変換サービスの利用を希望しているか否かを判断する（ステップＳ５３）。発声者情報を取得できなかった場合、又は取得できたが音声データから文字データへの変換サービスの利用を希望してない場合、音声文字化サーバー５は、発声者（又は利用者）が未登録であると判断して、この音声登録処理が終了する。 The phonetic transcription server 5 searches and acquires speaker information from the speaker DB 41 using the telephone number or speaker ID acquired in step S51 (step S52). The phonetic transcription server 5 determines whether or not the speaker information can be acquired and the use of the voice data to text data conversion service is desired (step S53). If the speaker information could not be acquired, or if it was acquired but the user did not wish to use the voice data to character data conversion service, the voice characterizing server 5 is not registered by the speaker (or user). This voice registration process is terminated.

一方、発声者情報を取得できた場合、音声文字化サーバー５は、発声者（又は利用者）が登録済みであると判断して、通話状態であるか否かを判断する（ステップＳ５４）。通話状態の判断は、例えば、ステップＳ５１にて取得した情報が電話番号であった場合に、対応する局交換機１１又は１２から通話終了の通知を受けているならば、ステップＳ５４−２へと進む。 On the other hand, when the speaker information can be acquired, the phonetic transcription server 5 determines that the speaker (or user) has been registered, and determines whether or not it is in a call state (step S54). For example, when the information acquired in step S51 is a telephone number, if the notification of call termination is received from the corresponding central office switch 11 or 12, the determination of the call state proceeds to step S54-2. .

一方、通話終了の通知を受けていないならば通話中であると判断し、音声データを取り込んで（ステップＳ５５）、アナログの音声データをデジタルの音声データに変換（Ａ／Ｄ変換）する（ステップＳ５６）。局交換機１１又は１２から受信した音声データがデジタルで提供される場合には、ステップＳ５６は省略される。 On the other hand, if the call end notification has not been received, it is determined that the call is in progress, audio data is taken in (step S55), and analog audio data is converted to digital audio data (A / D conversion) (step S55). S56). If the voice data received from the central office 11 or 12 is provided digitally, step S56 is omitted.

音声文字化サーバー５は、音声データを解析して、音声がない部分を区切りとしてフレーム単位に分割して、フレーム毎に音声データファイル４２を作成する（ステップＳ５７）。音声文字化サーバー５は、音声データファイル４２毎に音声データのデジタル値を合算してチェックサムを算出し、音声データファイル４２に設定しておく（ステップＳ５８）。そして、音声文字化サーバー５は、音声データファイル４２毎に、ステップＳ５２で取得した発声者情報の発声者ＩＤとステップＳ５８で算出したチェックサムとを用いて同じ音声データが音声ＤＢ４３に登録されているか否かを、音声ＤＢ４３を検索して確認する（ステップＳ５９及びＳ６０）。 The voice characterizing server 5 analyzes the voice data, divides the part without the voice into frames, and creates the voice data file 42 for each frame (step S57). The voice characterizing server 5 calculates the checksum by adding the digital values of the voice data for each voice data file 42 and sets the checksum in the voice data file 42 (step S58). Then, for each voice data file 42, the phonetic transcription server 5 registers the same voice data in the voice DB 43 using the speaker ID of the speaker information acquired in step S 52 and the checksum calculated in step S 58. It is confirmed by searching the voice DB 43 whether or not there are (steps S59 and S60).

全ての音声データファイル４２について音声データが音声ＤＢ４３に登録されていることを確認した場合、ステップＳ６０−２へ進み、図５に示す文字データキューイング処理を行った後、ステップＳ５４へ戻って上述同様の処理を繰り返す。 When it is confirmed that the voice data is registered in the voice DB 43 for all the voice data files 42, the process proceeds to step S60-2, and after performing the character data queuing process shown in FIG. 5, the process returns to step S54 and described above. Similar processing is repeated.

一方、音声データファイル４２のうち音声データの登録の確認ができなかったファイルがあった場合、音声文字化サーバー５は、その音声データファイル４２の音声データが細分割済みのファイルであるか否かを判断する（ステップＳ６１）。細分割済みのファイルである場合、ステップＳ６２へ進む。一方、細分割されていない音声データが保存された音声データファイル４２である場合、音声文字化サーバー５は、保存されている１フレームに相当する音声データを所定の区切り方法で細分割して音声データファイル４２を作成する（ステップＳ６１−２）。この場合、音声データファイル４２の所定記憶領域に細分割を示すフラグを設定しておけばよい。所定の区切り方法として、例えば、予め定められた時間軸又はバイト数、或いは文字単位で区切る方法がある。その後、音声文字化サーバー５は、ステップＳ５８へ戻り、細分割による音声データファイル４２に対して上述した方法でチェックサムを算出するなど、上記同様の処理を繰り返す。 On the other hand, when there is a file in which the voice data registration could not be confirmed in the voice data file 42, the voice characterizing server 5 determines whether or not the voice data of the voice data file 42 is a subdivided file. Is determined (step S61). If the file has been subdivided, the process proceeds to step S62. On the other hand, when the audio data file 42 stores the audio data that has not been subdivided, the audio characterizing server 5 subdivides the audio data corresponding to one stored frame by a predetermined dividing method to generate the audio data. A data file 42 is created (step S61-2). In this case, a flag indicating subdivision may be set in a predetermined storage area of the audio data file 42. As the predetermined separation method, for example, there is a method of dividing in a predetermined time axis, number of bytes, or character unit. After that, the phonetic character server 5 returns to step S58 and repeats the same processing as described above, such as calculating the checksum for the voice data file 42 by subdivision by the method described above.

細分割しても音声データの登録の確認ができなかった場合、音声文字化サーバー５は、既存の音声認識機能を使用して文字に変換して音声データから文字データを作成し（ステップＳ６２）、発声者ＩＤと対応付けて音声ＤＢ４３に登録する（ステップＳ６３）。 If the registration of the voice data cannot be confirmed even after the subdivision, the voice characterizing server 5 uses the existing voice recognition function to convert the voice data into characters and creates character data from the voice data (step S62). Then, it is registered in the voice DB 43 in association with the speaker ID (step S63).

そして、音声文字化サーバー５は、文字データを解析して音声認識が不完全であるか否かを判断し、また、ステップＳ５２で取得した発声者情報のオペレータ校正フラグをチェックして校正サービスを希望しているか否かを判断する（ステップＳ６４）。音声認識が不完全で、かつ校正サービスを希望している場合に、音声文字化サーバー５は、発声者ＩＤ、音声データ、変換した文字データ、付加情報などを含む校正依頼情報を校正依頼キュー４９に保存して（ステップＳ６５）、文字データキューイング処理を実行した後（ステップＳ６５−２）、ステップＳ５４へと戻って上記同様の処理を繰り返す。音声認識が完全に行われた場合や、音声認識が不完全であっても校正サービスを希望していない場合には、音声文字化サーバー５は、ステップＳ６５を行わず、文字データキューイング処理を実行した後（ステップＳ６５−２）、ステップＳ５４へと戻って上記同様の処理を繰り返す。 Then, the phonetic transcription server 5 analyzes the character data to determine whether or not the voice recognition is incomplete, and checks the operator calibration flag of the speaker information acquired in step S52 to provide a calibration service. It is determined whether or not it is desired (step S64). When the voice recognition is incomplete and the proofreading service is desired, the phonetic transcription server 5 sends the proofreading request information including the utterer ID, the voice data, the converted character data, and the additional information to the proofreading request queue 49. (Step S65), the character data queuing process is executed (step S65-2), the process returns to step S54, and the same process as described above is repeated. When the speech recognition is completely performed, or when the speech recognition is incomplete but the calibration service is not desired, the speech characterizing server 5 does not perform step S65 and performs the character data queuing process. After the execution (step S65-2), the process returns to step S54 and repeats the same processing as described above.

図３は、アナログの音声データを量子化する方法例を示す図である。図３（Ａ）において、アナログ音声の波形２ｐを時間の関数Ｆ（ｔ）とし、次に時間軸に沿って時間点列Ｔ０、Ｔ１、Ｔ２、・・・Ｔｎをとり、各店での波高値Ｆ（ｔｋ）を読み取る標本化（サンプリング）を行う。標本の結果得られる値を標本値という。次に、図３（Ｂ）において、標本値としての波高値は連続量（アナログ）であるため一般に小数点以下の値が存在するが、その値に最も近い整数値で近似してそれを波高値とみなす整数化を行う。これを量子化という。 FIG. 3 is a diagram illustrating an example of a method for quantizing analog audio data. In FIG. 3A, the analog audio waveform 2p is set as a function of time F (t), and then time point sequences T0, T1, T2,... Tn are taken along the time axis, and the peak values at each store are taken. Sampling for reading F (tk) is performed. A value obtained as a result of the sample is called a sample value. Next, in FIG. 3B, since the crest value as the sample value is a continuous quantity (analog), there is generally a value after the decimal point. However, the crest value is approximated by an integer value closest to the crest value. Integer conversion is performed. This is called quantization.

例えば、アナログ音声の波形２ｐは、時間間隔Ｗｉごとの時間Ｔ０、Ｔ１、Ｔ２、・・・Ｔｎにおいて波高値１、９、１３、１３、１０、６、６、６、７、５、１の値でデジタル化される。 For example, the waveform 2p of the analog voice has a peak value of 1, 9, 13, 13, 10, 6, 6, 6, 7, 5, 1 at times T0, T1, T2,. Digitized by value.

上述したような標本化と量子化とによって元のアナログ音声の波形２ｐは適当な整数値の集合として表現でき、この整数値を電気パルス列に置き換えてＡ／Ｄ変換を行うことにより、元のアナログ音声の波形が対応する電気パルスの集まったデジタルの音声データとして扱えるようになる。 By sampling and quantizing as described above, the original analog speech waveform 2p can be expressed as a set of appropriate integer values. By replacing the integer values with an electric pulse train and performing A / D conversion, the original analog speech waveform 2p can be expressed. The voice waveform can be handled as digital voice data in which corresponding electrical pulses are collected.

図４は、音声データを分割する処理を説明するための図である。図４（Ａ）に示す音声データの例において、図４（Ｂ）に示すように、音声がない部分を区切りにしてフレーム毎に音声データを分割する。音声データを分割して得られたフレームＡ１からＡｎは、各々の音声データファイル４２に保存される。例えば、「おはよう」、「吉岡です」・・・などが各々音声データファイル４２に保存される。音声がない部分の判断は、図３（Ｂ）に示すように量子化されたデジタル値が所定値以下である場合に区切るようにすればよい。 FIG. 4 is a diagram for explaining a process of dividing audio data. In the example of the audio data shown in FIG. 4A, as shown in FIG. 4B, the audio data is divided for each frame with a portion where there is no audio as a delimiter. The frames A1 to An obtained by dividing the audio data are stored in the respective audio data files 42. For example, “Good morning”, “It is Yoshioka”, etc. are stored in the audio data file 42, respectively. The determination of the portion without sound may be made when the quantized digital value is equal to or smaller than a predetermined value as shown in FIG.

得られたフレームＡ１からＡｎ毎に量子化されたデジタル値を合算したチェックサムを、各々の音声データファイル４２に設定するようにする。 A checksum obtained by adding the digital values quantized for each frame A1 to An is set in each audio data file 42.

次に、文字データを送信する処理について図５及び図６で説明する。図５は、図２に示す音声登録処理から呼び出される文字データキューイング処理を説明するためのフローチャート図である。図５において、音声文字化サーバー５は、発声者ＩＤを用いて発声者ＤＢ４１から発声者情報を取得する（ステップＳ７１）。そして、音声文字化サーバー５は、文字データの送信要求があったか否かを判断する（ステップＳ７２）。文字データの送信要求がなかった場合、音声文字化サーバー５は、この文字データキューイング処理を終了し、音声登録処理（図２）へ戻る。一方、文字データの送信要求があった場合、音声文字化サーバー５は、発声者情報で指定される送信タイミングを解析する（ステップＳ７３）。 Next, processing for transmitting character data will be described with reference to FIGS. FIG. 5 is a flowchart for explaining the character data queuing process called from the voice registration process shown in FIG. In FIG. 5, the phonetic transcription server 5 acquires the speaker information from the speaker DB 41 using the speaker ID (step S71). Then, the voice characterizing server 5 determines whether or not there is a transmission request for character data (step S72). When there is no character data transmission request, the voice characterizing server 5 ends the character data queuing process and returns to the voice registration process (FIG. 2). On the other hand, when there is a transmission request for character data, the phonetic character server 5 analyzes the transmission timing specified by the speaker information (step S73).

送信タイミングが「（ａ）発声から規定時間後に送信」である場合、音声文字化サーバー５は、現時間に設定した時間を加算して、送信時間を決定する（ステップＳ７４）。送信タイミングが「（ｂ）すぐに送信」である場合、音声文字化サーバー５は、送信時間を現時間に設定する（ステップＳ７５）。送信タイミングが「（ｃ）通話終了後」である場合、音声文字化サーバー５は、通話が終了した場合は、送信時間の設定を現在時間に設定し、終了していない場合はＮＵＬＬに設定する（ステップＳ７６）。送信タイミングが「（ｄ）規定文字数に到達した時」である場合、音声文字化サーバー５は、送信する文字数と送信待機ＤＢ４７に設定されている文字数の合計が、利用者が事前に設定した文字数に達していれば、送信時間を現時間に設定する（ステップＳ７７）。 When the transmission timing is “(a) transmission after a specified time from the utterance”, the phonetic transcription server 5 adds the time set to the current time to determine the transmission time (step S74). If the transmission timing is “(b) Immediate transmission”, the voice characterizing server 5 sets the transmission time to the current time (step S75). When the transmission timing is “(c) after the call ends”, the phonetic transcription server 5 sets the transmission time to the current time when the call ends, and sets it to NULL when the call is not ended. (Step S76). When the transmission timing is “(d) When the specified number of characters has been reached”, the voice characterizing server 5 determines that the total number of characters to be transmitted and the number of characters set in the transmission standby DB 47 is the number of characters set in advance by the user. If it has reached, the transmission time is set to the current time (step S77).

送信時間を設定した後、音声文字化サーバー５は、発声者情報を参照して、完全認識できない文字をフィードバックするか否かを判断する（ステップＳ７８）。発声者情報でフィードバックが指定されていない場合、音声文字化サーバー５は、ステップＳ８０へ進む。一方、フィードバックが指定されている場合、音声文字化サーバー５は、認識できない文字を認識できた文字とは異なる書体もしくは色に変更して目立つようにハイライト表示にする（ステップＳ８９）。また、音声文字化サーバー５は、必要であればその部分の音声データも送信する。 After setting the transmission time, the phonetic transcription server 5 refers to the speaker information and determines whether or not to feed back characters that cannot be completely recognized (step S78). If feedback is not specified in the speaker information, the phonetic transcription server 5 proceeds to step S80. On the other hand, if feedback is designated, the phonetic character server 5 changes the font or color different from the recognized character to the unrecognizable character and highlights it so as to stand out (step S89). In addition, the voice characterizing server 5 also transmits the voice data of that portion if necessary.

更に、音声文字化サーバー５は、発声者情報を参照して、暗号化の指定があるか否かを判断する（ステップＳ８０）。暗号化の指定がない場合、音声文字化サーバー５は、ステップＳ８２へと進む。一方、暗号化の指定がある場合、音声文字化サーバー５は、文字データを所定の方法で暗号化する（ステップＳ８１）。 Furthermore, the phonetic transcription server 5 refers to the speaker information and determines whether or not there is an encryption designation (step S80). If there is no designation of encryption, the phonetic transcription server 5 proceeds to step S82. On the other hand, if there is an encryption designation, the phonetic character server 5 encrypts the character data by a predetermined method (step S81).

その後、音声文字化サーバー５は、送信待機ＤＢ４７に文字データと、送信時間と、発声者情報とを含む送信情報を格納することによりキューイングし（ステップＳ８２）、音声登録処理からの実行されるこの文字データキューイング処理を終了して、呼び出し元の音声登録処理へと戻る。 Thereafter, the voice characterizing server 5 queues the transmission waiting DB 47 by storing the transmission information including the character data, the transmission time, and the speaker information (step S82), and is executed from the voice registration process. This character data queuing process is terminated, and the process returns to the caller's voice registration process.

図６は、文字データ送信処理を説明するためのフローチャート図である。図６に示す文字データ送信処理は、所定の間隔で音声文字化サーバー５が停止されるまで繰り返して行われる。音声文字化サーバー５は、送信待機ＤＢ４７にキューイングされている文字データに係る送信情報のうち送信時間が現時間と同じか又は経過している送信情報を検索する（ステップＳ９１）。音声文字化サーバー５は、検索された送信情報毎に発声者情報で指定される送信方法を特定する（ステップＳ９２）。 FIG. 6 is a flowchart for explaining the character data transmission process. The character data transmission process shown in FIG. 6 is repeated until the phonetic character server 5 is stopped at a predetermined interval. The phonetic server 5 searches for transmission information related to character data queued in the transmission standby DB 47 for transmission information whose transmission time is the same as or has elapsed (step S91). The phonetic transcription server 5 identifies the transmission method specified by the speaker information for each searched transmission information (step S92).

送信方法が「（ａ）文字データをファイルに保存」を示す場合、音声文字化サーバー５は、文字データをファイルに保存して発声者情報で指定される宛先に送信する（ステップＳ９３）。送信方法が「電子メール」を示す場合、音声文字化サーバー５は、電子メールで通話内容を文字にして送信する。送信方法が「（ｃ）ＲＳＳ（Rich Site Summary）」を示す場合、音声文字化サーバー５は、ＲＳＳで文字を送信する（ステップＳ９５）。この場合、音声文字化サーバー５は画面表示設定を行うようにしてもよい。送信方法が「（ｄ）電子掲示板、ＣＨＡＴなど」を示す場合、音声文字化サーバー５は、ある特定のサーバーに文字データを送信する（ステップＳ９６）。この場合、特定のサーバーにてアクセス可能な有資格者がリアルタイムに参照することができる。 When the transmission method indicates “(a) save character data in file”, the phonetic character server 5 saves the character data in a file and sends it to the destination specified by the speaker information (step S93). When the transmission method indicates “e-mail”, the voice characterizing server 5 transmits the contents of the call by e-mail as characters. When the transmission method indicates “(c) RSS (Rich Site Summary)”, the voice characterizing server 5 transmits characters by RSS (step S95). In this case, the phonetic transcription server 5 may perform screen display settings. When the transmission method indicates “(d) electronic bulletin board, CHAT, etc.”, the voice characterizing server 5 transmits character data to a specific server (step S96). In this case, a qualified person who can access with a specific server can refer in real time.

特定した送信方法にて文字データを提供した後、音声文字化サーバー５は、ステップＳ９１へ戻り、音声文字化サーバー５が停止されるまで上述した処理を繰り返す。 After providing the character data by the specified transmission method, the phonetic character server 5 returns to step S91 and repeats the above-described processing until the phonetic character server 5 is stopped.

次に、発声者として登録されている利用者による文字データの校正処理について説明する。この場合、音声文字化サーバー５はＷｅｂサーバーとして動作し、利用者が文字データを更新するためのサービスを提供する。図７は、利用者による音声ＤＢに保存されている文字データの校正処理を説明するためのフローチャート図である。図７において、発声者として登録された利用者が、使用しているＰＣのブラウザから音声文字化サーバー５が提供する文字データの校正処理を行うサービスにアクセスすると（ステップＳ１０１）、音声文字化サーバー５は、利用者に発声者ＩＤ及びパスワードを要求し、利用者から取得した発声者ＩＤ及びパスワードで発声者ＤＢ４１を用いてユーザ認証を行う（ステップＳ１０２）。 Next, character data proofreading processing by a user registered as a speaker will be described. In this case, the phonetic character server 5 operates as a Web server and provides a service for the user to update the character data. FIG. 7 is a flowchart for explaining the proofreading process of the character data stored in the voice DB by the user. In FIG. 7, when a user registered as a speaker accesses a service for proofreading character data provided by the phonetic character server 5 from the browser of the PC being used (step S101), the phonetic character server 5. Requests the speaker ID and password from the user, and performs user authentication using the speaker DB 41 using the speaker ID and password acquired from the user (step S102).

次に、音声文字化サーバー５は、発声者ＩＤを用いて、音声ＤＢ４３から過去に登録された文字データを検索し（ステップＳ１０３）、利用者が指定した順番でブラウザに一覧８ａを表示する（ステップＳ１０４）。一覧８ａを表示する順番として、例えば、音声ＤＢ４３への登録が新しい順、最近使われた文字データの順、音声認識が不完全であった文字データの順などである。文字データに例えば「＊＊」などの特殊文字が含まれる場合、音声認識が不完全であったことを示す。ブラウザに一覧を表示した画面から順番を指定できるようにしてもよいし、ユーザ認証後に予め利用者から所望の順番を取得しておいてもよい。 Next, using the speaker ID, the phonetic character server 5 searches the voice DB 43 for previously registered character data (step S103), and displays the list 8a on the browser in the order specified by the user (step S103). Step S104). The order in which the list 8a is displayed includes, for example, the order of registration in the speech DB 43, the order of recently used character data, the order of character data for which speech recognition has been incomplete. If the character data includes special characters such as “**”, it indicates that the speech recognition is incomplete. The order may be designated from a screen displaying a list on the browser, or a desired order may be acquired from the user in advance after user authentication.

ブラウザの一覧８ａを表示する画面から、利用者は音声認識によって変換された文字と対になっている再生アイコンをクリックして音声を再生し、それと認識された文字とを照らし合わせて、必要があれば校正した文字を校正後文字蘭に入力する（ステップＳ１０５）。そして、利用者は、校正後文字による更新で良ければ、更新ボタン８ｂをクリックする（ステップＳ１０６）。 From the screen displaying the browser list 8a, the user clicks the play icon that is paired with the character converted by the voice recognition to play the voice, and compares it with the recognized character. If there is, the proofread character is input to the proofread character string (step S105). If the user can update the text after proofreading, the user clicks the update button 8b (step S106).

利用者によるこれら操作に応じて、ブラウザから校正された文字データが音声文字化サーバー５に送信され、音声文字化サーバー５は、利用者によって入力された文字列による文字データで音声ＤＢ４３を更新する。 In response to these operations by the user, the character data calibrated from the browser is transmitted to the voice characterizing server 5, and the voice characterizing server 5 updates the voice DB 43 with the character data based on the character string input by the user. .

図８は、校正依頼キューに登録された自動認識後の文字データのオペレータによる校正処理を説明するためのフローチャート図である。図８において、音声文字化サーバー５は、校正依頼キュー４９に校正したい文字データが登録されると、複数のオペレータ端末のうち待機状態のオペレータ端末を検索する（ステップＳ２０１）。音声文字化サーバー５は、検索により待機状態のオペレータ端末があったか否かを判断する（ステップＳ２０２）。待機状態のオペレータ端末がない場合、音声文字化サーバー５はステップＳ２０１へ戻り、校正依頼キュー４９に新たに文字データが登録されるのを待つ。 FIG. 8 is a flowchart for explaining the proofreading process by the operator of character data after automatic recognition registered in the proofreading request queue. 8, when the character data to be proofread is registered in the proofreading request queue 49, the phonetic transcription server 5 searches for a standby operator terminal among a plurality of operator terminals (step S201). The phonetic transcription server 5 determines whether or not there is an operator terminal in a standby state by the search (step S202). If there is no operator terminal in a standby state, the voice characterizing server 5 returns to step S201 and waits for new character data to be registered in the calibration request queue 49.

一方、音声文字化サーバー５は、待機状態のオペレータ端末がある場合、待機状態のオペレータ端末の何れか１台を選択し、その端末を使用状態に定義する（ステップＳ２０３）。そして、音声文字化サーバー５は、オペレータ端末に注意を促すアラームを出力して、校正すべき音声がオペレータのヘッドセットで再生されると、更に音声に対する１以上の文字変換候補をオペレータ端末に表示する（ステップＳ２０４）。 On the other hand, when there is an operator terminal in the standby state, the phonetic transcription server 5 selects any one of the operator terminals in the standby state and defines that terminal as a use state (step S203). Then, the voice characterizing server 5 outputs an alarm for alerting the operator terminal, and when the voice to be calibrated is reproduced by the operator's headset, one or more character conversion candidates for the voice are further displayed on the operator terminal. (Step S204).

音声文字化サーバー５は、利用者によって変換候補から１つ選択されたか否かを判断する（ステップＳ２０５）。利用者が変換候補を選択した場合、音声文字化サーバー５は、選択された変換候補で音声ＤＢ４３を更新するためにステップＳ２０７へ進む。一方、利用者が変換候補を選択しなかった場合、音声文字化サーバー５は、変換候補の選択の代わりに、オペレータ端末のキーボードから文字入力を受け付けるか、音声認識機能及び言語解析機能を用いてオペレータが発音し直した音声を文字列に変換する（ステップＳ２０６）。 The phonetic transcription server 5 determines whether one conversion candidate has been selected by the user (step S205). If the user selects a conversion candidate, the phonetic transcription server 5 proceeds to step S207 to update the voice DB 43 with the selected conversion candidate. On the other hand, when the user does not select a conversion candidate, the phonetic character server 5 accepts character input from the keyboard of the operator terminal instead of selecting the conversion candidate, or uses a voice recognition function and a language analysis function. The voice re-produced by the operator is converted into a character string (step S206).

音声文字化サーバー５は、オペレータ端末から取得した選択又は入力などによって校正された文字列を文字データとして音声ＤＢ４３を更新し（ステップＳ２０７）、オペレータ端末を待機状態に定義して（ステップＳ２０８）、ステップＳ２０１へ戻って上述した同様の処理を繰り返す。 The voice characterizing server 5 updates the voice DB 43 with the character string calibrated by selection or input acquired from the operator terminal as character data (step S207), defines the operator terminal in a standby state (step S208), and It returns to step S201 and repeats the same process mentioned above.

図９は、利用者として登録する発声者ＤＢのテーブル構成例を示す図である。図９において、発声者ＤＢ４１は、発声者ＩＤ、発声者パスワード、発声者登録日時、住所、電話番号、携帯電話番号、ＰＣ識別コード、音声→文字変換サービス利用フラグ、オペレータ校正フラグ、暗号化対応フラグ、送信タイミングフラグ、送信タイミングの規定数、送信方法、送信先、フィードバックフラグなどの項目を有する。 FIG. 9 is a diagram illustrating a table configuration example of a speaker DB registered as a user. In FIG. 9, a speaker DB 41 includes a speaker ID, a speaker password, a speaker registration date, an address, a telephone number, a mobile phone number, a PC identification code, a voice-to-character conversion service use flag, an operator calibration flag, and encryption support. It has items such as a flag, a transmission timing flag, a prescribed number of transmission timings, a transmission method, a transmission destination, and a feedback flag.

発声者ＩＤは、音声文字化サーバー５によるサービスを利用する利用者としての発声者を識別するためのＩＤであり、例えば「０８０００１０００１」などの発声者の携帯電話番号が設定される。発声者パスワードは、音声文字化サーバー５に利用登録する際に利用者によって設定された認証用の文字列である。発声者登録日時は、例えば「０８０３２８１０３０」のように登録した年月日時間を示す。発声者指名は、利用者としての発声者の名前が例えば「富士通太郎」のように登録時に設定される。住所は、例えば「山中湖のほとり」などのように発声者によって登録時に設定される。電話番号及び携帯電話番号は、例えば「０４２ｘｘｘｘｘｘｘ」及び「０８０００１０００１」のように登録時に設定される。ＰＣ識別コードは、例えば「０１．１０９．ｘｘ．ｘｘ」などのＩＰアドレス、又はＭＡＣアドレスが設定される。 The speaker ID is an ID for identifying a speaker as a user who uses the service provided by the phonetic transcription server 5. For example, a cellular phone number of the speaker such as “0800010001” is set. The speaker password is a character string for authentication set by the user when registering for use in the phonetic transcription server 5. The speaker registration date and time indicates the date and time of registration, for example, “0803281030”. The name of the speaker is set at the time of registration such that the name of the speaker as a user is “Fujitsu Taro”, for example. The address is set by the speaker at the time of registration, for example, “Lake Yamanaka”. The telephone number and the mobile phone number are set at the time of registration, for example, “042xxxxxxxx” and “0800010001”. As the PC identification code, for example, an IP address such as “01.109.xx.xx” or a MAC address is set.

音声→文字変換サービス利用フラグには、音声データから文字データへの変換サービスを利用する場合には「１」が設定され、利用しない場合には「０」などの「１」以外の値が設定される。オペレータ校正フラグには、オペレータによる校正サービスを利用する場合には「１」が設定され、利用しない場合には「０」などの「１」以外の値が設定される。暗号化対応フラグには、文字データを暗号化する場合には「１」が設定され、暗号化しない場合には「０」などの「１」以外の値が設定される。 In the voice-to-character conversion service use flag, “1” is set when using the conversion service from voice data to character data, and a value other than “1” such as “0” is set when not using. Is done. In the operator calibration flag, “1” is set when the calibration service by the operator is used, and a value other than “1” such as “0” is set when the service is not used. In the encryption correspondence flag, “1” is set when the character data is encrypted, and a value other than “1” such as “0” is set when the character data is not encrypted.

送信タイミングフラグには、発声から規定時間後に送信する場合には「１」が設定され、すぐに送信する場合には「２」が設定され、通話終了後に送信する場合には「３」が設定され、所定文字数に達したら送信する場合には「４」が設定される。送信タイミングの規定数に設定される値は、送信タイミングフラグが「１」の場合には規定時間を示し、「４」の場合には文字数が示す。 The transmission timing flag is set to “1” when transmitting a specified time after the utterance, set to “2” when transmitting immediately, and set to “3” when transmitting after the call ends. When the predetermined number of characters is reached, “4” is set when transmitting. The value set as the specified number of transmission timings indicates the specified time when the transmission timing flag is “1”, and indicates the number of characters when “4”.

送信方法は、「ＦＩＬＥ」、「ＭＡＩＬ」、「ＲＳＳ」、「ＳＥＲＶＥＲ」のいずれかで指定される。更に、ファクスなどの送信手段を設定することも可能である。送信先は、送信方法に応じた宛先が設定され、例えば、送信方法が「ＭＡＩＬ」である場合は１つ以上の電子メールアドレスが指定される。自身と通話相手の電子メールアドレスなど複数の送信先を設定してもよい。フィードバッグフラグは、音声認識が完全に出来なかった場合にフィードバッグを行うときは「１」が設定され、フィードバッグを行わないときは「０」など「１」以外の値が設定される。 The transmission method is designated by any one of “FILE”, “MAIL”, “RSS”, and “SERVER”. Further, it is possible to set a transmission means such as a fax. As the transmission destination, a destination corresponding to the transmission method is set. For example, when the transmission method is “MAIL”, one or more e-mail addresses are designated. A plurality of transmission destinations such as the e-mail addresses of the caller and the other party may be set. The feedback flag is set to “1” when performing feedback when speech recognition is not completely completed, and is set to a value other than “1” such as “0” when not performing feedback.

電話番号、携帯電話番号、ＰＣ識別コードは、発声者が利用する装置を特定するための装置特定情報である。 The telephone number, the mobile phone number, and the PC identification code are device specifying information for specifying the device used by the speaker.

図１０は、音声データから変換された文字データを保存する音声ＤＢのテーブル構成例を示す図である。図１０において、音声ＤＢ４３は、発声者ＩＤ、チェックサム、音声データ情報、認識文字、校正後文字、最新の参照日時、参照回数、音声データ登録日時、校正日時などの項目を有する。 FIG. 10 is a diagram illustrating a table configuration example of a voice DB that stores character data converted from voice data. In FIG. 10, the voice DB 43 includes items such as a speaker ID, checksum, voice data information, recognized characters, post-proofreading characters, latest reference date, reference count, voice data registration date, calibration date and time.

発声者ＩＤは、利用者として発声者ＤＢ４１に登録した発声者ＩＤである。発声者ＤＢ４１に携帯電話番号を発声者ＩＤとして登録した場合、その携帯電話番号が設定される。チェックサムには、音声データのデジタル値の合計値が設定される。音声データ情報には、デジタル化された音声データの保存先を示す情報が設定される。ファイルに保存されている場合にはファイル名が設定される。 The speaker ID is a speaker ID registered in the speaker DB 41 as a user. When a mobile phone number is registered as a speaker ID in the speaker DB 41, the mobile phone number is set. In the checksum, a total value of digital values of the audio data is set. In the audio data information, information indicating the storage destination of the digitized audio data is set. If saved in a file, the file name is set.

認識文字には、音声データから自動認識された文字列が文字データとして設定される。校正後文字には、オペレータ又は発声者として登録されている利用者によって校正された文字列が文字データとして設定され、校正されていない場合は空白となる。発声者ＩＤとチェックサムとによって音声データの登録が検索された場合、校正後文字に設定された文字データを認識文字に設定された文字データよりも優先的に使用し、校正後文字が空白の場合に認識文字の文字データを使用する。 A character string automatically recognized from voice data is set as character data in the recognized character. In the post-proofreading character, a character string proofread by a user registered as an operator or a speaker is set as character data, and is blank when not proofreading. When registration of voice data is retrieved by the speaker ID and checksum, the character data set in the proofread character is used preferentially over the character data set in the recognized character, and the proofread character is blank. Use the character data of the recognized character.

最新の参照日時は、音声データ情報の保存先に保存されている音声データから認識又は校正された文字データが利用された最後の日時を示す。参照回数は、この文字データが利用された回数を示す。音声データ登録日時は、この音声データを文字データに変換し登録した日時を示す。校正日時は、認識文字に保存される文字データをオペレータ又は利用者が校正した日時を示す。 The latest reference date / time indicates the last date / time when the character data recognized or calibrated from the voice data stored in the storage destination of the voice data information is used. The reference count indicates the number of times this character data is used. The voice data registration date and time indicates the date and time when the voice data was converted into character data and registered. The calibration date and time indicates the date and time when the operator or user calibrated the character data stored in the recognized character.

次に、上述したような音声文字化サーバー５が適用される利用形態について説明する。図１１は、携帯電話機で通話する利用形態での適用例を示す図である。図１１において、発声者Ａ及びＢが携帯電話機２ａ及び２ｂを用いた通話による各音声信号は、各々の携帯基地局３ａ及び３ｂで受信され、電話回線網７ａを介して各々の中継交換局４ａ及び４ｂによって双方の携帯電話機２ａ及び２ｂに送信される。中継交換局４ａ及び４ｂは、通信可能なデータ回線網によって音声文字化サーバー５に接続されており、音声文字化サーバー５により携帯電話番号などで認証確認後、更に音声→文字変換サービス利用フラグによりサービスの利用を確認後、音声データと自動認識によって変換された文字データとを音声文字化サーバー５に蓄積する。音声文字化サーバー５は、所定の送信タイミングで指定された送信先へ文字データを送信する。 Next, a usage form to which the above-described phonetic transcription server 5 is applied will be described. FIG. 11 is a diagram illustrating an application example in a usage mode in which a mobile phone makes a call. In FIG. 11, the voice signals of the voice calls made by the speakers A and B using the mobile phones 2a and 2b are received by the mobile base stations 3a and 3b, and the relay exchanges 4a are connected via the telephone network 7a. And 4b are transmitted to both mobile phones 2a and 2b. The relay exchanges 4a and 4b are connected to the voice characterizing server 5 by a communicable data line network. After the voice characterizing server 5 confirms the authentication with a mobile phone number or the like, it further uses the voice → character conversion service use flag After confirming the use of the service, the voice data and the character data converted by the automatic recognition are stored in the voice characterizing server 5. The voice characterizing server 5 transmits character data to a transmission destination designated at a predetermined transmission timing.

図１２は、ＩＰ電話機で通話する利用形態での適用例を示す図である。図１２において、発声者Ａ及びＢがＩＰ電話機２ｃ及び２ｄを用いた通話による各音声信号は、光ファイバー又はデジタル専用回線などに接続される無線アンテナ３ｃ及び３ｅによってＩＰ網７ｃを介して送受信される。ＩＰ網７ｃに接続される音声文字化サーバー５は、ＩＰ網７ｃを形成するルーターから転送されるＩＰ電話機２ｃ及び２ｄを夫々識別するＩＰアドレスによって認証確認し、更に音声→文字変換サービス利用フラグによりサービスの利用を確認後、ＩＰ電話機２ｃ及び２ｄ夫々からの音声データと自動認識によって変換された文字データとを音声文字化サーバー５に蓄積する。 FIG. 12 is a diagram showing an application example in a usage mode in which a telephone call is made with an IP telephone. In FIG. 12, the voice signals of the voice calls made by the speakers A and B using the IP telephones 2c and 2d are transmitted and received via the IP network 7c by the wireless antennas 3c and 3e connected to an optical fiber or a digital dedicated line. . The voice characterizing server 5 connected to the IP network 7c authenticates and confirms the IP telephones 2c and 2d transferred from the routers forming the IP network 7c, respectively, and further uses the voice → character conversion service use flag. After confirming the use of the service, the voice data from the IP telephones 2c and 2d and the character data converted by the automatic recognition are stored in the voice characterizing server 5.

発声者Ａ又はＢが公衆電話ボックスやコンビニなどに設置されるＩＰ電話機から通話する場合においても同様である。このような利用形態では、図９に示される発声者ＤＢ４２のＰＣ識別コードを用いて認証などを行えばよい。 The same applies when the speaker A or B makes a call from an IP telephone set in a public telephone box or a convenience store. In such a usage mode, authentication or the like may be performed using the PC identification code of the speaker DB 42 shown in FIG.

図１３は、Ｐ２Ｐネットワークを介してＩＰ電話により通話する利用形態での適用例を示す図である。図１３において、一般のノードとなるＰＣ端末６２は、インターネット６８を介して音声文字化サーバー５で認証後、複数のスーパーノード６１で構成されるＰ２Ｐネットワーク６７に接続され、スーパーノード６１を介して相手方のＰＣ端末６２に接続され、ＩＰ電話による通話が開始される。 FIG. 13 is a diagram illustrating an application example in a usage mode in which a telephone call is made by an IP phone via a P2P network. In FIG. 13, a PC terminal 62 that is a general node is connected to a P2P network 67 composed of a plurality of super nodes 61 after being authenticated by the phonetic transcription server 5 via the Internet 68, and via the super node 61. Connected to the other party's PC terminal 62, the IP telephone call is started.

通話開始後、各ＰＣ端末６２が通話中の音声データを音声文字化サーバー５へ転送することにより、音声データと自動認識によって変換された文字データとを音声文字化サーバー５に蓄積する。 After the call is started, each PC terminal 62 transfers the voice data during the call to the voice characterizing server 5, thereby storing the voice data and the character data converted by the automatic recognition in the voice characterizing server 5.

図１４は、音声文字化サーバーの機能構成例を示す図である。図１４において、音声文字化サーバー５は、ＣＰＵ、メモリ、記憶装置、表示ユニット、出力ユニット、入力ユニット、通信ユニット、外部記憶装置Ｉ／Ｆなどを備えたコンピュータ装置であり、ＣＰＵがプログラムを実行することによって実現される音声データ受信処理部５０１と、ユーザ認証及びサービス利用確認部５０２と、音声データＡ／Ｄ変換部５０３と、フレーム分割部５０４と、音声データ登録確認部５０５と、文字データ変換部５０６と、オペレータ校正部５０９と、利用者校正部５１０と、文字データ配信部５１１と、表示処理部５２１と、入出力処理部５２２と、通信制御部５２３と、インストーラ５２４とを有する。また、発声者ＤＢ４１と、音声データファイル４２と、音声ＤＢ４３と、送信待機ＤＢ４７とは記憶装置に保持される。音声データファイル４２は、音声データが音声ＤＢ４３に登録されるデータファイルである。 FIG. 14 is a diagram illustrating a functional configuration example of the phonetic transcription server. In FIG. 14, the phonetic transcription server 5 is a computer device including a CPU, a memory, a storage device, a display unit, an output unit, an input unit, a communication unit, an external storage device I / F, etc., and the CPU executes a program. A voice data reception processing unit 501, a user authentication and service use confirmation unit 502, a voice data A / D conversion unit 503, a frame division unit 504, a voice data registration confirmation unit 505, and character data. A conversion unit 506, an operator calibration unit 509, a user calibration unit 510, a character data distribution unit 511, a display processing unit 521, an input / output processing unit 522, a communication control unit 523, and an installer 524 are included. Further, the speaker DB 41, the voice data file 42, the voice DB 43, and the transmission standby DB 47 are held in the storage device. The audio data file 42 is a data file in which audio data is registered in the audio DB 43.

表示処理部５２１は、表示ユニットへのデータの表示を制御する。入出力処理部５２２は、入力ユニット及び出力ユニットへのデータの入出力を制御する。通信制御部５２３は、ネットワークを介して行われるデータ通信を制御する。インストーラ５２４は、本発明に係るプログラムを記録した記録媒体５２０から外部記憶装置Ｉ／Ｆを介して該プログラムをインストールする。記録媒体５２０は、コンピュータが読み取り可能な媒体であればよい。 The display processing unit 521 controls display of data on the display unit. The input / output processing unit 522 controls input / output of data to / from the input unit and the output unit. The communication control unit 523 controls data communication performed via the network. The installer 524 installs the program from the recording medium 520 in which the program according to the present invention is recorded via the external storage device I / F. The recording medium 520 may be a computer readable medium.

音声データ受信処理部５０１は、図２のステップＳ５１及びＳ５２に相当し、通信制御部５２３によって音声データが受信されると、音声データと共に送信される電話番号を用いて発声者ＤＢ４１を検索して発声者情報を取得して、音声データと発声者情報とを作業用の記憶領域に格納する。認証後は、格納しておいた音声データが取り込まれ必要に応じてＡ／Ｄ変換されて、文字データへの変換が行われる。また、音声データ受信処理部５０１は、通信制御部５２３から発声者ＩＤが通知された場合には、発声者ＩＤを用いて発声者ＤＢ４１を検索して発声者情報を取得し、発声者情報を作業用の記憶領域に格納する。 The voice data reception processing unit 501 corresponds to steps S51 and S52 in FIG. 2. When voice data is received by the communication control unit 523, the voice data reception processing unit 501 searches the speaker DB 41 using the telephone number transmitted together with the voice data. The speaker information is acquired, and the voice data and the speaker information are stored in the working storage area. After the authentication, the stored voice data is taken in, A / D converted as necessary, and converted into character data. Further, when the speaker ID is notified from the communication control unit 523, the voice data reception processing unit 501 searches the speaker DB 41 using the speaker ID, acquires the speaker information, and obtains the speaker information. Store in the working storage area.

ユーザ認証及びサービス利用確認部５０２は、図２のステップＳ５３からＳ５４に相当し、音声データ受信処理部５０１によって取得した発声者情報を用いて、通話中の発声者又はインターネットを介してアクセスする利用者に対するユーザ認証を行うと共に、発声者情報の音声→文字変換サービス利用フラグを参照することによって、音声データから文字データへの変換サービスの利用を確認する。 The user authentication and service usage confirmation unit 502 corresponds to steps S53 to S54 in FIG. 2, and uses the speaker information acquired by the voice data reception processing unit 501 to access the speaker during a call or via the Internet. In addition to authenticating the user, the use of the voice data to character data conversion service is confirmed by referring to the voice to character conversion service use flag of the speaker information.

音声データＡ／Ｄ変換部５０３は、図２のステップＳ５６に相当し、アナログの音声データを図３に示す所定のアルゴリズムに従ってデジタルの音声データに変換する。フレーム分割部５０４は、図２のステップＳ５７に相当し、デジタルに変換された音声データをフレームに分割し、分割したフレーム毎にデジタル値を合算してチェックサムを算出する。 The audio data A / D converter 503 corresponds to step S56 in FIG. 2, and converts analog audio data into digital audio data according to a predetermined algorithm shown in FIG. The frame dividing unit 504 corresponds to step S57 in FIG. 2, divides the audio data converted into digital data into frames, and calculates the checksum by adding the digital values for each divided frame.

音声データ登録確認部５０５は、図２のステップＳ５８からＳ６０に相当し、フレーム分割部５０４によって算出されたチェックサムを用いて音声データの登録を確認する。音声データが音声ＤＢ４３に登録されている場合、変換された文字データが音声ＤＢ４３に登録されていることを意味する。 The audio data registration confirmation unit 505 corresponds to steps S58 to S60 in FIG. 2, and confirms registration of the audio data using the checksum calculated by the frame division unit 504. When the voice data is registered in the voice DB 43, it means that the converted character data is registered in the voice DB 43.

文字データ変換部５０６は、図２のステップＳ６２及びＳ６３に相当し、音声認識機能５０７及び言語解析機能５０８を用いて文字データに変換したのち音声ＤＢ６３に登録する。 The character data conversion unit 506 corresponds to steps S62 and S63 in FIG. 2 and is converted into character data using the speech recognition function 507 and the language analysis function 508 and then registered in the speech DB 63.

オペレータ校正部５０９は、図８のステップＳ２０１からＳ２０８に相当し、発声者情報のオペレータ校正フラグを参照することによってオペレータによる校正を希望していると判断した場合に実行される。 The operator calibration unit 509 corresponds to steps S201 to S208 in FIG. 8, and is executed when it is determined that the operator wants calibration by referring to the operator calibration flag of the speaker information.

利用者校正部５１０は、図７のステップＳ１０１からＳ１０７に相当し、ネットワークを介してアクセスする利用者がユーザ認証及びサービス利用確認部５０２によって認証された場合に、音声ＤＢ４２に登録されている音声データを視聴可能とし、変換された文字データの校正を許可し、利用者によって確認された文字データで音声ＤＢ４２の更新を行う。 The user calibration unit 510 corresponds to steps S101 to S107 in FIG. 7, and the voice registered in the voice DB 42 when a user who accesses through the network is authenticated by the user authentication and service usage confirmation unit 502. The data can be viewed, the proofreading of the converted character data is permitted, and the voice DB 42 is updated with the character data confirmed by the user.

文字データ配信部５１１は、図５のステップＳ７１からＳ８２と図６のステップＳ９１から９６とに相当し、発声者情報の送信タイミングフラグ、送信タイミングの規定数、送信方法、送信先などを参照して、変換された文字データを送信する。 The character data distribution unit 511 corresponds to steps S71 to S82 in FIG. 5 and steps S91 to 96 in FIG. 6, and refers to the transmission timing flag of the speaker information, the prescribed number of transmission timings, the transmission method, the transmission destination, and the like. Then, the converted character data is transmitted.

上述したように、音声文字化サーバー５を利用することによって、例えば、騒音又は難聴により通話がはっきりしない場合であっても、変換された文字の提供によって通話が理解し易くなる。 As described above, by using the phonetic character conversion server 5, even when the call is not clear due to noise or hearing loss, for example, it becomes easy to understand the call by providing converted characters.

音声文字化サーバー５では、文字データを変換する前段階における受信した音声データのチェックサムで音声ＤＢ４３を検索して音声データの音声ＤＢ４３の登録確認をするため、登録済みの場合は、既に音声データに対応させて格納されている文字データを提供することができ、より高速に音声データから文字データへの変換を行うことができる。また、発声者ＩＤで関連付けられるチェックサムの値で検索するため、文字データに変換するための個々の発声者の音声の特性を詳細に解析する必要がない。 The voice characterizing server 5 searches the voice DB 43 with the checksum of the received voice data before converting the character data and confirms the registration of the voice data in the voice DB 43. The character data stored corresponding to the character data can be provided, and the voice data can be converted to the character data at a higher speed. Further, since the search is performed using the checksum value associated with the speaker ID, it is not necessary to analyze in detail the characteristics of the individual speaker's voice for conversion into character data.

通話中の音声データに対するオペレータによる文字データの校正を可能にすることで、タイムリーにより精度の高い文字データを提供することができる。また、通話後に発声者によって直接文字データを校正することを可能とすることで、更に精度の高い文字データを提供することができる。 By allowing the operator to calibrate the character data with respect to the voice data during a call, it is possible to provide character data with higher accuracy in a timely manner. In addition, since it is possible to calibrate the character data directly by the speaker after the call, it is possible to provide more accurate character data.

また、音声文字化サーバー５によって、電話中の通話が文字データに変換可能であるため、発声者は通常の電話機又は自身の携帯電話機を利用するのみで音声認識機能や言語解析機能などを備えた特別な装置を別途備える必要がなく、また、そのような特別な装置が設置されている場所に制限されることがない。 In addition, since the voice call server 5 can convert a telephone call into character data, the speaker has a voice recognition function and a language analysis function only by using a normal telephone or his / her own mobile phone. It is not necessary to provide a special device separately, and the place where such a special device is installed is not limited.

実施例において、日本語に変換する例で説明したが、日本語に限定されることなく、英語やその他の言語でも可能である。 In the embodiment, the example of conversion into Japanese has been described, but the present invention is not limited to Japanese, and English and other languages are also possible.

以上の説明に関し、更に以下の項を開示する。
（付記１）
通話中の発声者の音声データを通信回線を介して受信する音声データ受信手段と、
前記音声データを文字データに変換する音声文字変換手段と、
前記文字データを前記発声者に配信する配信手段とを有する音声文字化サーバー。
（付記２）
音声データのチェックサムと該音声データから変換された文字データとを対応させて音声データベースに格納することによって、該音声データを登録する音声登録手段と、
前記音声データ受信手段が受信した発声者の音声データが前記音声データベースに登録されているか否かを該音声データのチェックサムの一致で判断する音声登録判断手段とを有し、
前記配信手段は、前記チェックサムの一致した音声データに対応する文字データを前記発声者に配信する付記１記載の音声文字化サーバー。
（付記３）
前記音声データ受信手段によって受信した音声データを発生音節毎のフレーム単位に分割するフレーム単位分割手段と、
前記フレーム単位にチェックサムを算出するチェックサム算出手段とを有し、
前記音声登録判断手段は、前記フレーム単位に算出された前記チェックサムを用いて前記音声データベースへの前記音声データの登録を判断する付記２記載の音声文字化サーバー。
（付記４）
前記音声登録判断手段によって前記音声データが未登録であると判断した場合、前記フレーム単位の音声データを所定の区切り方法で細分割し、前記チェックサム算出手段により該細分割された音声データに対してチェックサムを算出させ、前記音声登録判断手段により該チェックサムの一致する音声データが登録されているか否かを判断させるようにする付記３記載の音声文字化サーバー。
（付記５）
前記音声文字変換手段は、前記音声データが前記音声データベースに登録されていない場合、音声認識機能を用いて該音声データを文字データに変換し、
前記音声認識機能による変換誤りをオペレータ又は前記発声者によって校正可能とする校正手段と、
前記変換誤りが修正された文字データで前記音声データベースを更新手段とを有する付記２乃至４のいずれか一項記載の音声文字化サーバー。
（付記６）
前記配信手段は、所定の送信タイミングによって前記発声者の音声データから変換された文字データを送信する際に、前記音声認識機能によって認識できた文字データと認識できなかった文字データとを区別可能なようにフィードバックする付記５記載の音声文字化サーバー。
（付記７）
前記音声データと共に受信する通話装置を特定する装置特定情報を用いて前記発声者を認証する発声者認証手段を有し、
前記発声者認証手段によって認証が成功した場合に前記音声文字変換手段と前記配信手段とを有効にする付記１乃至６のいずれか一項記載の音声文字化サーバー。
（付記８）
前記発声者を識別する発声者ＩＤに対応づけて、前記装置特定情報と前記文字データの送信方法とを含む発声者情報を発声者データベースに格納して管理する発声者情報管理手段と、
前記配信手段は、前記送信方法で指定される手段によって前記文字データを配信する付記７記載の音声文字化サーバー。
（付記９）
音声文字化サーバーとして機能するコンピュータが、
通話中の発声者の音声データを通信回線を介して受信する音声データ受信手順と、
前記音声データ受信手順によって受信した音声データを発生音節毎のフレーム単位に分割するフレーム単位分割手順と、
前記音声データを前記フレーム単位で文字データに変換する音声文字変換手順と、
前記文字データを前記発声者に配信する配信手順と実行する音声文字化方法。 Regarding the above description, the following items are further disclosed.
(Appendix 1)
Voice data receiving means for receiving voice data of a speaker during a call via a communication line;
A voice character conversion means for converting the voice data into character data;
A voice characterizing server having distribution means for distributing the character data to the speaker;
(Appendix 2)
Voice registration means for registering the voice data by storing the checksum of the voice data and the character data converted from the voice data in the voice database in association with each other;
Voice registration judging means for judging whether or not the voice data of the speaker received by the voice data receiving means is registered in the voice database by matching the checksum of the voice data;
The voice characterizing server according to claim 1, wherein the distribution means distributes character data corresponding to the voice data having the matched checksum to the speaker.
(Appendix 3)
A frame unit dividing unit that divides the audio data received by the audio data receiving unit into frames for each generated syllable;
Checksum calculation means for calculating a checksum for each frame,
The phonetic transcription server according to claim 2, wherein the voice registration determining unit determines registration of the voice data in the voice database using the checksum calculated for each frame.
(Appendix 4)
When the audio registration determining means determines that the audio data is unregistered, the frame-by-frame audio data is subdivided by a predetermined dividing method, and the checksum calculating means The voice characterizing server according to supplementary note 3, wherein the checksum is calculated and the voice registration judging means judges whether or not voice data having the same checksum is registered.
(Appendix 5)
When the voice data is not registered in the voice database, the voice character conversion means converts the voice data into character data using a voice recognition function,
Calibration means for enabling conversion error due to the voice recognition function to be calibrated by an operator or the speaker.
The phonetic transcription server according to any one of appendices 2 to 4, further comprising: updating means for updating the voice database with the character data in which the conversion error is corrected.
(Appendix 6)
The distribution means can distinguish character data recognized by the voice recognition function from character data that could not be recognized when transmitting character data converted from the voice data of the speaker at a predetermined transmission timing. The phonetic transcription server according to appendix 5, which feeds back as follows.
(Appendix 7)
A speaker authentication means for authenticating the speaker using device specifying information for specifying a communication device to be received together with the voice data;
The phonetic transcription server according to any one of appendices 1 to 6, wherein when the authentication by the speaker authenticating unit is successful, the phonetic character converting unit and the distributing unit are validated.
(Appendix 8)
Speaker information management means for storing and managing speaker information including the device specifying information and the character data transmission method in a speaker database in association with a speaker ID for identifying the speaker;
The phonetic transcription server according to appendix 7, wherein the distribution means distributes the character data by means specified by the transmission method.
(Appendix 9)
A computer that functions as a phonetic transcription server
A voice data reception procedure for receiving voice data of a speaker during a call via a communication line;
A frame unit dividing procedure for dividing the audio data received by the audio data receiving procedure into frames for each generated syllable;
A voice character conversion procedure for converting the voice data into character data in units of frames;
A distribution procedure for distributing the character data to the speaker and a voice conversion method to be executed.

本発明は、具体的に開示された実施例に限定されるものではなく、特許請求の範囲から逸脱することなく、種々の変形や変更が可能である。 The present invention is not limited to the specifically disclosed embodiments, and various modifications and changes can be made without departing from the scope of the claims.

全体処理の概要を説明するための図である。It is a figure for demonstrating the outline | summary of a whole process. 音声データを文字データに変換して登録する音声登録処理を説明するためのフローチャート図である。It is a flowchart for demonstrating the audio | voice registration process which converts and registers audio | voice data into character data. アナログの音声データを量子化する方法例を示す図である。It is a figure which shows the example of a method of quantizing analog audio | voice data. 音声データを分割する処理を説明するための図である。It is a figure for demonstrating the process which divides | segments audio | voice data. 図２に示す音声登録処理から呼び出される文字データキューイング処理を説明するためのフローチャート図である。It is a flowchart for demonstrating the character data queuing process called from the audio | voice registration process shown in FIG. 文字データ送信処理を説明するためのフローチャート図である。It is a flowchart figure for demonstrating a character data transmission process. 利用者による音声ＤＢに保存されている文字データの校正処理を説明するためのフローチャート図である。It is a flowchart for demonstrating the proofreading process of the character data preserve | saved by the voice DB by a user. 校正依頼キューに登録された自動認識後の文字データのオペレータによる校正処理を説明するためのフローチャート図である。It is a flowchart for demonstrating the proofreading process by the operator of the character data after the automatic recognition registered into the proofreading request queue. 利用者として登録する発声者ＤＢのテーブル構成例を示す図である。It is a figure which shows the example of a table structure of speaker DB registered as a user. 音声データから変換された文字データを保存する音声ＤＢのテーブル構成例を示す図である。It is a figure which shows the example of a table structure of audio | voice DB which preserve | saves the character data converted from audio | voice data. 携帯電話機で通話する利用形態での適用例を示す図である。It is a figure which shows the example of application in the utilization form which talks with a mobile telephone. ＩＰ電話機で通話する利用形態での適用例を示す図である。It is a figure which shows the example of application in the utilization form which makes a telephone call with an IP telephone. Ｐ２Ｐネットワークを介してＩＰ電話により通話する利用形態での適用例を示す図である。It is a figure which shows the example of application in the utilization form which makes a telephone call by IP telephone via a P2P network. 音声文字化サーバーの機能構成例を示す図である。It is a figure which shows the function structural example of a phonetic transcription server.

Explanation of symbols

５音声文字化サーバー
６ＰＣ
１１、１２局交換機
２１、２２電波塔
３１、３２携帯電話機
４１発声者ＤＢ
４２音声データファイル
４３音声ＤＢ
４７送信待機ＤＢ
１００音声文字化システム 5 Spoken text server 6 PC
11, 12 Office switch 21, 22 Radio tower 31, 32 Mobile phone 41 Speaker DB
42 Voice data file 43 Voice DB
47 Transmission standby DB
100 Speech transcription system

Claims

Voice data receiving means for receiving voice data of a speaker during a call via a communication line;
Whether the voice data of the speaker received by the voice data receiving means is registered in a voice database in which a checksum of the voice data is associated with character data converted from the voice data. Voice registration judgment means for judging by the checksum match,
Distribution means for distributing character data corresponding to the voice data with the matched checksum to the speaker;
When there is no voice data having the same checksum in the voice database, voice character conversion means for converting the voice data into character data;
Voice registration means for registering the voice data by storing the checksum of the voice data and the character data converted from the voice data by the voice character conversion means in correspondence with each other in the voice database;
A phonetic transcription server.

A frame unit dividing unit that divides the audio data received by the audio data receiving unit into frames for each generated syllable;
Checksum calculation means for calculating a checksum for each frame,
The voice registration determination means, the speech text of the server according to claim 1, wherein for determining the registration of the voice data to the voice database by using the checksum calculated in the frame.

When the audio registration determining means determines that the audio data is unregistered, the frame-by-frame audio data is subdivided by a predetermined dividing method, and the checksum calculating means 3. The voice characterizing server according to claim 2 , wherein a checksum is calculated and the voice registration judging means judges whether or not voice data having the same checksum is registered.