JP4595415B2

JP4595415B2 - Voice search system, method and program

Info

Publication number: JP4595415B2
Application number: JP2004207650A
Authority: JP
Inventors: 真寺尾; 孝文越仲
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2004-07-14
Filing date: 2004-07-14
Publication date: 2010-12-08
Anticipated expiration: 2024-07-14
Also published as: JP2006031278A

Description

本発明は、音声検索システムおよび方法ならびにプログラムに関し、特に音声認識された音声データ中の単語や語句を検索するシステムおよび方法ならびにプログラムに関する。 The present invention relates to a voice search system, method, and program, and more particularly, to a system, method, and program for searching for words and phrases in voice data that has been voice-recognized.

従来、この種の音声検索システムは、ニュースや講演などの音声データの中からキーワードを検索し、所望のコンテンツにアクセスするために用いられている。この場合、音声認識が用いられるが、辞書に登録されていない未知語を認識することはできず、また、既知語であっても誤認識は避けられない。このような誤認識を含む音声認識結果から単語や語句を検索するための音声検索システムの一例が、非特許文献１に記載されている。 Conventionally, this type of voice search system is used to search for keywords from voice data such as news and lectures and to access desired contents. In this case, although speech recognition is used, unknown words that are not registered in the dictionary cannot be recognized, and erroneous recognition is unavoidable even for known words. Non-patent document 1 describes an example of a speech search system for searching for words and phrases from speech recognition results including such misrecognition.

図１２は、非特許文献１に記載されている音声検索システムの構成を示すブロック図である。図１２において、音声検索システムは、音声データ記憶手段１０１と、連続音素認識手段１０２と、認識結果音素列記憶手段１０３と、検索文字列入力手段１０４と、音素変換手段１０５と、マッチング手段１０６と、検索結果出力手段１０７、とから構成されている。このような構成を有する従来の音声検索システムは、次のように動作する。 FIG. 12 is a block diagram showing the configuration of the voice search system described in Non-Patent Document 1. In FIG. 12, the speech search system includes a speech data storage unit 101, a continuous phoneme recognition unit 102, a recognition result phoneme sequence storage unit 103, a search character string input unit 104, a phoneme conversion unit 105, and a matching unit 106. , And search result output means 107. The conventional voice search system having such a configuration operates as follows.

音声データ記憶手段１０１には、検索対象となる音声データが記憶されている。連続音素認識手段１０２は、音声データを連続音素認識し、認識結果音素列を認識結果音素列記憶手段１０３に記憶する。一方、検索文字列入力手段１０４から入力された検索文字列は、音素変換手段１０５によって音素列に変換される。マッチング手段１０６は、音素変換された検索文字列と認識結果音素列全体とを音素単位のＤＰ（Dynamic Programming）マッチングにより照合し、あらかじめ定めた閾値よりも累積距離が小さい区間に検索文字列が存在すると判定する。検索結果出力手段１０７は、検索文字列が存在すると判定された音声データの区間を出力する。音素単位のＤＰマッチングを行うことにより、検索したい文字列が誤認識される場合であってもある程度検索することが可能となっている。 The voice data storage unit 101 stores voice data to be searched. The continuous phoneme recognition unit 102 performs continuous phoneme recognition on the speech data, and stores the recognition result phoneme string in the recognition result phoneme string storage unit 103. On the other hand, the search character string input from the search character string input means 104 is converted into a phoneme string by the phoneme conversion means 105. The matching means 106 collates the phoneme converted search character string with the entire recognition result phoneme string by DP (Dynamic Programming) matching in phonemes, and the search character string exists in a section where the cumulative distance is smaller than a predetermined threshold. Judge that. The search result output means 107 outputs a section of voice data determined to have a search character string. By performing DP matching on a phoneme basis, it is possible to search to some extent even if a character string to be searched is erroneously recognized.

なお、連続単語音声認識アルゴリズムの一つとして用いられるフレーム同期ビームサーチについては、特許文献１に記載されている。また、音声認識における各種のサーチ技術に関しては、非特許文献２に記載されている。 Note that a frame-synchronized beam search used as one of continuous word speech recognition algorithms is described in Patent Document 1. Non-patent document 2 describes various search techniques in speech recognition.

特許第３３４６２８５号公報Japanese Patent No. 3346285 岡隆一他著、「音素系列表現を用いた音声およびテキスト検索」、電子情報通信学会技術研究報告、2001年、SP2001-29、29-35頁Ryuichi Oka et al., "Speech and text retrieval using phoneme sequence representation", IEICE technical report, 2001, SP2001-29, 29-35 ローレンス・ラビナー（Lawrence Rabiner）他著、「古井貞熙監訳、音声認識の基礎（下）」、NTTアドバンステクノロジ株式会社、1995年、194-229頁Lawrence Rabiner et al., “Translation by Sadahiro Furui, Basics of Speech Recognition (below)”, NTT Advanced Technology Corporation, 1995, pp. 194-229

従来技術の問題点は、検索に必要な計算量が大きいことである。その結果、検索対象となる音声データの規模が大きくなると、検索文字列を入力してから実際に検索結果が得られるまでに長時間待たされることになり、音声検索システムの実用性が低下してしまう。従来技術で必要な計算量が大きい理由は、音素変換された検索文字列と認識結果音素列とのＤＰマッチングにおいて、複数パスの距離計算を音素単位で実行するためである。 The problem with the prior art is that the amount of calculation required for the search is large. As a result, when the scale of the speech data to be searched increases, it will take a long time to enter the search string until the search result is actually obtained, which reduces the practicality of the voice search system. End up. The reason why the amount of calculation required in the conventional technique is large is that distance calculation of a plurality of paths is executed in units of phonemes in DP matching between a phoneme converted search character string and a recognition result phoneme string.

本発明の目的は、検索したい文字列が誤認識されていたり未知語であった場合にも対処可能で、かつ高速に検索することができる音声検索システムおよび方法ならびにプログラムを提供することにある。 An object of the present invention is to provide a voice search system, method, and program that can cope with a case where a character string to be searched is erroneously recognized or is an unknown word and can be searched at high speed.

前記目的を達成する本発明の原理は、入力された検索文字列を、検索対象となる音声データの音声認識結果に出現し得る類似単語または類似単語列に展開してから検索することにある。 The principle of the present invention that achieves the above-mentioned object is to search after an input search character string is expanded into similar words or similar word strings that can appear in the speech recognition result of the speech data to be searched.

本発明の一つのアスペクトに係る音声検索システムは、検索対象となる音声データの単語単位の音声認識結果を認識結果単語列として記憶する認識結果単語列記憶手段と、音声認識結果に出現し得る単語を候補単語として記憶する単語候補記憶手段と、入力された検索文字列を音素列に変換し、単語候補記憶手段に記憶されている候補単語または候補単語の組み合わせからなる候補単語列を音素列に変換し、それぞれ音素列に含まれる音素同士の一致の程度に基づいて検索文字列を候補単語または候補単語列に展開する検索文字列展開手段と、検索文字列展開手段によって展開された候補単語または候補単語列を認識結果単語列記憶手段から検索する検索手段と、を備える。 A speech search system according to one aspect of the present invention includes a recognition result word string storage unit that stores a word-unit speech recognition result of speech data to be searched as a recognition result word string, and words that can appear in the speech recognition result. A candidate word string comprising a candidate word or a combination of candidate words stored in the word candidate storage means , and converting the input search character string into a phoneme string. A search character string expanding means for converting the search character string into candidate words or candidate word strings based on the degree of matching between the phonemes included in the phoneme strings, and a candidate word expanded by the search character string expanding means Search means for searching for candidate word strings from the recognition result word string storage means.

第１の展開形態の音声検索システムにおいて、単語候補記憶手段が、音声データから認識結果単語列を得るときの認識語彙を記憶するように構成されることが好ましい。 In the voice search system of the first development form, it is preferable that the word candidate storage unit is configured to store a recognition vocabulary used when obtaining a recognition result word string from the voice data.

第２の展開形態の音声検索システムにおいて、認識結果単語列記憶手段に記憶された認識結果に現れる単語のリストを抽出する単語抽出手段を備え、単語候補記憶手段がリストを記憶するように構成されることが好ましい。 In the voice search system of the second development form, the speech search system includes a word extraction unit that extracts a list of words appearing in the recognition result stored in the recognition result word string storage unit, and the word candidate storage unit stores the list. It is preferable.

第３の展開形態の音声検索システムにおいて、単語抽出手段が、リストを抽出する時に、認識結果において各単語の前後に現れる単語について調べて、前後に現れる単語にのみ接続を許した単語間の接続テーブルを作成し、単語候補記憶手段がリストと共に接続テーブルを記憶し、検索文字列展開手段は、単語候補記憶手段の記憶するリストと接続テーブルとを参照して、検索文字列を接続可能な候補単語または候補単語列のみに展開する機能を含んで構成されることが好ましい。 In the voice search system according to the third development mode, when the word extracting unit extracts a list, the words that appear before and after each word in the recognition result are checked, and the connection between words that allows connection only to the words that appear before and after A table is created, the word candidate storage means stores the connection table together with the list, and the search character string expansion means refers to the list stored in the word candidate storage means and the connection table, and candidates that can connect the search character strings It is preferable to include a function that expands only to words or candidate word strings.

第４の展開形態の音声検索システムにおいて、検索文字列展開手段が、検索文字列の音素列を入力特徴ベクトル系列とし、候補単語を認識語彙として連続単語音声認識アルゴリズムによって単語展開する機能を含んで構成されることが好ましい。 In the speech search system of the fourth expansion form, the search character string expansion means includes a function of expanding a word by a continuous word speech recognition algorithm using a phoneme string of the search character string as an input feature vector series and a candidate word as a recognition vocabulary. Preferably, it is configured.

第５の展開形態の音声検索システムにおいて、検索文字列展開手段が、検索文字列を単語候補記憶手段が記憶している候補単語を用いて展開する時に、音素同士の一致の程度に基づいて検索文字列と候補単語または候補単語列との間の距離を求めて距離があらかじめ定めた閾値以内となるように展開する機能を含んで構成されることが好ましい。 In the fifth search form speech search system, when the search character string expansion means expands the search character string using the candidate words stored in the word candidate storage means , the search is performed based on the degree of matching between phonemes. It is preferable to include a function of obtaining a distance between the character string and the candidate word or the candidate word string and developing the distance so that the distance is within a predetermined threshold .

第６の展開形態の音声検索システムにおいて、音素の認識誤り傾向を表す混同行列を記憶する混同行列記憶手段を備え、検索文字列展開手段が、混同行列に基づいて、音素同士の一致の程度を求める機能を含んで構成されることが好ましい。 The speech expansion system according to the sixth expansion mode includes confusion matrix storage means for storing a confusion matrix representing a tendency of recognition errors of phonemes , and the search character string expansion means determines the degree of matching between phonemes based on the confusion matrix. It is preferable to be configured including the desired function.

第７の展開形態の音声検索システムにおいて、検索文字列展開手段が、音声データから認識結果単語列を得るときに使用した音響モデル中のモデル間距離に基づいて音素同士の一致の程度を求める機能を含んで構成されることが好ましい。 In the speech search system of the seventh development form, the search character string development means obtains the degree of matching between phonemes based on the inter-model distance in the acoustic model used when obtaining the recognition result word string from the speech data It is preferable that it is comprised including.

第８の展開形態の音声検索システムにおいて、検索文字列展開手段が、距離に加えて、音声データから認識結果単語列を得るときに使用した言語モデルを参照して言語的に現れにくい単語や単語列に対して距離をより遠くなるようにする距離加算機能を含んで構成されることが好ましい。 In the voice search system according to the eighth development form, words or words that are difficult to appear linguistically by referring to the language model used when the search character string development means obtains the recognition result word string from the voice data in addition to the distance. It is preferable to include a distance adding function for further increasing the distance from the column.

本発明の第２のアスペクトに係る音声検索方法は、入力手段と、展開手段と、検索手段と、検索対象となる音声データの単語単位の音声認識結果に出現し得る単語を候補単語として記憶する記憶手段とを備える音声検索システムが単語列を検索する方法である。この方法は、入力手段が検索文字列を入力するステップと、展開手段が、記憶手段内の候補単語または候補単語の組み合わせからなる候補単語列を音素列に変換し、検索文字列を音素列に変換し、それぞれ音素列に含まれる音素同士の一致の程度に基づいて検索文字列を候補単語または候補単語列に展開するステップと、検索手段が音声認識結果を認識結果単語列として表し、展開された候補単語または候補単語列を認識結果単語列から検索するステップと、を含む。 The speech search method according to the second aspect of the present invention stores, as candidate words , input means, expansion means, search means, and words that can appear in speech recognition results in word units of speech data to be searched. speech retrieval system comprising a storage means is a method for searching a word string. In this method, the input means inputs the search character string, and the expansion means converts the candidate word string consisting of candidate words or combinations of candidate words in the storage means into a phoneme string, and converts the search character string into a phoneme string. Converting and expanding the search character string into candidate words or candidate word strings based on the degree of matching between the phonemes included in each phoneme string, and the search means representing the speech recognition result as a recognition result word string and expanded Searching the candidate word or candidate word string from the recognition result word string.

第１の展開形態の音声検索方法において、音声検索システムは、さらに音声認識手段を備え、音声認識手段が、検索文字列を入力するステップに先立ち、音声データに対して音声認識を行い、認識結果として候補単語を得るステップを含んでもよい。 In the audio retrieval method according to the first deployed configuration, the speech retrieval system further comprises a voice recognition unit, the speech recognition means, prior to the step of inputting a search string, performs speech recognition on the speech data, a recognition result As a candidate word.

第２の展開形態の音声検索方法において、音声検索システムは、さらに音声認識手段を備え、音声認識手段が、検索文字列を入力するステップに先立ち、音声データに対して音声認識を行い、認識結果から抽出された単語のリストを作成するステップを含み、検索文字列を展開するステップは、作成されたリストを参照して展開してもよい。 In the voice search method of the second development mode, the voice search system further includes voice recognition means, and the voice recognition means performs voice recognition on the voice data prior to the step of inputting the search character string, and the recognition result. The step of developing a search character string may be expanded with reference to the generated list.

本発明の第３のアスペクトに係る音声検索方法は、展開手段と、検索手段と、音声認識手段と、認識結果単語列記憶手段と、単語候補記憶手段とを備える音声検索システムが単語列を検索する方法である。この方法は、音声認識手段が検索対象となる音声データの単語単位の音声認識結果を認識結果単語列として認識結果単語列記憶手段に記憶させるステップと、認識結果に出現し得る単語を候補単語として単語候補記憶手段が記憶しておき、展開手段が、入力された検索文字列を音素列に変換し、単語候補記憶手段に記憶されている候補単語または候補単語の組み合わせからなる候補単語列を音素列に変換し、それぞれ音素列に含まれる音素同士の一致の程度に基づいて検索文字列を候補単語または候補単語列に展開するステップと、検索手段が、展開された候補単語または候補単語列を認識結果単語列記憶手段から検索するステップと、を含む。 Voice search method according to a third aspect of the present invention, the search and development unit, a search unit, a voice recognition unit, a recognition result a word sequence storage unit, the speech retrieval system and a word candidate storage unit the word sequence It is a method to do. In this method, the speech recognition means stores the speech recognition result in units of words of the speech data to be searched as a recognition result word string in the recognition result word string storage means, and words that can appear in the recognition result as candidate words word candidate storage means stores the phoneme deployment means converts the entered search string phoneme string, the candidate word sequence consisting candidate word or candidate combination of words stored in the word candidate storage unit And converting the search character string into candidate words or candidate word strings based on the degree of matching between the phonemes included in the phoneme strings, and the search means Retrieving from the recognition result word string storage means.

第３の展開形態の音声検索方法において、音声データから認識結果単語列を得るときの認識語彙を単語候補記憶手段が記憶しておいてもよい。 In the third speech retrieval method deployed configuration of the recognition vocabulary in obtaining recognition result word strings from the speech data word candidate storage means may store.

第４の展開形態の音声検索方法において、音声検索システムは、さらに単語抽出手段を備え、単語抽出手段が認識結果単語列記憶手段に記憶された認識結果に現れる単語のリストを抽出するステップと、単語抽出手段が抽出された単語のリストを単語候補記憶手段に記憶させるステップと、をさらに含み、候補単語列に展開するステップは、記憶されたリストを参照して展開してもよい。 In the fourth search form voice search method, the voice search system further includes a word extraction unit, and the word extraction unit extracts a list of words appearing in the recognition result stored in the recognition result word string storage unit; further comprising a step of storing a list of words that word extraction means is extracted in the word candidate storing means, the step of deploying the candidate word sequence may be deployed with reference to stored list.

本発明の第４のアスペクトに係るプログラムは、認識結果単語列記憶手段と単語候補記憶手段とを備える音声検索システムを構成するコンピュータに、検索対象となる音声データの単語単位の音声認識結果を認識結果単語列として認識結果単語列記憶手段に記憶させる処理と、認識結果に出現し得る単語を候補単語として単語候補記憶手段に記憶させる処理と、入力された検索文字列を音素列に変換し、単語候補記憶手段に記憶されている候補単語または候補単語の組み合わせからなる候補単語列を音素列に変換し、それぞれ音素列に含まれる音素同士の一致の程度に基づいて検索文字列を候補単語または候補単語列に展開する検索文字列展開処理と、検索文字列展開手段によって展開された候補単語または候補単語列を認識結果単語列記憶手段から検索する検索処理と、を実行させる。 A program according to a fourth aspect of the present invention recognizes a speech recognition result in units of words of speech data to be searched by a computer constituting a speech search system including a recognition result word string storage unit and a word candidate storage unit. A process for storing the result word string in the recognition result word string storage unit, a process for storing a word that can appear in the recognition result in the word candidate storage unit as a candidate word, and converting the input search character string into a phoneme string, A candidate word string composed of a candidate word or a combination of candidate words stored in the word candidate storage means is converted into a phoneme string, and the search character string is converted into a candidate word or a search string based on the degree of matching between the phonemes included in each phoneme string a search string expansion process of expanding the candidate word sequence, the recognition result word string stores the candidate words or candidate word string deployed by the search string expansion means A search process of searching from the stage, to the execution.

第１の展開形態のプログラムにおいて、音声データから認識結果単語列を得るときの認識語彙を単語候補記憶手段に記憶させる処理をさらに実行させてもよい。 In the program of the first development form, a process of storing the recognition vocabulary for obtaining the recognition result word string from the speech data in the word candidate storage unit may be further executed.

第２の展開形態のプログラムにおいて、認識結果単語列記憶手段に記憶された認識結果に現れる単語のリストを抽出する単語抽出処理と、単語抽出処理によって抽出された単語のリストを単語候補記憶手段に記憶させる処理と、をさらに実行させ、記憶されたリストを参照して展開するように検索文字列展開処理を実行させてもよい。 In the program of the second development form, a word extraction process for extracting a list of words appearing in the recognition result stored in the recognition result word string storage unit, and a word list extracted by the word extraction process in the word candidate storage unit A search character string expansion process may be performed such that the stored character string is expanded with reference to the stored list.

本発明の第５のアスペクトに係るプログラムは、認識結果単語列記憶手段と単語候補記憶手段とを備える音声検索システムを構成するコンピュータに、検索対象となる音声データの単語単位の音声認識結果に出現し得る単語を予め記憶してある単語候補記憶手段中の候補単語または候補単語の組み合わせからなる候補単語列を音素列に変換し、入力された検索文字列を音素列に変換し、それぞれ音素列に含まれる音素同士の一致の程度に基づいて検索文字列を候補単語または候補単語列に展開する検索文字列展開処理と、音声認識結果を認識結果単語列として予め記憶してある認識結果単語列記憶手段から、検索文字列展開処理によって展開された候補単語または候補単語列を検索する検索処理と、を実行させる。
A program according to a fifth aspect of the present invention appears in a speech recognition result in units of words of speech data to be searched on a computer constituting a speech search system including a recognition result word string storage unit and a word candidate storage unit. A candidate word string consisting of a candidate word or a combination of candidate words in a word candidate storage means in which possible words are stored in advance is converted into a phoneme string, and an input search character string is converted into a phoneme string, A search character string expansion process for expanding a search character string into a candidate word or a candidate word string based on the degree of matching of phonemes included in the word, and a recognition result word string in which a speech recognition result is stored in advance as a recognition result word string And a search process for searching for candidate words or candidate word strings developed by the search character string development process from the storage means.

第３の展開形態のプログラムにおいて、音声データに対して音声認識を行い、認識結果として候補単語を得る処理をさらに実行させてもよい。 In the program of the third development form, voice recognition may be performed on voice data, and a process of obtaining candidate words as a recognition result may be further executed.

本発明によれば、検索したい文字列が誤認識されていたり未知語であった場合にも対処可能で、かつ高速に検索することができる。その理由は、音素単位の認識結果ではなく、より大きな単位である単語単位の認識結果を検索すれば良いことにある。この結果、検索を行う空間が小さくなり、検索速度が向上する。すなわち、入力された検索文字列を単語または単語列に展開してから、単語単位の認識結果を検索するためである。 According to the present invention, it is possible to cope with a case where a character string to be searched is misrecognized or an unknown word, and search can be performed at high speed. The reason is that it is only necessary to search the recognition result of the word unit which is a larger unit, not the recognition result of the phoneme unit. As a result, the search space is reduced, and the search speed is improved. That is, after the input search character string is expanded into a word or a word string, a recognition result in units of words is searched.

また、他の理由は、検索時に検索文字列と認識結果との距離をＤＰマッチングなどによって計算しなくても、検索したい文字列が誤認識されている場合にある程度検索できることにある。すなわち、入力された検索文字列の認識結果として可能性の高い単語または単語列に予め展開してから検索するためである。 Another reason is that a search can be made to some extent when a character string to be searched is erroneously recognized without calculating the distance between the search character string and the recognition result by DP matching or the like. That is, the search is performed after previously expanding into a word or a word string having a high possibility as a recognition result of the input search character string.

さらに他の理由は、認識結果にまったく出現する可能性のない単語または単語列を検索することがないことにある。すなわち、検索文字列として未知語が入力された場合でも無駄に検索を行うことがなく、入力された検索文字列を認識結果に出現し得る単語または単語列に展開するためである。 Yet another reason is that no word or word string that has no possibility of appearing in the recognition result is searched. That is, even when an unknown word is input as a search character string, the search is not performed wastefully, and the input search character string is expanded into words or word strings that can appear in the recognition result.

次に、本発明を実施するための最良の形態について図面を参照して詳細に説明する。本発明の実施形態に係る音声検索システムは、検索対象となる音声データの単語単位の音声認識結果を記憶する認識結果単語列記憶手段（図１の１３）と、認識結果単語列に出現し得る単語の候補を記憶しておく単語候補記憶手段（図１の１８）と、単語候補記憶手段に記憶されている単語を使用して音響的な距離が検索文字列に近くなるような単語または単語列に展開する検索文字列展開手段（図１の１５）と、展開された単語または単語列を認識結果単語列から検索する検索手段（図１の１６）とを有する。 Next, the best mode for carrying out the present invention will be described in detail with reference to the drawings. The voice search system according to the embodiment of the present invention can appear in a recognition result word string storage unit (13 in FIG. 1) that stores a voice recognition result in units of words of voice data to be searched, and a recognition result word string. Word candidate storage means (18 in FIG. 1) for storing word candidates and words or words whose acoustic distance is close to the search character string using the words stored in the word candidate storage means A search character string expanding means (15 in FIG. 1) that expands into a column and a search means (16 in FIG. 1) that searches the expanded word or word string from the recognition result word string.

以上のように構成される音声検索システムは、入力された検索文字列の認識結果となる可能性が高い単語または単語列を認識結果単語列から検索することにより、検索したい文字列が誤認識されていたり未知語であった場合にも対処可能で、かつ高速に検索することができる。 The voice search system configured as described above causes a character string to be searched to be erroneously recognized by searching the recognition result word string for a word or word string that is likely to be a recognition result of the input search character string. It is possible to cope with a case where the word is unknown or unknown, and the search can be performed at high speed.

以下に、実施例に即して音声検索システムをより詳細に説明する。 Hereinafter, the voice search system will be described in more detail in accordance with an embodiment.

図１は、本発明の第１の実施例に係る音声検索システムの構成を示すブロック図である。図１において、音声検索システムは、検索対象となる音声データを記憶する音声データ記憶手段１１と、音声データを連続単語認識する連続単語認識手段１２と、連続単語認識の結果を記憶する認識結果単語列記憶手段１３と、認識結果単語列に出現し得る単語の候補を記憶する単語候補記憶手段１８と、検索したい文字列を入力する検索文字列入力手段１４と、入力された検索文字列を単語または単語列に展開する検索文字列展開手段１５と、展開された単語または単語列を認識結果単語列の中から検索する検索手段１６と、検索結果を出力する検索結果出力手段１７、とから構成されている。 FIG. 1 is a block diagram showing the configuration of the voice search system according to the first embodiment of the present invention. In FIG. 1, the speech search system includes speech data storage means 11 for storing speech data to be searched, continuous word recognition means 12 for recognizing speech data as continuous words, and recognition result words for storing results of continuous word recognition. A word storage means 13, a word candidate storage means 18 for storing word candidates that can appear in the recognition result word string, a search character string input means 14 for inputting a character string to be searched, and the input search character string as a word Alternatively, a search character string expanding means 15 that expands into a word string, a search means 16 that searches the expanded word or word string from the recognition result word string, and a search result output means 17 that outputs the search result are configured. Has been.

単語候補記憶手段１８は、連続単語認識手段１２の認識語彙を定めた認識辞書を記憶する。このようにすることで、単語候補記憶手段１８が記憶している単語候補は、認識結果に出現し得る単語となる。なお、単語候補記憶手段１８は、必ずしも認識辞書そのものを記憶する必要はなく、例えば、助詞「は」、「が」などの長さの短い単語を認識辞書から除いた単語リストを単語候補記憶手段１８に記憶させても構わない。短い単語を除くことで展開される単語列の数が膨大になってしまうことを防ぐことができる。 The word candidate storage unit 18 stores a recognition dictionary that defines the recognition vocabulary of the continuous word recognition unit 12. By doing in this way, the word candidate memorize | stored in the word candidate memory | storage means 18 turns into a word which can appear in a recognition result. The word candidate storage means 18 does not necessarily need to store the recognition dictionary itself. For example, the word candidate storage means includes a word list obtained by removing short words such as particles “ha” and “ga” from the recognition dictionary. 18 may be stored. Excluding short words can prevent the number of expanded word strings from becoming enormous.

次に、図１及び図２を参照して、本発明の第１の実施例に係る音声検索システムの動作について詳細に説明する。図２は、本発明の第１の実施例に係る音声検索システムの動作を示すフローチャート図である。 Next, the operation of the voice search system according to the first embodiment of the present invention will be described in detail with reference to FIG. 1 and FIG. FIG. 2 is a flowchart showing the operation of the voice search system according to the first embodiment of the present invention.

まず予め、連続単語認識手段１２が、音声データ記憶手段１１から検索対象となる音声データを読み出して連続単語認識を行い、認識結果を認識結果単語列記憶手段１３に記憶する。連続単語認識は、一般的な音声認識手法である音素を単位としたＨＭＭ（Hidden Markov Model）による音響モデルと、ｎ語間の統計確率に基づくn-gramによる言語モデルを用いたサーチによって実現する。なお、音響モデルの単位として音節やそれに準ずるサブワードを用いたり、言語モデルとして文脈自由文法などを使用することも可能である。また、本実施例では、予め音声データ記憶手段１１に格納された音声データを認識して、検索対象となる認識結果単語列を生成する場合について述べるが、これは、本発明における音声データの状態を限定するものではない。例えば、マイクなどから入力された音声に対して、連続単語認識手段１２で逐次に認識処理を実行して音声データを作成することで、リアルタイムに音声検索を行うことも可能である。 First, the continuous word recognition unit 12 reads out the speech data to be searched from the speech data storage unit 11 to perform continuous word recognition, and stores the recognition result in the recognition result word string storage unit 13 in advance. Continuous word recognition is realized by a search using an acoustic model based on HMM (Hidden Markov Model) in units of phonemes, which is a general speech recognition technique, and a language model based on n-gram based on statistical probability between n words. . It is also possible to use syllables and subwords equivalent to them as acoustic model units, and context-free grammar as language models. In this embodiment, the case where speech data stored in the speech data storage means 11 is recognized in advance and a recognition result word string to be searched is generated will be described. This is the state of the speech data in the present invention. It is not intended to limit. For example, it is also possible to perform a voice search in real time by generating recognition data for voice input from a microphone or the like by the successive word recognition means 12 sequentially.

また、本実施例では、連続単語認識手段１２は、連続単語認識した認識結果そのものを認識結果単語列記憶手段１３に出力しているが、連続単語認識手段１２は、必ずしも連続単語認識の認識結果をそのまま出力しなくても良い。例えば、連続単語認識によって得られた単語列の各単語をさらに細かく区切って単語を短単位化しても良いし、或いは、単語同士を結合して単語を長単位化しても良い。例として、長単位の「非科学的」を、「非」と「科学」と「的」とのような短単位の単語にする等が挙げられる。このときは、単語候補記憶手段１８に記憶する単語候補も同じように単語の短単位化、長単位化が施された単語にして、認識結果単語列記憶手段１３中に出現する単語の種類を合わせることが望ましい。また、連続単語認識の結果に対して形態素解析をかけた結果を認識結果単語列とすることも本実施例に含まれる。このときは、単語候補記憶手段１８に記憶する単語候補は、形態素解析器の語彙を定めた辞書に含まれる単語とすることが望ましい。 In this embodiment, the continuous word recognizing unit 12 outputs the recognition result itself of the continuous word recognition to the recognition result word string storage unit 13, but the continuous word recognizing unit 12 does not necessarily recognize the recognition result of continuous word recognition. Need not be output as is. For example, each word of a word string obtained by continuous word recognition may be further divided into short units, or words may be combined into short units, or words may be combined into long units. For example, the long unit “non-scientific” is changed to a short unit word such as “non”, “science” and “target”. At this time, the word candidates stored in the word candidate storage unit 18 are similarly converted into words having a short unit and a long unit, and the types of words appearing in the recognition result word string storage unit 13 are set. It is desirable to match. In addition, the present embodiment includes a result of morphological analysis performed on the result of continuous word recognition as a recognition result word string. In this case, the word candidates stored in the word candidate storage means 18 are preferably words included in a dictionary that defines the vocabulary of the morphological analyzer.

ユーザが音声データから単語または単語列などを検索するときには、まず、キーボードなどの検索文字列入力手段１４から検索したい文字列を入力する（ステップＡ１）。なお、検索文字列の入力は、キーボードなどからの文字列入力に限る必要はなく、マイクなどによる音声入力を音声認識しても良い。例えば、マイクによって入力された音声を連続音素認識した結果や、孤立単語認識した結果を検索文字列とすることも本実施例に含まれる。 When a user searches for a word or a word string from voice data, first, a character string to be searched is input from the search character string input means 14 such as a keyboard (step A1). Note that the input of the search character string is not limited to the character string input from a keyboard or the like, and the voice input from a microphone or the like may be recognized. For example, the result of continuous phoneme recognition of speech input by a microphone or the result of isolated word recognition is used as a search character string.

次に、検索文字列展開手段１５が、単語候補記憶手段１８に記憶されている単語候補を使用して、検索文字列を単語または単語列に展開する（ステップＡ２）。この展開は、検索文字列と展開する単語または単語列との音響的な距離が近くなるように行う。例えば、検索文字列として「ハリーポッター」を入力した場合を考える。また、単語候補記憶手段１８には「ハリー」、「ポスター」、「は」、「リポーター」、などの単語候補が記憶されているとする。このとき、検索文字列展開手段１５は、単語候補記憶手段１８中の単語候補を並べることで、検索文字列「ハリーポッター」と音響的に近い「ハリー」＋「ポスター」、「は」＋「リポーター」、などの単語列に展開する。もちろん、単語候補記憶手段１８に「ハリーポッター」が存在すれば、「ハリーポッター」という単語にも展開される。この展開は、検索文字列と音響的な距離が近くなるように行われるため、検索文字列の認識結果となる可能性が高い単語列を誤認識も含めて求めることになる。すなわち、「ハリーポッター」を認識すると、認識結果として「ハリー」＋「ポスター」や、「は」＋「リポーター」などの単語列が得られる可能性が高いということになる。このとき、単語候補記憶手段１８は、連続単語認識手段１２の認識語彙であるので、認識結果に出現し得ない無関係な単語列に展開されることはない。以下では、検索文字列展開手段１５が、入力された検索文字列との音響的な距離が近い単語または単語列をどのようにして展開するかについて説明する。 Next, the search character string expansion unit 15 expands the search character string into a word or a word string using the word candidates stored in the word candidate storage unit 18 (step A2). This expansion is performed so that the acoustic distance between the search character string and the word or word string to be expanded is close. For example, consider a case where “Harry Potter” is input as a search character string. Further, word candidates such as “Harry”, “Poster”, “Ha”, “Reporter”, and the like are stored in the word candidate storage unit 18. At this time, the search character string expansion means 15 arranges the word candidates in the word candidate storage means 18 so that “Harry” + “poster”, “ha” + “has acoustically close to the search character string“ Harry Potter ”. Expands to a word string such as "Reporter". Of course, if “Harry Potter” exists in the word candidate storage means 18, the word “Harry Potter” is also expanded. Since this expansion is performed so that the acoustic distance is close to the search character string, a word string that is highly likely to be a recognition result of the search character string is obtained including erroneous recognition. That is, when “Harry Potter” is recognized, a word string such as “Harry” + “poster” or “ha” + “reporter” is likely to be obtained as a recognition result. At this time, since the word candidate storage means 18 is the recognition vocabulary of the continuous word recognition means 12, it is not expanded into an irrelevant word string that cannot appear in the recognition result. Below, how the search character string expansion | deployment means 15 expand | deploys the word or word string with an acoustic distance close | similar to the input search character string is demonstrated.

まず、検索文字列展開手段１５は、入力された検索文字列を音素列に変換する。検索文字列を音素列に変換するためには検索文字列の読み情報が必要となるが、これはかな漢字混じりで入力された検索文字列から自動で読みつけしても良いし、或いは、検索文字列入力手段１４でユーザが検索文字列の読みを入力しても良い。ただし、検索文字列をマイクなどからの音声入力を音声認識することで得た場合には、音声認識によって検索文字列の音素列も得られるので、検索文字列展開手段１５は、ここで述べた音素変換を行う必要はない。一方、単語候補記憶手段１８には、各単語候補の音素列の情報も記憶しておく。 First, the search character string expansion means 15 converts the input search character string into a phoneme string. In order to convert a search character string into a phoneme string, reading information of the search character string is required, but this may be automatically read from a search character string input mixed with kana-kanji, or the search character string The user may input the reading of the search character string by the column input means 14. However, when the search character string is obtained by voice recognition from a microphone or the like, the phoneme string of the search character string can also be obtained by voice recognition. Therefore, the search character string expansion means 15 is described here. There is no need to perform phoneme conversion. On the other hand, the word candidate storage means 18 also stores information on the phoneme string of each word candidate.

検索文字列展開手段１５は、検索文字列と単語候補記憶手段１８に記憶されている単語候補の様々な並びとの音響的な距離を、音素を単位としたＤＰマッチングによって求め、距離があらかじめ定めた閾値以内となる単語列に展開する。なお、本実施例では音素を単位として距離を計算しているが、音節またはそれに準ずるサブワード単位で距離を計算してもよい。その場合は、検索文字列や単語候補を音素列の代わりに音節列やサブワード列に変換する必要がある。 The search character string expansion unit 15 obtains an acoustic distance between the search character string and various arrangements of word candidates stored in the word candidate storage unit 18 by DP matching in units of phonemes, and the distance is determined in advance. Expands to a word string that falls within the specified threshold. In this embodiment, the distance is calculated in units of phonemes, but the distance may be calculated in units of syllables or subwords equivalent thereto. In that case, it is necessary to convert the search character string or word candidate into a syllable string or subword string instead of the phoneme string.

図３は、音素間の距離尺度として、同じ音素間の距離を０、違う音素間の距離を１、音素の挿入時や脱落時の距離を１としたときの、検索文字列の音素列「ｈａｒｉｉｐｏＱｔａａ（ハリーポッター）」（Ｑは促音を表す）と展開単語列の音素列「ｈａｒｉｉ（ハリー）」＋「ｂｏｋｕｓａａ（ボクサー）」との距離をＤＰマッチングによって求めたときの様子である。この例の場合、「ｈａｒｉｉｐｏＱｔａａ（ハリーポッター）」と「ｈａｒｉｉ（ハリー）」＋「ｂｏｋｕｓａａ（ボクサー）」との距離の累積は、４になる。一方、図４は、検索文字列の音素列「ｈａｒｉｉｐｏＱｔａａ（ハリーポッター）」と展開単語列の音素列「ｈａｒｉｉ（ハリー）」＋「ｐｏｓｕｔａａ（ポスター）」との距離をＤＰマッチングによって求めたときの様子で、この場合、両者の距離の累積は、２になる。今、展開するかどうかを決める距離の閾値を３とすると、検索文字列「ハリーポッター」は、検索文字列展開手段１５によって、「ハリー」＋「ポスター」には展開されるが、「ハリー」＋「ボクサー」には展開されないことになる。なお、展開するかどうかを決める距離の閾値は、常に一定値である必要はなく、例えば、検索文字列の長さに応じて閾値を正規化しても良い。 FIG. 3 shows a distance measure between phonemes, a distance between the same phonemes is 0, a distance between different phonemes is 1, and a phoneme string “ This is the state when the distance between “haripoQtaa (Harry Potter)” (Q represents a prompting sound) and the phoneme string “harii” + “bokusaa (boxer)” of the expanded word string is obtained by DP matching. In this example, the accumulated distance between “haripoQtaa (Harry Potter)” and “harii (Harry)” + “bokusaa (boxer)” is 4. On the other hand, FIG. 4 shows a case where the distance between the phoneme string “haripoQtaa (Harry Potter)” of the search character string and the phoneme string “harii” + “postua” (poster) of the expanded word string is obtained by DP matching. In this case, the cumulative distance between the two is 2. If the distance threshold for determining whether or not to expand is now 3, the search character string “Harry Potter” is expanded into “Harry” + “Poster” by the search character string expanding means 15, but “Harry”. + "Boxer" will not be expanded. It should be noted that the distance threshold for determining whether or not to expand is not always a constant value, and for example, the threshold may be normalized according to the length of the search character string.

このように、検索文字列との距離が近くなるような単語列を効率的に求めることは、連続単語音声認識で用いられているサーチアルゴリズムによって高速に実現できる。連続単語音声認識とは、入力特徴ベクトルの時系列に近い単語列を求める問題である。連続単語音声認識アルゴリズムによって、入力特徴ベクトルを単語辞書中の単語の様々な組み合わせと照合することで、認識結果となる単語列を求めることができる。ここで、連続単語音声認識における特徴ベクトルの時系列とは、入力音声波形をフレームと呼ばれる時間単位ごとに分析したものである。 Thus, efficiently obtaining a word string that is close to the search character string can be realized at high speed by a search algorithm used in continuous word speech recognition. Continuous word speech recognition is a problem of obtaining a word string close to a time series of input feature vectors. By comparing the input feature vector with various combinations of words in the word dictionary by the continuous word speech recognition algorithm, a word string that becomes a recognition result can be obtained. Here, the time series of feature vectors in continuous word speech recognition is obtained by analyzing an input speech waveform for each time unit called a frame.

一方、本発明における検索文字列展開は、入力検索文字列に近い単語列を求める問題である。これは、上記の連続単語音声認識アルゴリズムにおける入力特徴ベクトルとして、検索文字列の音素列を入力することで実現できる。入力検索文字列と単語の様々な組み合わせとの照合時のスコア計算には、音素間の距離を用いればよい。なお、本実施例では音素列を入力としているが、音節列またはそれに順ずるサブワード列を入力としても良い。 On the other hand, search character string expansion in the present invention is a problem of obtaining a word string close to the input search character string. This can be realized by inputting a phoneme string of a search character string as an input feature vector in the continuous word speech recognition algorithm. The distance between phonemes may be used to calculate the score when collating the input search character string with various combinations of words. In this embodiment, a phoneme string is used as an input, but a syllable string or a subword string corresponding to the syllable string may be used as an input.

以下では、よく用いられる連続単語音声認識アルゴリズムの一つとして、特許文献１に記載されているようなフレーム同期ビームサーチについて説明する。フレーム同期ビームサーチは、フレームごとに認識結果の候補となる単語列を仮説として展開する一方で、スコアが閾値以下の仮説は消去していくことで、入力特徴ベクトルの時系列と単語列仮説との照合を効率よく行うアルゴリズムである。具体的には、以下のステップ１からステップ３までが繰り返される。 Hereinafter, a frame-synchronized beam search as described in Patent Document 1 will be described as one of continuous word speech recognition algorithms that are often used. The frame-synchronized beam search develops a word string as a recognition result candidate for each frame as a hypothesis, while deleting hypotheses whose score is equal to or less than a threshold value, so that the time series of input feature vectors and the word string hypothesis Is an algorithm that efficiently performs the collation. Specifically, the following steps 1 to 3 are repeated.

ステップ１：Ｉ番目のフレームの仮説をＩ＋１番目のフレームに展開する。すなわち、Ｉ番目のフレームの仮説が単語終端状態にあれば、辞書中の単語を接続して仮説を展開する。Ｉ番目のフレームの仮説は消去され、Ｉ＋１番目のフレームの仮説だけが記憶される。 Step 1: The hypothesis of the I-th frame is expanded to the I + 1-th frame. That is, if the hypothesis of the I-th frame is in the word end state, the word in the dictionary is connected and the hypothesis is developed. The hypothesis for the I-th frame is erased and only the hypothesis for the I + 1-th frame is stored.

ステップ２：Ｉ＋１番目のフレームに展開された仮説のうち、スコアが一定の閾値より良い仮説のみを記憶し、それ以外の仮説を消去する。これは枝狩り(beam pruning)と呼ばれる。 Step 2: Among hypotheses developed in the (I + 1) th frame, only hypotheses whose scores are better than a certain threshold value are stored, and other hypotheses are deleted. This is called beam pruning.

ステップ３：処理すべきフレーム番号Ｉに１を加える。 Step 3: Add 1 to the frame number I to be processed.

上記のフレーム同期ビームサーチの入力特徴ベクトルを入力検索音素列に置き換えることで、本発明における検索文字列の展開が実現可能である。フレーム単位の処理は、音素単位の処理とし、スコアは入力検索音素列と展開された単語列仮説との累積音素間距離とすればよい。また、ステップ２の枝狩り処理は、例えば、累積距離が一定値以上になった仮説を消去すればよい。或いは、展開された仮説のうち最も累積距離の小さい仮説を基準として、その距離よりも一定の閾値以上の累積距離を持つ仮説を消去しても良い。このようにすることで、入力検索文字列との距離が近い単語列を効率的に求めることができる。枝狩りの閾値を調整することで、得られる単語列の数を制御することも可能である。 By replacing the input feature vector of the frame-synchronized beam search with the input search phoneme string, the search character string can be expanded in the present invention. The processing in units of frames may be processing in units of phonemes, and the score may be the cumulative interphoneme distance between the input search phoneme sequence and the expanded word sequence hypothesis. Moreover, the branch hunting process of step 2 should just erase | eliminate the hypothesis in which the accumulated distance became more than a fixed value, for example. Alternatively, a hypothesis having a cumulative distance equal to or greater than a certain threshold than that distance may be deleted on the basis of the hypothesis having the smallest cumulative distance among the developed hypotheses. In this way, it is possible to efficiently obtain word strings that are close to the input search character string. It is also possible to control the number of word strings obtained by adjusting the threshold for branch picking.

なお、検索文字列の展開アルゴリズムはフレーム同期ビームサーチに限る必要はなく、連続単語音声認識アルゴリズムとして一般的に用いられている手法を適用することもできる。例えば、２段ＤＰマッチング、レベルビルディング法、或いはワンステージ法などによって展開を行うことも可能である。これらのアルゴリズムの詳細は、非特許文献２に記載されている。 The search character string expansion algorithm need not be limited to the frame-synchronized beam search, and a technique generally used as a continuous word speech recognition algorithm can also be applied. For example, it is possible to perform development by a two-stage DP matching, a level building method, a one-stage method, or the like. Details of these algorithms are described in Non-Patent Document 2.

また、本実施例では、音素間の距離として同じ音素間のときの距離を０、違う音素間の距離を１としたが、別の距離尺度を使っても良い。例えば、音素間混同行列に基づいて音素間距離を計算しても良い。音素間混同行列とは、音声認識において各音素がどのような音素に認識されやすいかを予め認識実験などにより求め、行列の要素を確率で表したものである。この音素間混同行列の例を図５に示す。図５は、入力音素ｋ、ｇ、ｓ、ｚ、ａ、・・がそれぞれｋ、ｇ、ｓ、ｚ、ａ、・・と認識される確率を行列で表したものである。例えば、音素ｋがｋと認識される確率は０．６、ｇと認識される確率は０．３、ｚと認識される確率は０．１、であることなどが示される。このとき、例えば、音素混同行列中の確率の逆数を音素間距離として定義することができる。このように距離を定義することで、音声認識における誤り傾向を考慮した距離を計算することが可能となる。なお、この場合、確率が０である音素間の距離は、十分に大きな値とする。 In this embodiment, the distance between the same phonemes is 0 and the distance between different phonemes is 1 as the distance between phonemes, but another distance scale may be used. For example, the distance between phonemes may be calculated based on the interphoneme confusion matrix. The inter-phoneme confusion matrix is obtained by obtaining in advance a recognition experiment or the like what kind of phoneme each phoneme is likely to be recognized in speech recognition, and expressing the elements of the matrix by probability. An example of this interphoneme confusion matrix is shown in FIG. FIG. 5 is a matrix showing the probabilities that input phonemes k, g, s, z, a,... Are recognized as k, g, s, z, a,. For example, the probability that the phoneme k is recognized as k is 0.6, the probability that it is recognized as g is 0.3, and the probability that it is recognized as z is 0.1. At this time, for example, the reciprocal of the probability in the phoneme confusion matrix can be defined as the interphoneme distance. By defining the distance in this way, it is possible to calculate the distance in consideration of the error tendency in speech recognition. In this case, the distance between phonemes having a probability of 0 is set to a sufficiently large value.

また、別の距離尺度として、連続単語認識手段１２が認識時に使用した音響モデルのモデル間距離を使用しても良い。例えば、各音素を表す音響モデルの確率分布間のＫＬ（Kullback-Leibler）距離によって音素間の距離を定義することができる。各音素の音響モデルが１状態かつ単一ガウス分布でモデル化されているとき、２つのモデル間のＫＬ距離は、（１）式で表される。

なお、f (x|u, Σ)は、平均ベクトルu、分散共分散行列ΣのK次元ガウス分布であって、（２）式で表される。

Moreover, you may use the distance between models of the acoustic model which the continuous word recognition means 12 used at the time of recognition as another distance scale. For example, the distance between phonemes can be defined by a KL (Kullback-Leibler) distance between probability distributions of an acoustic model representing each phoneme. When the acoustic model of each phoneme is modeled with one state and a single Gaussian distribution, the KL distance between the two models is expressed by equation (1).

Note that f (x | u, Σ) is a K-dimensional Gaussian distribution of an average vector u and a variance-covariance matrix Σ, and is expressed by Equation (2).

さらに、音響モデルが複数の状態で表されるときや、状態が混合ガウス分布で表されるときには、例えば、最も距離の近いガウス分布間距離を音素間距離とすればよい。 Furthermore, when the acoustic model is represented by a plurality of states, or when the state is represented by a mixed Gaussian distribution, for example, the distance between the Gaussian distributions with the closest distance may be set as the interphoneme distance.

また、検索文字列展開手段１５が検索文字列を単語または単語列に展開するときに、音響的な近さだけでなく、言語的な制約を加えることも可能である。例えば、検索文字列と展開する単語候補との距離計算を行うときに、連続単語認識手段１２が認識時に使用した言語モデルを参照して、ユニグラム確率が低い単語にはペナルティを与えれば良い。また、バイグラム確率が低い単語連鎖に展開するときにペナルティを与えても良い。 Further, when the search character string expanding means 15 expands the search character string into words or word strings, it is possible to add not only acoustic proximity but also linguistic restrictions. For example, when calculating the distance between the search character string and the word candidate to be developed, a word model having a low unigram probability may be penalized by referring to the language model used by the continuous word recognition unit 12 at the time of recognition. Also, a penalty may be given when expanding to a word chain with a low bigram probability.

例えば、言語モデルのバイグラム確率の逆数を定数倍するなどして、図６に示すような単語間のペナルティを求めておく。このとき、図４で求めた、検索文字列の音素列「ｈａｒｉｉｐｏＱｔａａ（ハリーポッター）」と展開単語列の音素列「ｈａｒｉｉ（ハリー）」＋「ｐｏｓｕｔａａ（ポスター）」との距離は、音素列間の距離２にペナルティ２を加えて４に修正される。一方、図７に示すように「ｈａｒｉｉｐｏＱｔａａ（ハリーポッター）」と展開単語列の音素列「ｗａ（は）」＋「ｒｉｐｏｏｔａａ（リポーター）」との音素列間距離は３であるが、ペナルティは０．５なので最終的な距離は、３．５に修正される。この結果、言語的により認識結果となりやすい「は」＋「リポーター」の方が距離が近いと判定される。 For example, the penalty between words as shown in FIG. 6 is obtained by multiplying the inverse of the bigram probability of the language model by a constant. At this time, the distance between the phoneme string “haripoQtaa (Harry Potter)” of the search character string and the phoneme string “harii” + “postua (poster)” of the expanded word string obtained in FIG. It is corrected to 4 by adding penalty 2 to distance 2 of. On the other hand, as shown in FIG. 7, the distance between phoneme strings between “haripoQtaa (Harry Potter)” and the phoneme string “wa (ha)” + “ripotaa (reporter)” of the expanded word string is 3, but the penalty is 0. .5 so the final distance is corrected to 3.5. As a result, it is determined that “ha” + “reporter”, which is more likely to be a linguistic recognition result, is closer in distance.

また、より高次のn-gramに対しても同様である。距離計算のときにこのようなペナルティを加えることで、検索文字列を認識結果に出現しやすい単語または単語列のみに展開することが可能となる。 The same applies to higher-order n-grams. By adding such a penalty when calculating the distance, it is possible to expand the search character string only to words or word strings that tend to appear in the recognition result.

以上で説明したように、検索文字列展開手段１５によって、入力された検索文字列は、単語候補記憶手段１８が記憶する単語または単語列に展開される。このとき、展開された単語または単語列は、検索文字列を認識した結果得られる可能性の高い単語または単語列である。 As described above, the input search character string is expanded into a word or a word string stored in the word candidate storage unit 18 by the search character string expansion unit 15. At this time, the expanded word or word string is a word or word string that is highly likely to be obtained as a result of recognizing the search character string.

最後に、検索手段１６は、検索文字列展開手段１５によって展開された単語または単語列が認識結果単語列記憶手段１３に存在するかどうかを調べる（ステップＡ３、Ａ４）。展開された単語または単語列が認識結果単語列に存在すれば、検索に成功したと判断して、検索結果出力手段１７は、その認識結果単語列に対応する区間を検索結果として出力する（ステップＡ５）。展開された単語または単語列が認識結果単語列に存在しなければ、検索に失敗したと判断して、検索結果出力手段１７は検索不能を示すメッセージを出力する（ステップＡ６）。前述した例のように、検索文字列「ハリーポッター」が「ハリー」＋「ポスター」に展開された場合を考えると、認識結果中に「ハリー」＋「ポスター」が存在したら、その区間を「ハリーポッター」の検索結果として出力する。 Finally, the search means 16 checks whether the word or word string developed by the search character string development means 15 exists in the recognition result word string storage means 13 (steps A3 and A4). If the expanded word or word string exists in the recognition result word string, it is determined that the search is successful, and the search result output means 17 outputs the section corresponding to the recognition result word string as the search result (step) A5). If the expanded word or word string does not exist in the recognition result word string, it is determined that the search has failed, and the search result output means 17 outputs a message indicating that the search is impossible (step A6). Consider the case where the search character string “Harry Potter” is expanded to “Harry” + “Poster” as in the example described above, if “Harry” + “Poster” exists in the recognition result, It is output as a search result of “Harry Potter”.

また、予め認識結果単語列記憶手段１３に記憶されている認識結果単語列に対して、検索のためのインデックスを作成し、検索手段１６がインデックスを参照することで展開された単語または単語列を検索することも本実施例に含まれる。図８は、検索対象となる音声データの認識結果に対するインデックスの例を示す図である。各単語の出現位置情報が出現した文書番号と文書中の出現位置との組み合わせによって記憶されている。例えば、「ハリーポッター」を展開した単語列「ハリー」＋「ポスター」を検索する場合について示す。図８のインデックスを参照することで「ハリー」は、文書１中の１０単語目と文書１中の２０単語目とに出現し、「ポスター」は、文書１中の１１単語目と文書２中の１０単語目とに出現することが直ちに分かる。その後、「ハリー」と「ポスター」とが連続しているかどうかを調べることで、「ハリー」＋「ポスター」が文書１中の１０単語目から１１単語目にあることが検索できる。このように展開された単語列の検索に図８に示すようなインデックスを利用することで、文書全体を探索する必要がなくなるため、検索の速度を大幅に向上することが可能である。 Further, an index for search is created for the recognition result word string stored in the recognition result word string storage unit 13 in advance, and a word or a word string developed by referring to the index by the search unit 16 is used. Searching is also included in this embodiment. FIG. 8 is a diagram illustrating an example of an index for the recognition result of the audio data to be searched. The appearance position information of each word is stored as a combination of the document number where the word appears and the appearance position in the document. For example, a case where a word string “Harry” + “poster” in which “Harry Potter” is expanded is searched. By referring to the index of FIG. 8, “Harry” appears at the 10th word in document 1 and the 20th word in document 1, and “poster” appears in the 11th word in document 1 and in document 2. It can be seen immediately that it appears at the 10th word. Thereafter, by checking whether “Harry” and “Poster” are continuous, it can be found that “Harry” + “Poster” is in the 10th to 11th words in the document 1. By using such an index as shown in FIG. 8 for searching for the expanded word string, it is not necessary to search the entire document, so that the search speed can be greatly improved.

次に、本実施例の効果について説明する。本実施例では、単語単位の認識結果を単語単位で検索するために、音素列の認識結果を音素単位で検索するのに比べて検索を行う空間が小さくなる。 Next, the effect of the present embodiment will be described. In this embodiment, since the recognition result in units of words is searched for in units of words, the space for performing the search is smaller than in the case of searching for the recognition results of phoneme strings in units of phonemes.

また、予め検索文字列を誤認識の可能性を考慮した単語または単語列に展開してから検索するために、検索時には検索文字列と認識結果との距離をＤＰマッチングによって計算する必要がない。本実施例では、検索文字列を単語列に展開するときに検索文字列と展開する単語候補との距離計算を行う必要があるが、これは検索対象全体と検索文字列とをＤＰマッチングする従来の方式に比べれば大した計算量ではない。また、展開された単語または単語列を認識結果から検索するときには、インデックスを用いた検索手法を利用できる。 In addition, since the search character string is expanded into words or word strings in consideration of the possibility of erroneous recognition in advance, it is not necessary to calculate the distance between the search character string and the recognition result by DP matching during the search. In this embodiment, when the search character string is expanded into a word string, it is necessary to calculate the distance between the search character string and the word candidate to be expanded. This is a conventional technique for DP matching between the entire search target and the search character string. Compared to this method, it is not a large amount of calculation. In addition, when searching for a developed word or word string from the recognition result, a search method using an index can be used.

さらに、本実施例では、認識結果単語列を得るときの認識語彙を使用して単語または単語列へ展開しているため、認識結果にまったく出現する可能性のない単語または単語列を検索することはない。 Furthermore, in this embodiment, since the recognition vocabulary used when obtaining the recognition result word string is used to expand the word or word string, a word or word string that has no possibility of appearing in the recognition result is searched. There is no.

これらの結果、本実施例によって、音声データに対する検索速度が大幅に向上する。 As a result, according to the present embodiment, the search speed for the voice data is greatly improved.

次に、本発明の第２の実施例について図面を参照して詳細に説明する。 Next, a second embodiment of the present invention will be described in detail with reference to the drawings.

図９は、本発明の第２の実施例に係る音声検索システムの構成を示すブロック図である。図９に示す音声検索システムは、図１に示した音声検索システムに対して、認識結果単語列記憶手段２３に記憶された認識結果に出現する単語のリストを抽出する単語抽出手段２９をさらに備え、単語候補記憶手段２８は、単語抽出手段２９の抽出した単語のリストを記憶している点で異なる。なお、図９において、音声データ記憶手段２１、連続単語認識手段２２、検索文字列入力手段２４、検索文字列展開手段２５、検索手段２６、検索結果出力手段２７は、それぞれ図１における音声データ記憶手段１１、連続単語認識手段１２、検索文字列入力手段１４、検索文字列展開手段１５、検索手段１６、検索結果出力手段１７に相当し、特に記載無き場合には、その説明を省略する。 FIG. 9 is a block diagram showing the configuration of the voice search system according to the second embodiment of the present invention. The voice search system shown in FIG. 9 further includes word extraction means 29 for extracting a list of words appearing in the recognition result stored in the recognition result word string storage means 23 with respect to the voice search system shown in FIG. The word candidate storage means 28 is different in that it stores a list of words extracted by the word extraction means 29. In FIG. 9, the voice data storage means 21, the continuous word recognition means 22, the search character string input means 24, the search character string expansion means 25, the search means 26, and the search result output means 27 are the voice data storage in FIG. It corresponds to means 11, continuous word recognition means 12, search character string input means 14, search character string expansion means 15, search means 16, and search result output means 17.

第１の実施例では、単語候補記憶手段１８には検索文字列展開手段の展開する単語候補として、連続単語認識手段１２の認識語彙が記憶されていた。本実施例では、単語候補記憶手段２８は、単語抽出手段２９の抽出した単語のリストを記憶する。単語抽出手段２９は、認識結果単語列記憶手段２３に記憶されている認識結果を調べて、認識結果に出現する単語のリストを抽出する。このとき、必ずしも認識結果に現れる単語の全てを抽出しなくても良い。例えば、単語が認識結果に現れる頻度を調べて、頻度が極端に少ない単語を抽出しないようにしてもよい。また、長さが短い単語を抽出しないようにしてもよい。このようにして認識結果単語列から抽出された単語のリストが単語候補記憶手段２８に記憶され、検索文字列展開手段２５の展開する単語候補となる。 In the first embodiment, the word candidate storage means 18 stores the recognition vocabulary of the continuous word recognition means 12 as word candidates developed by the search character string expansion means. In this embodiment, the word candidate storage unit 28 stores a list of words extracted by the word extraction unit 29. The word extraction unit 29 examines the recognition result stored in the recognition result word string storage unit 23 and extracts a list of words that appear in the recognition result. At this time, it is not always necessary to extract all the words appearing in the recognition result. For example, the frequency with which words appear in the recognition result may be checked so that words with extremely low frequency are not extracted. Moreover, you may make it not extract the word with short length. In this way, a list of words extracted from the recognition result word string is stored in the word candidate storage means 28 and becomes a word candidate developed by the search character string expanding means 25.

また、単語抽出手段２９が認識結果に出現する単語を調べるときに、各単語の前後に現れる単語についても調べ、各単語の前後に現れる単語にのみ接続を許した単語間の接続テーブルを作成しても良い。この場合、単語候補記憶手段２８は、単語抽出手段２９が作成する単語のリストと接続テーブルの両方を記憶する。検索文字列展開手段２５は、第１の実施例の検索文字列展開手段１５とほぼ同様の動作を行うが、単語候補記憶手段２８が記憶している接続テーブルも参照し、接続不可能となっている単語列には展開しない点が異なる。 Further, when the word extracting means 29 examines words appearing in the recognition result, it also examines words appearing before and after each word, and creates a connection table between words that allows connection only to words appearing before and after each word. May be. In this case, the word candidate storage unit 28 stores both the list of words and the connection table created by the word extraction unit 29. The search character string expansion means 25 performs substantially the same operation as the search character string expansion means 15 of the first embodiment, but also refers to the connection table stored in the word candidate storage means 28 and connection is impossible. The difference is that it does not expand to the word string.

次に、接続テーブルについて説明する。図１０は、認識結果から作成した接続テーブルの例を示す図であって、先行単語に対し後続単語が接続可能（「○」で表わす）か接続不可能（「×」で表わす）かを表している。図１０の接続テーブルを参照すると、認識結果中に「ハリー」＋「ポスター」の並びは存在するが、「ハリー」＋「ボクサー」の並びは存在しないことが分かるため、検索文字列展開手段は「ハリー」＋「ボクサー」への展開を行わない。この結果、展開速度が向上し、また、無駄な検索を行わなくなるため検索速度も向上する。 Next, the connection table will be described. FIG. 10 is a diagram showing an example of a connection table created from the recognition result, and shows whether the subsequent word can be connected to the preceding word (represented by “◯”) or not connectable (represented by “x”). ing. Referring to the connection table of FIG. 10, it can be seen that the sequence of “Harry” + “Poster” exists in the recognition result, but the sequence of “Harry” + “Boxer” does not exist. No expansion to "Harry" + "Boxer". As a result, the development speed is improved and the search speed is also improved because unnecessary search is not performed.

次に、本実施例の効果について説明する。本実施例では、単語抽出手段２９によって認識結果に現れる単語のリストを抽出して単語候補記憶手段２８に記憶するため、検索文字列展開手段２５は、認識結果に必ず現れる単語のみを使用して検索文字列を単語または単語列に展開できる。このため、単語の展開および検索の両方の効率が良くなり、検索速度がより向上する。 Next, the effect of the present embodiment will be described. In this embodiment, the word extraction unit 29 extracts a list of words appearing in the recognition result and stores it in the word candidate storage unit 28. Therefore, the search character string expansion unit 25 uses only words that always appear in the recognition result. Search strings can be expanded into words or word strings. For this reason, the efficiency of both word expansion and search is improved, and the search speed is further improved.

次に、以上で説明した第１および第２の実施例に係る音声検索システムおよび音声検索用プログラムの実装について図面を参照して説明する。 Next, implementation of the voice search system and the voice search program according to the first and second embodiments described above will be described with reference to the drawings.

図１１は、本発明の実施例に係る音声検索システムの構成を示すブロック図である。図１１において音声検索システムは、入出力部５１、データ処理部５２、記憶部５３を備える。記憶部５３には、プログラム記憶部５４、単語候補記憶部５５、音声データ記憶部５６、認識結果単語列記憶部５７が備えられる。 FIG. 11 is a block diagram showing the configuration of the voice search system according to the embodiment of the present invention. In FIG. 11, the voice search system includes an input / output unit 51, a data processing unit 52, and a storage unit 53. The storage unit 53 includes a program storage unit 54, a word candidate storage unit 55, a voice data storage unit 56, and a recognition result word string storage unit 57.

入出力部５１は、キーボード、音声入力装置、表示装置などから構成され、音声検索システムにおける各種データの入出力を司る。図１の検索文字列入力手段１４、または図９の検索文字列入力手段２４に相当する。また、入出力部５１は、図１の検索結果出力手段１７、または図９の検索文字列出力手段２７にも相当する。 The input / output unit 51 includes a keyboard, a voice input device, a display device, and the like, and controls input / output of various data in the voice search system. This corresponds to the search character string input means 14 of FIG. 1 or the search character string input means 24 of FIG. The input / output unit 51 also corresponds to the search result output unit 17 in FIG. 1 or the search character string output unit 27 in FIG.

記憶部５３は、音声検索用プログラムをプログラム記憶部５４に記憶しておく。また、図１または図９にそれぞれ示した音声データ記憶手段１１または２１、認識結果単語列記憶手段１３または２３、単語候補記憶手段１８または２８は、それぞれ記憶部５３内の音声データ記憶部５６、認識結果単語列記憶部５７、単語候補記憶部５５に相当し、データ処理部５２によって読み書きされる。 The storage unit 53 stores the voice search program in the program storage unit 54. Also, the speech data storage means 11 or 21, the recognition result word string storage means 13 or 23, and the word candidate storage means 18 or 28 shown in FIG. 1 or FIG. 9, respectively, are the speech data storage section 56 in the storage section 53, It corresponds to the recognition result word string storage unit 57 and the word candidate storage unit 55, and is read and written by the data processing unit 52.

データ処理部５２は、音声検索プログラムの制御により、図１に示した連続単語認識手段１２、検索文字列入力手段１４、検索文字列展開手段１５、検索手段１６、検索結果出力手段１７、における処理を実行する。あるいは、データ処理部５２は、音声検索プログラムの制御により、図９に示した連続単語認識手段２２、検索文字列入力手段２４、検索文字列展開手段２５、検索手段２６、検索結果出力手段２７、単語抽出手段２９における処理を実行する。また、音声検索プログラムは、音声データ記憶部５６、認識結果単語列記憶部５７、単語候補記憶部５５を参照することによって、入力された検索文字列を検索対象となる音声データから検索するように動作する。 The data processing unit 52 performs processing in the continuous word recognition unit 12, the search character string input unit 14, the search character string expansion unit 15, the search unit 16, and the search result output unit 17 shown in FIG. Execute. Alternatively, the data processing unit 52 controls the continuous word recognition unit 22, the search character string input unit 24, the search character string expansion unit 25, the search unit 26, the search result output unit 27, and the like shown in FIG. Processing in the word extraction means 29 is executed. Further, the voice search program searches the input search character string from the voice data to be searched by referring to the voice data storage unit 56, the recognition result word string storage unit 57, and the word candidate storage unit 55. Operate.

本発明は、放送音声や講演音声などの音声データベースから所望のコンテンツを検索する用途に適用できる。 The present invention can be applied to a purpose of retrieving desired content from an audio database such as broadcast audio and lecture audio.

本発明の第１の実施例に係る音声検索システムの構成を示すブロック図である。It is a block diagram which shows the structure of the voice search system which concerns on 1st Example of this invention. 本発明の第１の実施例に係る音声検索システムの動作を表すフローチャート図である。It is a flowchart figure showing operation | movement of the voice search system which concerns on 1st Example of this invention. 検索文字列とある展開単語列との距離を求めるときの説明図である。It is explanatory drawing when calculating | requiring the distance of a search character string and a certain expansion | deployment word string. 検索文字列と他の展開単語列との距離を求めるときの説明図である。It is explanatory drawing when calculating | requiring the distance of a search character string and another expansion | deployment word string. 音素間混同行列の例を表す図である。It is a figure showing the example of the confusion matrix between phonemes. 単語間のペナルティの例を表す図である。It is a figure showing the example of the penalty between words. 検索文字列とさらに他の展開単語列との距離を求めるときの説明図である。It is explanatory drawing when calculating | requiring the distance of a search character string and another expansion | deployment word string. 検索対象となる音声データの認識結果に対するインデックスの例を示す図である。It is a figure which shows the example of the index with respect to the recognition result of the audio | voice data used as search object. 本発明の第２の実施例に係る音声検索システムの構成を示すブロック図である。It is a block diagram which shows the structure of the voice search system which concerns on 2nd Example of this invention. 認識結果から作成した接続テーブルの例を示す図である。It is a figure which shows the example of the connection table produced from the recognition result. 本発明の実施例に係る音声検索システムの構成を示すブロック図である。It is a block diagram which shows the structure of the voice search system based on the Example of this invention. 従来の音声検索システムの構成を示すブロック図である。It is a block diagram which shows the structure of the conventional voice search system.

Explanation of symbols

１１、２１音声データ記憶手段
１２、２２連続単語認識手段
１３、２３認識結果単語列記憶手段
１４、２４検索文字列入力手段
１５、２５検索文字列展開手段
１６、２６検索手段
１７、２７検索結果出力手段
１８、２８単語候補記憶手段
２９単語抽出手段
５１入出力部
５２データ処理部
５３記憶部
５４プログラム記憶部
５５単語候補記憶部
５６音声データ記憶部
５７認識結果単語列記憶部 11, 21 Speech data storage means 12, 22 Continuous word recognition means 13, 23 Recognition result word string storage means 14, 24 Search character string input means 15, 25 Search character string expansion means 16, 26 Search means 17, 27 Search result output Means 18, 28 Word candidate storage means 29 Word extraction means 51 Input / output unit 52 Data processing unit 53 Storage unit 54 Program storage unit 55 Word candidate storage unit 56 Speech data storage unit 57 Recognition result word string storage unit

Claims

A recognition result word string storage means for storing a speech recognition result in units of words of the voice data to be searched as a recognition result word string;
Word candidate storage means for storing words that can appear in the speech recognition result as candidate words;
An input search character string is converted into a phoneme string, a candidate word string composed of the candidate word or a combination of the candidate words stored in the word candidate storage unit is converted into a phoneme string, and each is included in the phoneme string Search character string expansion means for expanding the search character string into the candidate word or the candidate word string based on the degree of matching between phonemes to be generated;
Search means for searching the candidate word or the candidate word string developed by the search character string development means from the recognition result word string storage means;
A voice search system comprising:

The speech search system according to claim 1, wherein the word candidate storage unit stores a recognition vocabulary for obtaining the recognition result word string from the speech data.

2. The voice according to claim 1, further comprising a word extracting unit that extracts a list of words appearing in a recognition result stored in the recognition result word string storage unit, wherein the word candidate storage unit stores the list. Search system.

When the word extraction means extracts the list, it examines words appearing before and after each word in the recognition result, creates a connection table between words that allows connection only to words appearing before and after, and the word candidates The storage unit stores the connection table together with the list, and the search character string expansion unit refers to the list stored in the word candidate storage unit and the connection table, and the search character string can be connected. The voice search system according to claim 3, further comprising a function that expands only to a candidate word or the candidate word string.

The search character string expansion means includes a function of expanding a word by a continuous word speech recognition algorithm using a phoneme string of the search character string as an input feature vector series and the candidate word as a recognition vocabulary. The voice search system described.

When the search character string expansion means expands the search character string using the candidate words stored in the word candidate storage means, the search character string and the candidate are based on the degree of matching of the phonemes. The voice search system according to claim 1, further comprising a function of obtaining a distance between a word or the candidate word string and developing the distance so that the distance is within a predetermined threshold .

It comprises a confusion matrix storage means for storing a confusion matrix representing the recognition error tendency of the phonemes, and the search character string expansion means includes a function for obtaining a degree of coincidence between the phonemes based on the confusion matrix. The voice search system according to claim 6.

The search character string expansion means includes a function for obtaining a degree of coincidence of the phonemes based on a distance between models in an acoustic model used when obtaining the recognition result word string from the speech data. The voice search system according to claim 6.

In addition to the distance , the search character string expansion means refers to the language model used when obtaining the recognition result word string from the speech data, and sets the distance for words or word strings that are difficult to appear linguistically. The voice search system according to claim 6, further comprising a distance addition function for further distance .

Search input means, a developing means, a search unit, a voice search system word string and a storage means for storing a word that may appear in the speech recognition result of a word unit of audio data to be retrieved as a candidate word A method,
The input means inputting a search character string;
The expansion means converts the candidate word string composed of the candidate word or the combination of candidate words in the storage means into a phoneme string, converts the search character string into a phoneme string, and each phoneme included in the phoneme string a step of developing the search string in the candidate word or the candidate word sequence based on the degree of coincidence between,
The search means representing the speech recognition result as a recognition result word string, and searching the expanded candidate word or the candidate word string from the recognition result word string;
A voice search method comprising:

The voice search system further includes voice recognition means,
The speech recognition unit includes a step of performing speech recognition on the speech data and obtaining the candidate word as a recognition result prior to the step of inputting the search character string. Voice search method.

The voice search system further includes voice recognition means,
The speech recognition means includes a step of performing speech recognition on the speech data and creating a list of words extracted from a recognition result prior to the step of inputting the search character string, and expanding the search character string The voice search method according to claim 10, wherein the step of expanding is performed with reference to the created list.

And expanding means, a search unit, a method for searching a voice recognition unit, a recognition result a word sequence storage unit, the speech retrieval system and a word candidate storage means word strings,
A step of the recognition result stored in the word sequence storage means speech recognition result of a word unit of audio data to which the speech recognition means is searched as a recognition result word string,
The leave the word candidate storage means stores a word that may appear in the recognition result as the candidate words, the expansion means, converts the entered search string phoneme string, stored in the word candidate storage unit It said candidate word or convert the candidate word sequence composed of a combination of the candidate words into a phoneme string, the candidate word or the candidate word the search string based on the degree of coincidence of phonemes each other included the each phoneme string are Steps to expand into columns ,
The search means searching for the expanded candidate word or the candidate word string from the recognition result word string storage means;
A voice search method comprising:

13. The method of voice search, wherein the recognition vocabulary in obtaining the recognition result word string from said audio data said word candidate storage means stores.

The voice search system further includes word extraction means,
The word extracting unit extracting a list of words appearing in the recognition result stored in the recognition result word string storage unit; and the word extracting unit storing the extracted word list in the word candidate storage unit. 14. The voice search method according to claim 13, further comprising the step of expanding the candidate word string with reference to the stored list.

A computer constituting a speech search system comprising a recognition result word string storage means and a word candidate storage means ,
A process of storing the recognition result word sequence storage means the search speech recognition result of a word unit of audio data made as a recognition result word string,
A process of storing in the word candidate storage means for a word may appear in the recognition result as a candidate word,
An input search character string is converted into a phoneme string, a candidate word string composed of the candidate words or a combination of the candidate words stored in the word candidate storage unit is converted into a phoneme string, and each of the phoneme strings is included in the phoneme string A search character string expansion process for expanding the search character string into the candidate word or the candidate word string based on the degree of matching between phonemes to be generated ;
A search process for searching the candidate word or the candidate word string expanded by the search character string expansion process from the recognition result word string storage unit;
A program that executes

The program according to claim 16, further causing the word candidate storage unit to store a recognition vocabulary for obtaining the recognition result word string from the speech data.

A word extraction process for extracting a list of words appearing in the recognition result stored in the recognition result word string storage means, and a process for storing the word list extracted by the word extraction process in the word candidate storage means. The program according to claim 16, further causing execution of the search character string expansion process so as to expand with reference to the stored list.

A computer constituting a speech search system comprising a recognition result word string storage means and a word candidate storage means ,
The candidate word sequence consisting of the candidate words or combinations of said candidate word in said word candidate storage means to the search subject to audio data units of words of the speech recognition result is previously stored words may appear into a phoneme sequence A search character string expansion that converts the input search character string into a phoneme string and expands the search character string into the candidate word or the candidate word string based on the degree of matching between the phonemes included in the respective phoneme strings Processing,
From the recognition result word sequence storage means are previously stored as the speech recognition result recognition results word string, a search process of searching the candidate word or the candidate word sequence which is expanded by the search string expansion process,
A program that executes

The program according to claim 19, further performing a process of performing speech recognition on the speech data and obtaining the candidate word as a recognition result.