JP5146979B2

JP5146979B2 - Ambiguity resolution device and computer program in natural language

Info

Publication number: JP5146979B2
Application number: JP2006154497A
Authority: JP
Inventors: 英一郎隅田; 史昭菅谷
Original assignee: ATR Advanced Telecommunications Research Institute International; KDDI Corp
Current assignee: ATR Advanced Telecommunications Research Institute International; KDDI Corp
Priority date: 2006-06-02
Filing date: 2006-06-02
Publication date: 2013-02-20
Anticipated expiration: 2026-06-02
Also published as: JP2007323475A

Description

この発明は自然言語処理に関し、特に、単語の読み（日本語における仮名表記）、アクロニム（頭字語）のフルスペル、及び二つの言語の間での訳語の対応などに見られるあいまい性を解消するための自然言語処理に関する。 The present invention relates to natural language processing, and more particularly to eliminate ambiguity seen in word reading (kana notation in Japanese), full spelling of acronyms (acronyms), and correspondence between translated words between two languages. Related to natural language processing.

自然言語には、あいまい性が常に付きまとう。例えば同形異音語という問題がある。同形異音語とは、読みが複数ある単語のことである。例えば英語の「ｂｏｗ」という単語には、「ｂｏｗ」（蝶型リボン）と「ｂｏｗ」（船首）という二つの読み方がある。日本語でもこうした例は多い。例えば「大平」という語は、「オオヒラ」とも、「タイヘイ」とも、「オオダイラ」とも読める。 Natural language is always ambiguous. For example, there is a problem of isomorphic allophones. An isomorphic allophone is a word with multiple readings. For example, the word “bow” in English has two readings: “bow” (butterfly ribbon) and “bow” (bow). There are many examples in Japanese. For example, the word “Ohira” can be read as “Ohira”, “Taihei”, and “Odaira”.

こうしたあいまい性は、アクロニムにも存在する。例えば「ＡＣＬ」というアクロニムは、「The Association for Computational Linguistics」、「Anterior Cruciate Ligament」、及び「Access Control List」のいずれとも解釈できる。同様のあいまい性は、翻訳の際の訳語の決め方等にも見出すことができる。 Such ambiguity also exists in acronyms. For example, an acronym “ACL” can be interpreted as any of “The Association for Computational Linguistics”, “Anterior Cruciate Ligament”, and “Access Control List”. Similar ambiguity can also be found in the way of deciding the translation word for translation.

人間の場合、こうしたあいまい性については、その語が生起した状況などに応じて適宜適切に判断をしたり、いずれかの手段を用いて候補をさがし、その中で状況に応じて最も適していると思われるものを選択したりすることで解決している。しかし、自然言語処理でそのような処理を実現することは困難である。 In the case of humans, such ambiguity is best suited to the situation according to the situation in which the word occurs, as appropriate, or by using one of the means to find candidates. It is solved by selecting what seems to be. However, it is difficult to realize such processing by natural language processing.

このようなあいまい性は、自然言語処理において重大な問題となり得る。例えば、日本語テキストの読上げにおいて、同形異音語に遭遇した場合、適切な発音で読上げを行なうためには、その発音（かな表記）を決定する必要がある。さもないと、不適切な読上げが行なわれてしまう。 Such ambiguity can be a serious problem in natural language processing. For example, in the case of reading a Japanese text, when an isomorphic allophone is encountered, it is necessary to determine the pronunciation (kana notation) in order to read out with proper pronunciation. Otherwise, inappropriate reading will occur.

こうした問題を解決するための提案が非特許文献１でなされている。非特許文献１では、予め、ある単語Ｗとその対応する意味Ｓｉとを記述した学習データを人手で用意し、その学習データを用いて、ある単語Ｗが与えられたときに意味Ｓｉのうちで適切なものを選択する分類器を作成する。
梅村祥之、清水司、「音声合成システムのための同形異音語の読み分け」、豊田中央研究所Ｒ＆Ｄレビュー、２０００年、第３５巻第１号、６７頁〜７４頁 A proposal for solving such a problem is made in Non-Patent Document 1. In Non-Patent Document 1, learning data in which a certain word W and its corresponding meaning Si are described in advance are manually prepared, and when a certain word W is given using the learning data, Create a classifier that selects the appropriate one.
Umemura Yoshiyuki, Shimizu Tsukasa, “Different Homophones for Speech Synthesis System”, Toyota Central R & D Review, 2000, Vol. 35, No. 1, pp. 67-74

しかし、非特許文献１に開示された方法では、学習データを人手で用意する必要があり、時間と費用とがかさむという問題がある。また、限られた人の手によって学習データが作成されるので、学習データに偏りが生ずる可能性もあるため、信頼性が低いという問題もある。 However, in the method disclosed in Non-Patent Document 1, it is necessary to prepare learning data manually, and there is a problem that time and cost are increased. In addition, since learning data is created by a limited number of people, there is a possibility that the learning data may be biased, and there is a problem that reliability is low.

それ故に本発明の目的は、自然言語に伴うあいまい性を、容易に、かつ信頼性高く解決できる、自然言語における多義解消装置を提供することである。 Therefore, an object of the present invention is to provide an ambiguity resolution device in natural language that can easily and reliably resolve the ambiguity associated with natural language.

本発明の第１の局面によれば、自然言語における多義解消装置は、自然言語文からなる入力文において、ある単語と、入力文においてある単語が置かれた文脈と、ある単語の意味を表す可能性のある複数の意味候補を含む意味候補の集合とが与えられると、当該意味候補の集合の中から、文脈においてある単語の意味として最も適切なものを選択する、自然言語における多義解消装置であって、ある単語と、意味候補集合中の意味候補との組合せの各々について、所定のコーパスから、当該組合せを構成する語が共起する文書の集合を収集するための文書収集手段と、文書収集手段によって組合せの各々について収集された文書の集合を学習データとして用い、ある単語と、その単語の文書中の文脈とが与えられると、ある単語の当該文脈中での意味として最適な意味候補を意味候補集合中から選択する分類器を自動的に作成するための分類器作成手段と、入力文において、ある単語が置かれた文脈に基づいて、ある単語の意味として最適なものを、意味候補の集合の中から分類器を用いて選択するための分類実行手段とを含む。 According to the first aspect of the present invention, an ambiguity resolution device in a natural language represents a certain word, a context in which a certain word is placed in the input sentence, and the meaning of a certain word in an input sentence composed of natural language sentences. Given a set of semantic candidates including a plurality of possible semantic candidates, an ambiguity resolution device in natural language that selects the most appropriate meaning of a word in context from the set of semantic candidates A document collection means for collecting a set of documents in which a word constituting the combination co-occurs from a predetermined corpus for each combination of a word and a meaning candidate in the meaning candidate set; Using a set of documents collected for each combination by the document collection means as learning data, given a word and the context of the word in the document, A classifier creating means for automatically creating a classifier that selects an optimal meaning candidate from the semantic candidate set as the meaning of the meaning of the word, and the meaning of the word based on the context in which the word is placed in the input sentence Classification execution means for selecting an optimal one from a set of semantic candidates using a classifier.

入力文中のある単語と、その単語が置かれた文脈と、その単語に意味を表す可能性のある複数の意味候補を含む意味候補の集合が与えられると、その単語と意味候補との組合わせの各々について、文書収集手段が所定のコーパスから当該組合せを構成する単語が共起する文書の集合を収集する。分類器作成手段は、組合せの各々について収集された文書の集合を学習データとして用いて分類器を作成する。この分類器は、ある単語と、その単語の文書中の文脈とが与えられると、ある単語の当該文脈中での意味として最適な意味候補を意味候補集合中から選択する機能を持つ。分類手段は、入力文中の単語と、その単語が置かれた文脈とを、このようにして作成された分類器に与え、その結果に基づいて、入力文中の単語の意味として最適なものを、意味候補の集合の中から選択する。 Given a word in an input sentence, the context in which the word is placed, and a set of semantic candidates that include multiple semantic candidates that might represent the meaning of the word, the combination of the word and the semantic candidate For each of the above, the document collection means collects a set of documents in which words constituting the combination co-occur from a predetermined corpus. The classifier creating means creates a classifier using a set of documents collected for each combination as learning data. This classifier has a function of selecting a semantic candidate most suitable as a meaning of a certain word from the semantic candidate set given a certain word and the context of the word in the document. The classification means gives the word in the input sentence and the context in which the word is placed to the classifier created in this way, and based on the result, the most appropriate meaning of the word in the input sentence, Select from a set of semantic candidates.

すなわち、この装置では、ある単語と、その単語の文脈と、その単語に対応する可能性のある複数の意味候補とが与えられると、文脈から適切と思われる意味候補を自動的に選択できる。この作業には人手を介在させる必要はない。従って、容易に適切な意味候補を選択し、入力された単語の多義性を解消できる多義解消装置を提供できる。 That is, in this apparatus, given a certain word, the context of the word, and a plurality of semantic candidates that may correspond to the word, it is possible to automatically select a semantic candidate that seems appropriate from the context. This operation does not require manual intervention. Therefore, it is possible to provide an ambiguity resolution device that can easily select an appropriate meaning candidate and eliminate the ambiguity of an input word.

好ましくは、分類器作成手段は、文書収集手段によって組合せの各々について収集された文書の集合のうち、集合に含まれる文書の数が多いものを所定の基準に従って選択し、それら文書の集合に対応する意味候補のみを意味候補の集合の要素として選択する処理を行なうための意味候補選択手段と、意味候補選択手段により選択された文書集合を学習データとして用い、ある単語と、その単語の文書中の文脈とが与えられると、ある単語の当該文脈中での意味として最適な意味候補を、意味候補集合中から選択する分類器を機械学習により自動的に作成するための機械学習手段とを含む。 Preferably, the classifier creating unit selects a document set collected by the document collecting unit for each of the combinations according to a predetermined criterion, and selects a document having a large number of documents included in the set, and corresponds to the set of documents. Semantic candidate selection means for performing processing for selecting only the meaning candidates to be selected as elements of the semantic candidate set, and using the document set selected by the semantic candidate selection means as learning data, a word and a document in the word And a machine learning means for automatically creating a classifier for selecting a semantic candidate most suitable as the meaning of a certain word from the semantic candidate set by machine learning. .

収集された文書集合のうち、集合に含まれる文書の数が少ないものは意味候補選択手段により棄却される。集合に含まれる文書の数が少ないということは、その単語と、その集合に対応する意味候補とが共起する可能性が他と比較して少ないということである。従って、与えられた文脈におけるある単語の意味として不適切なものを排除できる。その結果、分類の信頼性を高めることができる。 Among the collected document sets, those with a small number of documents included in the set are rejected by the semantic candidate selection means. The fact that the number of documents included in the set is small means that the possibility that the word and the semantic candidate corresponding to the set co-occur is smaller than others. Therefore, it is possible to exclude inappropriate words as meanings of a word in a given context. As a result, the reliability of classification can be improved.

さらに好ましくは、機械学習手段は、文書集合選択手段により選択された文書集合に含まれる文書の各々に対し、当該文書中におけるある単語の位置の前後の所定範囲に存在する単語列から、当該文書中におけるある単語の文脈の特徴量を表す、所定の構成の学習用の特徴量ベクトルを算出するための特徴量ベクトル算出手段と、文書集合選択手段により選択された文書集合に含まれる文書の各々に対して特徴量ベクトル算出手段により算出された学習用の特徴量ベクトルと、当該文書の検索時に使用された意味候補とを組にして学習用データを作成し、当該学習用データを用いた機械学習により、学習用の特徴量ベクトルと同じ構成の分類用の特徴量ベクトルが与えられると、当該分類用の特徴量ベクトルに対応する文脈中におけるある単語の意味として最適なものを、意味候補集合中から選択する所定の分類器を自動的に作成するための手段とを含む。 More preferably, for each of the documents included in the document set selected by the document set selection unit, the machine learning unit calculates the document from a word string existing in a predetermined range before and after a certain word position in the document. Each of the documents included in the document set selected by the document set selection means and the feature quantity vector calculation means for calculating the feature quantity vector for learning having a predetermined configuration representing the context feature quantity of a certain word The learning feature data calculated by the feature value vector calculating means and the semantic candidates used at the time of searching the document are created as learning data, and the machine using the learning data is used. When learning provides a feature vector for classification having the same configuration as the feature vector for learning, a single unit in the context corresponding to the feature vector for classification is provided. The optimum as meaning, and means for automatically generating a predetermined classifier selected from means candidate set in.

単語の文脈を、その単語の前後の所定範囲に存在する単語列から作成した学習用の特徴量ベクトルにより表す。こうした学習用の特徴量ベクトルを用いた機械学習により分類器を自動的に作成できる。その結果、適切な意味候補を人手を介在させることなく自動的に選択し、入力された単語の多義性を解消できる多義解消装置を提供できる。 The context of a word is represented by a learning feature value vector created from a word string existing in a predetermined range before and after the word. A classifier can be automatically created by machine learning using such learning feature vectors. As a result, it is possible to provide an ambiguity resolution device that can automatically select an appropriate meaning candidate without human intervention and eliminate the ambiguity of an input word.

より好ましくは、意味候補選択手段は、文書収集手段によって組合せの各々について収集された文書の集合のうち、集合に含まれる文書の数が多い所定の個数の集合を選択し、それら文書の集合に対応する意味候補のみを意味候補として選択する処理を行なうための手段を含む。 More preferably, the semantic candidate selection unit selects a predetermined number of sets having a large number of documents included in the set from the set of documents collected for each combination by the document collection unit, and sets the set of documents as the set of documents. Means for selecting only the corresponding semantic candidate as the semantic candidate is included.

入力された単語と、ある意味候補との組合せに対して収集された文書の集合に含まれる文書の数が多いということは、その組合せを構成する単語が共起する可能性が高いということである。従ってそうした意味候補は入力された単語に対する適切な意味候補である可能性が高い。また、この時点で意味候補の上限個数が設定されるので、以後の処理を安定した時間で完了できる。その結果、適切な意味候補を、人手を介在させることなく自動的に、信頼性高く、安定した時間で選択し、入力された単語の多義性を解消できる多義解消装置を提供できる。 A large number of documents included in a set of documents collected for a combination of an input word and a semantic candidate means that there is a high possibility that the words constituting the combination will co-occur. is there. Therefore, there is a high possibility that such semantic candidates are appropriate semantic candidates for the input word. Further, since the upper limit number of semantic candidates is set at this time, the subsequent processing can be completed in a stable time. As a result, it is possible to provide an ambiguity resolution device that can automatically select a proper meaning candidate with high reliability and stable time without manual intervention, and eliminate the ambiguity of the input word.

意味候補選択手段は、文書収集手段によって組合せの各々について収集された文書の集合のうち、集合に含まれる文書の数が予め定められるしきい値より大きな集合を選択し、それら文書の集合に対応する意味候補のみを意味候補として選択する処理を行なうための手段を含んでもよい。 The semantic candidate selection unit selects a set in which the number of documents included in the set is larger than a predetermined threshold from the set of documents collected for each combination by the document collection unit, and corresponds to the set of documents. Means for performing processing for selecting only meaning candidates to be selected as meaning candidates may be included.

入力された単語と、ある意味候補との組合せに対して収集された文書の集合に含まれる文書の数があるしきい値より多いということは、その組合せを構成する単語が共起する可能性が高いということである。従ってそうした意味候補は入力された単語に対する適切な意味候補である可能性が高い。その結果、適切な意味候補を、人手を介在させることなく自動的に、かつ信頼性高く選択し、入力された単語の多義性を解消できる多義解消装置を提供できる。 If the number of documents included in a set of documents collected for a combination of an input word and a semantic candidate is greater than a certain threshold value, the word constituting the combination may co-occur. Is high. Therefore, there is a high possibility that such semantic candidates are appropriate semantic candidates for the input word. As a result, it is possible to provide an ambiguity resolution device capable of automatically and reliably selecting an appropriate meaning candidate without human intervention and eliminating the ambiguity of an input word.

文書収集手段は、ある単語と、意味候補集合中の意味候補との組合せの各々について、インターネット上に存在するウェブページからなる仮想的コーパスから、当該組合せを構成する語が共起するウェブページの集合を検索し収集するための検索手段を含む。 For each combination of a word and a meaning candidate in a meaning candidate set, the document collection means uses a virtual corpus consisting of a web page existing on the Internet to generate a web page in which the words constituting the combination co-occur. Includes search means for searching and collecting sets.

インターネット上のウェブページは、多数の人間により作成され維持されている。従ってそこで使用されている単語の用法は非常に数多い使用例をカバーしている。そのため、そうした文書をもとに分類器を作成すると、分類結果の偏りをなくし、信頼性を高めることができる。 Web pages on the Internet are created and maintained by many people. Therefore, the word usage used there covers a very large number of use cases. Therefore, if a classifier is created based on such a document, the bias of the classification result can be eliminated and the reliability can be improved.

好ましくは、収集するための手段は、ある単語と、意味候補集合中の意味候補との組合せの各々について、インターネット上に存在するウェブページからなる仮想的コーパスから、当該組合せを構成する語が共起するウェブページの集合を検索し、所定の定数を上限とした要素数の集合として収集するための手段を含む。 Preferably, the means for collecting includes, for each combination of a word and a meaning candidate in the meaning candidate set, a word constituting the combination from a virtual corpus including web pages existing on the Internet. Means for searching for a set of web pages to occur and collecting them as a set of elements up to a predetermined constant.

一つの集合について収集されるウェブページの個数に上限が設けられる。そのため、分類器の学習が過大な負荷となるおそれは小さい。その結果、適切な意味候補を、人手を介在させることなく自動的に、かつ信頼性高く安定して選択し、入力された単語の多義性を解消できる多義解消装置を提供できる。 There is an upper limit on the number of web pages collected for a set. Therefore, there is little possibility that the learning of the classifier becomes an excessive load. As a result, it is possible to provide an ambiguity resolution device that can automatically and reliably select an appropriate meaning candidate without human intervention and eliminate the ambiguity of an input word.

本発明の第２の局面に係るコンピュータプログラムは、コンピュータにより実行されると、当該コンピュータを、上記したいずれかの自然言語における多義解消装置として機能させるものである。 The computer program according to the second aspect of the present invention, when executed by a computer, causes the computer to function as an ambiguity resolution device in any of the natural languages described above.

以下、本発明の実施の形態について図を参照して説明する。実施の形態は三つある。第１の実施の形態は、日本語の入力文に対する音声合成において、複数の仮名表記（読み）を持つ語の仮名表記を決定する装置に関する。第２の実施の形態は、英語のアクロニムに対し、英語の定義（フルスペル）を与える装置に関する。第３の実施の形態は、日本語から英語への翻訳において、日本語の単語に対し複数の英語の訳語が存在するときに、そのうちの一つを選択する装置に関する。すなわち、本発明において、ある単語の「意味」とは、日本語の場合に国語辞書にのっているような「意味」だけでなく、ある基準で見てその単語と等価であると評価できるような単語又は単語の集合又は文字列のことをいう。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. There are three embodiments. The first embodiment relates to an apparatus for determining kana notation of a word having a plurality of kana notations (readings) in speech synthesis for Japanese input sentences. The second embodiment relates to an apparatus for giving an English definition (full spelling) to an English acronym. The third embodiment relates to an apparatus for selecting one of a plurality of English translations for a Japanese word in translation from Japanese to English. That is, in the present invention, the “meaning” of a word can be evaluated not only as “meaning” in the Japanese language dictionary in the case of Japanese, but also as equivalent to the word as seen from a certain standard. Such a word or a set of words or a character string.

なお、以下の実施の形態の説明に用いる図面において、同一の部品には同一の参照符号を付してある。それらの名称及び機能も同一である。従って、それらについての詳細な説明は繰返さない。なお、後述するように、各実施の形態は、コンピュータハードウェアと、その上で実行されるコンピュータプログラムとにより実現可能である。従って、以下に示すブロック図中の機能ブロックの一部については、それを実現するためのコンピュータプログラムのフローチャート形式でその機能及び構成を示す。 In the drawings used for the description of the following embodiments, the same reference numerals are assigned to the same components. Their names and functions are also the same. Therefore, detailed description thereof will not be repeated. As will be described later, each embodiment can be realized by computer hardware and a computer program executed on the computer hardware. Therefore, about a part of functional block in the block diagram shown below, the function and structure are shown in the flowchart format of the computer program for implement | achieving it.

＜第１の実施の形態＞
［構成］
図１に、本発明の第１の実施の形態に係る音声合成システム３０のブロック図を示す。図１を参照して、音声合成システム３０は、音声合成の対象となる日本語の入力文を記憶するための入力文記憶部４０と、入力文記憶部４０から所定長を順次取出して記憶するための入力文バッファ４２と、日本語の単語と、その仮名表記とを対応付けて記憶した複数の辞書からなる辞書群４６と、入力文バッファ４２に含まれる文を形態素解析して、漢字を含む単語があれば辞書群４６を参照して仮名表記を検索し、仮名表記等の情報が付された形態素列を出力するための仮名変換部４４とを含む。 <First Embodiment>
[Constitution]
FIG. 1 shows a block diagram of a speech synthesis system 30 according to the first embodiment of the present invention. Referring to FIG. 1, the speech synthesis system 30 sequentially extracts and stores a predetermined length from an input sentence storage unit 40 for storing a Japanese input sentence that is a target of speech synthesis, and the input sentence storage unit 40. Input sentence buffer 42, a dictionary group 46 composed of a plurality of dictionaries that store Japanese words and their kana notations in association with each other, and morphological analysis of sentences included in the input sentence buffer 42 If there is a word to be included, a kana notation is searched with reference to the dictionary group 46, and a kana conversion unit 44 for outputting a morpheme string to which information such as kana notation is attached is included.

既に述べたように、漢字を含む単語の中には、複数の仮名表記を持つものがあり得る。音声合成のためには、それら複数の仮名表記の中で適切なものを選択する必要がある。音声合成システム３０は、そのために、仮名変換部４４及びいわゆるインターネット５２に接続され、仮名変換部４４がある単語Ｗについて複数の仮名表記候補Ｒｋ（ｋ＝１〜Ｋ：Ｋは仮名表記候補の数）が存在することを検出したことに応答して、インターネット５２上でその単語Ｗと仮名表記候補Ｒｋとが共起するウェブページを、単語Ｗと仮名表記候補Ｒｋの組合わせの各々について検索し、得られたウェブページのテキストを学習データとした機械学習による分類によって、単語Ｗにふさわしい仮名表記を決定して仮名変換部４４に与えるための同形異音語解消処理部５０とを含む。すなわち、このシステムでは、インターネット５２上のウェブページの集合を、一つの仮想的なコーパスと見なして用例文書を収集している。 As already mentioned, some words including kanji may have a plurality of kana notations. For speech synthesis, it is necessary to select an appropriate one of the plural kana notations. For this purpose, the speech synthesis system 30 is connected to the kana conversion unit 44 and the so-called Internet 52, and the kana conversion unit 44 has a plurality of kana notation candidates Rk (k = 1 to K: K is the number of kana notation candidates). ) Is detected for each combination of the word W and the kana notation candidate Rk on the Internet 52 for a web page on which the word W and the kana notation candidate Rk co-occur. And an isomorphic allophone elimination processing unit 50 for determining a kana notation suitable for the word W and giving it to the kana conversion unit 44 by classification by machine learning using the text of the obtained web page as learning data. In other words, in this system, an example document is collected by regarding a set of web pages on the Internet 52 as one virtual corpus.

音声合成システム３０はさらに、仮名変換部４４が出力する、仮名表記付入力文を記憶するための仮名表記入力文記憶部５４と、音声合成のための、仮名表記に対応する音声を格納した音声データベース４８と、仮名表記入力文記憶部５４から仮名表記付入力文を読出し、音声データベース４８を参照して音声合成を行ない、アナログ音声信号を出力するための音声合成部５６と、音声合成部５６から出力されるアナログ音声信号を音声に変換するスピーカ５８とを含む。 The speech synthesis system 30 further includes a kana notation input sentence storage unit 54 for storing an input sentence with kana notation output from the kana conversion unit 44, and a speech storing speech corresponding to kana notation for speech synthesis. A speech synthesizer 56 for reading an input sentence with kana notation from the database 48 and the kana notation input sentence storage unit 54, performing speech synthesis with reference to the speech database 48, and outputting an analog speech signal, and a speech synthesizer 56 And a speaker 58 for converting an analog audio signal output from the audio into audio.

本実施の形態では、同形異音語解消処理部５０がインターネット５２から検索するウェブページのテキストのうち、「スニペット」と呼ばれる部分を機械学習に用いる。「スニペット」とは、インターネットのいわゆる検索エンジンによる検索結果において、検索されたウェブページの内容を説明するための短文のことをいう。多くの場合、スニペットは、検索のキーワードとされた単語を含む部分のテキストからなる。なお、同形異音語解消処理部５０によるウェブページの検索には、独自の検索プログラムを用いてもよいが、本実施の形態では、既存の検索サービスサイトを利用し、単語Ｗと仮名表記候補ＲｋとについてのＡＮＤ検索をするクエリを検索サービスサイトに対して発行し、その結果を得ることで行なっている。なお、本実施の形態では、処理時間を安定させるため、検索件数の上限として、一回の検索について１０００件という基準を設けている。 In the present embodiment, a part called “snippet” is used for machine learning in the text of the web page searched by the homomorphic abnormal word elimination processing unit 50 from the Internet 52. A “snippet” refers to a short sentence for explaining the content of a searched web page in a search result by a so-called search engine on the Internet. In many cases, the snippet is composed of a portion of text including a word that is regarded as a search keyword. In addition, although the original search program may be used for the search of the web page by the isomorphic abnormal word elimination processing unit 50, in this embodiment, the existing search service site is used, and the word W and the kana notation candidate. A query for performing an AND search on Rk is issued to the search service site, and the result is obtained. In the present embodiment, in order to stabilize the processing time, a criterion of 1000 searches per search is provided as the upper limit of the number of searches.

音声合成部５６による音声合成の部分は、本発明とは直接には関係しないため、その詳細についての説明はここでは行なわない。 The portion of speech synthesis performed by the speech synthesizer 56 is not directly related to the present invention, and the details thereof will not be described here.

図２に、同形異音語解消処理部５０の詳細なブロック図を示す。図２を参照して、同形異音語解消処理部５０は、単語Ｗが与えられると、入力文バッファ４２に記憶された入力文のうち、単語Ｗを中心とする所定長の窓に含まれる単語に基づいて行なう学習により、単語Ｗに関する所定の特徴ベクトルが与えられればその単語Ｗに対応する適切な仮名表記を出力するように学習可能な決定木８２と、仮名変換部４４から単語Ｗとその仮名表記候補Ｒｋとの組合せ（Ｗ，Ｒｋ）を受け、それらが共起するウェブページのスニペットをインターネット５２から収集し、その結果を用いて決定木８２の学習を行なうための決定木作成部８０と、仮名変換部４４に接続され、仮名変換部４４から、組合せ（Ｗ，Ｒｋ）中の単語Ｗと、入力文中における単語Ｗを中心とする所定範囲の単語列８５とが与えられると、それらから決定木８２による分類に適合した分類用特徴ベクトルを作成し、出力するための分類用特徴ベクトル作成部８４と、分類用特徴ベクトル作成部８４から出力される分類用特徴ベクトルを決定木８２に与え、その結果として決定木８２から得られる、分類結果である仮名表記を仮名変換部４４に与えるための分類実行部８６とを含む。 FIG. 2 shows a detailed block diagram of the homomorphic abnormal word elimination processing unit 50. Referring to FIG. 2, when a word W is given, the homomorphic abnormal word elimination processing unit 50 is included in a window having a predetermined length centered on the word W among the input sentences stored in the input sentence buffer 42. If a predetermined feature vector related to the word W is given by learning based on the word, a decision tree 82 that can be learned to output an appropriate kana notation corresponding to the word W, and the word W from the kana conversion unit 44 A decision tree creation unit for receiving the combination (W, Rk) with the kana notation candidate Rk, collecting web page snippets co-occurring from the Internet 52, and learning the decision tree 82 using the result 80, connected to the kana conversion unit 44, and given from the kana conversion unit 44 a word W in the combination (W, Rk) and a word string 85 in a predetermined range centered on the word W in the input sentence, So A classification feature vector creation unit 84 for creating and outputting a classification feature vector suitable for classification by the decision tree 82, and a classification feature vector output from the classification feature vector creation unit 84. And a classification execution unit 86 for providing the kana notation which is the classification result obtained from the decision tree 82 to the kana conversion unit 44 as a result.

決定木作成部８０は、単語Ｗとその仮名表記Ｒｋとの組合せ（Ｗ，Ｒｋ）が与えられると、インターネット５２上でそれらが共起するウェブページを検索するための検索部１００と、検索部１００により検索されたウェブページのスニペットの集合（以下単に「ウェブページの集合」と呼ぶ。）を組合せ（Ｗ，Ｒｋ）ごとに記憶するための検索結果記憶部１０２と、組合せ（Ｗ，Ｒｋ）のうちで、取得されたウェブページの件数の降順にウェブページの集合をソートし、件数が上位であるＮ件（Ｎは自然数）のみを選択することにより、決定木８２のための学習データを作成するためのソート及び選択部１０４とを含む。本実施の形態では、このソート及び選択部１０４により選択された（Ｗ，Ｒｋ）に含まれるＮ個の仮名表記候補Ｒｋが、仮名表記候補として残され、後の決定木の学習に用いられる。 When a combination (W, Rk) of the word W and its kana notation Rk is given, the decision tree creating unit 80 searches the web page on which they co-occur on the Internet 52, and the search unit A search result storage unit 102 for storing a set of snippets of web pages searched by 100 (hereinafter simply referred to as “set of web pages”) for each combination (W, Rk), and a combination (W, Rk) Among them, the set of web pages is sorted in descending order of the number of acquired web pages, and only N items (N is a natural number) with the highest number of items are selected, whereby learning data for the decision tree 82 is obtained. And a sort and selection unit 104 for creation. In the present embodiment, the N kana notation candidates Rk included in (W, Rk) selected by the sorting and selecting unit 104 are left as kana notation candidates and are used for later learning of the decision tree.

この処理では、他の文書集合は棄却され、それら文書集合の検索に用いられた仮名表記候補も棄却される。これは、単語Ｗと共起する頻度の低い仮名表記候補は候補として不適であると一般的に考えられるためである。もっとも、応用によってはそのように低頻度の仮名表記候補であっても棄却しない方がよい場合もあり得る。 In this process, other document sets are rejected, and kana notation candidates used for searching these document sets are also rejected. This is because a kana notation candidate with a low frequency of co-occurring with the word W is generally considered to be inappropriate as a candidate. However, depending on the application, it may be better not to reject even such a low frequency kana notation candidate.

決定木作成部８０はさらに、ソート及び選択部１０４により作成された学習データを記憶するための学習データ記憶部１０６と、検索対象となっている単語Ｗについて、学習データ記憶部１０６に記憶されている、その単語Ｗに関して検索された仮名表記候補Ｒｋのウェブページの各々から、所定の学習用特徴ベクトルを作成するための学習用特徴ベクトル作成部１０８と、学習用特徴ベクトル作成部１０８の作成した特徴ベクトルを記憶するための特徴ベクトル記憶部１１０と、特徴ベクトル記憶部１１０に記憶された特徴ベクトルを用いて決定木８２を学習させるための決定木学習部１１２とを含む。 The decision tree creation unit 80 further stores a learning data storage unit 106 for storing the learning data created by the sorting and selection unit 104, and the learning data storage unit 106 for the search target word W. The learning feature vector creation unit 108 for creating a predetermined learning feature vector and the learning feature vector creation unit 108 created from each of the web pages of the kana notation candidates Rk searched for the word W. A feature vector storage unit 110 for storing the feature vector and a decision tree learning unit 112 for learning the decision tree 82 using the feature vector stored in the feature vector storage unit 110 are included.

図３は、図２に示す検索部１００を実現するためのコンピュータプログラムのフローチャートである。図３を参照して、このプログラムは、ある単語Ｗについての仮名表記の候補Ｒｋ（ｋ＝１〜Ｋ）の各々について繰返されるステップ１３０〜１３４の３つのステップを含む。 FIG. 3 is a flowchart of a computer program for realizing the search unit 100 shown in FIG. Referring to FIG. 3, this program includes three steps of steps 130 to 134 that are repeated for each of kana notation candidates Rk (k = 1 to K) for a certain word W.

ステップ１３０では、クエリ「単語Ｗａｎｄ単語Ｒｋ」でウェブページを上限件数ＭＡＸ＝１０００件で検索する要求をインターネット上の検索エンジンに送信する。 In step 130, a request for searching the web page with the maximum number MAX = 1000 is transmitted to a search engine on the Internet by the query “word W and word Rk”.

ステップ１３２では、その検索結果として、単語Ｗと仮名表記候補Ｒｋ（ｋ＝１〜Ｎ）とを含むスニペットの集合｛Ｓｎ（Ｗ，Ｒｋ）｝（ｎ＝１〜Ｌｋ，ｋ＝１〜Ｋ）を取得する。ただしここでＬｋは単語Ｗと仮名表記候補Ｒｋとの組合せに対して得られた検索結果の数である。 In step 132, as a search result, a set of snippets {Sn (W, Rk)} (n = 1 to Lk, k = 1 to K) including the word W and the kana notation candidates Rk (k = 1 to N). To get. Here, Lk is the number of search results obtained for the combination of the word W and the kana notation candidate Rk.

ステップ１３４では、各集合Ｓｎから仮名表記候補Ｒｋを削除することで、検索結果のスニペットの集合｛（Ｔｎ（Ｗ），Ｒｋ）｜ｎ＝１〜Ｌｋ｝を作成する。 In step 134, the kana notation candidate Rk is deleted from each set Sn, thereby generating a set of search result snippets {(Tn (W), Rk) | n = 1 to Lk}.

以上の３つのステップは、単語Ｗに対する仮名表記候補Ｒｋの全てに対して繰返される。 The above three steps are repeated for all the kana notation candidates Rk for the word W.

図２に示す検索部１００の機能はこのようなプログラムで実現される。 The function of the search unit 100 shown in FIG. 2 is realized by such a program.

なお、ソート及び選択部１０４によって、検索件数が上位Ｎ個のスニペットの集合｛（Ｔｎ（Ｗ），Ｒｋ）｜ｎ＝１〜Ｌｋ｝が抽出され、学習データ記憶部１０６に学習データとして記憶されるものとする。 The sort and selection unit 104 extracts a set of snippets with the highest number of searches {(Tn (W), Rk) | n = 1 to Lk} and stores them in the learning data storage unit 106 as learning data. Shall be.

図４に、図２に示す学習用特徴ベクトル作成部１０８の構成をブロック図形式で示す。図４を参照して、学習用特徴ベクトル作成部１０８は、学習データ記憶部１０６に記憶された学習データのスニペットの集合｛（Ｔｎ（Ｗ），Ｒｋ）｜ｎ＝１〜Ｌｋ｝に含まれる各スニペットから、そのスニペット中に存在する単語Ｗをはさんで前後それぞれＭ個（合計２Ｍ個）の単語群（これら合計２Ｍ個の単語群を「窓」と呼ぶ。」）を抽出するための抽出部１５０と、学習データ記憶部１０６に記憶された学習データ１０６に出現する、単語Ｗ以外の語彙によって、決定木８２（図２参照）の学習に用いる分類用特徴ベクトルの構成を決定するためのベクトル構成決定部１５２と、ベクトル構成決定部１５２により決定された特徴ベクトルの構成に従い、抽出部１５０によりスニペットごとに抽出された単語群に基づいて各スニペットの特徴ベクトルの要素を算出して、各スニペットの特徴ベクトルを作成し、特徴ベクトル記憶部１１０に記憶させるための要素算出部１５４とを含む。 FIG. 4 shows the configuration of the learning feature vector creation unit 108 shown in FIG. 2 in a block diagram format. Referring to FIG. 4, learning feature vector creation unit 108 is included in a set of learning data snippets {(Tn (W), Rk) | n = 1 to Lk} stored in learning data storage unit 106. From each snippet, M word groups (total 2M words) before and after the word W existing in the snippet (the total 2M word groups are called “windows”) are extracted. In order to determine the configuration of the classification feature vector used for learning of the decision tree 82 (see FIG. 2) by the vocabulary other than the word W that appears in the learning data 106 stored in the extraction unit 150 and the learning data storage unit 106. Each of the snippets based on the word group extracted for each snippet by the extraction unit 150 according to the configuration of the feature vector determined by the vector configuration determination unit 152 and the vector configuration determination unit 152 To calculate the elements of the feature vector, to create a feature vector of each snippet, and a component calculation unit 154 for storing the feature vector storage unit 110.

図５に、単語Ｗを中心とする「窓」の構成を模式的に示す。図５を参照して、学習用のスニペット１７０の単語列のうち、単語Ｗを中心としてその前後に存在する単語列を、単語Ｗを含めて、「Ｗ_−ｍ，Ｗ_{−（ｍ−１）}，Ｗ_{−（ｍ−２）}，…，Ｗ_−２，Ｗ_−１，Ｗ、Ｗ_１，Ｗ_２，…，Ｗ_ｍ−２，Ｗ_ｍ−１，Ｗ_ｍ」と書くことができる。単語Ｗを中心とし、その前のｍ個の単語からなる単語列１７４と、単語Ｗより後のｍ個の単語からなる単語列１７６とを含む単語列により、窓長２ｍの窓１７２が構成される。本実施の形態では、窓長を２Ｍとする。 FIG. 5 schematically shows the configuration of the “window” centered on the word W. Referring to FIG. 5, among word strings of learning snippet 170, word strings existing around and around word W include words W and “W _−m , W _{− (m−1).} , W _{− (m−2)} ,..., W ₋₂ , W ₋₁ , W, W ₁ , W ₂ ,..., W _m−2 , W _m−1 , W _m ”. A window 172 having a window length of 2 m is formed by a word string including a word string 174 consisting of m words preceding the word W and a word string 176 consisting of m words after the word W. The In this embodiment, the window length is 2M.

ベクトル構成決定部１５２は、次のようにして特徴ベクトルの構成を決定する。すなわち、ベクトル構成決定部１５２は、学習データ記憶部１０６に存在する学習データ内に出現する単語の頻度を各単語について算出する。ベクトル構成決定部１５２はさらに、頻度が上位であるＨ個の単語のみを選択する。ベクトル構成決定部１５２はさらに、特徴ベクトルの次元をＨ次元とし、１番目〜Ｈ番目の要素を、それぞれ頻度が１位〜Ｈ位の単語に対応付ける。これにより特徴ベクトルの構成が決定される。この特徴ベクトルの要素数はＨ個である。各要素は０又は１の値をとる。各要素は、その要素に対応する単語がスニペット中の単語Ｗを中心とする窓長２Ｍの窓内に出現すると１の値となり、出現しないと０の値となる。 The vector configuration determination unit 152 determines the configuration of the feature vector as follows. That is, the vector configuration determination unit 152 calculates the frequency of words appearing in the learning data existing in the learning data storage unit 106 for each word. The vector configuration determination unit 152 further selects only H words having the highest frequency. Further, the vector configuration determining unit 152 sets the dimension of the feature vector to H dimension, and associates the 1st to Hth elements with the words having the 1st to Hth frequencies, respectively. Thereby, the structure of the feature vector is determined. The number of elements of this feature vector is H. Each element takes a value of 0 or 1. Each element has a value of 1 when a word corresponding to the element appears in a window having a window length of 2M centered on the word W in the snippet, and a value of 0 if it does not appear.

従って、ある学習用のスニペットＴｉについて要素算出部１５４が行なう処理は次のような処理である。すなわち、要素算出部１５４は、このスニペットＴｉに対応するＨ次元の特徴ベクトルの各要素について、対応する単語がスニペットＴｉ中の、単語Ｗを中心とする窓長２Ｍの窓の中に出現するか否かを調べる。その要素の値は、その単語が出現すれば１、出現しなければ０となる。この処理をＨ個の要素の全てについて行なうことにより、スニペットＴｉの特徴ベクトルＶｉが算出される。この特徴ベクトルＶｉと、その特徴ベクトルが得られた組合せ（Ｗ，Ｒｋ）の仮名表記候補Ｒｋとを互いに関連付けて（特徴ベクトルに対する正解が仮名表記候補Ｒｋであるとして）決定木８２の学習に用いる。 Therefore, the process performed by the element calculation unit 154 for a certain learning snippet Ti is as follows. That is, for each element of the H-dimensional feature vector corresponding to the snippet Ti, the element calculation unit 154 determines whether the corresponding word appears in the window with a window length 2M centered on the word W in the snippet Ti. Check for no. The value of the element is 1 if the word appears and 0 if it does not appear. By performing this process for all H elements, the feature vector Vi of the snippet Ti is calculated. This feature vector Vi and the kana notation candidate Rk of the combination (W, Rk) from which the feature vector was obtained are associated with each other (assuming that the correct answer to the feature vector is the kana notation candidate Rk) and used for learning of the decision tree 82. .

図２に示す分類用特徴ベクトル作成部８４が行なう分類用の特徴ベクトルの作成も、基本的にはこれと同様である。すなわち、分類用特徴ベクトル作成部８４は、学習用特徴ベクトル作成部１０８のベクトル構成決定部１５２（図４参照）から、特徴ベクトルの各要素に対応する単語に関する情報を受け、処理対象となる単語Ｗについて、その単語Ｗを中心とする窓長２Ｍの窓内に所定の単語が出現するか否かによって、単語Ｗに対する分類用の特徴ベクトルを作成する。すなわち、この特徴ベクトルは、学習用特徴ベクトル作成部１０８によって作成される特徴ベクトルと全く同じ構成となる。 The creation of classification feature vectors performed by the classification feature vector creation unit 84 shown in FIG. 2 is basically the same as this. That is, the classification feature vector creation unit 84 receives information on the word corresponding to each element of the feature vector from the vector configuration determination unit 152 (see FIG. 4) of the learning feature vector creation unit 108, and the word to be processed For W, a classification feature vector for the word W is created depending on whether or not a predetermined word appears in a window having a window length 2M centered on the word W. That is, this feature vector has the same configuration as the feature vector created by the learning feature vector creation unit 108.

決定木学習部１１２は、機械学習によって決定木８２の学習を行なう。この学習方式については機械学習の分野で慣用されている事項であるので、ここではその詳細な説明は行なわない。 The decision tree learning unit 112 learns the decision tree 82 by machine learning. Since this learning method is a matter commonly used in the field of machine learning, a detailed description thereof will not be given here.

図６に、本実施の形態に係る要素算出部１５４により作成される決定木の一例である、「佐原」という単語に関する決定木２００を示す。図６を参照して、この決定木は、４つの中間のノード２１０，２１２，２１４及び２１６と、５つの終端のノード２３０，２３２，２３４，２３６及び２３８を含み、各ノード２１０，２１２，２１４及び２１６では、それぞれ窓内の単語が特定の条件を満たすか否かという質問がなされる。 FIG. 6 shows a decision tree 200 related to the word “Sahara”, which is an example of a decision tree created by the element calculation unit 154 according to the present embodiment. Referring to FIG. 6, this decision tree includes four intermediate nodes 210, 212, 214, and 216 and five terminal nodes 230, 232, 234, 236, and 238, and each node 210, 212, 214 And at 216, a question is made as to whether each word in the window meets a certain condition.

ノード２１０の質問は、単語「佐原」を中心とする窓長２Ｍの窓内に、キーワード「千葉県」があるか、というものである。もしあればノード２３０に進み、「佐原」に対応する仮名表記として「さわら」が選択される。もしなければノード２１２に進む。なお、図６においては、「千葉県」のような具体的な単語について、窓内にあるか否かを聞いているが、実際の処理では、単語「佐原」の特徴ベクトル内において、単語「千葉県」に対応する要素（ビット）の値が１か０かを調べることによってこの判定を行なっている。 The question of node 210 is whether there is a keyword “Chiba Prefecture” in a window with a window length of 2M centering on the word “Sahara”. If there is, the process proceeds to node 230, and "Sawara" is selected as the kana notation corresponding to "Sahara". If not, go to node 212. In FIG. 6, whether a specific word such as “Chiba Prefecture” is in the window is asked, but in actual processing, the word “ This determination is made by checking whether the value of the element (bit) corresponding to “Chiba Prefecture” is 1 or 0.

ノード２１２の質問は、キーワード「神奈川県」があるか、というものである。もしあればノード２３２に進み、「佐原」に対応する仮名表記として「さはら」が選択される。もしなければノード２１４に進む。 The question at node 212 is whether there is a keyword “Kanagawa Prefecture”. If there is, the process proceeds to node 232, and “Sahara” is selected as the kana notation corresponding to “Sahara”. If not, go to node 214.

ノード２１４の質問は、キーワード「成田」があるか、というものである。もしあればノード２３４に進み、「佐原」に対応する読みとして「さわら」２３４が選択される。もしなければノード２１６に進む。 The question of node 214 is whether there is a keyword “Narita”. If there is, the process proceeds to node 234, and "Sawara" 234 is selected as the reading corresponding to "Sahara". If not, go to node 216.

ノード２１６の質問は、キーワード「横須賀」があるか、というものである。もしあればノード２３６に進み、「佐原」に対応する仮名表記として「さはら」が選択される。もしなければノード２３８に進み、「佐原」に対応する仮名表記として「さわら」が選択される。 The question at node 216 is whether there is a keyword “Yokosuka”. If there is, the process proceeds to node 236, and “Sahara” is selected as the kana notation corresponding to “Sahara”. If not, the process proceeds to node 238, where “Sawara” is selected as the kana notation corresponding to “Sahara”.

本実施の形態では、基本的に各単語に対し、決定木２００が作成される。ある単語に対応する特徴ベクトルが与えられると、その単語に対応する決定木を特徴ベクトルの各要素の値に従ってたどることにより、その単語の仮名表記が選択される。 In the present embodiment, a decision tree 200 is basically created for each word. Given a feature vector corresponding to a word, the kana notation of that word is selected by following the decision tree corresponding to that word according to the value of each element of the feature vector.

［動作］
図１〜図６を参照して、上記した音声合成システム３０は以下のように動作する。図１に示す入力文記憶部４０には、音声合成の対象となる日本語の文が予め記憶される。そのうちの所定長部分が読出され、入力文バッファ４２に記憶される。 [Operation]
1 to 6, the above-described speech synthesis system 30 operates as follows. In the input sentence storage unit 40 shown in FIG. 1, a Japanese sentence to be subjected to speech synthesis is stored in advance. A predetermined length portion is read out and stored in the input sentence buffer 42.

仮名変換部４４は、入力文バッファ４２に記憶された文について辞書群４６を参照して形態素解析を行なう。その結果、各単語の品詞、仮名表記（漢字の場合）、活用型、活用形などが決定される。もしも一つの単語について複数の仮名表記が得られた場合（すなわち同形異音語が存在する場合）、仮名変換部４４は、その単語（単語Ｗとする。）と、仮名表記の組合せをそれぞれ同形異音語解消処理部５０に与える。以下の説明では、構成のときに使用した表記を用いる。すなわち、ある単語Ｗに対して得られたＫ個の仮名表記候補を仮名表記候補Ｒ_１〜Ｒ_Ｋとする。 The kana conversion unit 44 performs morphological analysis on the sentence stored in the input sentence buffer 42 with reference to the dictionary group 46. As a result, the part of speech, kana notation (in the case of kanji), utilization type, utilization form, etc. are determined for each word. If a plurality of kana notations are obtained for one word (that is, if there is an isomorphic allophone), the kana conversion unit 44 uses the same combination of the word (word W) and the kana notation. This is given to the abnormal word elimination processing unit 50. In the following description, the notation used in the configuration is used. That is, K kana notation candidates obtained for a certain word W are set as kana notation candidates R _{1 to} R _K.

図２を参照して、検索部１００は、単語Ｗと、仮名表記候補Ｒｋ（ｋ＝１〜Ｋ）との組合せ（Ｗ，Ｒｋ）が与えられると、（単語Ｗａｎｄ単語Ｒｋ）をクエリとしてインターネット５２上の検索エンジンに検索件数上限＝１０００件という条件で検索要求を送信する（図３のステップ１３０）。そして、この検索要求に応答して検索エンジンから得られたウェブページのスニペットの集合｛Ｓｎ（Ｗ，Ｒｋ）｝（ｎ＝１〜Ｌｋ）を取得する（図３のステップ１３２）。ここでＬｋはクエリ（単語Ｗａｎｄ単語Ｒｋ）に対して得られた検索結果（ウェブページ）の数である。このスニペットの集合の各々から単語Ｒｋを削除して得られた検索結果のスニペットの集合が検索結果記憶部１０２に記憶される（図３のステップ１３４）。これらスニペットの集合は、（単語Ｗ，仮名表記候補Ｒｋ）の組合せごとに得られる。スニペットの集合の各々の要素の数Ｌｋの上限ＭＡＸは、本実施の形態では、上記したように１０００である。 Referring to FIG. 2, when a combination (W, Rk) of word W and kana notation candidate Rk (k = 1 to K) is given, search unit 100 uses (word W and word Rk) as a query. A search request is transmitted to the search engine on the Internet 52 under the condition that the search number upper limit = 1000 (step 130 in FIG. 3). Then, a set of web page snippets {Sn (W, Rk)} (n = 1 to Lk) obtained from the search engine in response to the search request is acquired (step 132 in FIG. 3). Here, Lk is the number of search results (web pages) obtained for the query (word W and word Rk). A set of search result snippets obtained by deleting the word Rk from each set of snippets is stored in the search result storage unit 102 (step 134 in FIG. 3). A set of these snippets is obtained for each combination of (word W, kana notation candidate Rk). In the present embodiment, the upper limit MAX of the number Lk of each element of the set of snippets is 1000 as described above.

検索部１００は、単語Ｗと仮名表記候補Ｒｋとの組合せの各々に対し、上記した処理を実行する。すなわち、図３におけるステップ１３０〜１３４の処理を各組合せに対し実行する。その結果、検索結果記憶部１０２には、これら組合せの各々について、検索結果のスニペットの集合｛Ｓｎ（Ｗ，Ｒｋ）｝が記憶される。 The search unit 100 executes the above-described processing for each combination of the word W and the kana notation candidate Rk. That is, the processing of steps 130 to 134 in FIG. 3 is executed for each combination. As a result, the search result storage unit 102 stores a set {Sn (W, Rk)} of search result snippets for each of these combinations.

ソート及び選択部１０４は、検索結果記憶部１０２に記憶されたスニペットの集合｛Ｓｎ（Ｗ，Ｒｋ）｝を、その要素の数Ｌｋをキーに降順にソートする。ソート及び選択部１０４はさらに、ソート結果のうち、上位Ｎ個のスニペットの集合｛（Ｔｎ（Ｗ），Ｒｋ）｜ｎ＝１〜Ｌｋ｝を選択して、それらスニペットが得られた仮名表記要素Ｒｋと関連付けて学習データ記憶部１０６に学習データとして記憶させる。すなわち学習データ記憶部１０６には、スニペットの集合のうち、検索結果の多かったものから順番にＮ個が記憶される。 The sort and selection unit 104 sorts the set of snippets {Sn (W, Rk)} stored in the search result storage unit 102 in descending order using the number of elements Lk as a key. The sort and selection unit 104 further selects a set of top N snippets {(Tn (W), Rk) | n = 1 to Lk} from the sort results, and the kana notation element from which these snippets are obtained In association with Rk, it is stored as learning data in the learning data storage unit 106. That is, the learning data storage unit 106 stores N pieces of snippets in order from the search result having the largest number.

図４を参照して、学習用特徴ベクトル作成部１０８のベクトル構成決定部１５２は、学習データ記憶部１０６に学習データが記憶されると、これら学習データに出現する単語の頻度を各単語について算出する。ベクトル構成決定部１５２はさらに、出現頻度が上位Ｈ番目までの単語を選択する。特徴ベクトルの１番目〜Ｈ番目の要素を出現頻度１位〜Ｈ位の単語に対応付けることにより、特徴ベクトルの構成が決定される。ベクトル構成決定部１５２は、この特徴ベクトルの構成（すなわち特徴ベクトルの各要素に対応する単語に関する情報）を図２に示す分類用特徴ベクトル作成部８４及び図４に示す要素算出部１５４に与える。 Referring to FIG. 4, when learning data is stored in learning data storage unit 106, vector configuration determining unit 152 of learning feature vector creation unit 108 calculates the frequency of words appearing in these learning data for each word. To do. The vector configuration determination unit 152 further selects words up to the highest H frequency of appearance. The structure of the feature vector is determined by associating the 1st to Hth elements of the feature vector with the words having the appearance frequencies of 1st to Hth. The vector configuration determining unit 152 gives the configuration of the feature vector (that is, information relating to the word corresponding to each element of the feature vector) to the classification feature vector creating unit 84 shown in FIG. 2 and the element calculating unit 154 shown in FIG.

一方、抽出部１５０は、学習データ記憶部１０６に記憶されている各スニペットについて、単語Ｗを中心とする窓長２Ｍの窓を抽出して要素算出部１５４に与える。 On the other hand, for each snippet stored in the learning data storage unit 106, the extraction unit 150 extracts a window having a window length 2M centered on the word W and gives the extracted value to the element calculation unit 154.

要素算出部１５４は、ベクトル構成決定部１５２から与えられるベクトル構成に従い、抽出部１５０から与えられる窓に含まれる単語に基づいて、各スニペットの特徴ベクトルの各要素の値を算出する。その結果、各スニペットの特徴ベクトルが得られる。要素算出部１５４は、各スニペットを、そのスニペットが検索されたときの仮名表記候補Ｒｋと関連付けて特徴ベクトル記憶部１１０に学習用データとして記憶させる。 The element calculation unit 154 calculates the value of each element of the feature vector of each snippet based on the word included in the window provided from the extraction unit 150 according to the vector configuration provided from the vector configuration determination unit 152. As a result, a feature vector for each snippet is obtained. The element calculation unit 154 stores each snippet as learning data in the feature vector storage unit 110 in association with the kana notation candidate Rk when the snippet is searched.

図２を参照して、決定木学習部１１２は、特徴ベクトル記憶部１１０に記憶された特徴ベクトルと、それら特徴ベクトルに関連付けられた仮名表記候補とを用いた機械学習により、決定木８２の学習を行なう。 Referring to FIG. 2, the decision tree learning unit 112 learns the decision tree 82 by machine learning using the feature vectors stored in the feature vector storage unit 110 and kana notation candidates associated with the feature vectors. To do.

以上の処理によって、決定木８２は、ある単語Ｗを中心とする窓長２Ｍの窓中の単語列、すなわち単語Ｗの文脈、を表す特徴ベクトルが与えられると、その文脈における単語Ｗの仮名表記として最適なものを出力するように機能するようになる。 With the above processing, when the decision tree 82 is given a feature vector representing a word string in a window having a window length of 2M centered on a certain word W, that is, the context of the word W, the kana notation of the word W in that context is given. Will function to output the most appropriate one.

一方、仮名変換部４４は、分類用特徴ベクトル作成部８４に対し、同形異音語の解消を要求する単語Ｗと、入力文において単語Ｗを中心とする窓長２Ｍの窓に含まれる単語列８５とを与える。分類用特徴ベクトル作成部８４は、単語Ｗについて、仮名変換部４４より与えられた、入力文中のその単語Ｗを中心とする窓長２Ｍの窓に含まれる単語列８５と、図４に示すベクトル構成決定部１５２から与えられたベクトル構成とによって、要素算出部１５４と同様の処理により単語Ｗの特徴ベクトルを作成し、分類実行部８６に与える。 On the other hand, the kana conversion unit 44 requests the classification feature vector creation unit 84 to cancel the homomorphic abnormal word, and a word string included in a window having a window length of 2M centering on the word W in the input sentence. 85 is given. For the word W, the classification feature vector creation unit 84 supplies the word string 85 included in the window having a window length 2M centered on the word W in the input sentence, given from the kana conversion unit 44, and the vector shown in FIG. A feature vector of the word W is created by the same processing as the element calculation unit 154 based on the vector configuration given from the configuration determination unit 152 and given to the classification execution unit 86.

分類実行部８６は、この特徴ベクトルを決定木８２に与える。決定木８２は、単語Ｗを中心とする窓長Ｍから上記方法によって作成した特徴ベクトルが与えられると、単語Ｗの仮名表記として適切なものを出力するように学習済みである。分類実行部８６は、この仮名表記を決定木８２から得て、仮名変換部４４に与える。 The classification execution unit 86 gives this feature vector to the decision tree 82. The decision tree 82 has been learned to output an appropriate kana notation of the word W when the feature vector created by the above method is given from the window length M centering on the word W. The classification execution unit 86 obtains this kana notation from the decision tree 82 and gives it to the kana conversion unit 44.

仮名変換部４４は、このようにして同形異音語解消処理部５０から得られた仮名表記を、問題となった単語Ｗに形態素分析の結果と同様にして付加する。仮名変換部４４はさらに、形態素解析が終わり、品詞、仮名表記（漢字の場合）、活用型、活用形などの情報が付された形態素列を音声合成部５６に与える。この場合、同形異音語については既に同形異音語解消処理部５０により解消されているため、一つの単語には一つの仮名表記しか付されていない。 The kana conversion unit 44 adds the kana notation thus obtained from the homomorphic abnormal word elimination processing unit 50 to the problematic word W in the same manner as the result of the morphological analysis. Further, the kana conversion unit 44 finishes the morphological analysis, and gives the synthesizer 56 a morpheme string to which information such as part of speech, kana notation (in the case of kanji), a utilization type, a utilization form, and the like is attached. In this case, since the homomorphic allophone word has already been eliminated by the isomorphic allophone word elimination processing unit 50, only one kana notation is attached to one word.

音声合成部５６は、与えられた形態素列に基づき、形態素に付された仮名表記などを用いて音声データベース４８から適切な音声波形を抽出し、波形接続処理によって合成音声波形データを作成し、さらにこの合成音声波形データをアナログ変換してスピーカ５８に与える。スピーカ５８はこの音声信号を音声に変換する。 The speech synthesizer 56 extracts an appropriate speech waveform from the speech database 48 using a kana notation attached to the morpheme based on the given morpheme sequence, creates synthesized speech waveform data by waveform connection processing, The synthesized speech waveform data is converted to analog and supplied to the speaker 58. The speaker 58 converts this sound signal into sound.

以上のように音声合成システム３０によれば、入力文記憶部４０に記憶された入力文に同形異音語が含まれていても、同形異音語解消処理部５０によって同形異音語が解消され、一つの仮名表記のみがその単語に割当てられる。インターネット５２上のウェブページをいわば仮想的なコーパスとして用い、自動的にこの同形異音語の解消のための決定木の学習が行なわれる。人手で学習データを作成する必要がなく、同形異音語の解消のための手間を従来と比較してはるかに少なくできる。さらに、インターネット５２上で検索されるウェブページは多数の人により作成されたものであるため、少数の人が学習データを作成する場合と比較して、学習データの偏りが少なく、そのカバーする範囲も広くなる。従って、同形異音語の解消の信頼性が従来より高くなるという効果がある。 As described above, according to the speech synthesis system 30, even if the input sentence stored in the input sentence storage unit 40 includes an isomorphic allophone word, the isomorphic allophone word elimination unit 50 eliminates the isomorphic allophone word. And only one kana expression is assigned to the word. A web page on the Internet 52 is used as a virtual corpus, and a decision tree is automatically learned to eliminate this homomorphic allophone. There is no need to create learning data by hand, and much less time is required to eliminate homomorphic abnormal words than in the past. Further, since the web page searched on the Internet 52 is created by a large number of people, the learning data is less biased compared to the case where a small number of people create learning data, and the range covered by the web page is searched. Also become wider. Therefore, there is an effect that the reliability of eliminating homomorphic abnormal words is higher than in the prior art.

［コンピュータによる実現］
上記した第１の実施の形態に係る音声合成システム３０は、既に述べたようにコンピュータハードウエア及び当該コンピュータハードウエア上で実行されるコンピュータソフトウエアにより実現される。図７に音声合成システム３０を実現するための一般的なコンピュータシステム２５０の外観を示し、図８にこのコンピュータシステム２５０の内部構成をブロック図形式で示す。 [Realization by computer]
The speech synthesis system 30 according to the first embodiment described above is realized by computer hardware and computer software executed on the computer hardware as described above. FIG. 7 shows an external appearance of a general computer system 250 for realizing the speech synthesis system 30, and FIG. 8 shows an internal configuration of the computer system 250 in a block diagram form.

図７を参照して、コンピュータシステム２５０は、コンピュータ２６０と、いずれもコンピュータ２６０に接続されるモニタ２６２、キーボード２６６、マウス２６８、マイクロホン２９０及び一対のスピーカ５８とを含む。コンピュータ２６０には、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）の再生及び記録が可能なＤＶＤドライブ２７０と、所定の規格に従った半導体メモリ記憶装置が装着可能なメモリポート２７２とが備えられている。コンピュータ２６０の内部構成については図８を参照して後述する。 Referring to FIG. 7, a computer system 250 includes a computer 260, a monitor 262, a keyboard 266, a mouse 268, a microphone 290, and a pair of speakers 58, all connected to the computer 260. The computer 260 includes a DVD drive 270 capable of reproducing and recording a DVD (Digital Versatile Disc), and a memory port 272 into which a semiconductor memory storage device conforming to a predetermined standard can be mounted. The internal configuration of the computer 260 will be described later with reference to FIG.

図８を参照して、コンピュータ２６０は、図７に示すＤＶＤドライブ２７０及びメモリポート２７２に加え、ＣＰＵ（中央演算処理装置）２７６と、ＣＰＵ２７６に接続されたバス２８６と、いずれもバス２８６に接続されたＲＯＭ（読出専用メモリ）２７８、ＲＡＭ（ランダムアクセスメモリ）２８０、ハードディスク２７４、ネットワークインタフェース２９６、及びサウンドボード２８８を含む。 Referring to FIG. 8, in addition to DVD drive 270 and memory port 272 shown in FIG. 7, computer 260 is connected to CPU (Central Processing Unit) 276 and bus 286 connected to CPU 276, both connected to bus 286. ROM (Read Only Memory) 278, RAM (Random Access Memory) 280, hard disk 274, network interface 296, and sound board 288.

ＤＶＤドライブ２７０には、ＤＶＤ２８２が装着される。メモリポート２７２には半導体メモリ記憶装置２８４が装着される。ＣＰＵ２７６は、バス２８６並びにＤＶＤドライブ２７０及びメモリポート２７２をそれぞれ介して、ＤＶＤ２８２及びメモリ２８４をアクセスできる。 A DVD 282 is attached to the DVD drive 270. A semiconductor memory storage device 284 is attached to the memory port 272. CPU 276 can access DVD 282 and memory 284 via bus 286 and DVD drive 270 and memory port 272, respectively.

キーボード２６６、マウス２６８、モニタ２６２はいずれも図示しないインタフェースを介してコンピュータ２６０のバス２８６に接続される。スピーカ５８及びマイクロホン２９０は、サウンドボード２８８に接続される。このコンピュータシステム２５０において、ＣＰＵ２７６で実行される音声合成プログラムは、最終的にはデジタル形式の音声波形データを生成する。サウンドボード２８８はその音声波形データをＣＰＵ２７６から受取ると、アナログ信号に変換してスピーカ５８を介して音声を発生させる処理をする。 The keyboard 266, the mouse 268, and the monitor 262 are all connected to the bus 286 of the computer 260 via an interface (not shown). The speaker 58 and the microphone 290 are connected to the sound board 288. In this computer system 250, the speech synthesis program executed by the CPU 276 finally generates digital waveform waveform data. When the sound board 288 receives the sound waveform data from the CPU 276, the sound board 288 converts the sound waveform data into an analog signal and generates sound through the speaker 58.

上記実施の形態における入力文記憶部４０、辞書群４６、仮名表記入力文記憶部５４、音声データベース４８、検索結果記憶部１０２、学習データ記憶部１０６、特徴ベクトル記憶部１１０等は、ＲＡＭ２８０、ハードディスク２７４、ＤＶＤディスク２８２、半導体メモリ記憶装置２８４のいずれでも実現できる。実際には、格納するデータの容量、読出し、書込みに要求される速度などによって、最も効率のよい記憶装置が各記憶部を実現するために選択される。 The input sentence storage unit 40, dictionary group 46, kana notation input sentence storage unit 54, speech database 48, search result storage unit 102, learning data storage unit 106, feature vector storage unit 110, etc. in the above embodiment are the RAM 280, hard disk 274, DVD disk 282, and semiconductor memory storage device 284 can be realized. Actually, the most efficient storage device is selected to realize each storage unit depending on the capacity of data to be stored, the speed required for reading and writing, and the like.

上記した第１の実施の形態に係る音声合成システム３０を実現するためのコンピュータプログラムは、単一のプログラムでもよいし、複数のプログラムを組合せたものでもよい。特に、上記した各部の機能のうち、図１に示す仮名変換部４４において行なわれる形態素解析処理、音声合成部５６において行なわれる音声合成処理、図２に示す検索部１００が実行するスニペットの検索処理、ソート及び選択部１０４が実行するソート及び選択処理、決定木学習部１１２が実行する決定木８２の学習処理などについては、既に広く流布しているプログラムをそのまま使用できる。もちろん、これらプログラムは汎用に作成されているため、適切な調整を行なうことは要求されるが、それらはこの技術分野における通常の知識を持つ者にとっては、目的に照らして容易に実現できる範囲に留まる。 The computer program for realizing the speech synthesis system 30 according to the first embodiment described above may be a single program or a combination of a plurality of programs. In particular, among the functions of each unit described above, morpheme analysis processing performed in the kana conversion unit 44 shown in FIG. 1, speech synthesis processing performed in the speech synthesis unit 56, and snippet search processing executed by the search unit 100 shown in FIG. As for the sorting and selection processing executed by the sorting and selection unit 104, the learning processing of the decision tree 82 executed by the decision tree learning unit 112, etc., programs that have already been widely distributed can be used as they are. Of course, since these programs are created for general use, it is necessary to make appropriate adjustments, but for those who have ordinary knowledge in this technical field, they are within the range that can be easily realized according to the purpose. stay.

さらに、学習用特徴ベクトル作成部１０８、分類用特徴ベクトル作成部８４での処理についても、上記した説明に基づいて、当該技術分野の通常の知識を持つものであれば、仕様に応じて適宜実現することが可能である。 Furthermore, the processing in the learning feature vector creation unit 108 and the classification feature vector creation unit 84 is appropriately realized according to the specifications as long as it has ordinary knowledge in the technical field based on the above description. Is possible.

これらプログラムは、例えばＤＶＤディスク２８２等のような記憶媒体に記憶され、又はインターネット５２等のネットワークを通じて流通し、通常はハードディスク２７４等の不揮発外部記憶装置に記憶される。そして実行時にはハードディスク２７４からＲＡＭ２８０にコピーされ、ＣＰＵ２７６内の図示しないプログラムカウンタにより指し示されるアドレスから読出された命令がＣＰＵ２７６により実行され、上記した所期の機能を実現する。コンピュータハードウェアそのものの動作形態については周知であるので、ここではこれ以上の詳細な説明は行なわない。 These programs are stored in a storage medium such as a DVD disk 282 or distributed through a network such as the Internet 52, and are usually stored in a nonvolatile external storage device such as a hard disk 274. At the time of execution, the CPU 276 executes an instruction that is copied from the hard disk 274 to the RAM 280 and read from an address indicated by a program counter (not shown) in the CPU 276, thereby realizing the above-described expected function. Since the operation mode of the computer hardware itself is well known, no further detailed description will be given here.

＜第２の実施形態＞
図９に、本発明の第２の実施の形態に係る、複数の定義を有する英語のアクロニムに対し、適切な定義を与える多義アクロニム解消システム３３０の構成をブロック図形式で示す。この多義アクロニム解消システム３３０は、アクロニムの近傍に、そのアクロニムの定義を与えている文書が多いこと、アクロニムの近傍に存在する単語は、その文書の分野によって何らかの傾向を持っていることを利用して、実施の形態１における同形異音語の解消と同じ原理によって、アクロニムに適切な定義を与えるものである。 <Second Embodiment>
FIG. 9 is a block diagram showing a configuration of a multiple definition acronym resolution system 330 that gives an appropriate definition to an English acronym having a plurality of definitions according to the second embodiment of the present invention. This ambiguous acronym resolution system 330 utilizes the fact that there are many documents that give the definition of the acronym in the vicinity of the acronym, and that the words existing in the vicinity of the acronym have some tendency depending on the field of the document. Thus, an appropriate definition is given to the acronym based on the same principle as the elimination of the homomorphic allophone in the first embodiment.

図９を参照して、この多義アクロニム解消システム３３０は、アクロニムを含む可能性のある入力文を記憶するための入力文記憶部３４０と、入力文記憶部３４０に記憶された入力文の所定部分を読込むための入力文バッファ３４２と、アクロニム及びその定義のリストよりなるデータからなる辞書群３４６と、入力文バッファ３４２に格納された入力文を形態素解析し、定義が付されていないアクロニムを見出すと、辞書群３４６によって当該アクロニムの定義を決定し、入力文中の当該アクロニムに当該定義を付して入力文を出力するためのアクロニム解釈部３４４とを含む。 Referring to FIG. 9, this ambiguous acronym resolution system 330 includes an input sentence storage unit 340 for storing an input sentence that may contain an acronym, and a predetermined portion of the input sentence stored in the input sentence storage unit 340. Morphological analysis of the input sentence buffer 342 for reading, a dictionary group 346 composed of data including acronyms and a list of definitions thereof, and the input sentence stored in the input sentence buffer 342, and finding an acronym with no definition And an acronym interpretation unit 344 for determining the definition of the acronym by the dictionary group 346, attaching the definition to the acronym in the input sentence, and outputting the input sentence.

多義アクロニム解消システム３３０はさらに、アクロニム解釈部３４４から出力される、アクロニムに定義が付された入力文を記憶するためのアクロニム定義付入力文記憶部３５４と、アクロニム定義付入力文記憶部３５４に記憶された入力文の意味を理解するための文章理解装置３５６とを含む。 The ambiguous acronym elimination system 330 further includes an acronym definition-added input sentence storage unit 354 for storing an input sentence in which an acronym is defined and an acronym definition-added input sentence storage unit 354, which are output from the acronym interpretation unit 344. A sentence understanding device 356 for understanding the meaning of the stored input sentence.

既に述べたように、アクロニムの中には複数の定義を持つものもあり得る。そうした場合に、アクロニム解釈部３４４がアクロニムに複数の定義を付して出力することはできない。そうすると、文章理解装置３５６における文章理解の障害となるからである。従って、入力文中で定義されていないアクロニムに対し、複数の定義が辞書群３４６から見出された場合、何らかの手段によりそれらの中の適切な一つを自動的に選択できるようにする必要がある。 As already mentioned, some acronyms can have multiple definitions. In such a case, the acronym interpretation unit 344 cannot output the acronym with a plurality of definitions. This is because the text comprehension device 356 becomes an obstacle to text comprehension. Therefore, when a plurality of definitions are found from the dictionary group 346 for an acronym that is not defined in the input sentence, it is necessary to be able to automatically select an appropriate one of them by some means. .

こうした問題を解決するために、本実施の形態に係る多義アクロニム解消システム３３０は、アクロニム解釈部３４４及びインターネット５２に接続され、アクロニム解釈部３４４から、アクロニムと、そのアクロニムに対して得られた複数の定義候補と、アクロニムの前後の所定の窓中に存在する単語列とが与えられると、インターネット５２をコーパスとして用いた学習処理により、与えられた複数の定義候補のうち、与えられた単語列に対して最も適切と思われるものを選択し、アクロニム解釈部３４４に与える処理を行なうための多義アクロニム解消処理部３５０を含む。 In order to solve such a problem, the ambiguous acronym resolution system 330 according to the present embodiment is connected to the acronym interpretation unit 344 and the Internet 52, and the acronym from the acronym interpretation unit 344 and a plurality of acronyms obtained for the acronym are obtained. And a word string existing in a predetermined window before and after the acronym, a given word string among a plurality of given definition candidates by a learning process using the Internet 52 as a corpus The ambiguous acronym elimination processing unit 350 is selected to select the most appropriate one for the above and perform the processing given to the acronym interpretation unit 344.

多義アクロニム解消処理部３５０の構成の詳細についてはここでは述べないが、多義アクロニム解消処理部３５０の構成及び動作は第１の実施の形態における同形異音語解消処理部５０と同様である。すなわち多義アクロニム解消処理部３５０は、以下の手順でアクロニムに対する適切な定義を決定する。 Although the details of the configuration of the ambiguity acronym elimination processing unit 350 will not be described here, the configuration and operation of the ambiguity acronym elimination processing unit 350 are the same as those of the isomorphic abnormal word elimination processing unit 50 in the first embodiment. That is, the ambiguous acronym elimination processing unit 350 determines an appropriate definition for an acronym in the following procedure.

（１）アクロニムＡと定義候補Ｄｋ（ｋ＝１〜Ｋ：Ｋは定義候補の数）が与えられると、定義候補Ｄｋの各々について、アクロニムＡと定義候補Ｄｋとが共起するウェブページのスニペットに対する検索要求をインターネット５２上の検索エンジンに与える。 (1) Given an acronym A and a definition candidate Dk (k = 1 to K: K is the number of definition candidates), for each definition candidate Dk, a snippet of a web page in which acronym A and the definition candidate Dk co-occur To the search engine on the Internet 52.

（２）検索結果として、アクロニムＡと定義候補Ｄｋとを含むスニペットの集合｛Ｓｎ（Ａ，Ｄｋ）｝（ｎ＝１〜Ｌｋ）（ただしＬｊ（ｊ＝１〜ｋ）はアクロニムＡと定義候補Ｄｋとの組合せに対して検索されたスニペットの数を表す。）を取得する。 (2) As a search result, a set of snippets {Sn (A, Dk)} (n = 1 to Lk) including acronym A and definition candidate Dk (where Lj (j = 1 to k) is acronym A and definition candidate) Represents the number of snippets retrieved for the combination with Dk).

（３）このスニペットの集合｛Ｓｎ（Ａ，Ｄｋ）｝の各々から、定義候補Ｄｋを削除することによって、検索結果のスニペットの集合｛（Ｔｎ（Ａ），Ｄｋ）｜ｎ＝１〜Ｌｋ｝を作成する。 (3) By deleting the definition candidate Dk from each of the snippet sets {Sn (A, Dk)}, a set of search result snippets {(Tn (A), Dk) | n = 1 to Lk} Create

（４）上記した３つの処理を、全ての定義候補Ｄｋに対して繰返す。 (4) The above three processes are repeated for all definition candidates Dk.

（５）検索されたウェブページのスニペットの集合Ｓｎを、それらに含まれるウェブページの数（検索結果の数）の降順でソートし、さらにその内で上位Ｎ個のみを選択することで、Ｎ個の学習用のスニペットの集合｛（Ｔｎ（Ａ），Ｄｋ）｜ｎ＝１〜Ｌｋ｝が抽出され、学習データとして記憶される。 (5) The set Sn of the searched web page snippets is sorted in descending order of the number of web pages (number of search results) included in them, and only the top N are selected among them. A set of learning snippets {(Tn (A), Dk) | n = 1 to Lk} is extracted and stored as learning data.

（６）この学習データを用い、図４に示す学習用特徴ベクトル作成部１０８と全く同様にして学習用の複数個の特徴ベクトルが作成される。特徴ベクトルの作成の仕方も第１の実施の形態の場合と全く同様である。特徴ベクトルの作成時の窓長も第１の実施の形態と同様、２Ｍと表すことにする。 (6) Using this learning data, a plurality of feature vectors for learning are created in exactly the same manner as the learning feature vector creation unit 108 shown in FIG. The method of creating the feature vector is exactly the same as in the first embodiment. Similarly to the first embodiment, the window length when creating the feature vector is represented by 2M.

（７）これらの特徴ベクトルと、それら特徴ベクトルを与えたスニペットが検索されたときの検索に用いられた定義候補とを関連付けて学習用のデータとする。 (7) These feature vectors are associated with the definition candidates used for the search when the snippet giving the feature vectors is searched for as learning data.

（８）この学習用のデータを用い、決定木の学習を行なう。この学習の結果、決定木は、入力文のうち、多義解消の対象となるアクロニムＡを中心とする窓長２Ｍに含まれる単語により作成される特徴ベクトルが与えられると、そのアクロニムに対する適切な定義を出力するようになる。 (8) Learning of the decision tree is performed using the learning data. As a result of this learning, when a decision vector is given a feature vector created by a word included in the window length 2M centered on the acronym A to be resolved, the appropriate definition for the acronym is given. Will be output.

（９）入力文の中の、多義解消の対象となるアクロニムＡを中心とし、窓長２Ｍの窓から決定木のための特徴ベクトルを作成する。 (9) A feature vector for a decision tree is created from a window having a window length of 2M, centering on the acronym A that is the object of ambiguity resolution in the input sentence.

（１０）この特徴ベクトルを決定木に与えることにより、決定木からはアクロニムＡの定義を一つだけ選択する出力が得られる。この出力を多義アクロニム解消処理部３５０からアクロニム解釈部３４４に与えることにより、アクロニム解釈部３４４は当該アクロニムに対し、多義アクロニム解消処理部３５０から与えられたただ一つの定義を付して、アクロニム定義付入力文記憶部３５４に出力できる。 (10) By giving this feature vector to the decision tree, an output for selecting only one definition of acronym A is obtained from the decision tree. By giving this output from the ambiguous acronym elimination processing unit 350 to the acronym interpretation unit 344, the acronym interpretation unit 344 attaches only one definition given from the ambiguous acronym elimination processing unit 350 to the acronym, and defines the acronym definition. The data can be output to the additional input sentence storage unit 354.

＜第３の実施の形態＞
図１０に、第３の実施の形態に係る日本語−英語の自動翻訳システム４３０のブロック図を示す。図１０を参照して、この自動翻訳システム４３０は、日本語の入力文を記憶するための日本文記憶部４４０と、日本文記憶部４４０に記憶された日本文の所定量を記憶するための入力文バッファ４４２と、日本語から英語への１又は複数の辞書からなる辞書群４４６と、自動翻訳の前処理として、入力文バッファ４４２に記憶された日本文を形態素解析し、各単語について辞書群４４６を参照して英語の訳語を割当て、出力するための訳語決定部４４４と、このように前処理された訳語付日本文を記憶するための訳語付日本文記憶部４５４と、訳語付日本文記憶部４５４に記憶された訳語付日本文を、その訳語を使用しながら英語に翻訳する自動翻訳装置４５６とを含む。 <Third Embodiment>
FIG. 10 is a block diagram of a Japanese-English automatic translation system 430 according to the third embodiment. Referring to FIG. 10, automatic translation system 430 stores a Japanese sentence storage unit 440 for storing Japanese input sentences, and a predetermined amount of Japanese sentences stored in Japanese sentence storage unit 440. An input sentence buffer 442, a dictionary group 446 composed of one or more dictionaries from Japanese to English, and a Japanese sentence stored in the input sentence buffer 442 as a preprocessing for automatic translation, morphological analysis is performed for each word. A translation determining unit 444 for assigning and outputting English translations with reference to the group 446, a Japanese sentence storage unit 454 with a translation for storing the pre-translated Japanese sentence, and a translation-added Japanese sentence And an automatic translation device 456 for translating Japanese sentences with translations stored in the sentence storage unit 454 into English using the translations.

しかし、既に述べたとおり、入力される一つの日本語単語に複数の英語の訳語候補が存在する場合があり得る。そうしたときにそれら複数の英語の訳語候補を日本語単語にそのまま付して訳語決定部４４４から出力すると、自動翻訳装置４５６における翻訳に支障が生ずる。そのために、何らかの手段でこれら複数の訳語候補の中から適切なものを選択する必要がある。 However, as already described, there may be a case where there are a plurality of English translation candidates for one input Japanese word. If such a plurality of English translation candidates are directly attached to the Japanese word and output from the translation determination unit 444 at that time, the automatic translation apparatus 456 has trouble in translation. Therefore, it is necessary to select an appropriate one from the plurality of translated word candidates by some means.

そのために、本実施の形態に係る自動翻訳システム４３０は、訳語決定部４４４及びインターネット５２に接続され、訳語決定部４４４から、日本語の単語と、その単語に対して得られた複数の訳語候補と、入力文における当該日本語の単語の前後の所定の窓中に存在する単語列とが与えられると、インターネット５２をコーパスとして用いた学習処理により、与えられた複数の訳語候補のうち、与えられた単語列に対して最も適切と思われるものを選択し、訳語決定部４４４に与える処理を行なうための多義訳語解消処理部４５０を含む。 For this purpose, the automatic translation system 430 according to the present embodiment is connected to the translation determination unit 444 and the Internet 52, and the translation determination unit 444 provides a Japanese word and a plurality of translation candidates obtained for the word. And a word string existing in a predetermined window before and after the Japanese word in the input sentence, given by a learning process using the Internet 52 as a corpus, It includes a multiple meaning word elimination processing unit 450 for selecting the most likely word sequence for the given word string and giving it to the translated word determination unit 444.

多義訳語解消処理部４５０の構成の詳細についてはここでは述べないが、多義訳語解消処理部４５０における処理が、第１の実施の形態の同形異音語解消処理部５０における処理、及び第２の実施の形態の多義アクロニム解消処理部３５０における処理と同一であり、従ってその構成も同形異音語解消処理部５０の構成と同一であることが理解されるであろう。 Although details of the configuration of the multiple meaning word elimination processing unit 450 will not be described here, the processing in the multiple meaning word elimination processing unit 450 is the same as the processing in the isomorphic abnormal word elimination processing unit 50 of the first embodiment and the second It will be understood that the processing is the same as the processing in the ambiguous acronym elimination processing unit 350 of the embodiment, and therefore the configuration thereof is the same as the configuration of the isomorphic abnormal word elimination processing unit 50.

訳語決定部４４４は、入力文バッファ４４２中の文を読出し、形態素解析して、各単語について辞書群４４６を参照することにより英語の訳語を割当て、訳語付日本文記憶部４５４に出力していく。複数の訳語候補が一つの日本語単語について出現した場合、訳語決定部４４４はその日本語単語と、複数の訳語候補とを多義訳語解消処理部４５０に引渡し、多義性の解消を依頼する。多義訳語解消処理部４５０は、第１の実施の形態における同形異音語解消処理部５０と全く同じ動作によって決定木を作成し、入力文のうち、与えら得た日本語単語の前後の窓内の単語列を用いて特徴ベクトルを作成し、決定木に与えることにより適切な訳語候補を得て、訳語決定部４４４に返す。訳語決定部４４４は問題となった日本語単語に、多義訳語解消処理部４５０から与えられたただ一つの訳語を付し、訳語付日本文記憶部４５４に出力する。従って、自動翻訳装置４５６における自動翻訳処理に支障が生ずることはない。 The translation determination unit 444 reads the sentence in the input sentence buffer 442, performs morphological analysis, assigns an English translation by referring to the dictionary group 446 for each word, and outputs the translated word to the Japanese sentence storage unit 454 with translation. . When a plurality of translation candidates appear for one Japanese word, the translation determination unit 444 delivers the Japanese word and the plurality of translation candidates to the ambiguity translation cancellation processing unit 450 and requests ambiguity resolution. The multiple meaning word elimination processing unit 450 creates a decision tree by exactly the same operation as the homomorphic abnormal word elimination processing unit 50 in the first embodiment, and the windows before and after the given Japanese word in the input sentence A feature vector is created using the word string in the list, and given to the decision tree, an appropriate translation word candidate is obtained and returned to the translation word determination unit 444. The translated word determining unit 444 attaches only one translated word given from the multiple meaning translated word elimination processing unit 450 to the Japanese word in question, and outputs it to the translated Japanese sentence storage unit 454. Accordingly, there is no problem in automatic translation processing in the automatic translation apparatus 456.

以上、第１〜第３の実施の形態の説明から明らかなように、本発明に係る多義性の解消、又はあいまい性の解消は、自然言語処理の分野の広い範囲にわたり、容易に適用できる。しかも、多義性の解消を行なう部分の仕組みは基本的に同一でよい。もちろん、解消処理の細部において様々な設計事項はあり得るが、ある分野で有効な方式は、基本的にそのままの形で他の分野についても適用可能である。 As is apparent from the description of the first to third embodiments, the ambiguity elimination or the ambiguity elimination according to the present invention can be easily applied over a wide range in the field of natural language processing. In addition, the mechanism of the part for eliminating ambiguity may be basically the same. Of course, there may be various design items in the details of the cancellation processing, but a method effective in a certain field can be applied to other fields basically as it is.

例えば日本語と英語との間の翻訳のみならず、あらゆる言語の間の単語の翻訳に、言語の相違にかかわらず本発明に係る多義性又はあいまい性の解消をする装置を適用できる。第１の実施の形態における同形異音語の解消を行なう機構も、言語にかかわらずほとんどそのまま適用できる。もちろん、言語に特有の調整が必要な場合もあり得るが（例えば日本語における形態素解析）、その部分は自然言語処理での前提として必ず前もって行なわれているとすれば、多義性又はあいまい性の部分の仕組みは言語に係らず同一でよい。 For example, the device for eliminating ambiguity or ambiguity according to the present invention can be applied not only to translation between Japanese and English but also to translation of words between all languages regardless of language differences. The mechanism for eliminating homomorphic abnormal words in the first embodiment can be applied almost as it is regardless of the language. Of course, language-specific adjustments may be necessary (for example, morphological analysis in Japanese), but if that part is always done in advance as a premise in natural language processing, it may be ambiguous or ambiguous. The mechanism of the part may be the same regardless of language.

従って、自然言語処理の分野の広い領域において、本発明を適用することができ、しかもある領域から別の領域への移植も極めて簡単に実現できる。 Therefore, the present invention can be applied to a wide area in the field of natural language processing, and transplantation from one area to another can be realized very easily.

＜可能な変形例＞
上記した実施の形態では、適切な仮名表記、アクロニムの定義、及び訳語を決定するために、決定木を用いた。しかし本発明は決定木を用いるものには限定されず、インターネットから収集した学習データによって、対象となる単語又は単語列がおかれた文脈（環境）によって、目的物として複数のうちからどれを選択するかを機械学習により学習できるものであれば、どのような分類方法でも用いることができる。例えば、ナイーブベイズ、決定リスト、ｋ−最近隣法、オンラインアルゴリズム、最大エントロピー法、サポートベクトルマシン、ブースティングなどを利用できる。 <Possible modification>
In the embodiment described above, a decision tree is used to determine an appropriate kana notation, an acronym definition, and a translation. However, the present invention is not limited to the one using a decision tree, and any one of a plurality of objects is selected as a target object according to the context (environment) where the target word or word string is placed by learning data collected from the Internet. Any classification method can be used as long as it can be learned by machine learning. For example, naive Bayes, decision lists, k-nearest neighbor methods, online algorithms, maximum entropy methods, support vector machines, boosting, etc. can be used.

また、上記した実施の形態では学習データとしてウェブページのスニペットを収集したが、本発明がそのような実施の形態に限定されないことはもちろんである。例えばウェブページ全体を処理の対象としてもよい。また、例えば一つの単語Ｗと仮名表記Ｒｋとの組み合わせに対して収集するウェブページの数の上限ＭＡＸを１０００に限定しているが、この数が自由に変更できることはいうまでもない。また、このような限定を用いないことも可能である。 In the above-described embodiment, web page snippets are collected as learning data. However, the present invention is not limited to such an embodiment. For example, the entire web page may be processed. Further, for example, the upper limit MAX of the number of web pages collected for a combination of one word W and kana notation Rk is limited to 1000, but it goes without saying that this number can be freely changed. It is also possible not to use such a limitation.

さらに、上記した実施の形態では、問題となる単語と、その単語と対となるべきいくつかの候補が与えられると、その時点でインターネットにアクセスし、決定木を作成している。しかし本発明はそのような実施の形態には限定されない。例えば、予め何らかのテスト文に対し、上記したような処理をすることにより、テスト文中に含まれる、何らかのあいまい性を持ついくつかの単語について、そのあいまい性を解消するための分類装置を予め準備しておいてもよい。そうした分類装置を多数の単語に対して一つずつ予め準備しておけば、その単語が与えられてから分類装置の学習を行なったりする必要はなく、直ちに適切な答えを与えることができる。もしもそれら複数の分類装置ではあいまい性が解消できない単語であれば、そのときに上記実施の形態で示したように改めて一つの分類装置を作成して適切な答えを得るようにすればよい。 Further, in the above-described embodiment, when a word in question and several candidates to be paired with the word are given, the Internet is accessed at that time, and a decision tree is created. However, the present invention is not limited to such an embodiment. For example, by performing the above-described processing on a certain test sentence in advance, a classification device is prepared in advance to eliminate some ambiguity for some words included in the test sentence. You may keep it. If such a classifier is prepared in advance for each of many words, it is not necessary to learn the classifier after the word is given, and an appropriate answer can be given immediately. If the ambiguity cannot be resolved by the plurality of classifiers, then one classifier may be created again as described in the above embodiment to obtain an appropriate answer.

また、上記した第１の実施の形態では、ソート及び選択部１０４により選択される仮名表記候補は、検索部１００によりヒットしたウェブページの数の多い上位Ｎ件（Ｎは複数）であった。第２の実施の形態及び第３の実施の形態の場合も同様である。しかし本発明はそのような実施の形態には限定されない。例えば、ソート及び選択部１０４の処理でヒット数の多かった最上位の１件の仮名表記候補のみを単語Ｗの仮名表記として採用してもよい。この場合には、決定木は１：１の分類を行なうものとして機能する。もっとも、この方法では単語Ｗの文脈が全く考慮されないので、結果の信頼性は低く、あいまい性の解消とはいえない。 Further, in the first embodiment described above, the kana notation candidates selected by the sort and selection unit 104 are the top N items (N is a plurality) having the largest number of web pages hit by the search unit 100. The same applies to the case of the second embodiment and the third embodiment. However, the present invention is not limited to such an embodiment. For example, only the topmost kana notation candidate having a large number of hits in the processing of the sort and selection unit 104 may be adopted as the kana notation of the word W. In this case, the decision tree functions as performing a 1: 1 classification. However, since the context of the word W is not considered at all in this method, the reliability of the result is low and it cannot be said that the ambiguity is resolved.

また、第１の実施の形態のソート及び選択部１０４の処理で、ヒットしたウェブページの数の多い上位Ｎ件ではなく、所定のしきい値以上のウェブページがヒットしたものを全て仮名表記候補として選択してもよい。又は、全ヒット数のうち、上位から各候補の割合を積算し、所定割合を超えるまでのものを、その数にかかわらず全て仮名表記候補として採用してもよい。 Also, in the sort and selection unit 104 processing of the first embodiment, not all of the top N web pages with a large number of hit web pages, but all web page hits of a predetermined threshold or higher are all kana notation candidates You may choose as Alternatively, among the total number of hits, the ratios of the candidates from the top may be integrated and all of the hits exceeding the predetermined ratio may be adopted as kana notation candidates regardless of the number.

さらに、上記実施の形態では、一つの単語を単位としてその意味候補を決定している。しかし本発明はそのような実施の形態には限定されない。意味候補の集合を作成するための辞書の見出しとして、例えば複数の単語からなる句を設けておくことにより、その句の意味についても、複数の意味集合の中から適切なものを選択できるようになる。 Further, in the above embodiment, the meaning candidate is determined for each word. However, the present invention is not limited to such an embodiment. As a dictionary heading for creating a set of meaning candidates, for example, by providing a phrase consisting of a plurality of words, the meaning of the phrase can be selected from a plurality of meaning sets. Become.

そして、そのようにして得られた分類装置を随時蓄積しておくことにより、直ちにあいまい性を解消できる単語が増加することになり、好ましい。 Then, it is preferable to accumulate the classification devices thus obtained as needed, because the number of words that can be immediately resolved is increased.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim in the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

本発明の第１の実施の形態に係る音声合成システム３０のブロック図である。1 is a block diagram of a speech synthesis system 30 according to a first embodiment of the present invention. 図１に示す同形異音語解消処理部５０のブロック図である。It is a block diagram of the isomorphic abnormal word elimination process part 50 shown in FIG. 図２の検索部１００を実現するためのコンピュータプログラムの制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the computer program for implement | achieving the search part 100 of FIG. 図２の学習用特徴ベクトル作成部１０８のブロック図である。FIG. 3 is a block diagram of a learning feature vector creation unit 108 in FIG. 2. 「窓」の概念について説明するための図である。It is a figure for demonstrating the concept of a "window." 決定木の一例を模式的に示す図である。It is a figure which shows an example of a decision tree typically. 第１の実施の形態に係る音声合成システム３０を実現するコンピュータシステム２５０の外観を示す図である。It is a figure which shows the external appearance of the computer system 250 which implement | achieves the speech synthesis system 30 which concerns on 1st Embodiment. 図７に示すコンピュータシステム２５０の内部構成を示すブロック図である。It is a block diagram which shows the internal structure of the computer system 250 shown in FIG. 本発明の第２の実施の形態に係る多義アクロニム解消システム３３０のブロック図である。It is a block diagram of the ambiguous acronym elimination system 330 which concerns on the 2nd Embodiment of this invention. 本発明の第３の実施の形態に係る自動翻訳システム４３０のブロック図である。It is a block diagram of the automatic translation system 430 which concerns on the 3rd Embodiment of this invention.

Explanation of symbols

３０音声合成システム
４０，３４０入力文記憶部
４２，３４２，４４２入力文バッファ
４４仮名変換部
４６，３４６，４４６辞書群
４８音声データベース
５０同形異音語解消処理部
５２インターネット
５４仮名表記入力文記憶部
５６音声合成部
５８スピーカ
８０決定木作成部
８２決定木
８４分類用特徴ベクトル作成部
８６分類実行部
１００検索部
１０２検索結果記憶部
１０４ソート及び選択部
１０６学習データ記憶部
１０８学習用特徴ベクトル作成部
１１０特徴ベクトル記憶部
１１２決定木学習部
３４４アクロニム解釈部
３５０多義アクロニム解消処理部
３５４アクロニム定義付入力文記憶部
４４０日本文記憶部
４４４訳語決定部
４５０多義訳語解消処理部
４５４訳語付日本文記憶部
４５６自動翻訳装置 DESCRIPTION OF SYMBOLS 30 Speech synthesis system 40,340 Input sentence memory | storage part 42,342,442 Input sentence buffer 44 Kana conversion part 46,346,446 Dictionaries 48 Speech database 50 Isomorphic allophone cancellation processing part 52 Internet 54 Kana notation input sentence memory part DESCRIPTION OF SYMBOLS 56 Speech synthesizer 58 Speaker 80 Decision tree creation part 82 Decision tree 84 Classification feature vector creation part 86 Classification execution part 100 Search part 102 Search result storage part 104 Sort and selection part 106 Learning data storage part 108 Learning feature vector creation part DESCRIPTION OF SYMBOLS 110 Feature vector memory | storage part 112 Decision tree learning part 344 Acronym interpretation part 350 Ambiguous acronym resolution processing part 354 Input sentence memory part with an acronym definition 440 Japanese sentence memory | storage part 444 Translation word determination part 450 Ambiguous translation word resolution processing part 454 Japanese sentence memory part with a translation 456 Automatic translation Equipment

Claims

In an input sentence composed of a natural language sentence, a certain word, a context in which the certain word is placed in the input sentence, and a set of semantic candidates including a plurality of semantic candidates that may represent the meaning of the certain word Given a ambiguity resolver in natural language that selects the most appropriate meaning of the word in the context from the set of semantic candidates,
Wherein and words, for each combination of the meanings candidates for the meaning candidate set in, from a given corpus, a document collection means for the word that constitute the combination to collect a set of documents to be co-occur,
Using the collection of documents collected for each of the combinations by the document collection means as learning data, given the certain word and the context in the document of the word, the certain word in the context A classifier creating means for automatically creating a classifier that selects a semantic candidate optimum as a meaning from the semantic candidate set;
In the input sentence, based on a context in which the certain word is placed, classification execution means for selecting an optimum meaning of the certain word from the set of meaning candidates using the classifier Including
The document collection means includes a virtual document composed of documents existing on the Internet for each combination of the certain word and the meaning candidate in the meaning candidate set, using both the certain word and the meaning candidate as a search keyword. from corpus comprises search means for word that constitute the combination are searched to collect a set of documents that co-occur,
The classifier creating means includes:
Of the set of documents collected for each of the combinations by the document collection means, a document having a large number of documents included in the set is selected according to a predetermined criterion, and only semantic candidates corresponding to the set of documents are selected. Semantic candidate selection means for performing processing of selecting as a candidate of a semantic candidate set;
When a set of documents selected by the meaning candidate selecting means is used as learning data and given the word and the context in the document of the word, the meaning candidate most suitable as the meaning of the word in the context is given. And a machine learning means for automatically creating, by machine learning, a classifier that selects the semantic candidate set from the semantic candidate set.

The machine learning means includes
For each document included in the document set selected by the document set selection unit, the context of the word in the document is determined from a word string existing in a predetermined range before and after the position of the word in the document. A feature amount vector calculating means for calculating a feature amount vector for learning having a predetermined configuration, which represents the feature amount of
A learning feature amount vector calculated by the feature amount vector calculation unit for each of the documents included in the document set selected by the document set selection unit, and a semantic candidate used when searching the document When learning data is created in pairs and machine learning using the learning data gives a feature vector for classification having the same configuration as the feature vector for learning, the feature vector for the classification 2. The natural language according to claim 1, further comprising: means for automatically creating a predetermined classifier that selects an optimum meaning of the certain word in the context corresponding to the meaning candidate set from the meaning candidate set. Ambiguity resolution device.

The meaning candidate selecting means selects a predetermined number of sets having a large number of documents included in the set from the set of documents collected for each of the combinations by the document collecting means, The ambiguity resolution device in natural language according to claim 1 or 2, comprising means for performing processing for selecting only a corresponding semantic candidate as the semantic candidate.

The meaning candidate selecting means selects a set in which the number of documents included in the set is larger than a predetermined threshold among the set of documents collected for each of the combinations by the document collecting means, and the documents The ambiguity resolution device in natural language according to claim 1 or 2, further comprising means for performing a process of selecting only the semantic candidates corresponding to the set of semantic candidates as the semantic candidates.

The means for collecting includes, for each combination of a certain word and a semantic candidate in the semantic candidate set, a word that constitutes the combination from a virtual corpus including web pages existing on the Internet. 2. The ambiguity resolution device in natural language according to claim 1, further comprising means for searching a set of web pages to be generated and collecting them as a set of elements having an upper limit of a predetermined constant.

A computer program that, when executed by a computer, causes the computer to function as an ambiguity resolver in natural language according to any one of claims 1 to 5.