JP3952964B2

JP3952964B2 - Reading information determination method, apparatus and program

Info

Publication number: JP3952964B2
Application number: JP2003046042A
Authority: JP
Inventors: 久子浅野
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2002-11-07
Filing date: 2003-02-24
Publication date: 2007-08-01
Anticipated expiration: 2023-02-24
Also published as: JP2004206659A

Description

【０００１】
【発明の属する技術分野】
本発明は、読み情報決定方法及び装置及びプログラムに係り、特に、日本語のテキスト音声合成を行う際に、日本語文章内に含まれる英数字列の読みクラスを判別することにより英数字列の読み精度を向上させるための読み情報決定方法及び装置及びプログラムに関する。
【０００２】
また、インターネット検索エンジンにおいて、日本語のページを検索対象とする際に、検索キーワードとして入力されたアルファベット列をカナに音訳して質問拡張する場合の拡張キーワードの精度向上のために利用される。
【０００３】
【従来の技術】
日本語テキスト音声合成は、日本語テキストに対して、読み、及び、アクセント、ポーズ等の韻律情報を設定し、これらを元に、音声波形を生成して合成音声を出力する。正しく自然な音声合成を出力するためには、この読みや韻律情報を正しく設定する必要がある。
【０００４】
読みとアクセスの付与は、単語に対する情報（単語情報）を用いることで、高精度に設定することができる。単語情報は、一般に日本語形態素解析を用いて得ることができる。日本語形態素解析は、成熟した技術であり、新聞記事などを対象にした場合、９９％以上の精度を実現しているものが数多く存在する。これらの形態素解析は、一般に単語情報を登録した単語辞書を用いて、解析を行う（例えば、非特許文献１参照）。
【０００５】
また、アルファベット列が未知語となった場合は、読みが付与されていないため、１文字ずつアルファベット読みをしたり（例えば、非特許文献２参照）、英単語と仮定して、英日音訳を行ったりしている（例えば、非特許文献３参照）。
また、入力されたテキストにおいて、アルファベット母音の出現頻度とアルファベット子音の出現頻度の割合により、そのテキストをローマ字読みするか英語読みするか判別する手法（例えば、特許文献１参照）がある。この方法は、アルファベット母音（ａ，ｉ，ｕ，ｅ，ｏ）及びアルファベット子音について、それぞれ毎に出現頻度を抽出して、アルファベット母音の出現頻度÷アルファベット子音の出現頻度の値が予め定められた値より大きいとき、テキスト中のアルファベット文字列をローマ字として、そうでないときには、英語として読み上げる技術である。
【０００６】
また、数字列に関しては、正数型、小数型など（以後、これを数字読みクラスと記す）に分類し、読み分ける方法が確立されている（例えば、非特許文献４参照）。
【０００７】
【特許文献１】
特開２０００−１０５７９号公報
【０００８】
【非特許文献１】
渕武志，他２名、「保守性を考慮した形態素解析システム」、情報処理学会研究報告：自然言語処理，1997年 1月20日、pp.59-66.
【０００９】
【非特許文献２】
宮崎正弘，他１名「日本分音声出力のための言語処理方式」、情報処理学会論文誌，1986年11月、第27巻、第11号、pp.1053-1061.
【００１０】
【非特許文献３】
高木伸一郎，他４名「電子メールを電話で確認できる通信秘書技術」，ＮＴＴ技術ジャーナル、日本電信電話株式会社、平成９年６月１日、第９巻、第６号、pp.63-68.
【００１１】
【非特許文献４】
宮崎正弘、「日本文音声変換のための数詞読み規則」、情報処理学会論文誌、１９８４年６月、第２５巻、第６号、pp.1035-1043.
【００１２】
【発明が解決しようとする課題】
しかしながら、日本語テキスト中に現れるアルファベット列（アルファベットとまとまって単語を構成しているアポストロフィーなどの記号類も含む）に対しては、辞書登録されている割合が低く、未知語となる割合が高い。また、数字列（数字とまとまってある情報を表している、小数点や市外局番前後のかっこなどの記号類も含む）は、前後の文脈により読み方が変わる場合があるが、これは、上記従来の形態素解析では対応できない。
【００１３】
また、アルファベット列が未知語となった場合に、１文字ずつアルファベット読みをしたり、英日音訳を行う場合、実際には、アルファベット読みや英語読みしない単語の場合には読み誤りとなる（以後、アルファベット読み、英単語読み、ローマ字読み、フランス語読み…などをアルファベット読みクラスと記す）。
また、アルファベット母音の出現頻度とアルファベット子音の出現頻度の割合によりローマ字読みまたは、英語読みするかを判断する方法は、英語とローマ字が混在する日本語テキストにたいしても、どちらか片方の読み方に固定され、読み誤りが生じる可能性がある。例えば、「ＹＯＫＯＨＡＭＡＴＥＡＨＯＵＳＥは、来月１日にオープンします。」という文では、アルファベット母音の割合が大きいため、ローマ字読みと決定され、「ＹＯＫＯＨＡＭＡＴＥＡＨＯＵＳＥ」は、「ヨコハマテアホウセ」という読みが付与されてしまう。
【００１４】
また、数字読みクラスに分類して読み分ける方法では、前後の文脈に応じてこの型を正しく推定する手法は解決されていない。
【００１５】
上記のように、ある種の日本語テキストには、英数字列が数多く含まれているものがある。例えば、インターネット上の店舗紹介のページなどでは、店名やサービス名、製品名が、アルファベット表記されているものが多く、その読み方もアルファベト読みするもの（例：ＣＤ）、ローマ字読みするもの（例：ＹＯＫＯＨＡＭＡ）、英語読みするもの（例：Ｒｅｓｔａｕｒａｎｔ）、フランス語読みするもの（例：ＴＥＲＲＡＳＳＥ）、イタリア語読みするもの（例：ＴＲＡＴＴＯＲＩＡ）等多彩である。また、テキストの前後の状況に応じて数字列の読み方にもバリエーションがある。例えば、「６１１」という数字列は、「６１１番」の場合は「ロッピャクジュウイチ」という読み、「Ａ６１１ｉｔ」（品番など）では、「ロクイチイチ」という読みになる。
【００１６】
しかし、これらのアルファベット列は固有名詞が多く新しい語も増えていくため、形態素解析の単語辞書に全てを登録するのは不可能であり、また、収集できる範囲で辞書登録するにしても、ローマ字や各種外来語などを登録しなければならず、単語辞書サイズが膨大になる。また、数字列は無限に存在し、さらにその前後の文字列まで考慮して登録するのは、非現実的である。
【００１７】
このため、アルファベット列に対しては、アルファベット列から読み（カナ列）へ変換する音訳が必要となるが、この音訳は、ある範囲のテキストに対して、英語読みやローマ字読みなどの特定アルファベット読みクラスを仮定して音訳を行っていたため、仮定と異なるクラスの場合には、正しく読みが付与されないという問題がある。
【００１８】
また、数字列に対しては、小数点などの数少ない文字を手掛かりに、数字読みクラスを判定し、数字列に読みを付与していたため、数字読みクラスを誤った場合に正しく読みが付与されないという問題がある。
【００１９】
本発明は、上記の点に鑑みなされたもので、アルファベットや数字からなる単語に対する日本語読みを決定する際に、アルファベット読み、英語読み等が一概に決定されない文字列に対する読みを自動的に付与するための読み情報決定方法及び装置及びプログラムを提供することを目的とする。
【００２０】
【課題を解決するための手段】
図１は、本発明の原理を説明するための図である。
【００２１】
本発明（請求項１）は、形態素解析手段、対象単語抽出手段、読みクラス候補抽出手段、対象単語情報利用型判定手段、文脈利用型判定手段、最終判定手段、読み付与手段、単語情報出力手段を有し、
処理対象のテキストが入力されると、各単語の読み、品詞を含む単語情報を出力する装置における読み情報決定方法であって、
形態素解析手段が、入力手段から入力、または、記憶手段から読み出された処理対象のテキストと設定情報を受け付け（ステップ１）、単語情報を付与するための情報が記憶された単語辞書を用いてテキストを形態素解析して単語情報を取得する（ステップ２）形態解析ステップと、
対象単語抽出手段が、設定情報として入力された読みクラスの判定を行う単語の指定により、単語情報の中から読みクラスの判定を行う対象単語を抽出する対象単語抽出ステップ（ステップ３）と、
読みクラス候補抽出手段が、各対象単語に対して、文字列を構成する文字種やその並びを以て読み方の種別を示す読みクラス候補となり得る読み候補を抽出する読みクラス候補抽出ステップ（ステップ４）と、
対象単語情報利用型判定手段が、抽出された対象単語がアルファベット列である場合は、対象単語情報利用型読みクラス判定モデルを用いた対象単語情報利用型判定を行う対象単語情報利用型判定ステップ（ステップ５）と、
文脈利用型判定手段が、読みクラスの第１候補のスコアが所定の信頼度閾値未満あるいは、抽出された対象単語が数字列の場合には、文脈利用型読みクラス判定モデルを用いた文脈利用型判定を行う文脈利用型判定ステップ（ステップ６）と、
最終判定手段が、
対象単語情報利用型判定ステップと文脈利用型判定ステップの第１候補のスコアを比較して、該対象単語情報利用型判定ステップにおける読みクラスの第１の候補のスコアと、該文脈利用型判定ステップで判定された第１候補の読みのクラスのスコアとスコアの重み（但し、スコアの重みは定数）を乗算した値のうち、値の大きい読みクラスを最終結果とし（ステップ８）、
対象単語情報利用型判定ステップの読みクラスの第１候補のスコアが所定の信頼度閾値以上、あるいは、対象単語情報利用型判定ステップと文脈利用型判定ステップの第１候補が同一、あるいは、対象単語が数字列の場合には、該第１候補を読みクラス判定の最終結果とし（ステップ７）、
対象単語が数字列の場合には、文脈利用型判定ステップの第１候補を読みクラス判定の最終結果とする（ステップ７）最終判定ステップと、
読み付与手段が、最終判定ステップで判定された読みクラスに応じて読み付与を行う読み付与ステップ（ステップ９）と、
単語情報出力手段が、設定情報として入力された出力する単語情報の形式に基づいて、単語情報を出力する単語情報出力ステップ（ステップ１０）と、を行う。
【００２２】
また、本発明（請求項２）は、対象単語情報利用型読みクラス判定モデルにおいて、
少なくとも単語の文字数、第１音節表記、末尾音節表記を含む属性に対応するパラメータを持つ予め定められた識別関数と、該識別関数の出力値より各読みクラスの順位とスコアを定める予め定められた順位関数を有し、
対象単語情報利用型判定ステップにおいて、対象単語情報利用型判定手段が、
対象単語情報利用型読み判定モデルに対して、抽出された対象単語の単語情報から得られる少なくとも単語の文字数、第１音節表記、末尾音節表記を含む属性を入力し、各属性ベクトル表現に変換して識別関数の計算を行い、該識別関数の出力値を順位関数に入力して、各読みクラス候補の推定順位をスコア付きで出力する。
【００２３】
また、本発明（請求項３）は、文脈利用型読みクラス判定モデルにおいて、
少なくとも、単語の文字数、字種（アルファベット列は全て大文字、先頭大文字、全て小文字、その他に分ける）、品詞を含む属性に対応するパラメータを持つ予め定められた識別関数と、該識別関数の出力値より各読みクラスの順位とスコアを定める予め定められた順位関数を有し、
文脈利用型判定ステップにおいて、文脈利用型判定手段が、
文脈利用型読み判定モデルに対して、抽出された対象単語、及び該対象単語の前方Ｍ個の単語（Ｍ>０，任意に設定可能）、後方Ｎ個の単語（Ｎ>０，任意に設定可能）の単語情報から得られる少なくともかく単語の文字数、字種（アルファベット列は全て大文字、先頭大文字、すべて小文字、その他に分ける）、品詞を含む属性を入力し、各属性をベクトル表現に変換して識別関数の計算を行い、該識別関数の出力値を順位関数に入力して、各読みクラス候補の推定順位をスコア付きで出力する。
【００２４】
本発明（請求項４）は、形態素解析手段、対象単語抽出手段、読みクラス候補抽出手段、一括判定手段、読み付与手段、単語情報出力手段を有し、処理対象のテキストが入力されると、各単語の読み、品詞を含む単語情報を出力する装置における読み情報決定方法であって、
形態素解析手段が、処理対象のテキストと設定情報を入力として受け付け、単語情報を付与するための情報が記憶された単語辞書を用いてテキストを形態素解析して単語情報を取得する形態解析ステップと、
対象単語抽出手段が、設定情報として入力された読みクラスの判定を行う単語の指定により、単語情報の中から読みクラスの判定を行う対象単語を抽出する対象単語抽出ステップと、
読みクラス候補抽出手段が、各対象単語に対して、文字列を構成する文字種やその並びを以て読み方の種別を示す読みクラス候補となり得る読み候補を抽出する読みクラス候補抽出ステップと、
一括判定手段が、一括読みクラス判定モデルを用いた一括判定を行い、第１候補を読みクラス判定の結果とする一括判定ステップと、
読み付与手段が、判定された読みクラスに応じて読み付与を行う読み付与ステップと、
単語情報出力手段が、設定情報として入力された出力する単語情報の形式に基づいて、単語情報を出力する単語情報出力ステップと、
を行い、
一括読みクラス判定モデルは、
少なくとも、アルファベット列用のみとしては、文字数、第１音節表記、末尾音節表記、字種（アルファベット列は全て大文字、先頭大文字、全て小文字、その他に分ける）を含む属性、数字列用のみとしては、文字数、字種（アルファベット列は全て大文字、先頭大文字、全て小文字、その他に分ける）、品詞、数字タイプ（先頭文字が " ０ " かどうか）を含む属性、アルファベット列と数字列共用としては、文字数、第１音節表記、末尾音節表記、字種（アルファベット列は全て大文字、先頭大文字、全て小文字、その他に分ける）、品詞数字タイプ（先頭文字が " ０ " かどうか）を含む属性に対応するパラメータを持つ予め定められた識別関数と、該識別関数と出力値より第１位の候補を選択する順位関数を有し、
一括判定ステップにおいて、一括判定手段が、
一括読み判定モデルに対して、抽出された対象単語、及び該対象単語の前方Ｍ個の単語（Ｍ > ０，任意に設定可能）、後方Ｎ個の単語（Ｎ > ０，任意に設定可能）の単語情報から得られる、少なくとも、
アルファベット列用のみとしては、文字数、第１音節表記、末尾音節表記、字種（アルファベット列は全て大文字、先頭大文字、全て小文字、その他に分ける）を含む属性、
数字列用のみとしては、文字数、字種（アルファベット列は全て大文字、先頭大文字、全て小文字、その他に分ける）、品詞、数字タイプ（先頭文字が " ０ " かどうか）を含む属性、
アルファベット列と数字列共用としては、文字数、第１音節表記、末尾音節表記、字種（アルファベット列は全て大文字、先頭大文字、全て小文字、その他に分ける）、品詞、数字タイプ（先頭文字が " ０ " かどうか）を含む属性
を入力し、各属性をベクトル表現に変換して識別関数の計算を行い、識別関数の出力値を一括読み判定モデルの順位関数に入力して、各読みクラス候補の推定順位をスコア付きで出力する。
【００２６】
図２は、本発明の原理構成図である。
【００２７】
本発明（請求項５）は、処理対象のテキストを入力して、各単語の読み、品詞を含む単語情報を出力する読み情報決定装置であって、
処理対象のテキストと設定情報を入力として受け付け、単語辞書を用いてテキストを形態素解析して単語情報を取得する形態素解析手段２と、
設定情報として入力された読みクラスの判定を行う単語の指定により、単語情報の中から読みクラスの判定を行う対象単語を抽出する対象単語抽出手段３と、
対象単語抽出手段３で抽出された各対象単語に対して、文字列を構成する文字種やその並びを以て読み方の種別を示す読みクラス候補となり得る読み候補を抽出する読みクラス候補抽出手段４１と、
対象単語抽出手段３で抽出された対象単語がアルファベット列である場合は、対象単語情報利用型読みクラス判定モデルを用いた対象単語情報利用型判定を行う対象単語情報利用型判定手段４２と、
読みクラスの第１候補のスコアが所定の信頼度閾値未満、あるいは、対象単語が数字列の場合には、文脈利用型読みクラス判定モデルを用いた文脈利用型判定を行う文脈利用型判定手段４３と、
対象単語情報利用型判定手段４２と該文脈利用型判定手段４３の第１候補のスコアを比較して、該対象単語情報利用型判定手段４２の読みクラスの第１の候補のスコアと、該文脈利用型判定手段４３で判定された第１候補の読みのクラスのスコアとスコアの重み（但し、スコアの重みは定数）を乗算した値のうち、値の大きい読みクラスを最終結果とし、対象単語情報利用型判定手段４２の読みクラスの第１候補のスコアが所定の信頼度閾値以上、あるいは、対象単語情報利用型判定手段４１と文脈利用型判定手段４３の第１候補が同一、あるいは、対象単語が数字列の場合には、該第１候補を読みクラス判定の最終結果とする最終判定手段４４と、
最終判定手段４４で判定された読みクラスに応じて読み付与を行う読み付与手段５と、
設定情報として入力された出力する単語情報の形式に基づいて、単語情報を出力する単語情報出力手段６と、を有する。
【００２８】
また、本発明（請求項６）は、対象単語情報利用型読み判定モデルにおいて、
少なくとも、単語の文字数、第１音節表記、末尾音節表記を含む属性に対応するパラメータを持つ予め定められた識別関数と、該識別関数の出力値より各読みクラスの順位とスコアを定める予め定められた順位関数を有し、
対象単語情報利用型判定手段４２は、
対象単語情報利用型読み判定モデルに対して、抽出された対象単語の単語情報から得られる少なくとも単語の文字数、第１音節表記、末尾音節表記を含む属性を入力し、各属性ベクトル表現に変換して識別関数の計算を行い、識別関数の出力値を順位関数に入力して、各読みクラス候補の推定順位をスコア付きで出力する手段を含む。
【００２９】
また、本発明（請求項７）は、文脈利用型読みクラス判定モデルにおいて、
少なくとも、単語の文字数、字種（アルファベット列は全て大文字、先頭大文字、全て小文字、その他に分ける）、品詞を含む属性に対応するパラメータを持つ予め定められた識別関数と、該識別関数の出力値より各読みクラスの順位とスコアを定める予め定められた順位関数を有し、
文脈利用型判定手段４３は、
文脈利用型読み判定モデルに対して、抽出された対象単語、及び該対象単語の前方Ｍ個の単語（Ｍ>０，任意に設定可能）、後方Ｎ個の単語（Ｎ>０，任意に設定可能）の単語情報から得られる少なくとも各単語の文字数、字種（アルファベット列は全て大文字、先頭大文字、すべて小文字、その他に分ける）、品詞を含む属性を入力し、各属性をベクトル表現に変換して識別関数の計算を行い、該識別関数の出力値を順位関数に入力して、各読みクラス候補の推定順位をスコア付きで出力する手段を含む。
【００３０】
本発明（請求項８）は、処理対象のテキストを入力して、各単語の読み、品詞を含む単語情報を出力する読み情報決定装置であって、
処理対象のテキストと設定情報を入力として受け付け、単語辞書を用いて該テキストを形態素解析して単語情報を取得する形態素解析手段と、
設定情報として入力された読みクラスの判定を行う単語の指定により、単語情報の中から読みクラスの判定を行う対象単語を抽出する対象単語抽出手段と、
各対象単語に対して、文字列を構成する文字種やその並びを以て読み方の種別を示す読みクラス候補となり得る読み候補を抽出する読みクラス候補抽出手段と、
一括読みクラス判定モデルを用いた一括判定を行い、第１候補を読みクラス判定の結果とする一括判定手段と、
判定された読みクラスに応じて読み付与を行う読み付与手段と、
設定情報として入力された出力する単語情報の形式に基づいて、単語情報を出力する単語情報出力手段と、を有し、
一括読みクラス判定モデルは、
少なくとも、アルファベット列用のみとしては、文字数、第１音節表記、末尾音節表記、字種（アルファベット列は全て大文字、先頭大文字、全て小文字、その他に分ける）を含む属性、数字列用のみとしては、文字数、字種（アルファベット列は全て大文字、先頭大文字、全て小文字、その他に分ける）、品詞、数字タイプ（先頭文字が " ０ " かどうか）を含む属性、アルファベット列と数字列共用としては、文字数、第１音節表記、末尾音節表記、字種（アルファベット列は全て大文字、先頭大文字、全て小文字、その他に分ける）、品詞数字タイプ（先頭文字が " ０ " かどうか）を含む属性に対応するパラメータを持つ予め定められた識別関数と、該識別関数と出力値より第１位の候補を選択する順位関数を有し、
一括判定手段は、
一括読み判定モデルに対して、抽出された対象単語、及び該対象単語の前方Ｍ個の単語（Ｍ > ０，任意に設定可能）、後方Ｎ個の単語（Ｎ > ０，任意に設定可能）の単語情報から得られる、少なくとも、
アルファベット列用のみとしては、文字数、第１音節表記、末尾音節表記、字種（アルファベット列は全て大文字、先頭大文字、全て小文字、その他に分ける）を含む属性、
数字列用のみとしては、文字数、字種（アルファベット列は全て大文字、先頭大文字、全て小文字、その他に分ける）、品詞、数字タイプ（先頭文字が " ０ " かどうか）を含む属性、
アルファベット列と数字列共用としては、文字数、第１音節表記、末尾音節表記、字種（アルファベット列は全て大文字、先頭大文字、全て小文字、その他に分ける）、品詞、数字タイプ（先頭文字が " ０ " かどうか）を含む属性、
を入力し、各属性をベクトル表現に変換して識別関数の計算を行い、識別関数の出力値を一括読み判定モデルの順位関数に入力して、各読みクラス候補の推定順位をスコア付きで出力する手段を含む。
【００３２】
本発明（請求項９）は、コンピュータを、請求項５乃至８記載の読み情報決定装置として機能させるプログラムである。
【００３６】
上記のように本発明は、アルファベット列及び数字列に対して、各種辞書等より収集が容易な当該文字列自身の情報、及び、コーパス等を作成するコストが必要な当該文字列近辺の文字列情報を利用した統計モデルを用いて、文字列を構成する文字種やその並びを以て読み方の種別を示す読みクラス候補を決め、前後の単語の文脈との関係から属性を判定して読みクラスを絞り込むことを可能にする。
【００３７】
【発明の実施の形態】
以下、図面と共に本発明の実施の形態を説明する。
【００３８】
最初に読み情報決定装置の概要を説明する。
【００３９】
図３は、本発明の一実施の形態における読み情報決定装置の構成を示す。
【００４０】
同図に示す読み情報決定装置は、テキスト入力部１、形態素解析部２、対象単語抽出部３、読みクラス判定部４、読み付与部５、単語情報出力部６、単語辞書７、及び読みクラス判定モデル８から構成される。
【００４１】
テキスト入力部１は、テキストと設定情報を入力する。
【００４２】
ここで、テキストは、キーボードから入力される、あるいはハードディスクやメモリ等に格納されている等の、読み等の単語情報を付与する対象となる任意のテキストであり、形態素解析部２に渡す。
【００４３】
また、設定情報（対象単語列抽出部３で用いられる）として、読みクラスの判定を行う単語を構成する文字列の条件（指定された字種列（アルファベット、全文大文字、小文字等））であり、例えば、全アルファベット列、全数字列、未知語のあったアルファベット列と全数字列、未知語または、読みの多義のあるアルファベット列、または、全く判定しない等）、出力する単語情報の形式（例えば、全ての単語情報をメモリに出力、読みだけを標準出力に出力、表記と読みをハードディスク上のファイルに出力等）からなり、キーボードから入力される、あるいは、ハードディスクやメモリ等に格納されている情報である。読みクラスの判定を行う字種の指定は、対象単語列抽出部３に渡す。出力する単語情報の形式は、単語情報出力部６に渡す。
【００４４】
形態素解析部２は、テキスト入力部１から受け取ったテキストを、単語表記、品詞、読み、アクセント型等を対応付けて記憶した単語辞書７を用いて、単語に区切り、表記、品詞、読み、アクセント型などからなる単語情報を付与する。ここで、単語辞書７に登録されておらず、未知語となった単語は字種単位でまとめて１語として扱う。また、数字はまとめて１語として扱う。
【００４５】
対象単語列抽出部３は、テキスト入力部１から得られた読みクラスの判定を行う単語の指定により、指定された単語を、形態素解析部３から得られた単語情報の中から抽出して、読みクラスの判定を行う対象単語の抽出を行う。
【００４６】
読みクラス判定部４は、対象単語列抽出部３が抽出した各対象単語に対して、読みクラス判定モデル８を利用して、読みクラスの判定を行う。ここで判定された読みクラスは、形態素解析部２が出力した単語情報に追加する。読みクラス判定部４及び読みクラス判定モデル８の詳細については後述する。
【００４７】
読み付与部５は、対象単語列抽出部３で抽出された各対象単語に対して、付与された読みクラスに応じて、読みを付与する。
【００４８】
具体的には、数字列に対しては、判定された数字読みクラスに応じて、例えば、表記のゆれを吸収するための日本語の数表記を七つの形式に分類し、数表記の標準形を定め、これらに標準的な音韻とアクセント、ポーズを付与する規則を作成し、また、数字に助数詞が連接した場合の数詞、助数詞の音韻変化とアクセント結合についての規則化を行う、「宮崎正弘，『日本文音声変換のための数字読み規則』，情報処理学会論文誌，１９８４年６月，第２５巻、第６号、pp.1035-1043」に示されるような規則を適用して読みを付与する。アルファベット列に対しては、アルファベット読みと判定された単語には、アルファベット各文字とその読みを対応させたアルファベット読み対応表（例：Ａ＝エー，Ｂ＝ビー）を用いて読みを付与し、ローマ字読みと判定された単語には、ローマ字とその読みを対応させたローマ字読み対応表（例：Ａ＝ア，ＫＡ＝カ）を用いて読みを付与し、英語読み、フランス語読みなどの各小国語に対しては、それぞれの言語毎に、例えば、特開２００１−１４２８７７公報に示される方法などを用いて読みを付与する。この方法は、英文字とカタカナ対応データから作成された音訳モデルに基づき、英単語とカタカナの同時出現確率が最大となる経路を探索するとにより、任意の英単語について最適なカタカナ音訳を行うものである。
【００４９】
ここで付与された読みは、形態素解析部２で出力した単語情報を上書きする（単語情報が読みの多義を持つ構造の場合には、ここで付与された読みを第一位とする）。なお、当該読み付与５が読みを付与するために、ローマ字読みの場合にはローマ字表、アルファベット読みの場合には、アルファベット表、英語読み、フランス語読み等で特開２００１−１４２８７７号公報に示される方法を用いる場合には、音訳モデルが必要となるため、これらの表を当該読み付与部５内部または、外部にデータベースとして設けられるものとする。
【００５０】
単語情報出力部６は、テキスト入力部１から得られた出力する単語情報の形式に従って単語情報を指定された出力先に指定された形式で出力する。
【００５１】
［第１の実施の形態］
上記の読みクラス判定部４の詳細な処理について説明する。
【００５２】
図４は、本発明の第１の実施の形態における読みクラス判定部の構成を示す。同図に示す読みクラス判定部４は、読みクラス候補抽出部４１、対象単語情報利用型判定部４２、文脈利用型判定部４３、最終判定部４４からなる。また、読みクラス判定モデル８は、対象単語情報利用型読みクラス判定モデル８１と文脈利用型読みクラス判定モデル８２を有し、対象単語情報利用型読みクラス判定モデル８１は、対象単語情報利用型判定部４２により参照され、文脈利用型読みクラス判定モデル８２は、文脈利用型判定部４３により参照される。
【００５３】
読みクラス候補抽出部４１は、対象とする読みクラスのうち、対象単語列抽出部３が抽出した対象単語が取り得る読みクラスを抽出する。例えば、数字列の場合には、アルファベット読みやローマ字読みといった読みクラスにはなり得ないので、これらのクラスを除外する。また、アルファベット列では棒読みや桁読みというクラスが除外され、さらに、ローマ字になり得ないもの、例えば、ローマ字で用いられない文字が存在（例：ＬＥＭＯＮ）、ローマ字であり得ない文字列の並びが存在（例：ＲＥＳＴＡＵＲＡＮＴ）した場合には、ローマ字読みというクラスも除外される。
【００５４】
対象単語情報利用型判定部４２は、対象単語列抽出部３が抽出した対象単語の単語情報から得られる属性を対象単語情報利用型読みクラス判定モデル８１に入力する。
【００５５】
ここでは、アルファベット列のみを対象としている。これは、アルファベット列は対象単語の情報だけで読みクラスが決定できる場合が数多くあり得るが（例：「ｂｅａｕｔｉｆｕｌ」＝英語読み、「ＳＶＭ」＝アルファベット読みなど）、数字列は先に挙げた「６１１」の例のように、対象単語の情報のみでは読みクラスが決定できないからである。
【００５６】
対象単語情報利用型読みクラス判定モデル８１は、以下に述べる属性を入力とする識別関数と、識別関数の出力値を入力して、各読みクラス候補の指定順位をスコア付きで出力する順位関数からなる。日本語テキストコーパス（または、辞書）等を用いて学習データを作成し、例えば、「山田寛康、他１名、『Support Vector Machineの多値分類問題への適用法について』、情報処理学会研究報告: 自然言語処理、２００１年１１月２０日、pp.33-38」に数種類示されるSupport Vector Machine（ＳＶＭ）を多値分類拡張したもの等を学習器として用いて、識別関数のパラメータは予め決定しておく。利用する属性は、少なくとも単語の文字数と、第１音節、末尾音節の表記を含む。それ以外の音節の表記を属性に加えても構わない。ここでの音節の境界は、“母音(aiueo) ＋それ以外の文字”となる位置とする。なお、順位関数としては、例えば、前述の山田他の文献に示されるpairwise法により順位を決定し、投票されたクラスの距離の緩和をスコアとするものなどが考えられる。
【００５７】
文脈利用型判定部４３は、対象単語列抽出部３が抽出した対象単語及びその隣接単語の単語情報から得られる属性を文脈利用型読みクラス判定モデル８２に入力して、各読みクラス候補の推定順序をスコア付で出力する。
文脈利用型読みクラス判定モデル８２は、以下に述べる属性を入力とする識別関数と、識別関数の出力値を入力して、各読みクラス候補の推定順位をスコア付きで出力する順位関数からなる。日本語テキストコーパス（または、辞書）等を用いて学習データを作成し、対象単語情報利用型判定モデル８１で用いた学習器を用いて、日本語テキストコーパス等から学習データを収集し、予め作成しておく。利用する属性は、対象単語、及びその前方Ｍ個の単語（Ｍ＞０、任意に設定可能）、後方Ｎ個の単語（Ｎ＞０、任意に設定可能）の文字数、字種（アルファベット列は、すべて大文字、先頭大文字、その他に分ける）、品詞等である。
【００５８】
なお、順位関数としては、例えば、前述の山田他の文献に示されるpairwise法により順位を決定し、投票されたクラスの距離の緩和をスコアとするものなどが考えられる。
【００５９】
最終判定部４４は、対象単語情報利用型判定部４２と文脈利用型判定部４３の判定結果より、最終的に判定した読みクラスを出力する。
【００６０】
図５は、本発明の第１の実施の形態における読みクラス判定処理動作のフローチャートである。
【００６１】
ステップ１０１）まず、現在の処理対象単語から、取り得る読みクラスを抽出する。
【００６２】
ステップ１０２）対象単語が数字列であるか判定し、数字列である場合にはステップ１０５に移行する。また、数字列でない場合にはステップ１０３に移行する。
【００６３】
ステップ１０３）対象単語が数字列でない場合には、対象単語情報利用型判定を行い、ステップ１０１で抽出された各読みクラス候補の推定順位をスコア付きで出力する。
【００６４】
ステップ１０４）ステップ１０３で出力された読みクラス候補第１位のスコアが信頼性閾値以上であるか判定し、信頼性閾値以上である場合には、ステップ１０８に移行し、信頼性閾値未満である場合には、ステップ１０５に移行する。ここで、信頼性閾値は、経験的に予め設定しておく値である。
【００６５】
ステップ１０５）読みクラス候補第１位のスコアが信頼性閾値以上でない場合、あるいは、対象単語が数字列の場合は、文脈利用型判定を行い、各読みクラス候補の推定順位をスコア付きで出力する。ここで、判定を行う読みクラスの候補は、ステップ１０１で抽出された読みクラスの候補のすべてとしてもよいし、ステップ１０３で順位付けされた読みクラスのうちの上位いくつかとする、あるいは、ステップ１０３で得られたスコアがある値以上の読みクラスのみとする等の絞り込みを行ってもよい（この場合でも、ステップ１０３を通らない場合は、ステップ１０１で抽出された読みクラス候補すべてとする）。
【００６６】
ステップ１０６）ステップ１０３が行われているかどうかを判定し、行われている場合には、ステップ１０３とステップ１０５で判定された各第１位の読みクラスが同じであるか判定する。ステップ１０３が行われなかった場合と、ステップ１０３が行われ、ステップ１０５と第１位の読みクラスが同じ場合には、ステップ１０８へ移行する。それ以外の場合にはステップ１０７に移行する。
【００６７】
ステップ１０７）ステップ１０３で判定された第１位の読みクラスのスコアと、ステップ１０５で判定された第１位の読みクラスの“スコア＊スコアの重み”（但し、スコアの重みは定数）の値のうち、値の大きい読みクラスを最終的な読みクラスとし、処理を終了する。スコア重みは、経験的に予め設定しておく定数である。
【００６８】
ステップ１０８）ステップ１０３あるいはステップ１０５（行われたもの）で判定された第１位の読みクラスを最終的な読みクラスとし、処理を終了する。
［第２の実施の形態］
図６は、本発明の第２の実施の形態における読みクラス判定部の構成図である。同図に示す読みクラス判定部４は、読みクラス候補抽出部４１と一括判定部４５を有し、一括判定部４５は一括読みクラス判定モデル８３を参照する。
【００６９】
読みクラス候補抽出部４１は、一括判定部４５が出力対象とする読みクラスのうち、対象単語列抽出部３が抽出した対象単語が取り得る読みクラスを抽出する。これは、前述の第１の実施の形態と全く同一である。
【００７０】
一括判定部４５は、対象単語列抽出部３が抽出した対象単語及びその隣接単語の単語情報から得られる属性を一括読みクラス判定モデル８３に入力して、各読みクラス候補の推定順位を得て、その第１位となった読みクラスを最終的な読みクラスとし、出力する。
一括読みクラス判定モデル８３は、対象単語情報利用型読みクラス判定モデル８１で用いた学習器を用いて、日本語テキストコーパス（または、辞書）等から抽出した属性と読みクラスのセットを学習データとして予め作成される識別関数と、識別関数の出力値を入力して、各読みクラス候補の推定順位をスコア付きで出力る順位関数からなる。ここで一括読みクラス判定モデル８３は、アルファベット列と数字列をまとめて１つのモデルとしてもよいし、アルファベット列用と数字列用の２つのモデルに分けてもよい。
【００７１】
利用する属性は、対象単語、及び対象単語前方Ｍ個の単語（Ｍ＞０，任意に設定可能）、及び対象単語後方Ｎ個の単語（Ｎ＞０，任意に設定可能）に対する単語属性と、対象単語前方Ｍ個の読みクラスである。
【００７２】
アルファベット列用の単語属性としては、少なくとも、文字数、第１音節表記、末尾音節表記、字種（アルファベット列は、全て大文字、先頭大文字、全て小文字、その他に分ける）を含む。ここで、単語がアルファベット列以外の場合には、第１音節表記、末尾音節表記はなしとなる。
【００７３】
数字列用の単語属性としては、少なくとも、文字数、文字種（アルファベット列は全て大文字、先頭大文字、全て小文字、その他に分ける）、品詞、数字タイプ（先頭文字が“０”かどうか）を含む。
【００７４】
アルファベット列と数字列用の（１つにまとめた）属性としては、少なくとも、文字数、第１音節表記、末尾音節表記、文字種（アルファベット列は、全て大文字、先頭大文字、全て小文字、その他に分ける）、品詞、数字タイプ（先頭文字が“０”かどうか）を含む。
【００７５】
【実施例】
以下では、図７に示すテキストを入力例として、図７から図１２を用いて本発明の実施例を説明する。
【００７６】
図７は、本発明の一実施例の入力から対象単語抽出までのデータ例を示し、図８は、本発明の一実施例の文脈利用型判定の属性例を示し、図９〜図１１は、本発明の一実施例の一括判定の属性例を示し、図１２は、本発明の一実施例の最終出力する単語情報例を示す。
【００７７】
ここでは、入力される設定情報は、『読みクラスの判定を行う単語＝全アルファベット列・全数字列、出力する単語の形式＝すべての単語情報をメモリに出力である』としておくが、以下では、部分的に他の設定情報の場合にはどうなるかについても説明を加える。
【００７８】
テキスト入力部１では、『読みクラスの判定を行う単語＝全アルファベット列・全数字列』を対象単語抽出部３に渡す。また、『出力する単語の形式＝全ての単語情報をメモリに出力』を単語情報出力部６に渡す。また、テキストを形態素解析部２に渡す。
【００７９】
次に、形態素解析部２は、単語辞書７を用いて、図７に示すように単語の認定を行い、各単語毎に、表記、品詞、読み、字種などからなる単語情報が得られる。
【００８０】
次に、対象単語抽出部３は、『単語情報と、読みクラスの判定を行う単語＝全アルファベット列・全数字列』という指定より、図７に示す対象単語を抽出する。
【００８１】
ちなみに、設定情報として、『読みクラスの判定を行う単語＝未知語のアルファベット』が入力された場合には、「１：ＹＯＫＯＳＵＫＡ」と「１３：ＡＩＲ」のみを対象単語として抽出する。
【００８２】
以下、読みクラス判定部４として、前述の第１の実施の形態における図４に示した読みクラスの判定処理について説明する。ここでは、「１：ＹＯＫＯＳＵＫＡ」、「４：１０」の例を用いて図５のフローチャートに基づいて説明する。
【００８３】
ここでは、アルファベット読みクラスとして、アルファベット読み、英語読み、ローマ字読み、数字読みクラスとして、整数型、小数型、分数型、概数型、棒読み型、範囲型、併記型、英語型（「宮崎正弘、「日本文音声変換のための数詞読み規則」、情報処理学会論文誌、１９８４年６月、第２５巻、第６号、pp.1035-1043. 」の分類に英語型を加えたもの）を扱うこととする。
【００８４】
対象単語情報利用型読みクラス判定モデル８１は、単語文字数と第１音節・末尾音節表記を属性として、ＳＶＭをペアワイズ法により、多値分類に拡張したモデルを利用するものとする。
【００８５】
文脈利用型読みクラス判定モデル８２は、対象単語及び前後２単語それぞれについての文字数、字種、単語表記、先頭文字表記、末尾文字表記、品詞、及び前方２つの読みクラス（それらが読みクラス判定の対象単語の場合のみ）を属性として、ＳＶＭをペアワイズ法により多値分類に拡張したモデルを利用するものとする。
【００８６】
また、ステップ１０４の信頼度閾値＝１．００、ステップ１０７のスコアの重み＝１．００とする。
【００８７】
ステップ１０５では、読みクラスを限定して、ステップ１０３を通る場合には、ステップ１０３の上位２位の読みクラスに対する判定を行うものとし、ステップ１０３、ステップ１０５のスコアとしては、第１解＝第２解との距離、それ以外＝０とする。
【００８８】
まず、「１：ＹＯＫＯＳＵＫＡ」の場合を示す。
【００８９】
図６のステップ１０１において、「ＹＯＫＯＳＵＫＡ」は、アルファベット列であるため、全数字読みクラスを除外する。また、ローマ字になり得る綴りかをチェックして、なり得ると判定する。この結果、読みクラス候補は、アルファベット読み、英語読み、ローマ字読みの３種類となる。
【００９０】
次に、ステップ１０２で、「ＹＯＫＯＳＵＫＡ」は数字列ではないので、ステプ１０３に移行する。
【００９１】
ステップ１０３では、単語文字数＝８、第１音節表記＝ＹＯ、末尾音節表記＝ＫＡを属性として抽出し、アルファベット読み、英語読み、ローマ字読みを読みクラス候補として、対象単語情報利用型読みクラス判定モデル８１に適用する。この結果、
１位：ローマ字読み、スコア＝２．５４
２位：英語読み、スコア＝０
３位：アルファベット読み、スコア＝０
が得られたとする。
【００９２】
ステップ１０４では、第１解スコア＝２．５４、信頼度閾値＝１．００であるので、ステップ１０８に移行し、ローマ字読みと判定して処理を終了する。
【００９３】
次に、「４：１０」の場合を示す。
【００９４】
ステップ１０１において、「１０」は数字列であるため、全アルファベット読みクラスを除外する。この結果、読みクラス候補は、整数型、小数型、分数型、概数型、棒読み型、範囲型、併記型、英語型となる。
【００９５】
次に、ステップ１０２で「１０」は数字列なので、ステップ１０５に移行する。
【００９６】
ステップ１０５で、判定に用いる属性を図８に示す。読みクラス候補を整数型、小数型、分数型、概数型、棒読み型、範囲型、併記型、英語型として、この属性を、文脈利用型読み判定モデル８２に適用し、この結果、
１位：英語型、スコア＝０．０３
２位：整数型、スコア＝０
３位：小数型、スコア０
（以下、略）
が得られたとする。
【００９７】
ステップ１０６では、ステップ１０３の判定を行っていないので、ステップ１０８に移行し、英語型と判定して処理を終了する。
【００９８】
次に、読みクラス判定部４として、図６に示す前述の第２の実施の形態を用いた場合の実施例を「４：１０」，「１３：ＡＩＲ」の例を用いて説明する。
【００９９】
ここでは、アルファベット読みクラスとして、アルファベット読み、英語読み、ローマ字読み、フランス語読み、イタリア語読み、数字読みクラスとして、整数型、小数型、分数型、概数型、棒読み型、範囲型、併記型、英語型（「宮崎正弘、「日本文音声変換のための数詞読み規則」、情報処理学会論文誌、１９８４年６月、第２５巻、第６号、pp.1035-1043. 」の分類に英語型を加えたもの）を扱うこととする。
【０１００】
一括読みクラス判定モデル８３は、ここでは、アルファベット列用と数字列用の２つのモデルに分けるものとする。いずれのモデルもＳＶＭをペアワイズ法により多値分類に拡張したモデルを利用するものとし、対象単語及び前後２単語についての以下に示すそれぞれの単語属性、及び、前方２単語の読みクラスを属性とするものとする。
【０１０１】
アルファベット列用の単語属性は、文字数、第１、第２、末尾−１、末尾音節表記（アルファベット列以外は値なし）、文字種（アルファベット列は、全て大文字、先頭大文字、全て小文字、その他に分ける）とする。
【０１０２】
数字列用の単語属性は、表記、文字数、数字タイプ（先頭文字が“０”かどうか）、主品詞、文字種（アルファベット列は、すべて大文字、先頭大文字、全て小文字、その他に分ける）とする。
【０１０３】
図６の読みクラス候補抽出部４１において、「４：１０」は、数字列であるため、全アルファベット読みクラスを除外する。この結果、読みクラス候補は、整数型、小数型、分数型、概数型、棒読み型、範囲型、並記型、英語型の８種類となる。
【０１０４】
次に、一括判定部４５では、上記８種類を読みクラスの候補として、図９に示す属性を、数字列用の一括読みクラス判定モデル８３に適用し、この結果、
１位：英語型
２位：整数型
（以下略）
が得られたとする。これにより、英語型と判定して処理を終了する。
【０１０５】
図６の読みクラス候補抽出部４１において「ＡＩＲ」は、アルファベット列であるため、全数字読みクラスを除外する。また、ローマ字では「Ｒ」が語尾となることはあり得ないので、ローマ字読みも読みクラスから除外する。この結果、読みクラスの候補は、アルファベット読み、英語読み、イタリア語読み、フランス語読みの４種類となる。
【０１０６】
次に、一括判定部４５では、アルファベット読み、英語読み、フランス語読み、イタリア語読みを読みクラスの候補として、図１０に示す属性を、一括読みクラス判定モデル８３に適用し、この結果、
１位：英語読み
２位：アルファベット読み
３位：イタリア語読み
４位：フランス語読み
が得られたとする。これにより、英語読みと判定して処理を終了する。
【０１０７】
次に、一括読みクラス判定モデル８３として、アルファベット列と数字列を纏めて１つにした場合の具体例を「１：ＹＯＫＯＳＵＫＡ」の例を用いて説明する。
【０１０８】
このモデルはＳＶＭをペアワイズ法により多値分類に拡張したモデルを利用するものとし、対象単語及び前後２単語についての以下に示す単語属性、及び、前方２単語の読みクラス属性とするものである。
【０１０９】
単語属性は、表記、文字数、第１、第２、末尾−１、末尾音節表記（アルファベット列以外は値なし）、文字種（アルファベット列は、すべて大文字、先頭大文字、すべて小文字、その他に分ける）、品詞、数字タイプ（先頭文字が“０”かどうか）とする。
【０１１０】
図６の読みクラス候補抽出部４１において、「１：ＹＯＫＯＳＵＫＡ」は、アルファベット列であるため、全数字読みクラスを除外する。この結果、読みクラスの候補は、アルファベット読み、英語読み、ローマ字読み、フランス語読み、イタリア語読みとなる。
【０１１１】
次に、一括判定部４５では、上記読みクラスを候補として、図１１に示す属性を、一括読みクラス判定モデル８３に適用し、この結果、
１位：ローマ字読み
２位：英語読み
３位：イタリア語読み
４位：フランス語読み
５位：アルファベット読み
が得られたとする。これにより、ローマ字読みと判定して処理を終了する。
【０１１２】
図３において、読みクラス判定部４は、上記に示したように、対象単語抽出部３で抽出された単語すべてに読みクラスを付与する（図１２の読みクラス参照）。
【０１１３】
次に読み付与部５は、付与した読みクラスに基づき読みを付与する。
【０１１４】
例えば、「１：ＹＯＫＯＳＵＫＡ」はローマ字読みと判定されているので、「ＹＯ→ヨ」、「ＫＯ→コ」、「ＳＵ→ス」、「ＫＡ→カ」と変換され、「ヨコスカ」という読みを得る。
【０１１５】
「４：１０」は、英語型と判定されているので、予め用意しておいた、英語読み変換表により、「テン」という読みを得る。
【０１１６】
「１３：ＡＩＲ」は英語型と判定されているので、英語用に作られた「特開２００１−１４２８７７号公報」等を利用して、「エア」という読みを得る。なお、当該「特開２００１−１４２８７７号公報」による方法を用いる場合には、各国語音訳モデルを用いるものとする。
【０１１７】
最後に、単語情報出力部６では、設定情報で『出力する単語の形式＝すべての単語情報をメモリに出力』としてあるので、図１２の単語情報をメモリに出力する。
【０１１８】
この出力された単語情報は、例えば、音声合成装置へ入力すれば、合成音声が出力できる。
【０１１９】
なお、上記の第１の実施の形態及び第２の実施の形態における読みクラス判定部の動作をプログラムとして構築し、読み情報決定装置として利用されるコンピュータにインストールし、ＣＰＵ等の制御手段で実行することも可能である。また、図３に示す単語辞書をデータベースとして構築し、記憶手段に記憶しておき、他の構成要素についてもプログラムとして構築し、読み情報決定装置として利用されるコンピュータにインストールし、ＣＰＵ等の制御手段で実行することも可能である。
【０１２０】
また、構築されたプログラムを読み情報決定装置として利用されるコンピュータに接続されるハードディスクや、フレキシブルディスクやＣＤ−ＲＯＭ等の可搬記憶媒体に格納しておき、本発明を実施するコンピュータにインストールすることも可能である。
【０１２１】
なお、本発明は上記の実施の形態及び実施例に限定されることなく、特許請求の範囲内において、種々変更・応用が可能である。
【０１２２】
【発明の効果】
上述のように、本発明によれば、アルファベット列及び数字列に対して、各種辞書等により、収集が用意な当該文字列自身の情報、及びコーパス等を作成するコストが必要な当該文字列近辺の文字列情報を利用した統計モデルを用いて、アルファベット読みクラス、数字読みクラスを推定することにより、日本語テキスト中に含まれる英数字列の読み精度を向上させることができる。
【図面の簡単な説明】
【図１】本発明の原理を説明するための図である。
【図２】本発明の原理構成図である。
【図３】本発明の一実施の形態における読み情報決定装置の構成図である。
【図４】本発明の第１の実施の形態における読みクラス判定部の構成図である。
【図５】本発明の第１の実施の形態における読みクラス判定処理動作のフローチャートである。
【図６】本発明の第２の実施の形態における読みクラス判定部の構成図である。
【図７】本発明の一実施例の入力から対象単語抽出までのデータ例である。
【図８】本発明の一実施例の文脈利用型判定の属性例である。
【図９】本発明の一実施例の一括判定の属性例（その１）である。
【図１０】本発明の一実施例の一括判定の属性例（その２）である。
【図１１】本発明の一実施例の一括判定の属性例（その３）である。
【図１２】本発明の一実施例の最終出力する単語情報例である。
【符号の説明】
１テキスト入力部
２形態素解析手段、形態素解析部
３対象単語抽出手段、対象単語抽出部
４読みクラス判定部
５読み付与手段、読み付与部
６単語情報出力手段、単語情報出力部
７単語辞書
８読みクラス判定モデル
４１読みクラス候補抽出部
４２対象単語情報利用型判定手段、対象単語情報利用型判定部
４３文脈利用型判定手段、文脈利用型判定部
４４最終判定手段、最終判定部
４５一括判定部
８１対象単語情報利用型読みクラス判定モデル
８２文脈利用型読みクラス判定モデル
８３一括読みクラス判定モデル[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a reading information determination method, apparatus, and program, and more particularly, when Japanese text-to-speech synthesis is performed, by determining the reading class of an alphanumeric string included in a Japanese sentence, The present invention relates to a reading information determination method, apparatus, and program for improving reading accuracy.
[0002]
In addition, when searching for Japanese pages in an Internet search engine, it is used to improve the accuracy of expanded keywords when translating an alphabetical string input as a search keyword into kana and expanding a question.
[0003]
[Prior art]
In Japanese text-to-speech synthesis, prosodic information such as reading, accent, and pose is set for Japanese text, and based on these, a speech waveform is generated and synthesized speech is output. In order to correctly output natural speech synthesis, it is necessary to set the reading and prosodic information correctly.
[0004]
Reading and giving access can be set with high accuracy by using information (word information) with respect to words. Word information can generally be obtained using Japanese morphological analysis. Japanese morphological analysis is a mature technology, and there are many that achieve an accuracy of 99% or more when targeting newspaper articles. These morphological analyzes are generally performed using a word dictionary in which word information is registered (for example, see Non-Patent Document 1).
[0005]
In addition, when an alphabetical string becomes an unknown word, no reading is given, so alphabetical reading is performed character by character (see, for example, Non-Patent Document 2), (For example, refer nonpatent literature 3).
In addition, there is a technique (for example, refer to Patent Document 1) for determining whether the input text is read in Roman letters or English based on the ratio between the appearance frequency of alphabet vowels and the appearance frequency of alphabet consonants. In this method, the appearance frequency is extracted for each of the alphabet vowels (a, i, u, e, o) and the alphabet consonants, and the value of the appearance frequency of the alphabet vowels / the appearance frequency of the alphabet consonants is determined in advance. When the value is larger than the value, the alphabet string in the text is read as Roman, and when not, it is read out as English.
[0006]
In addition, regarding numeric strings, a method of classifying them into positive numbers, decimals, etc. (hereinafter referred to as “number reading classes”) and reading them has been established (for example, see Non-Patent Document 4).
[0007]
[Patent Document 1]
JP 2000-10579 A
[0008]
[Non-Patent Document 1]
Takeshi Tsuji and two others, "A morphological analysis system considering maintainability", IPSJ Research Report: Natural Language Processing, January 20, 1997, pp.59-66.
[0009]
[Non-Patent Document 2]
Masahiro Miyazaki and 1 other "Language Processing Method for Japanese Speech Output", Transactions of the Information Processing Society of Japan, November 1986, Vol. 27, No. 11, pp.1053-1061.
[0010]
[Non-Patent Document 3]
Shinichiro Takagi, 4 others "Communication secretary technology that can confirm e-mail by telephone", NTT Technical Journal, Nippon Telegraph and Telephone Corporation, June 1, 1997, Vol. 9, No. 6, pp.63-68 .
[0011]
[Non-Patent Document 4]
Masahiro Miyazaki, “Numerical Reading Rules for Japanese Spoken Speech Conversion”, Transactions of Information Processing Society of Japan, June 1984, Vol. 25, No. 6, pp.1035-1043.
[0012]
[Problems to be solved by the invention]
However, for alphabetical strings that appear in Japanese text (including symbols such as apostrophe that constitute words together with alphabets), the percentage of dictionary entries is low and the percentage of unknown words is high. . In addition, numeric strings (including symbols such as decimal points and parentheses before and after the area code that represent information grouped together with numbers) may be read differently depending on the context before and after. It cannot be handled by morphological analysis.
[0013]
In addition, when an alphabetical string becomes an unknown word, when reading the alphabet one character at a time or performing English-Japanese transliteration, an error occurs in the case of an alphabet reading or a word that is not read in English. , Alphabet reading, English word reading, romaji reading, French reading, etc. are referred to as alphabet reading classes).
In addition, the method of judging whether to read roman letters or English according to the ratio between the appearance frequency of alphabet vowels and the appearance frequency of alphabet consonants is fixed to either one of the readings for Japanese text that contains both English and Roman letters. Reading errors may occur. For example, the sentence “YOKOHAMA TEA HOUSE will open on the 1st of the next month” is determined to be Romaji reading because of the large proportion of alphabet vowels, and “YOKOHAMA TEA HOUSE” reads “Yokohama Tea House”. Will be granted.
[0014]
In addition, the method of correctly estimating this type according to the context before and after is not solved by the method of classifying and classifying it into the number reading class.
[0015]
As mentioned above, some Japanese texts contain many alphanumeric strings. For example, on store introduction pages on the Internet, store names, service names, and product names are often written in alphabets, and the reading is also alphabetically read (eg CD) or Roman letters (eg. : YOKOHAMA), English reading (example: Restaurant), French reading (example: TERRASSE), Italian reading (example: TRATTORIA). There are also variations in how to read a number string depending on the situation before and after the text. For example, a number string “611” is read as “Lopiyakujuichi” for “611” and “Lokuichi” for “A611it” (product number, etc.).
[0016]
However, since these alphabet strings have many proper nouns and new words increase, it is impossible to register all of them in the word dictionary for morphological analysis. And various foreign words must be registered, and the word dictionary size becomes enormous. In addition, there are an infinite number of strings, and it is unrealistic to register the character strings in consideration of the character strings before and after that.
[0017]
For this reason, transliteration to convert from alphabetical sequence to reading (Kana sequence) is required for alphabetical sequences, but this transliteration is necessary for certain ranges of text such as English readings and Romanized readings. Since transliteration is performed assuming a class, there is a problem that reading is not correctly given in a class different from the assumption.
[0018]
Also, for numeric strings, the number reading class is judged based on few characters such as the decimal point, and reading is given to the numeric string, so reading is not given correctly if the number reading class is wrong. There is.
[0019]
The present invention has been made in view of the above points, and when determining Japanese readings for words consisting of alphabets and numbers, automatic reading is applied to character strings for which alphabetic readings, English readings, etc. are not generally determined. It is an object of the present invention to provide a reading information determination method, apparatus, and program for reading.
[0020]
[Means for Solving the Problems]
FIG. 1 is a diagram for explaining the principle of the present invention.
[0021]
  The present invention (Claim 1)Morphological analysis means, target word extraction means, reading class candidate extraction means, target word information utilization type determination means, context utilization type determination means, final determination means, reading provision means, word information output means,
Text to be processedIs entered,Read word information and output word information including part of speechIn equipmentReading information determination methodBecause,
  The morphological analysis means is input from the input means or read from the storage meansText to be processed and setting informationNewsAcceptance (step 1),Information for adding word information was storedGet word information by morphological analysis of text using word dictionary(Step 2) a morphological analysis step;
  The target word extraction means isExtracts target words for reading class judgment from word information by specifying words for reading class judgment input as setting informationA target word extraction step (step 3) to perform,
  Reading class candidate extraction meansFor each target word, reading candidates that can be reading class candidates that indicate the type of reading by using the character types constituting the character string and their arrangement are extracted.Reading class candidate extraction step (step 4),
  The target word information utilization type determination means isIf the extracted target word is an alphabet string, the target word information usage type determination using the target word information usage type reading class determination model is performed.Target word information utilization type determination step (step 5);
  Contextual use type judgment meansIf the score of the first candidate for the reading class is less than the predetermined reliability threshold or the extracted target word is a numeric string, a context-based determination using the context-based reading class determination model is performed.A context-use type determination step (step 6);
  Final decision means
  Target word information usage type determinationStepAnd context-based type determinationStepThe first candidate candidate's score is compared, and the target word information use type determinationIn stepScore of first candidate of reading class and context-use type determinationStepAmong the values obtained by multiplying the score of the first candidate reading class determined in step 1 and the weight of the score (however, the weight of the score is a constant), the reading class having a larger value is set as the final result (step 8).
  Target word information usage type determinationStepThe score of the first candidate of the reading class is greater than a predetermined reliability threshold, or the target word information usage type determinationStepAnd context-based type determinationStepIf the first candidate is the same or the target word is a numeric string, the first candidate is read as the final result of class determination (step 7).
  When the target word is a numeric string, context-based type determinationOf stepRead the first candidate and the final result of class determination(Step 7) a final determination step;
  In the final judgment stepReading is performed according to the determined reading class.Reading step (step 9);
  Word information output meansThe word information is output based on the format of the word information to be output input as setting information.A word information output step (step 10) is performed.
[0022]
  In addition, the present invention(Claim 2)Target word information-based reading class judgment modelIn
  A predetermined discriminant function having parameters corresponding to attributes including at least the number of characters of the word, the first syllable notation, and the end syllable notation, and a predetermined discriminating order and score for each reading class from the output value of the discriminant function Has a rank function,
  In the target word information utilization type determination step, the target word information utilization type determination means includes:
  For the target word information utilization type reading determination model, attributes including at least the number of characters of the word obtained from the extracted word information of the target word, the first syllable notation, and the last syllable notation are input and converted into attribute vector expressions. The discriminant function is calculated, the output value of the discriminant function is input to the rank function, and the estimated rank of each reading class candidate is output with a score.
[0023]
  The present invention (Claim 3)In the context-based reading class determination model,
  A predetermined discriminant function having parameters corresponding to attributes including at least the number of characters of the word, the type of characters (the alphabet string is divided into all upper case letters, first upper case letters, all lower case letters, etc.) and parts of speech; and an output value of the discriminant function Have a predefined ranking function that determines the ranking and score of each reading class,
  In the context usage type determination step, the context usage type determination means includes:
  For the context-based reading determination model, the extracted target word, M words in front of the target word (M> 0, can be set arbitrarily), and N words in the back (N> 0, set arbitrarily) Input the attributes including at least the number of characters of the word obtained from the word information, the character type (all alphabetical characters are divided into upper case letters, upper case letters, lower case letters, etc.) and parts of speech, and convert each attribute into a vector representation. The discriminant function is calculated, the output value of the discriminant function is input to the rank function, and the estimated rank of each reading class candidate is output with a score.
[0024]
  The present invention(Claim 4)IsMorphological analysis means, target word extraction means, reading class candidate extraction means, batch determination means, reading assignment means, word information output means,Text to be processedIs entered,Read word information and output word information including part of speechIn equipmentReading information determination methodBecause,
  The morphological analysis meansAccepts text to be processed and configuration information as input,Information for adding word information was storedGet word information by morphological analysis of text using word dictionaryA morphological analysis step,
  The target word extraction means isExtracts target words for reading class judgment from word information by specifying words for reading class judgment input as setting informationTarget word extraction step,
  Reading class candidate extraction meansFor each target word, reading candidates that can be reading class candidates that indicate the type of reading by using the character types constituting the character string and their arrangement are extracted.Reading class candidate extraction step,
  The collective judgment meansPerform batch judgment using the batch reading class judgment model, read the first candidate and the class judgment resultBatch judgment step to,
  Reading grant meansReading is performed according to the determined reading class.Reading step and,
  Word information output meansThe word information is output based on the format of the word information to be output input as setting information.Word information output step and,
And
  The batch reading class judgment model is
  At least for the alphabet string only, the attribute including the number of characters, the first syllable notation, the end syllable notation, the character type (the alphabet string is divided into all uppercase letters, first uppercase letters, all lowercase letters, etc.) Number of characters, character type (Alphabetic strings are divided into all uppercase letters, first uppercase letters, all lowercase letters, etc.), part of speech, number type (first letter is " 0 " , Including the number of characters, first syllable notation, last syllable notation, character type (alphabet string is divided into all uppercase letters, first uppercase letters, all lowercase letters, etc.), part of speech numeric type (The first character is " 0 " A predetermined discriminant function having a parameter corresponding to an attribute including whether or not), and a rank function for selecting the first candidate from the discriminant function and an output value,
  In the batch judgment step, the batch judgment means
  For the batch reading determination model, the extracted target word and the M words in front of the target word (M > 0, can be set arbitrarily), back N words (N > 0, which can be set arbitrarily)
  For the alphabet string only, attributes including the number of characters, first syllable notation, end syllable notation, character type (all alphabet strings are divided into upper case, first upper case, all lower case, etc.),
  For numeric strings only, the number of characters, type of letters (the alphabetical string is divided into all uppercase letters, first uppercase letters, all lowercase letters, etc.), part of speech, number type (first letter is " 0 " Whether or not),
  Alphabet strings and number strings are shared as follows: number of characters, first syllable notation, end syllable notation, character type (all alphabet strings are divided into upper case, first upper case, all lower case, etc.), part of speech, number type (the first character is " 0 " Whether or not)
, Convert each attribute to a vector representation and calculate the discriminant function, input the discriminant function output value into the rank function of the batch reading judgment model, and output the estimated rank of each reading class candidate with a score Do.
[0026]
FIG. 2 is a principle configuration diagram of the present invention.
[0027]
  The present invention(Claim 5)Is a reading information determination device that inputs text to be processed, reads each word, and outputs word information including part of speech.
  A morpheme analysis unit 2 that accepts text to be processed and setting information as input, and obtains word information by morphologically analyzing the text using a word dictionary;
  A target word extraction means 3 for extracting a target word for determining the reading class from the word information by designating a word for determining the reading class input as setting information;
  Reading class candidate extraction means 41 for extracting a reading candidate that can be a reading class candidate indicating the type of reading by using the character type constituting the character string and its arrangement for each target word extracted by the target word extracting means 3;
  When the target word extracted by the target word extraction unit 3 is an alphabet string, the target word information use type determination unit 42 that performs the target word information use type determination using the target word information use type reading class determination model;
  When the score of the first candidate of the reading class is less than the predetermined reliability threshold value or the target word is a numeric string, the context-use type determination means 43 that performs context-use type determination using the context-use type reading class determination model When,
  The score of the first candidate of the target word information usage type determination unit 43 is compared with the score of the first candidate of the reading class of the target word information usage type determination unit 42 and the context Of the values obtained by multiplying the score of the first candidate reading class determined by the usage type determining means 43 and the weight of the score (however, the score weight is a constant), the reading class having the larger value is set as the final result, and the target word The score of the first candidate of the reading class of the information usage type determination unit 42 is equal to or higher than a predetermined reliability threshold, or the first candidates of the target word information usage type determination unit 41 and the context usage type determination unit 43 are the same or the target If the word is a numeric string, the final decision means 44 which reads the first candidate and obtains the final result of the class decision;
  Reading imparting means 5 for imparting reading according to the reading class determined by the final determination means 44;
  Word information output means 6 for outputting word information based on the format of the word information to be output input as setting information.
[0028]
  In addition, the present invention(Claim 6)Target word information-based reading judgment modelIn
  A predetermined discriminant function having parameters corresponding to attributes including at least the number of characters of the word, the first syllable notation, and the end syllable notation; A ranking function,
  The target word information utilization type determination means 42
  For the target word information utilization type reading determination model, attributes including at least the number of characters of the word obtained from the extracted word information of the target word, the first syllable notation, and the last syllable notation are input and converted into attribute vector expressions. A means for calculating the discriminant function, inputting an output value of the discriminant function into the rank function, and outputting an estimated rank of each reading class candidate with a score.
[0029]
  In addition, the present invention(Claim 7)Context-based reading class judgment modelIn
  A predetermined discriminant function having parameters corresponding to attributes including at least the number of characters of the word, the type of characters (the alphabet string is divided into all upper case letters, first upper case letters, all lower case letters, etc.) and parts of speech; and an output value of the discriminant function Have a predefined ranking function that determines the ranking and score of each reading class,
  The context usage type determination means 43
  For the context-based reading determination model, the extracted target word, M words in front of the target word (M> 0, can be set arbitrarily), and N words in the back (N> 0, set arbitrarily) Input the attributes including at least the number of characters of each word, character type (all alphabets are divided into upper case letters, first upper case letters, all lower case letters, etc.) and parts of speech, and each attribute is converted to a vector representation. A means for calculating an identification function, inputting an output value of the identification function into a rank function, and outputting an estimated rank of each reading class candidate with a score.
[0030]
  The present invention (Claim 8) is a reading information determination device for inputting text to be processed, reading each word, and outputting word information including part of speech.
  A morpheme analyzer that accepts text to be processed and setting information as input, and obtains word information by morphologically analyzing the text using a word dictionary;
  A target word extracting means for extracting a target word for judging the reading class from the word information by specifying a word for judging the reading class inputted as setting information;
  For each target word, a reading class candidate extracting means for extracting a reading candidate that can be a reading class candidate indicating the type of reading with the character types constituting the character string and the arrangement thereof;
  Batch determination means that performs batch determination using the batch reading class determination model and sets the first candidate as a result of reading class determination;
  A reading granting means for giving a reading according to the determined reading class;
  Word information output means for outputting word information based on the format of the word information to be output input as setting information;Have
  The batch reading class judgment model is
  At least for the alphabet string only, the attribute including the number of characters, the first syllable notation, the end syllable notation, the character type (the alphabet string is divided into all uppercase letters, first uppercase letters, all lowercase letters, etc.) Number of characters, character type (Alphabetic strings are divided into all uppercase letters, first uppercase letters, all lowercase letters, etc.), part of speech, number type (first letter is " 0 " , Including the number of characters, first syllable notation, last syllable notation, character type (alphabet string is divided into all uppercase letters, first uppercase letters, all lowercase letters, etc.), part of speech numeric type (The first character is " 0 " A predetermined discriminant function having a parameter corresponding to an attribute including whether or not), and a rank function for selecting the first candidate from the discriminant function and an output value,
  The collective judgment means is
  For the batch reading determination model, the extracted target word and the M words in front of the target word (M > 0, can be set arbitrarily), back N words (N > 0, which can be set arbitrarily)
  For the alphabet string only, attributes including the number of characters, first syllable notation, end syllable notation, character type (all alphabet strings are divided into upper case, first upper case, all lower case, etc.),
For numeric strings only, the number of characters, type of letters (the alphabetical string is divided into all uppercase letters, first uppercase letters, all lowercase letters, etc.), part of speech, number type (first letter is " 0 " Whether or not),
  Alphabet strings and number strings are shared as follows: number of characters, first syllable notation, end syllable notation, character type (all alphabet strings are divided into upper case, first upper case, all lower case, etc.), part of speech, number type (the first character is " 0 " Whether or not),
, Convert each attribute into a vector representation and calculate the discrimination function, input the output value of the discrimination function into the rank function of the batch reading judgment model, and output the estimated rank of each reading class candidate with a score Means to do.
[0032]
  The present invention(Claim 9)IsA program that causes a computer to function as the reading information determination device according to claims 5 to 8.
[0036]
  UpAs described above, the present invention provides information on the character string itself that can be easily collected from various dictionaries, etc., and character strings in the vicinity of the character string that require the cost of creating a corpus, etc. Using a statistical model that uses information, determine the reading class candidate that indicates the type of reading by using the character types that compose the character string and their arrangement, and narrow down the reading class by judging the attribute from the relationship with the context of the preceding and following words. Enable.
[0037]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0038]
First, an outline of the reading information determination device will be described.
[0039]
FIG. 3 shows the configuration of the reading information determination apparatus in one embodiment of the present invention.
[0040]
The reading information determination apparatus shown in FIG. 1 includes a text input unit 1, a morpheme analysis unit 2, a target word extraction unit 3, a reading class determination unit 4, a reading assignment unit 5, a word information output unit 6, a word dictionary 7, and a reading class. The determination model 8 is configured.
[0041]
The text input unit 1 inputs text and setting information.
[0042]
Here, the text is an arbitrary text to which word information such as reading, which is input from the keyboard or stored in a hard disk, a memory, or the like, is added, and is passed to the morphological analysis unit 2.
[0043]
Also, setting information (used in the target word string extraction unit 3) is a condition of a character string (designated character type string (alphabetic characters, uppercase letters, lowercase letters, etc.)) constituting a word for which a reading class is determined. (For example, all alphabet strings, all numeral strings, alphabet strings and all numeral strings with unknown words, unknown words, ambiguous alphabet strings, or not determined at all), the format of the word information to be output ( For example, all word information is output to the memory, only the reading is output to the standard output, the notation and the reading are output to a file on the hard disk, etc.) and input from the keyboard or stored in the hard disk, memory, etc. Information. The designation of the character type for determining the reading class is passed to the target word string extraction unit 3. The format of the word information to be output is passed to the word information output unit 6.
[0044]
The morphological analysis unit 2 uses the word dictionary 7 in which the text received from the text input unit 1 is stored in association with word notation, part of speech, reading, accent type, etc., and is divided into words, notation, part of speech, reading, accent Give word information consisting of types. Here, words that are not registered in the word dictionary 7 and become unknown words are collectively treated as one word in units of character types. Numbers are treated as one word.
[0045]
The target word string extraction unit 3 extracts the designated word from the word information obtained from the morpheme analysis unit 3 by designating the word for which the reading class obtained from the text input unit 1 is determined, Extract target words for reading class determination.
[0046]
The reading class determination unit 4 determines the reading class for each target word extracted by the target word string extraction unit 3 using the reading class determination model 8. The reading class determined here is added to the word information output by the morphological analysis unit 2. Details of the reading class determination unit 4 and the reading class determination model 8 will be described later.
[0047]
The reading giving unit 5 gives readings to each target word extracted by the target word string extracting unit 3 according to the given reading class.
[0048]
Specifically, for numeric strings, for example, Japanese number notation for absorbing fluctuations in notation is classified into seven forms according to the determined number reading class, and the standard form of number notation Create rules to give them standard phonemes, accents, and poses, and to regularize numbers, phonological changes of classifiers and accent combinations when a number is connected to a number, `` Masahiro Miyazaki , "Numeric reading rules for Japanese-speech conversion", Journal of Information Processing Society of Japan, June 1984, Vol. 25, No. 6, pp.1035-1043 Is granted. For the alphabet string, the words determined to be alphabet readings are given a reading using an alphabet reading correspondence table (for example, A = A, B = B) in which each alphabet character corresponds to the reading, A word that is determined to be romaji reading is assigned a reading using a romaji reading correspondence table (eg, A = a, KA = k) that corresponds to the romaji, and each reading such as English reading, French reading, etc. For the national language, reading is given for each language using, for example, the method disclosed in Japanese Patent Laid-Open No. 2001-142877. This method performs the optimal katakana transliteration for any English word by searching for a path that maximizes the probability of simultaneous occurrence of English words and katakana based on a transliteration model created from English characters and Katakana data. is there.
[0049]
The reading given here overwrites the word information output by the morphological analysis unit 2 (in the case where the word information has a ambiguity of reading, the reading given here is ranked first). Note that in order to give a reading by the reading giving 5, the Roman alphabet reading is used in the Roman alphabet, and the alphabet reading in the alphabet reading, the English reading, the French reading, etc. When using the method, a transliteration model is required, so these tables are provided as a database inside or outside the reading assigning unit 5.
[0050]
The word information output unit 6 outputs the word information in the format specified for the specified output destination according to the format of the word information to be output obtained from the text input unit 1.
[0051]
[First Embodiment]
Detailed processing of the reading class determination unit 4 will be described.
[0052]
FIG. 4 shows the configuration of the reading class determination unit in the first embodiment of the present invention. The reading class determination unit 4 shown in the figure includes a reading class candidate extraction unit 41, a target word information usage type determination unit 42, a context usage type determination unit 43, and a final determination unit 44. The reading class determination model 8 includes a target word information utilization type reading class determination model 81 and a context utilization type reading class determination model 82. The target word information utilization type reading class determination model 81 is a target word information utilization type determination. The context usage type reading class determination model 82 is referred to by the unit 42, and is referred to by the context usage type determination unit 43.
[0053]
The reading class candidate extraction unit 41 extracts reading classes that can be taken by the target word extracted by the target word string extraction unit 3 from the target reading classes. For example, in the case of a numeric string, since it cannot be a reading class such as alphabet reading or Roman reading, these classes are excluded. In addition, classes such as bar reading and digit reading are excluded from the alphabet string, and there are characters that cannot be used in Roman characters, for example, characters that are not used in Roman characters (eg:LEMON), there is a sequence of strings that cannot be Roman characters (eg RESTAURANT), The class of romaji reading is also excluded.
[0054]
The target word information utilization type determination unit 42 inputs an attribute obtained from the word information of the target word extracted by the target word string extraction unit 3 to the target word information utilization type reading class determination model 81.
[0055]
Here, only alphabetical strings are targeted. This is because there are many cases where the alphabet string can determine the reading class only by the information of the target word (for example, “beautiful” = English reading, “SVM” = alphabetic reading, etc.), but the numeric string is “ This is because, as in the example of “611”, the reading class cannot be determined only by the target word information.
[0056]
The target word information utilization type reading class determination model 81 is input from an identification function that receives an attribute described below and an output value of the identification function, and a rank function that outputs a specified rank of each reading class candidate with a score. Become. Create learning data using a Japanese text corpus (or dictionary), for example, “Yamada Hiroyasu, one other person,“ How to apply Support Vector Machine to multivalued classification problems ”, Information Processing Society of Japan Report: Natural Language Processing, November 20, 2001, pp.33-38 "Support Vector Machine (SVM) shown in several types is expanded as a multi-value classification, etc. Make a decision. The attributes to be used include at least the number of characters of the word and the notation of the first syllable and the last syllable. Other syllable expressions may be added to the attribute. The boundary of the syllable here is a position where “vowel (aiueo) + other characters”. As the rank function, for example, the rank is determined by the pairwise method shown in the above-mentioned Yamada et al. Document, and the relaxation of the voted class distance is used as a score.
[0057]
The context usage type determination unit 43 inputs the attribute obtained from the word information of the target word extracted by the target word string extraction unit 3 and its adjacent words to the context usage type reading class determination model 82, and estimates each reading class candidate. Output order with score.
The context-based reading class determination model 82 includes an identification function that receives the attribute described below and a rank function that inputs an output value of the identification function and outputs an estimated rank of each reading class candidate with a score. Learning data is created using a Japanese text corpus (or dictionary), etc., and learning data is collected from a Japanese text corpus using a learning device used in the target word information utilization type decision model 81 and created in advance. Keep it. The attributes to be used are the target word, the number of characters in the front M words (M> 0, can be set arbitrarily), the number of characters in the back N words (N> 0, can be set arbitrarily), and the character type (the alphabet string is , All uppercase, first uppercase, etc.), part of speech.
[0058]
As the rank function, for example, the rank is determined by the pairwise method shown in the above-mentioned Yamada et al. Document, and the relaxation of the voted class distance is used as a score.
[0059]
The final determination unit 44 outputs the reading class finally determined based on the determination results of the target word information usage type determination unit 42 and the context usage type determination unit 43.
[0060]
FIG. 5 is a flowchart of the reading class determination processing operation in the first embodiment of the present invention.
[0061]
Step 101) First, a possible reading class is extracted from the current processing target word.
[0062]
Step 102) It is determined whether the target word is a numeric string. If it is a numeric string, the process proceeds to Step 105. On the other hand, if it is not a numeric string, the process proceeds to step 103.
[0063]
Step 103) If the target word is not a numeric string, the target word information utilization type determination is performed, and the estimated rank of each reading class candidate extracted in Step 101 is output with a score.
[0064]
Step 104) It is determined whether or not the first reading class candidate score output in Step 103 is equal to or higher than the reliability threshold value. If the score is equal to or higher than the reliability threshold value, the process proceeds to Step 108 and is less than the reliability threshold value. If so, the process proceeds to step 105. Here, the reliability threshold value is a value that is empirically set in advance.
[0065]
Step 105) When the first reading class candidate score is not equal to or higher than the reliability threshold, or when the target word is a numeric string, context-use type determination is performed, and the estimated rank of each reading class candidate is output with a score. . Here, the reading class candidates to be determined may be all of the reading class candidates extracted in step 101, or may be some of the top reading classes ranked in step 103, or step 103 It may be possible to narrow down only the reading classes having a score equal to or higher than a certain value (in this case, if not passing through step 103, all reading class candidates extracted in step 101 are used).
[0066]
Step 106) It is determined whether or not step 103 is performed. If so, it is determined whether or not the first reading classes determined in step 103 and step 105 are the same. When step 103 is not performed and when step 103 is performed and the first reading class is the same as step 105, the process proceeds to step 108. Otherwise, the process proceeds to step 107.
[0067]
Step 107) The score of the first reading class determined in Step 103 and the value of “score * score weight” (however, the score weight is a constant) of the first reading class determined in Step 105 Among these, the reading class having a large value is set as the final reading class, and the process is terminated. The score weight is a constant set in advance by experience.
[0068]
Step 108) The first reading class determined in Step 103 or Step 105 (performed) is set as the final reading class, and the process is terminated.
[Second Embodiment]
FIG. 6 is a configuration diagram of the reading class determination unit in the second embodiment of the present invention. The reading class determination unit 4 shown in the figure has a reading class candidate extraction unit 41 and a batch determination unit 45, and the batch determination unit 45 refers to the batch reading class determination model 83.
[0069]
The reading class candidate extraction unit 41 extracts reading classes that can be taken by the target word extracted by the target word string extraction unit 3 from the reading classes to be output by the collective determination unit 45. This is exactly the same as the first embodiment described above.
[0070]
The collective determination unit 45 inputs the attribute obtained from the target word extracted by the target word string extraction unit 3 and the word information of the adjacent word to the collective reading class determination model 83 to obtain the estimated rank of each reading class candidate. The first reading class is set as the final reading class and is output.
The batch reading class determination model 83 uses, as learning data, a set of attributes and reading classes extracted from a Japanese text corpus (or dictionary) using the learning device used in the target word information utilization type reading class determination model 81. It consists of a discriminant function created in advance and a rank function that inputs an output value of the discriminant function and outputs the estimated rank of each reading class candidate with a score. Here, the batch reading class determination model 83 may combine the alphabet string and the numeric string into one model, or may be divided into two models for the alphabet string and the numeric string.
[0071]
The attributes to be used are word attributes for the target word, M words ahead of the target word (M> 0, can be arbitrarily set), and N words behind the target word (N> 0, can be arbitrarily set), This is the reading class M words ahead of the target word.
[0072]
The word attributes for the alphabet string include at least the number of characters, the first syllable notation, the last syllable notation, and the character type (the alphabet string is divided into all upper case letters, first upper case letters, all lower case letters, etc.). Here, when the word is other than the alphabet string, the first syllable notation and the end syllable notation are none.
[0073]
The word attributes for the numeric string include at least the number of characters, the character type (the alphabet string is divided into all uppercase letters, first uppercase letters, all lowercase letters, etc.), part of speech, and numeric type (whether the first letter is “0”).
[0074]
Attributes for alphabetical and numeric strings (collected together) include at least the number of characters, first syllable notation, end syllable notation, and character type (the alphabet string is divided into all upper case letters, first upper case letters, all lower case letters, etc.) , Part of speech, and number type (whether the first character is “0”).
[0075]
【Example】
In the following, the embodiment of the present invention will be described with reference to FIGS. 7 to 12, taking the text shown in FIG. 7 as an input example.
[0076]
FIG. 7 shows an example of data from input to extraction of the target word in one embodiment of the present invention, FIG. 8 shows an example of context-use type determination attributes in one embodiment of the present invention, and FIGS. FIG. 12 shows an example of word information to be finally output according to an embodiment of the present invention.
[0077]
Here, the input setting information is “word for reading class determination = all alphabet strings / all numbers string, output word format = all word information is output to memory”, but in the following Partly, what happens in the case of other setting information will also be described.
[0078]
The text input unit 1 passes “word for reading class determination = all alphabet strings / all numbers string” to the target word extraction unit 3. Further, “output word format = output all word information to memory” is passed to the word information output unit 6. In addition, the text is passed to the morphological analyzer 2.
[0079]
Next, the morphological analysis unit 2 uses the word dictionary 7 to identify words as shown in FIG. 7, and word information including notation, part of speech, reading, character type, and the like is obtained for each word.
[0080]
Next, the target word extraction unit 3 extracts the target word shown in FIG. 7 based on the designation “word information and words for determining reading class = all alphabet strings / all numeral strings”.
[0081]
Incidentally, when “word for reading class determination = alphabet of unknown word” is input as setting information, only “1: YOKOSUKA” and “13: AIR” are extracted as target words.
[0082]
Hereinafter, the reading class determination processing shown in FIG. 4 in the first embodiment will be described as the reading class determination unit 4. Here, description will be made based on the flowchart of FIG. 5 using the examples of “1: YOKOSUKA” and “4:10”.
[0083]
Here, alphabet reading class, alphabet reading, English reading, Roman reading, number reading class, integer type, decimal type, fraction type, round number type, round reading type, range reading type, range type, parallel writing type, English type ("Masahiro Miyazaki, "The Numeral Reading Rules for Japanese Sentence Conversion", IPSJ Transactions, June 1984, Vol. 25, No. 6, pp.1035-1043. It will be handled.
[0084]
The target word information utilization type reading class determination model 81 uses a model in which the number of word characters and the first syllable / last syllable notation are attributes and the SVM is expanded to multi-value classification by the pairwise method.
[0085]
The context-based reading class determination model 82 includes the number of characters, the character type, the word notation, the first character notation, the last character notation, the part of speech, and the two reading classes for the target word and the preceding and following two words. Assume that only a target word) is used as an attribute, and a model in which SVM is expanded to multi-value classification by a pair-wise method is used.
[0086]
Further, the reliability threshold value in step 104 is set to 1.00, and the score weight in step 107 is set to 1.00.
[0087]
In step 105, when the reading class is limited and the process passes through step 103, the determination is made for the second highest reading class in step 103, and the scores in step 103 and step 105 are as follows. The distance from the two solutions, otherwise = 0.
[0088]
First, the case of “1: YOKOSUKA” is shown.
[0089]
In step 101 of FIG. 6, “YOKOSUKA” is an alphabet string, and therefore excludes all numeric reading classes. In addition, it is determined whether the spelling is possible by checking whether the spelling can be Roman letters. As a result, there are three types of reading class candidates: alphabet reading, English reading, and Roman reading.
[0090]
Next, in step 102, since “YOKOSUKA” is not a numeric string, the process proceeds to step 103.
[0091]
In step 103, the number of word characters = 8, the first syllable notation = YO, and the last syllable notation = KA are extracted as attributes, and alphabet reading, English reading, and Romaji reading are used as reading class candidates, and the target word information utilization type reading class determination model 81. As a result,
1st place: Romaji reading, score = 2.54
2nd place: English reading, score = 0
3rd place: Alphabet reading, score = 0
Is obtained.
[0092]
In step 104, since the first solution score = 2.54 and the reliability threshold = 1.00, the process proceeds to step 108, where it is determined that the Roman character is read, and the process ends.
[0093]
Next, the case of “4:10” is shown.
[0094]
In step 101, since “10” is a numeric string, all alphabet reading classes are excluded. As a result, the reading class candidates are an integer type, a decimal type, a fraction type, an approximate number type, a bar reading type, a range type, a combined writing type, and an English type.
[0095]
Next, since “10” is a numeric string in step 102, the process proceeds to step 105.
[0096]
The attributes used for determination in step 105 are shown in FIG. The reading class candidates are integer type, decimal type, fractional type, approximate number type, bar reading type, range type, parallel writing type, and English type, and this attribute is applied to the context-based reading determination model 82. As a result,
1st place: English type, score = 0.03
2nd place: integer type, score = 0
3rd place: decimal type, score 0
(Hereinafter abbreviated)
Is obtained.
[0097]
In step 106, since the determination in step 103 has not been performed, the process proceeds to step 108, where the English type is determined, and the process ends.
[0098]
Next, an example in which the above-described second embodiment shown in FIG. 6 is used as the reading class determination unit 4 will be described using examples of “4:10” and “13: AIR”.
[0099]
Here, as alphabet reading class, alphabet reading, English reading, Roman reading, French reading, Italian reading, number reading class, integer type, decimal type, fraction type, round number type, bar reading type, range type, parallel writing type, English type (Masahiro Miyazaki, “Numerical Reading Rules for Japanese Spoken Speech Conversion”, IPSJ Transactions, June 1984, Vol. 25, No. 6, pp. 1035-1043.) (Types added).
[0100]
Here, the collective reading class determination model 83 is divided into two models for alphabetical strings and numeric strings. Each model uses a model in which SVM is expanded to multi-value classification by the pairwise method, and each word attribute shown below for the target word and the two words before and after and the reading class of the two words ahead are set as attributes. Shall.
[0101]
The word attribute for the alphabet string is divided into the number of characters, first, second, last -1, syllable syllabary (no value other than the alphabet string), character type (the alphabet string is all uppercase, first uppercase, all lowercase, etc. ).
[0102]
The word attributes for the numeric string are notation, number of characters, numeric type (whether the first character is “0”), main part of speech, character type (the alphabet string is divided into all uppercase letters, first uppercase letters, all lowercase letters, etc.).
[0103]
In the reading class candidate extraction unit 41 of FIG. 6, “4:10” is a numeric string, and therefore excludes all alphabet reading classes. As a result, there are eight types of reading class candidates: an integer type, a decimal type, a fraction type, an approximate number type, a bar reading type, a range type, a parallel writing type, and an English type.
[0104]
Next, the batch determination unit 45 applies the above-described eight types as reading class candidates to the attributes shown in FIG. 9 to the batch reading class determination model 83 for numeric strings, and as a result,
1st place: English type
2nd place: Integer type
(Omitted)
Is obtained. Thereby, it determines with an English type and complete | finishes a process.
[0105]
In the reading class candidate extraction unit 41 of FIG. 6, “AIR” is an alphabet string, and therefore excludes all numeric reading classes. In addition, since “R” cannot end in Roman letters, Roman letter reading is also excluded from the reading class. As a result, there are four reading class candidates: alphabet reading, English reading, Italian reading, and French reading.
[0106]
Next, the collective judgment unit 45 applies the attributes shown in FIG. 10 to the collective reading class judgment model 83 as alphabetical reading, English reading, French reading, and Italian reading as candidate reading classes, and as a result,
1st place: English reading
2nd place: Alphabet reading
3rd place: Italian reading
4th place: French reading
Is obtained. Thereby, it determines with English reading, and a process is complete | finished.
[0107]
Next, as the collective reading class determination model 83, a specific example in which an alphabet string and a numeric string are combined into one will be described using an example of “1: YOKOSUKA”.
[0108]
This model uses a model in which SVM is expanded to multi-value classification by the pairwise method, and uses the following word attributes for the target word and the two words before and after, and the reading class attribute of the two words ahead.
[0109]
Word attributes are: notation, number of characters, first, second, last -1, syllabary notation (no value except for alphabetical strings), character type (alphabetic strings are divided into all uppercase letters, first uppercase letters, all lowercase letters, etc.), Part of speech, number type (whether the first character is “0”).
[0110]
In the reading class candidate extraction unit 41 of FIG. 6, “1: YOKOSUKA” is an alphabet string, and therefore excludes all numeric reading classes. As a result, the reading class candidates are alphabet reading, English reading, Roman reading, French reading, and Italian reading.
[0111]
Next, the batch determination unit 45 applies the attribute shown in FIG. 11 to the batch reading class determination model 83 with the reading class as a candidate, and as a result,
1st place: Romaji reading
2nd place: English reading
3rd place: Italian reading
4th place: French reading
5th place: Alphabet reading
Is obtained. As a result, it is determined that the Roman character is read, and the process ends.
[0112]
In FIG. 3, the reading class determination unit 4 assigns reading classes to all the words extracted by the target word extraction unit 3 as described above (see the reading class in FIG. 12).
[0113]
Next, the reading giving unit 5 gives readings based on the given reading class.
[0114]
For example, since “1: YOKOSUKA” is determined to be read in Roman characters, it is converted into “YO → Yo”, “KO → Yo”, “SU → Su”, “KA → K”, and “Yokosuka” is read. obtain.
[0115]
Since “4:10” is determined to be an English type, a reading “ten” is obtained from an English reading conversion table prepared in advance.
[0116]
Since “13: AIR” is determined to be an English type, the reading “air” is obtained using “Japanese Patent Laid-Open No. 2001-142877” or the like made for English. In addition, when using the method by the said "Japanese Patent Laid-Open No. 2001-142877", the national language transliteration model is used.
[0117]
Finally, the word information output unit 6 outputs the word information of FIG. 12 to the memory because the setting information indicates “format of output word = all word information is output to memory”.
[0118]
For example, if the output word information is input to a speech synthesizer, a synthesized speech can be output.
[0119]
The operation of the reading class determination unit in the first and second embodiments is constructed as a program, installed in a computer used as a reading information determination device, and executed by a control means such as a CPU. It is also possible to do. Further, the word dictionary shown in FIG. 3 is constructed as a database and stored in a storage means, and other components are also constructed as programs, installed in a computer used as a reading information determination device, and controlled by a CPU or the like. It is also possible to execute by means.
[0120]
Further, the constructed program is stored in a hard disk connected to a computer used as a reading information determination device, a portable storage medium such as a flexible disk or a CD-ROM, and installed in a computer that implements the present invention. It is also possible.
[0121]
The present invention is not limited to the above-described embodiments and examples, and various modifications and applications are possible within the scope of the claims.
[0122]
【The invention's effect】
As described above, according to the present invention, in the vicinity of the character string that requires the cost of creating the corpus and the like, information on the character string itself that is ready to be collected by various dictionaries, etc. By using the statistical model using the character string information of, the alphabet reading class and the number reading class are estimated, so that the reading accuracy of the alphanumeric string included in the Japanese text can be improved.
[Brief description of the drawings]
FIG. 1 is a diagram for explaining the principle of the present invention.
FIG. 2 is a principle configuration diagram of the present invention.
FIG. 3 is a configuration diagram of a reading information determination device according to an embodiment of the present invention.
FIG. 4 is a configuration diagram of a reading class determination unit according to the first embodiment of the present invention.
FIG. 5 is a flowchart of a reading class determination processing operation according to the first embodiment of the present invention.
FIG. 6 is a configuration diagram of a reading class determination unit according to the second embodiment of the present invention.
FIG. 7 is an example of data from input to target word extraction according to an embodiment of the present invention.
FIG. 8 is an example of context use type determination attributes according to an embodiment of the present invention;
FIG. 9 is an example (part 1) of batch determination attributes according to an embodiment of the present invention;
FIG. 10 is an example of attribute of batch determination according to an embodiment of the present invention (part 2);
FIG. 11 is an example (part 3) of batch determination attributes according to an embodiment of the present invention;
FIG. 12 is an example of word information to be finally output according to an embodiment of the present invention.
[Explanation of symbols]
1 Text input part
2 Morphological analyzer, morphological analyzer
3. Target word extraction means, target word extraction unit
4 Reading class judgment part
5 Reading giving means, reading giving part
6 Word information output means, word information output section
7 word dictionary
8 Reading class judgment model
41 Reading class candidate extraction unit
42 target word information use type determination means, target word information use type determination unit
43 Contextual usage type determination means, Contextual usage type determination unit
44 Final determination means, final determination unit
45 Batch judgment part
81 Target word information utilization type reading class judgment model
82 Context-Based Reading Class Judgment Model
83 Batch reading class judgment model

Claims

Morphological analysis means, target word extraction means, reading class candidate extraction means, target word information utilization type determination means, context utilization type determination means, final determination means, reading provision means, word information output means,
The text to be processed is input, reading of each word, an information determination method to read in the apparatus for outputting word information including parts of speech,
Said morphological analysis means, input from the input means, or accepts a text and configuration information of the processing target that has been read from the storage unit, information for giving the word information using the stored word dictionary text Morphological analysis to obtain word information by performing morphological analysis ,
The target word extracting means, by specifying the word for judging reading class input as the setting information, the target word extracting a target word for judging read class from among the word information,
The reading class candidate extraction means, for each of the target words, a reading class candidate extracting step of extracting a reading candidate that can be a reading class candidate indicating a reading type with a character type constituting the character string and its arrangement ;
The target word information-using determination means, if the target words extracted is alphabetic string, cormorants row target word information-using determination using target word information-using read class determination model target word information-using A determination step;
When the score of the first candidate for the reading class is less than a predetermined reliability threshold value or the extracted target word is a numeric string, the context using type determining unit determines the context using the context using type reading class determination model and line cormorant context-using decision step usage type determination,
The final determination means is
By comparing the scores of the first candidate of the context-using judging step and said target word information-using determination step, a score of the first candidate of class reading in the subject word information-using determination step,該文pulse-using Of the values obtained by multiplying the score of the reading class of the first candidate determined in the determination step and the weight of the score (however, the weight of the score is a constant), the reading class having a larger value is set as the final result.
The target word information-using scores of the first candidate of the read class determining step predetermined confidence threshold or more, or first candidate of the context-using judging step and said target word information-using determination step are identical, Alternatively, if the target word is a numeric string, the first candidate is read as the final result of class determination,
If the target word is a number string, a final determination step for the final result of the class determination read first candidate of the context-using determination step,
The readings applying means includes a row Uyomi imparting step grant read in response to the read class is determined by the final determination step,
The word information output means for outputting word information based on the format of the word information to be output input as the setting information ;
Reading information determination method and performing.

The target word information utilization type reading class determination model is:
A predetermined discriminant function having parameters corresponding to attributes including at least the number of characters of the word, the first syllable notation, and the end syllable notation, and a predetermined discriminating order and score for each reading class from the output value of the discriminant function Has a rank function,
In the target word information utilization type determination step, the target word information utilization type determination means includes:
For the target word information utilization type reading determination model, attributes including at least the number of characters of the word obtained from the extracted word information of the target word, the first syllable expression, and the last syllable expression are input, and each attribute vector expression The reading information determination method according to claim 1, wherein the discrimination function is calculated by conversion, an output value of the discrimination function is input to the rank function, and an estimated rank of each reading class candidate is output with a score.

The context-based reading class determination model is:
A predetermined discriminant function having parameters corresponding to attributes including at least the number of characters of the word, the type of characters (the alphabet string is divided into all uppercase letters, first uppercase letters, all lowercase letters, etc.) and parts of speech; and an output value of the discriminant function Have a predefined ranking function that determines the ranking and score of each reading class,
In the context usage type determination step, the context usage type determination means includes:
For the context-based reading determination model, the extracted target word, M words in front of the target word (M> 0, can be arbitrarily set), N words in the back (N> 0, arbitrary) Enter at least the number of characters of the word obtained from the word information, the character type (the alphabet string is divided into all uppercase letters, first uppercase letters, all lowercase letters, etc.), attributes including part of speech, and each attribute is converted into a vector expression The reading information determination method according to claim 1, wherein the discrimination function is calculated by conversion, an output value of the discrimination function is input to the rank function, and an estimated rank of each reading class candidate is output with a score.

Morphological analysis means, target word extracting means, to read class candidate extracting means comprises bulk determining means, read application means, the word information output means, the text to be processed is input, reading of each word, the word including parts of speech an information determination method to read in the apparatus for outputting information,
It said morphological analysis means, receiving the text and setting information of the processing target as input, morphological analysis step of obtaining word information by morphological analysis text using a word dictionary information for giving word information is stored When,
The target word extracting means, by specifying the word for judging reading class input as the setting information, the target word extracting a target word for judging readings from the word information class,
The reading class candidate extracting means, for each target word, a reading class candidate extracting step of extracting a reading candidate that can be a reading class candidate indicating a type of reading with a character type constituting the character string and its arrangement;
A batch determination step in which the batch determination means performs batch determination using a batch reading class determination model and sets the first candidate as a result of reading class determination;
The readings applying means includes a row Uyomi imparting step readings assigned depending on where the determined read class,
It said word information output means, based on the format of the word information to be output which is input as the setting information, the word information output step of outputting the word information,
And
The batch reading class determination model is:
At least for the alphabet string only, the attributes including the number of characters, the first syllable notation, the end syllable notation, the character type (the alphabet string is divided into all capital letters, first capital letters, all lower case letters, etc.) Number of characters, type of characters (Alphabetic strings are divided into all uppercase letters, first uppercase letters, all lowercase letters, etc.), parts of speech, numeric attributes ( whether the first letter is " 0 " ) , Parameters corresponding to attributes including first syllable notation, end syllable notation, character type (all alphabetical characters are divided into upper case, first upper case, all lower case, etc.), part of speech numeric type ( whether the first character is " 0 " ) And a ranking function for selecting a first candidate from the discrimination function and an output value,
In the batch determination step, the batch determination means includes:
For the batch reading determination model, the extracted target word, M words in front of the target word (M > 0, can be arbitrarily set), N words in the back (N > 0, arbitrarily set) Possible) word information, at least,
For the alphabet string only, attributes including the number of characters, first syllable notation, end syllable notation, character type (all alphabet strings are divided into upper case, first upper case, all lower case, etc.),
For numeric strings only, attributes that include the number of characters, type of characters (the alphabetical string is divided into all uppercase letters, first uppercase letters, all lowercase letters, etc.), part of speech, and numeric type ( whether the first letter is " 0 " ),
Alphabet strings and number strings can be shared with the number of characters, first syllable notation, end syllable notation, character type (the alphabet string is divided into all uppercase letters, first uppercase letters, all lowercase letters, etc.), part of speech, number type (the first letter is " 0 ") attributes, including "whether)
, Convert each attribute into a vector representation and calculate the discriminant function, input the discriminant function output value into the rank function of the batch reading judgment model, and estimate the estimated rank of each reading class candidate with a score A reading information determination method characterized by outputting .

A reading information determination device that inputs text to be processed, reads each word, and outputs word information including part of speech,
Morphological analysis means for receiving the text to be processed and setting information as input, and obtaining word information by morphological analysis of the text using a word dictionary;
A target word extracting means for extracting a target word for determining the reading class from the word information by specifying a word for determining the reading class input as the setting information;
Reading class candidate extraction means for extracting a reading candidate that can be a reading class candidate indicating the type of reading with the character type constituting the character string and its arrangement for each target word extracted by the target word extracting means;
When the target word extracted by the target word extraction means is an alphabet string, target word information use type determination means for performing target word information use type determination using a target word information use type reading class determination model;
A context usage type determination means for performing context usage type determination using a context usage type reading class determination model when the score of the first candidate of the reading class is less than a predetermined reliability threshold or the target word is a numeric string; ,
The score of the first candidate of the target word information utilization type determination unit is compared with the score of the first candidate of the reading class of the target word information utilization type determination unit, and the context utilization type Among the values obtained by multiplying the score of the first candidate reading class determined by the determining means and the weight of the score (however, the score weight is a constant), the reading class having a larger value is used as the final result, and the target word information is used. The score of the first candidate of the reading class of the type determining means is equal to or greater than a predetermined reliability threshold, or the first candidates of the target word information utilization type determination means and the context utilization type determination means are the same, or the target word is In the case of a numeric string, a final determination unit that reads the first candidate as a final result of class determination;
A reading imparting means for imparting a reading according to the reading class determined by the final determining means;
A reading information determination apparatus comprising: word information output means for outputting word information based on a format of word information to be output input as the setting information.

The target word information utilization type reading determination model is:
A predetermined discriminant function having parameters corresponding to attributes including at least the number of characters of the word, the first syllable notation, and the end syllable notation; A ranking function,
The target word information utilization type determination means includes
For the target word information utilization type reading determination model, attributes including at least the number of characters of the word obtained from the extracted word information of the target word, the first syllable expression, and the last syllable expression are input, and each attribute vector expression 6. The reading information determination device according to claim 5 , further comprising means for performing conversion and calculating an identification function, inputting an output value of the identification function into a rank function, and outputting an estimated rank of each reading class candidate with a score.

The context-based reading class determination model is:
A predetermined discriminant function having parameters corresponding to attributes including at least the number of characters of the word, the type of characters (the alphabet string is divided into all uppercase letters, first uppercase letters, all lowercase letters, etc.) and parts of speech; and an output value of the discriminant function Have a predefined ranking function that determines the ranking and score of each reading class,
The context usage type determination means includes:
For the context-based reading determination model, the extracted target word, M words in front of the target word (M> 0, can be arbitrarily set), N words in the back (N> 0, arbitrary) Enter at least the number of characters of each word obtained from the word information, the character type (the alphabet string is divided into all uppercase letters, first uppercase letters, all lowercase letters, etc.), attributes including part of speech, and each attribute is converted into a vector expression 6. A reading information determining apparatus according to claim 5 , further comprising means for converting and calculating an identification function, inputting an output value of the identification function to the rank function, and outputting an estimated rank of each reading class candidate with a score. .

A reading information determination device that inputs text to be processed, reads each word, and outputs word information including part of speech,
A morpheme analyzing unit that accepts the text to be processed and setting information as input, and obtains word information by morphologically analyzing the text using a word dictionary;
A target word extracting means for extracting a target word for determining the reading class from the word information by specifying a word for determining the reading class input as the setting information;
For each target word, a reading class candidate extracting means for extracting a reading candidate that can be a reading class candidate indicating the type of reading with the character types constituting the character string and the arrangement thereof;
Batch determination means that performs batch determination using the batch reading class determination model and sets the first candidate as a result of reading class determination;
A reading granting means for giving a reading according to the determined reading class;
Word information output means for outputting word information based on the format of the word information to be output input as the setting information ,
The batch reading class determination model is:
At least for the alphabet string only, the attributes including the number of characters, the first syllable notation, the end syllable notation, the character type (the alphabet string is divided into all capital letters, first capital letters, all lower case letters, etc.) Number of characters, type of characters (Alphabetic strings are divided into all uppercase letters, first uppercase letters, all lowercase letters, etc.), parts of speech, numeric attributes ( whether the first letter is " 0 " ) , Parameters corresponding to attributes including first syllable notation, end syllable notation, character type (all alphabetical characters are divided into upper case, first upper case, all lower case, etc.), part of speech numeric type ( whether the first character is " 0 " ) And a ranking function for selecting a first candidate from the discrimination function and an output value,
The collective determination means includes
For the batch reading determination model, the extracted target word, M words in front of the target word (M > 0, can be arbitrarily set), N words in the back (N > 0, arbitrarily set) Possible) word information, at least,
For the alphabet string only, attributes including the number of characters, first syllable notation, end syllable notation, character type (all alphabet strings are divided into upper case, first upper case, all lower case, etc.),
For numeric strings only, attributes that include the number of characters, type of characters (the alphabetical string is divided into all uppercase letters, first uppercase letters, all lowercase letters, etc.), part of speech, and numeric type ( whether the first letter is " 0 " ),
Alphabet strings and number strings can be shared with the number of characters, first syllable notation, end syllable notation, character type (the alphabet string is divided into all uppercase letters, first uppercase letters, all lowercase letters, etc.), part of speech, number type (the first letter is " 0 ") Attribute, including "
Enter a, performs calculation of discriminant function converts each attribute in vector representation, inputs the output values of the discriminant function to rank function of the collective reading decision model-out with an estimated order of each read class candidate score in reading information determining apparatus according to claim <br/> including means for outputting.

Computer
A program that functions as the reading information determination device according to claim 5 .