JP3932912B2

JP3932912B2 - Character string shaping device, method and program

Info

Publication number: JP3932912B2
Application number: JP2002019038A
Authority: JP
Inventors: 明男山下
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2002-01-28
Filing date: 2002-01-28
Publication date: 2007-06-20
Anticipated expiration: 2022-01-28
Also published as: JP2003223441A

Description

【０００１】
【発明の属する技術分野】
本発明はテキストを整形する技術に関し、特に言語処理の対象となる文章のうち、重複文字を有する部分の誤字の除去を行い、正常な文章に回復した後に、検索、要約等の言語処理を行う技術に関する。
【０００２】
【従来の技術】
コンピュータ技術の性能の向上に伴い、これまで様々な言語処理技術の提案がなされてきている。これらの技術の手法として、言語処理の対象となる文章を単語単位に区切り、解析を行う形態素解析手法を用いることがある。しかし、言語処理の対象となる文章が誤字を含む場合、当該文字は形態素解析を行う際に未登録語として処理され、正しい結果が得られない。たとえば、もともとのアプリケーションで、強調したい個所に影文字などの文字修飾を施してあった場合、このアプリケーションを別形式のデータに変換してテキストデータを抽出すると、文字修飾を施した部分の文字が重なって抽出されることがあった。一例として、米国マイクロソフト社のWord2000（登録商標）上で特定の単語、例えば「あいう」を強調するためにボールド体を使用した場合、これを米国アドービシステムズ社のPDF(Portable Document Format)フォーマットに変換し、Acrobat Readerのテキスト選択を用いてPDF(Portable Document Format)ファイルから抽出したテキストでは、「ああいいうう」等の連続する文字が抽出されることがある。そこで、誤字を修正する技術としてスペルチェッカーや単に連続する文字をまとめて１文字に置き換えるという方法が提案されている。
【０００３】
【発明が解決しようとする課題】
しかし、スペルチェッカーでは、上記のような重複文字に対しては、修正候補の提示はおろか、ミススペルの発見すら困難である。また、連続して同じ文字が出現する場合に、単純に連続する文字を１文字に置き換えてしまうと、「ふたたび」という単語が「ふたび」になったりと、本来あるべき単語の構成文字まで削除される恐れがある。影文字などの文字修飾を施した部分は、本来作者が重要と考える部分に相当するが、従来技術では作者の意図が言語処理の結果に十分に反映されない恐れがある。
【０００４】
本発明は、上記課題を解決するために発明されたものであり、誤字と思われる重複文字を除去して正しい単語に整形した後に、言語処理の対象とすることを目的とする。
【０００５】
【発明を解決するための手段】
上記のような目的を達成するために、本発明の第1の特徴とするプログラムは、コンピュータを、同一文字が重複して出現する文字列のデータを受け付けるデータ受付手段と、前記受付手段により受け付けられた前記文字列を単語情報と品詞情報に基づいて複数の文字列に切り出す文章解析手段と、前記文章解析手段により前記切り出された複数の文字列のうち、第１の文字列が一文字であるか否かを判断する一文字判断手段と、前記一文字判断手段により前記第１の文字列が一文字であると判断された場合は、前記第１の文字列と前記第１の文字列の次に切り出された文字列の先頭文字とが同一であるか否かを判断する同一文字判断手段と、前記同一文字判断手段により同一文字であると判断された場合は、同一文字の整形を行う文字整形手段と、前記一文字判断手段により前記第１の文字列が一文字でないと判断された場合は、前記第１の文字列の品詞情報を判断する品詞情報判断手段と、前記品詞情報判断手段により前記第１の文字列が品詞情報として登録されていないと判断された場合は、前記第１の文字列から重複する同一の文字を一つにまとめる重複文字除去手段を有することを特徴とする。
【０００６】
また、本発明の第2の特徴とするプログラムは、コンピュータを、同一文字が重複して出現する文字列のデータを受け付けるデータ受付手段と、前記受付手段により受け付けられた前記文字列を単語情報と品詞情報に基づいて複数の文字列に切り出す文章解析手段と、前記文章解析手段により前記切り出された複数の文字列のうち、第１の文字列が一文字であるか否かを判断する一文字判断手段と、前記一文字判断手段により前記第１の文字列が一文字であると判断された場合は、前記第１の文字列と前記第１の文字列の次に切り出された文字列の先頭文字とが同一であるか否かを判断する同一文字判断手段と、前記同一文字判断手段により同一文字であると判断された場合は、同一文字の整形を行う文字整形手段を有することを特徴とする。
【０００７】
また、本発明の第３の特徴とするプログラムは、コンピュータを、同一文字が重複して出現する文字列のデータを受け付けるデータ受付手段と、前記受付手段により受け付けられた前記文字列を単語情報と品詞情報に基づいて複数の文字列に切り出す文章解析手段と、前記文章解析手段により前記切り出された複数の文字列のうち、第１の文字列の品詞情報を判断する品詞情報判断手段と、前記品詞情報判断手段により前記第１の文字列が品詞情報として登録されていないと判断された場合は、前記第１の文字列から重複する同一の文字を一つにまとめる重複文字除去手段を有することを特徴とする。
【０００８】
また、本発明の第４の特徴とするプログラムは、前記文字整形手段により整形された文字列を出力する整形文字列出力手段を含み、前記重複文字除去手段により重複文字を除いた文字列を出力する重複文字除去後出力手段を含むことを特徴とする。
【０００９】
このような様々な特徴を有する本発明のプログラムは、本発明の装置において、同様の特徴を有する。また、本発明の方法において、同様の特徴を有する。即ち、同一文字が重複して出現する文字列のデータを受け付けて、前記文字列を単語情報と品詞情報に基づいて複数の文字列に切り出す。また、前記切り出された複数の文字列のうち、第１の文字列が一文字であるか否かを判断し、一文字であると判断された場合は、前記第１の文字列と前記第１の文字列の次に切り出された文字列の先頭文字とが同一であるか否かを判断し、同一文字であると判断された場合は、同一文字の整形を行う。また、前記切り出された複数の文字列のうち、第１の文字列の品詞情報を判断し、品詞情報として登録されていないと判断された場合は、前記第１の文字列から重複する同一の文字を一つにまとめる。前記同一文字の整形後の内容及び前記重複する同一の文字を一つにまとめた後の内容は出力される。
【００１０】
【発明の実施の形態】
本発明をその一実施形態に基づいて具体的に説明する。本実施形態では、形態素解析結果を利用して重複文字の除去を行う。本実施形態では、米国マイクロソフト社のWord2000（登録商標）上で、"「武士ゼロフウウCadrCdr（カダアクダア）」を開発、4月13日より発売"といった文字列の"「武士ゼロフウウCadrCdr(カダアクダア)」を開発"を影文字とした場合、別のファイル形式に変換した文書からは"「「武武士士ゼゼロロフフウウウウCCaaddrrCCddrr（（カカダダアアククダダアア））」」をを開開発発"といった文字列が抽出されたときにおける重複文字の除去について説明する。
【００１１】
図1は、本発明に関わる言語処理システムの概略構成を示す機能ブロック図である。この言語処理システムは、文章記憶手段１、単語記憶手段２、接続関係記憶手段３、文章解析手段４、解析結果記憶手段５、重複文字除去手段６、整形文字列記憶手段７から構成されている。
【００１２】
文章記憶手段１には、言語処理の対象となる日本語文章が格納されている。単語記憶手段２は、単語とその属性情報を登録した日本語辞書である。各単語の属性情報には、品詞情報などが含まれている。接続関係記憶手段３には、単語間の接続の可否を示す接続情報が格納されている。文章解析手段４は、単語記憶手段２と接続関係記憶手段３を検索して文章記憶手段１に記憶された文章を単語単位に文章解析する。解析結果記憶手段５は、前記文章解析手段４で単語単位に解析された結果を記憶する。重複文字除去手段６は、解析結果記憶手段５の結果を調べ、整形文字列を整形文字列記憶手段７に格納する。整形文字列記憶手段７には整形文字列が格納される。
【００１３】
上記言語処理システムにおいて、入力文中に同一文字が重複して出現する場合の処理手順を図2のフローチャートならびに図3から図7に基づいて説明する。図2において文章解析手段４は文章記憶手段１に記憶されている文章を取り出す操作を試み（ステップ１０１）、文章があるかどうかを判断する（ステップ１０２）。ここで、文章がないときは処理を終了し、文章があるときは、その中から１文を読み込む（ステップ１０３）。以下の処理では、この１文全体が処理の単位となる。ここでは、具体例として"「「武武士士ゼゼロロフフウウウウCCaaddrrCCddrr（（カカダダアアククダダアア））」」をを開開発発"という文が読み込まれたものとする。
【００１４】
文章解析手段４は、単語記憶手段２の日本語辞書を用いて入力文を文頭から文末まで単語検索する（ステップ１０５）。単語記憶手段２の内容の一例を図３に示す。単語検索によって、入力文の各単語（見出し語）に対応する語の情報が読み取られる。続いて、検索された単語が持つ品詞情報と、接続関係記憶手段３の接続情報とに基づいて単語間の接続関係をチェックする（ステップ１０６）。接続関係記憶手段３の内容の一例を図４に示す。図４において、単語間の接続の可否を示す接続情報としては、接続が可能である組み合わせには１という値が付けられ、接続できない組み合わせには０という値が付けられている。
【００１５】
文章解析手段４は、これらの情報を用いて、文節数最少法により文章解析を行う（ステップ１０７）。文節数最少法とは、接続可能であった単語の内文節数が最少のものを結果として出力する文章解析方法である。なお、文章解析には最長一致法やコスト最少法を用いてもよい。この結果、"「／「／武／武士／士／ゼゼロロフフウウウウ／CCaaddrrCCddrr／（／（／カカダダアアククダダアア／）／）／」／」／を／を／開／開発／発"という単語列が切り出される。次に、文章解析手段４は、切り出した単語の表記、品詞、位置情報を解析結果記憶手段５に格納する（ステップ１０８）。解析結果記憶手段５格納された情報を図５に示す。図５に示すように、本来辞書に登録されている単語情報が正しく、切り出されない。
【００１６】
次に、重複文字除去手段６は、重複文字の除去を実行する（ステップ２００）。ステップ２００の重複文字の除去のフローチャートを図6に示す。ここで、解析結果記憶手段５の中の単語No.を保持する変数としてiを、入力文の末尾の位置の値を保持する変数としてendを、整形文字列記憶手段７に出力した文字数を保持する変数としてlenを、使用する。
【００１７】
重複文字除去手段６は、iとlenに0をセットする（ステップ２０１）。重複文字除去手段６は、lenの値がend以上になったかどうかを判定する（ステップ２０２）。判定結果がYesの場合、重複除去の処理を終了する（図2のステップ３００へ戻る）。判定結果がNoの場合、ステップ２１０に処理を進める。ステップ２１０において、重複文字除去手段６は、i番目の単語の表記は1文字かどうかを判定する。判定結果が、Yesの場合、処理をステップ２２０に進め、Noの場合には、処理をステップ２３０に進める。
【００１８】
ステップ２２０において、重複文字除去手段６は、i番目の単語の表記とi+1番目の単語の表記の先頭文字は同じかどうかを判定する。判定結果が、Yesの場合、処理をステップ２４０に進め、Noの場合には、処理をステップ２７０に進める。
【００１９】
ステップ２４０において、重複文字除去手段６は、i+1番目の単語の表記を整形文字列記憶手段７に格納する。そして、ステップ２４１で、i+1番目の単語の表記の長さをlenに足し、iに１を足し、処理をステップ２７０に進める。
【００２０】
ステップ２３０において、重複文字除去手段６は、i番目の単語の品詞は未登録語かどうかを判定する。判定結果が、Yesの場合、処理をステップ２５０に進め、Noの場合には処理をステップ２６０に進める。ステップ２５０において、重複文字除去手段６は、i番目の単語の表記から連続する文字が同一のもの（重複文字）を１つにまとめる。例えば、"ゼゼロロフフウウウウ"は"ゼロフウウ"にまとめられる。ステップ２５１において、重複文字を除いた文字列を整形文字列記憶手段７に格納する。そして、ステップ２５２において重複文字を除いた文字列の長さをlenに足し、処理をステップ270に進める。
【００２１】
ステップ２６０において、重複文字除去手段６は、i番目の単語の表記を整形文字列記憶手段７に格納する。そして、ステップ２６１でi番目の形態素の表記の長さをlenに足し、処理を２７０に進める。ステップ２７０において重複文字除去手段６は、iに１を足し、処理をステップ２０２に進める。
【００２２】
以上のように重複文字の除去が実行されると、図２のステップ３００にて、文章解析手段４は、整形文字列記憶手段７の内容を図示しない記憶手段または記憶装置に出力し（ステップ３００）、ステップ１０２の処理を行う。
【００２３】
本実施の形態では、ステップ２００の結果、整形文字列記憶手段７の内容は、"「武士ゼロフウウＣａｄｒＣｄｒ（カダアクダア）」を開発"となり、整形文字列記憶手段７に格納される。
【００２４】
次に、本発明の第2の実施形態について具体的に説明する。第2の実施形態では、形態素解析結果を利用して重複文字の除去を行った後、再解析を行う。第2の実施形態では、米国マイクロソフト社のWord2000（登録商標）上で、"「武士ゼロフウウCadrCdr（カダアクダア）」を開発、4月13日より発売"といった文字列の"「武士ゼロフウウCadrCdr(カダアクダア)」を開発"を影文字とした場合、別のファイル形式に変換した文書からは"「「武武士士ゼゼロロフフウウウウCCaaddrrCCddrr（（カカダダアアククダダアア））」」をを開開発発"といった文字列が抽出されたときにおける重複文字の除去後、再解析を行うことについて説明する。
【００２５】
第2の実施形態は、図９に示すように、文章記憶手段９１、単語記憶手段９２、接続関係記憶手段９３、文章解析手段９４、解析結果記憶手段９５、重複文字除去手段９６、再解析文字列記憶手段９７から構成されている。文章記憶手段９１、単語記憶手段９２、接続関係記憶手段９３、文章解析手段９４、解析結果記憶手段９５は第1の実施形態における図1の文章記憶手段１、単語記憶手段２、接続関係記憶手段３、文章解析手段４、解析結果記憶手段５にそれぞれ対応するものである。第1の実施形態と比較して、重複文字除去手段９６、再解析文字列記憶手段９７の構成が異なる。重複文字除去手段９６は、解析結果記憶手段９５の結果を調べ、再解析の対象となる文字列を再解析文字列記憶手段９７に格納する。再解析文字列記憶手段９７には、文章解析手段９４による再解析の対象となる文字列が格納される。
【００２６】
次に、上記言語処理システムにおいて、入力文中に同一文字が重複して出現する場合の処理手順を図８と図１０のフローチャートならびに図3、図４、図５、図7に基づいて説明する。図８において文章解析手段９４は文章記憶手段９１に記憶されている文章を取り出す操作を試み（ステップ８１０１）、文章があるかどうかを判断する（ステップ８１０２）。ここで、文章がないときは処理を終了し、文章があるときは、その中から１文を読み込む（ステップ８１０３）。以下の処理では、この１文全体が処理の単位となる。ここでは、具体例として"「「武武士士ゼゼロロフフウウウウCCaaddrrCCddrr（（カカダダアアククダダアア））」」をを開開発発"という文が読み込まれたものとする。
【００２７】
文章解析手段９４は、再解析フラグをOffにする（ステップ８104）。文章解析手段９４は、単語記憶手段９２の日本語辞書を用いて入力文を文頭から文末まで単語検索する（ステップ８１０５）。単語記憶手段９２の内容の一例を図３に示す。単語検索によって、入力文の各単語（見出し語）に対応する語の情報が読み取られる。続いて、検索された単語が持つ品詞情報と、接続関係記憶手段９３の接続情報とに基づいて単語間の接続関係をチェックする（ステップ８１０６）。接続関係記憶手段９３の内容の一例を図４に示す。図４において、単語間の接続の可否を示す接続情報としては、接続が可能である組み合わせには１という値が付けられ、接続できない組み合わせには０という値が付けられている。
【００２８】
文章解析手段９４は、これらの情報を用いて、接続可能であった単語の内文節数が最少のものを結果として出力する文節数最少法により文章解析を行う（ステップ８１０７）。なお、文章解析には最長一致法やコスト最少法を用いてもよい。この結果、"「／「／武／武士／士／ゼゼロロフフウウウウ／CCaaddrrCCddrr／（／（／カカダダアアククダダアア／）／）／」／」／を／を／開／開発／発"という単語列が切り出される。次に、文章解析手段９４は、切り出した単語の表記、品詞、位置情報を解析結果記憶手段９５に格納する（ステップ８１０８）。解析結果記憶手段９５格納された情報を図５に示す。図５に示すように、本来辞書に登録されている単語情報が正しく、切り出されない。
【００２９】
次に文章解析手段９４は、再解析フラグがOffかどうかを判断する（ステップ８１０９）。フラグがOnで判定結果がNoの場合は、ステップ８１１３を実行する。フラグがOffで判定結果がYesであれば、ステップ８２００を実行する。
【００３０】
ステップ８２００の重複文字の除去のフローチャートを図１０に示す。ここで、解析結果記憶手段９５の中の単語No.を保持する変数としてiを、入力文の末尾の位置の値を保持する変数としてendを、再解析文字列記憶手段９７に出力した文字数を保持する変数としてlenを、使用する。
【００３１】
重複文字除去手段９６は、iとlenに0をセットする（ステップ１２０１）。重複文字除去手段９６は、lenの値がend以上になったかどうかを判定する（ステップ１２０２）。判定結果がYesの場合、重複除去の処理を終了する（図８のステップ８１１０へ戻る）。判定結果がNoの場合、ステップ１２１０に処理を進める。ステップ１２１０において、重複文字除去手段９６は、i番目の単語の表記は1文字かどうかを判定する。判定結果が、Yesの場合、処理をステップ１２２０に進め、Noの場合には、処理をステップ１２３０に進める。
【００３２】
ステップ１２２０において、重複文字除去手段９６は、i番目の単語の表記とi+1番目の単語の表記の先頭文字は同じかどうかを判定する。判定結果が、Yesの場合、処理をステップ１２４０に進め、Noの場合には、処理をステップ１２７０に進める。
【００３３】
ステップ１２４０において、重複文字除去手段９６は、i+1番目の単語の表記を再解析文字列記憶手段９７に格納する。そして、ステップ１２４１で、i+1番目の単語の表記の長さをlenに足し、iに１を足し、処理をステップ１２７０に進める。
【００３４】
ステップ１２３０において、重複文字除去手段９６は、i番目の単語の品詞は未登録語かどうかを判定する。判定結果が、Yesの場合、処理をステップ１２５０に進め、Noの場合には処理をステップ１２６０に進める。ステップ１２５０において、重複文字除去手段９６は、i番目の単語の表記から連続する文字が同一のもの（重複文字）を１つにまとめる。例えば、"ゼゼロロフフウウウウはゼロフウウ"にまとめられる。ステップ１２５１において、重複文字を除いた文字列を再解析文字列記憶手段９７に格納する。そして、ステップ１２５２において重複文字を除いた文字列の長さをlenに足し、処理をステップ１270に進める。
【００３５】
ステップ１２６０において、重複文字除去手段９６は、i番目の単語の表記を再解析文字列記憶手段９７に格納する。そして、ステップ１２６１でi番目の形態素の表記の長さをlenに足し、処理を１２７０に進める。ステップ１２７０において重複文字除去手段９６は、iに１を足し、処理をステップ１２０２に進める。
【００３６】
図８の説明に戻り、次に、文章解析手段９４は、再解析文字列記憶手段９７の内容と入力文の内容を比較する（ステップ８１１０）。ここで、入力文の内容と再解析文字列記憶手段９７の内容が一致する場合、再解析は必要ないのでステップ８１１３を実行する。一致しない場合、判定結果がNoとなり、文章解析手段９４は、再解析フラグをOnにし（ステップ８１１１）、入力文を再解析文字列とし、解析結果記憶手段９５の内容を消去する（ステップ８１１２）。そして、ステップ８１０５に進み、入力文字列の再解析を行う。ステップ８１０５，８１０６，８１０７，８１０８と処理を進め、ステップ８１０９の判定結果はNoとなるので、文章解析手段９４はステップ８１１３を実行する。
【００３７】
ステップ８１１３にて、文章解析手段９４は、解析結果記憶手段９５の内容を図示しない記憶手段または記憶装置に出力し、ステップ８１０２の処理を行う。
【００３８】
本実施形態での入力では、ステップ８２００の結果、再解析文字列記憶手段９７の内容は、"「武士ゼロフウウCadrCdr（カダアクダア）」を開発"となり、ステップ８１１０の判定結果がNoとなり、ステップ８１１１、８１１２、８１０５、８１０６、８１０７、８１０８と処理した結果、解析結果記憶手段９５には、図７に示すような結果が格納されようになる。
【００３９】
第２の実施形態においては、第1の実施形態と比較し、重複文字の除去を実施した後に入力文と再解析文字列の内容を比較し、内容が一致しない場合は入力文字の再解析を行う。これにより、精度の高い整形結果が得られる。
【００４０】本発明は、上記実施形態に限定されることなく、例えば図８のステップ８１１３において、文章解析手段４は重複文字除去手段によって変換された文字列も解析結果とともに出力するように構成することも出来る。
【００４１】
【発明の効果】
形態素解析の結果を利用して連続する文字を置き換えるため、誤って単語の構成文字を削除しないような連続文字の除去が可能とる。その結果、連続する文字を含むテキストの修正や、修正されたテキストを使った形態素解析結果を得ることができるようになる。
【図面の簡単な説明】
【図１】第1の実施形態における言語処理システムの概略構成を示す機能ブロック図である。
【図２】第1の実施形態における言語処理システムの処理手順を示すフローチャートである。
【図３】単語記憶手段の内容を示す説明図である。
【図４】接続関係記憶手段の内容を示す説明図である。
【図５】初回の解析の結果、解析結果記憶手段に格納された情報を示す説明図である。
【図６】第1の実施形態における重複文字除去の処理手順を示すフローチャートである。
【図７】再解析の結果、解析結果記憶手段に格納された情報を示す説明図である。
【図８】第2の実施形態における言語処理システムの処理手順を示すフローチャートである。
【図９】第２の実施形態における言語処理システムの概略構成を示す機能ブロック図である。
【図１０】第２の実施形態における重複文字除去の処理手順を示すフローチャートである。
【符号の説明】
１文章記憶手段
２単語記憶手段
３接続関係記憶手段
４文章解析手段
５解析結果記憶手段
６重複文字除去手段
７整形文字列記憶手段
９１文章記憶手段
９２単語記憶手段
９３接続関係記憶手段
９４文章解析手段
９５解析結果記憶手段
９６重複文字除去手段
９７再解析文字列記憶手段[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a technique for formatting text, and in particular, removes typographical errors in a portion of a sentence that is subject to linguistic processing and performs linguistic processing such as search and summarization after recovering to a normal sentence. Regarding technology.
[0002]
[Prior art]
With the improvement of the performance of computer technology, various language processing technologies have been proposed so far. As a technique of these techniques, there is a case of using a morpheme analysis technique in which a sentence to be subjected to language processing is divided into words and analyzed. However, when a sentence to be subjected to language processing includes an erroneous character, the character is processed as an unregistered word when performing morphological analysis, and a correct result cannot be obtained. For example, if the original application had a character modification such as a shadow character at the point you want to emphasize, converting this application to another format data and extracting the text data will result in the character modification part of the character being It was sometimes extracted with overlapping. As an example, if a bold word is used to emphasize a specific word, such as “Aoi”, in Microsoft Word 2000 (registered trademark) in the United States, it is converted into Adobe Systems PDF (Portable Document Format) format. However, in text extracted from a PDF (Portable Document Format) file using the text selection of Acrobat Reader, continuous characters such as “Oh,” may be extracted. Therefore, as a technique for correcting erroneous characters, a spell checker or a method of simply replacing consecutive characters with one character has been proposed.
[0003]
[Problems to be solved by the invention]
However, in the spell checker, it is difficult to detect misspellings as well as to present correction candidates for the above-mentioned overlapping characters. In addition, when the same character appears continuously, if the consecutive character is simply replaced with one character, the word “Futsu” will become “Future”, and the constituent characters of the original word should be May be deleted. A portion to which a character modification such as a shadow character is applied corresponds to a portion that the author originally considers important. However, in the conventional technique, the intention of the author may not be sufficiently reflected in the result of language processing.
[0004]
The present invention has been invented in order to solve the above-described problems, and has an object to remove a duplicate character that seems to be a typo and shape it into a correct word, and then set it as a language processing target.
[0005]
[Means for Solving the Invention]
In order to achieve the object as described above, a program as a first feature of the present invention is a program that accepts a computer by data accepting means for accepting data of a character string in which the same character appears repeatedly, and the accepting means. Sentence analyzing means for cutting the character string into a plurality of character strings based on word information and part-of-speech information, and the first character string is one character among the plurality of character strings cut out by the sentence analyzing means If the first character string is determined to be one character by the one character determining unit and the one character determining unit, the character string is cut out next to the first character string and the first character string. The same character determining means for determining whether or not the first character of the read character string is the same, and a character shaper for shaping the same character when the same character determining means determines that the same character is the same character If the first character string is determined not to be one character by the one-character determining means, the part-of-speech information determining means for determining the part-of-speech information of the first character string and the part-of-speech information determining means When it is determined that one character string is not registered as part-of-speech information, there is provided a duplicate character removing unit that collects duplicate identical characters from the first character string.
[0006]
Further, a program as a second feature of the present invention is a program for receiving data of a character string in which the same character appears repeatedly, and the character string received by the receiving means as word information. Sentence analysis means for cutting out into a plurality of character strings based on part-of-speech information, and one-character determination means for judging whether or not the first character string is one character among the plurality of character strings cut out by the sentence analysis means When the one character determining means determines that the first character string is one character, the first character string and the first character of the character string cut out next to the first character string are The same character judging means for judging whether or not they are the same, and the character shaping means for shaping the same character when the same character judging means judges that they are the same character.
[0007]
According to a third aspect of the present invention, there is provided a program for causing a computer to receive data of a character string in which the same character appears repeatedly, and to accept the character string received by the receiving unit as word information. Sentence analysis means for cutting out into a plurality of character strings based on part of speech information; part of speech information determination means for determining part of speech information of a first character string among the plurality of character strings cut out by the sentence analysis means; When the part of speech information judging means judges that the first character string is not registered as part of speech information, the part of speech information judging means has a duplicate character removing means for collecting the same characters duplicated from the first character string into one. It is characterized by.
[0008]
According to a fourth aspect of the present invention, there is provided a program that includes a shaped character string output unit that outputs a character string shaped by the character shaping unit, and outputs a character string from which duplicate characters have been removed by the duplicate character removal unit. And output means for removing duplicate characters.
[0009]
The program of the present invention having such various characteristics has the same characteristics in the apparatus of the present invention. The method of the present invention has similar characteristics. That is, data of a character string in which the same character appears repeatedly is received, and the character string is cut into a plurality of character strings based on word information and part-of-speech information. In addition, it is determined whether or not the first character string is one character among the plurality of character strings cut out. If it is determined that the first character string is one character, the first character string and the first character string are determined. It is determined whether or not the first character of the character string cut out after the character string is the same. If it is determined that the character is the same, the same character is shaped. If the part-of-speech information of the first character string is determined from the plurality of extracted character strings and it is determined that the part-of-speech information is not registered as the part-of-speech information, the same overlapping character string from the first character string Combine characters into one. The content after shaping the same character and the content after combining the same identical characters are output.
[0010]
DETAILED DESCRIPTION OF THE INVENTION
The present invention will be specifically described based on an embodiment thereof. In the present embodiment, duplicate characters are removed using the morphological analysis results. In this embodiment, on the Word 2000 (registered trademark) of Microsoft Corporation in the United States, the character string “Bushi Zero Fu CadrCdr (Kada Akdaa)” is developed, which is “Developed Samurai Zero Fu CadrCdr (Kada Akdaa)”. When "development" is used as a shadow character, a character such as "" Buy warrior ZEROROFUFUUU CCaaddrrCCddrr ((Kakadada Akakudadaaa)) "developed and developed from a document converted to another file format" The removal of duplicate characters when a column is extracted will be described.
[0011]
FIG. 1 is a functional block diagram showing a schematic configuration of a language processing system according to the present invention. This language processing system is composed of a sentence storage means 1, a word storage means 2, a connection relation storage means 3, a sentence analysis means 4, an analysis result storage means 5, a duplicate character removal means 6, and a shaped character string storage means 7. .
[0012]
The sentence storage means 1 stores Japanese sentences that are subject to language processing. The word storage means 2 is a Japanese dictionary in which words and their attribute information are registered. The attribute information of each word includes part-of-speech information. The connection relationship storage means 3 stores connection information indicating whether or not a connection between words is possible. The sentence analysis unit 4 searches the word storage unit 2 and the connection relation storage unit 3 and analyzes the sentence stored in the sentence storage unit 1 in units of words. The analysis result storage means 5 stores the result analyzed in units of words by the sentence analysis means 4. The duplicate character removing unit 6 examines the result of the analysis result storage unit 5 and stores the formatted character string in the formatted character string storage unit 7. The formatted character string storage means 7 stores a formatted character string.
[0013]
In the language processing system described above, a processing procedure when the same character appears in the input sentence will be described with reference to the flowchart of FIG. 2 and FIGS. 3 to 7. In FIG. 2, the sentence analysis unit 4 tries to extract a sentence stored in the sentence storage unit 1 (step 101), and determines whether there is a sentence (step 102). Here, when there is no sentence, the process is terminated, and when there is a sentence, one sentence is read from the sentence (step 103). In the following processing, this entire sentence is the unit of processing. Here, as a specific example, it is assumed that a sentence ““ development and development of ““ Samurai warrior Zezerofufuuou CCaaddrrCCddrr () ”” ”is read.
[0014]
The sentence analysis unit 4 searches the input sentence from the beginning to the end of the sentence using the Japanese dictionary of the word storage unit 2 (step 105). An example of the contents of the word storage means 2 is shown in FIG. By word search, word information corresponding to each word (entry word) of the input sentence is read. Subsequently, the connection relation between words is checked based on the part of speech information of the searched word and the connection information in the connection relation storage means 3 (step 106). An example of the contents of the connection relationship storage means 3 is shown in FIG. In FIG. 4, as connection information indicating whether or not connection between words is possible, a value of 1 is assigned to a combination that can be connected, and a value of 0 is assigned to a combination that cannot be connected.
[0015]
The sentence analysis means 4 analyzes the sentence by using this information by the phrase number minimum method (step 107). The phrase number minimization method is a sentence analysis method that outputs a word having the smallest number of internal phrases that can be connected as a result. For sentence analysis, the longest match method or the least cost method may be used. As a result, """/" / Takeshi / Samurai / Toshi / Zezerolofufuuou / CCaaddrrCCddrr / (/ (/ Kakadada Akukuda Aaa /) /) / "/" / "/" / Open / Development / Departure " Next, the sentence analysis unit 4 stores the notation, part of speech, and position information of the extracted word in the analysis result storage unit 5 (step 108). As shown in Fig. 5, the word information originally registered in the dictionary is not correctly cut out.
[0016]
Next, the duplicated character removing unit 6 executes duplicated character removal (step 200). FIG. 6 shows a flowchart for removing duplicate characters in step 200. Here, i is stored as a variable for storing the word number in the analysis result storage unit 5, end is stored as a variable for storing the position value at the end of the input sentence, and the number of characters output to the formatted character string storage unit 7 is stored. Use len as a variable to do.
[0017]
The duplicate character removing means 6 sets 0 to i and len (step 201). The duplicate character removal means 6 determines whether or not the value of len is equal to or greater than end (step 202). If the determination result is Yes, the deduplication process ends (return to step 300 in FIG. 2). If the determination result is No, the process proceeds to step 210. In step 210, the duplicate character removing unit 6 determines whether the notation of the i-th word is one character. If the determination result is Yes, the process proceeds to step 220; if the determination result is No, the process proceeds to step 230.
[0018]
In step 220, the duplicate character removal unit 6 determines whether the first character of the i-th word notation is the same as the first character of the i + 1-th word notation. If the determination result is Yes, the process proceeds to step 240. If the determination result is No, the process proceeds to step 270.
[0019]
In step 240, the duplicate character removing unit 6 stores the notation of the (i + 1) th word in the formatted character string storage unit 7. In step 241, the notation length of the (i + 1) -th word is added to len, i is added to 1, and the process proceeds to step 270.
[0020]
In step 230, the duplicate character removing unit 6 determines whether the part of speech of the i-th word is an unregistered word. If the determination result is Yes, the process proceeds to step 250; if the determination result is No, the process proceeds to step 260. In step 250, the duplicate character removing unit 6 combines the same consecutive characters (duplicate characters) from the i-th word notation into one. For example, “Zero Fufu Ou” is grouped into “Zero Fu”. In step 251, the character string excluding duplicate characters is stored in the shaped character string storage unit 7. In step 252, the length of the character string excluding duplicate characters is added to len, and the process proceeds to step 270.
[0021]
In step 260, the duplicate character removing unit 6 stores the i-th word notation in the formatted character string storage unit 7. In step 261, the length of the i-th morpheme is added to len, and the process proceeds to 270. In step 270, the duplicate character removing unit 6 adds 1 to i and advances the process to step 202.
[0022]
When the removal of duplicate characters is executed as described above, in step 300 of FIG. 2, the sentence analysis unit 4 outputs the contents of the shaped character string storage unit 7 to a storage unit or storage device (not shown) (step 300). ), The process of step 102 is performed.
[0023]
In the present embodiment, as a result of step 200, the content of the formatted character string storage unit 7 is “Development of“ Samurai Zero Fu CadCdr (Kada Akuda) ”” and is stored in the formatted character string storage unit 7.
[0024]
Next, the second embodiment of the present invention will be specifically described. In the second embodiment, re-analysis is performed after removing duplicate characters using the morphological analysis result. In the second embodiment, on the Word 2000 (registered trademark) of Microsoft Corporation in the United States, the character string "" Samurai Zero Fu CadrCdr (Kada Akda) "is developed and released from April 13th. Developed """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" On """""""""""""""""""""""""""" V "" A description will be given of performing reanalysis after the removal of duplicate characters when such a character string is extracted.
[0025]
In the second embodiment, as shown in FIG. 9, sentence storage means 91, word storage means 92, connection relation storage means 93, sentence analysis means 94, analysis result storage means 95, duplicate character removal means 96, reanalysis characters It consists of column storage means 97. The sentence storage means 91, the word storage means 92, the connection relation storage means 93, the sentence analysis means 94, and the analysis result storage means 95 are the sentence storage means 1, the word storage means 2, the connection relation storage means of FIG. 1 in the first embodiment. 3, corresponding to the sentence analysis means 4 and the analysis result storage means 5, respectively. Compared with the first embodiment, the configurations of the duplicate character removal unit 96 and the reanalysis character string storage unit 97 are different. The duplicate character removal unit 96 examines the result of the analysis result storage unit 95 and stores the character string to be reanalyzed in the reanalysis character string storage unit 97. The reanalysis character string storage unit 97 stores a character string to be reanalyzed by the sentence analysis unit 94.
[0026]
Next, in the language processing system, a processing procedure in the case where the same character appears repeatedly in the input sentence will be described with reference to the flowcharts of FIGS. 8 and 10 and FIGS. 3, 4, 5, and 7. FIG. In FIG. 8, the sentence analysis means 94 tries to take out the sentence stored in the sentence storage means 91 (step 8101), and determines whether there is a sentence (step 8102). If there is no sentence, the process is terminated. If there is a sentence, one sentence is read from the sentence (step 8103). In the following processing, this entire sentence is the unit of processing. Here, as a specific example, it is assumed that a sentence ““ development and development of ““ Samurai warrior Zezerofufuuou CCaaddrrCCddrr () ”” ”is read.
[0027]
The sentence analysis means 94 sets the reanalysis flag to Off (step 8104). The sentence analysis means 94 searches the input sentence from the beginning to the end of the sentence using the Japanese dictionary of the word storage means 92 (step 8105). An example of the contents of the word storage means 92 is shown in FIG. By word search, word information corresponding to each word (entry word) of the input sentence is read. Subsequently, the connection relation between words is checked based on the part of speech information of the searched word and the connection information in the connection relation storage means 93 (step 8106). An example of the contents of the connection relation storage means 93 is shown in FIG. In FIG. 4, as connection information indicating whether or not connection between words is possible, a value of 1 is assigned to a combination that can be connected, and a value of 0 is assigned to a combination that cannot be connected.
[0028]
Using this information, the sentence analysis unit 94 performs sentence analysis by the phrase number minimizing method that outputs the result of the word with the smallest number of phrases that can be connected (step 8107). For sentence analysis, the longest match method or the least cost method may be used. As a result, """/" / Takeshi / Samurai / Toshi / Zezerolofufuuou / CCaaddrrCCddrr / (/ (/ Kakadada Akukuda Aaa /) /) / "/" / "/" / Open / Development / Departure " Next, the sentence analysis unit 94 stores the notation, part of speech, and position information of the extracted word in the analysis result storage unit 95 (step 8108). As shown in Fig. 5, the word information originally registered in the dictionary is not correctly cut out.
[0029]
Next, the sentence analysis means 94 determines whether or not the reanalysis flag is Off (step 8109). If the flag is On and the determination result is No, Step 8113 is executed. If the flag is Off and the determination result is Yes, Step 8200 is executed.
[0030]
FIG. 10 shows a flowchart of the removal of duplicate characters in step 8200. Here, i is set as a variable for holding the word number in the analysis result storage unit 95, end is set as a variable for holding the value of the end position of the input sentence, and the number of characters output to the re-analysis character string storage unit 97 is calculated. Use len as a variable to hold.
[0031]
The duplicate character removing unit 96 sets 0 to i and len (step 1201). The duplicate character removing unit 96 determines whether or not the value of len is equal to or greater than end (step 1202). If the determination result is Yes, the deduplication process ends (return to step 8110 in FIG. 8). If the determination result is No, the process proceeds to step 1210. In step 1210, the duplicate character removing unit 96 determines whether the notation of the i-th word is one character. If the determination result is Yes, the process proceeds to step 1220. If the determination result is No, the process proceeds to step 1230.
[0032]
In step 1220, the duplicate character removing unit 96 determines whether the notation of the i-th word and the i + 1-th word notation are the same. If the determination result is Yes, the process proceeds to step 1240. If the determination result is No, the process proceeds to step 1270.
[0033]
In step 1240, the duplicate character removal unit 96 stores the notation of the i + 1th word in the reanalysis character string storage unit 97. In step 1241, the length of the notation of the (i + 1) -th word is added to len, 1 is added to i, and the process proceeds to step 1270.
[0034]
In step 1230, the duplicate character removing unit 96 determines whether the part of speech of the i-th word is an unregistered word. If the determination result is Yes, the process proceeds to step 1250; if the determination result is No, the process proceeds to step 1260. In step 1250, the duplicate character removal unit 96 combines the same consecutive characters (duplicate characters) from the i-th word notation into one. For example, "Zero Fufuu Ou" is summarized as "Zero Fu". In step 1251, the character string excluding duplicate characters is stored in the reanalysis character string storage unit 97. In step 1252, the length of the character string excluding duplicate characters is added to len, and the process proceeds to step 1270.
[0035]
In step 1260, the duplicate character removal unit 96 stores the notation of the i-th word in the reanalysis character string storage unit 97. In step 1261, the length of the i-th morpheme is added to len, and the process proceeds to 1270. In step 1270, the duplicate character removing unit 96 adds 1 to i and advances the process to step 1202.
[0036]
Returning to the description of FIG. 8, the sentence analysis unit 94 compares the contents of the reanalysis character string storage unit 97 with the contents of the input sentence (step 8110). Here, if the content of the input sentence matches the content of the reanalysis character string storage means 97, reanalysis is not necessary, so step 8113 is executed. If they do not match, the determination result is No, the sentence analysis unit 94 sets the reanalysis flag to On (step 8111), sets the input sentence as a reanalysis character string, and erases the contents of the analysis result storage unit 95 (step 8112). . In step 8105, the input character string is reanalyzed. The process proceeds to steps 8105, 8106, 8107, and 8108, and the determination result in step 8109 is No. Therefore, the sentence analysis unit 94 executes step 8113.
[0037]
In step 8113, the sentence analysis unit 94 outputs the contents of the analysis result storage unit 95 to a storage unit or storage device (not shown), and performs the process of step 8102.
[0038]
In the input in the present embodiment, as a result of Step 8200, the content of the reanalysis character string storage means 97 is “Development of Samurai Zero Fu CadrCdr”, the determination result of Step 8110 is No, Step 8111, As a result of processing 8112, 8105, 8106, 8107, and 8108, the result as shown in FIG. 7 is stored in the analysis result storage unit 95.
[0039]
In the second embodiment, compared with the first embodiment, after removing duplicate characters, compare the contents of the input sentence and the re-analyzed character string, and if the contents do not match, re-analyze the input characters. Do. Thereby, a highly accurate shaping result is obtained.
The present invention is not limited to the above embodiment. For example, in step 8113 in FIG. 8, the sentence analyzing means 4 is configured to output the character string converted by the duplicate character removing means together with the analysis result. You can also
[0041]
【The invention's effect】
Since consecutive characters are replaced using the result of morphological analysis, it is possible to remove consecutive characters so as not to accidentally delete the constituent characters of the word. As a result, it becomes possible to correct a text including continuous characters and obtain a morphological analysis result using the corrected text.
[Brief description of the drawings]
FIG. 1 is a functional block diagram illustrating a schematic configuration of a language processing system according to a first embodiment.
FIG. 2 is a flowchart showing a processing procedure of the language processing system in the first embodiment.
FIG. 3 is an explanatory diagram showing the contents of word storage means.
FIG. 4 is an explanatory diagram showing the contents of a connection relationship storage unit.
FIG. 5 is an explanatory diagram showing information stored in an analysis result storage unit as a result of the first analysis.
FIG. 6 is a flowchart showing a processing procedure of duplicate character removal in the first embodiment.
FIG. 7 is an explanatory diagram showing information stored in an analysis result storage unit as a result of reanalysis.
FIG. 8 is a flowchart showing a processing procedure of the language processing system in the second embodiment.
FIG. 9 is a functional block diagram showing a schematic configuration of a language processing system in a second embodiment.
FIG. 10 is a flowchart showing a duplicate character removal processing procedure in the second embodiment;
[Explanation of symbols]
1 sentence storage means 2 word storage means 3 connection relation storage means 4 sentence analysis means 5 analysis result storage means 6 duplicate character removal means 7 shaped character string storage means 91 sentence storage means 92 word storage means 93 connection relation storage means 94 sentence analysis means 95 Analysis result storage means 96 Duplicate character removal means 97 Reanalysis character string storage means

Claims

Data receiving means for receiving character string data; and text analysis means for cutting out the character string received by the receiving means into a plurality of character strings based on dictionary information in which word information and part-of-speech information are registered ; character of the character determination means for determining a string, following the cut-out character string and the character string is determined character by said character determining means among the plurality of character strings cut out the by the text analysis unit the same character determining means and the first character of the string is equal to or the same, non-corresponding word information in the character string has not been determined to be character by said character determination means is not registered in the dictionary information dividing the part of speech information determining means for determining a registered word, the character string is determined to be the same as the first character of the next string by the same characters determination means A plurality of strings the cut out, the part of speech information determining means by for functioning as a duplicate character removing means for outputting the same character in character appearing in succession in the determined character string as an unregistered word program.

Data receiving means for receiving character string data; and text analysis means for cutting out the character string received by the receiving means into a plurality of character strings based on dictionary information in which word information and part-of-speech information are registered ; character of the character determination means for determining a string, following the cut-out character string and the character string is determined character by said character determining means among the plurality of character strings cut out the by the text analysis unit The same character determining means for determining whether or not the first character of the string is the same, and the cut out except for the character string of one character determined to be the same as the first character of the next character string by the same character determining means For functioning as a duplicate character removal means for outputting a plurality of character strings .

Data receiving means for receiving character string data; and text analysis means for cutting out the character string received by the receiving means into a plurality of character strings based on dictionary information in which word information and part-of-speech information are registered ; and part of speech information determining means for determining the unregistered word corresponding word information of the plurality of character strings cut out the by the text analysis unit is not registered in the dictionary information, the extracted plurality of character strings are, A program for functioning as a duplicate character removal unit that outputs the same character that appears continuously in a character string determined as an unregistered word by the part of speech information determination unit as a single character .

The program according to any one of claims 1 to 3, which causes a computer to function as a storage unit that stores a character string output by the duplicate character removal unit .

A character string shaping method implemented by a character string shaping device having data reception means, sentence analysis means, single character judgment means, identical character judgment means , part of speech information judgment means, and duplicate character removal means, The data receiving means receives character string data, and the sentence analyzing means cuts out the character string received by the receiving means into a plurality of character strings based on dictionary information in which word information and part-of-speech information are registered. determination means determines character string among the plurality of character strings cut out the by the text analysis unit, the next string and the character string is determined character by the same characters determination means the character determination means characters and cut out the first character of the string is determined whether or not the same, the the part of speech information determining means has not determined that the character by said character determination means Character corresponding word information is determined unregistered word that is not registered in the dictionary information, the duplicate character removal means is determined to be the same as the first character of the next string by the same characters determination means of A plurality of the extracted character strings excluding the character string are output as one character with the same character continuously appearing in the character string determined as an unregistered word by the part-of-speech information determining means. String formatting method to be performed.

A character string shaping method implemented by a character string shaping device having data accepting means, sentence analyzing means, single character judging means, identical character judging means, and duplicate character removing means , wherein the data accepting means is a character string And the sentence analysis unit cuts out the character string received by the reception unit into a plurality of character strings based on dictionary information in which word information and part-of-speech information are registered, and the one-character determination unit performs the sentence analysis. character string among the plurality of character strings cut out the determined by means string the same character determining means is cut to the next character string and the character string is determined character by said character determination means one of the first character and it is determined whether or not the same, the overlapping letters removing means is determined to be the same as the first character of the next string by the same characters determination means String shaping method characterized by outputting a plurality of character strings the cut out except for the string of characters.

A character string shaping method implemented by a character string shaping device having data accepting means, sentence analyzing means, part of speech information judging means, and duplicate character removing means, wherein the data accepting means accepts character string data, The sentence analysis unit cuts out the character string received by the reception unit into a plurality of character strings based on dictionary information in which word information and part-of-speech information are registered, and the part-of-speech information determination unit extracts the character string from the sentence analysis unit. Among the plurality of character strings, the corresponding word information is determined as an unregistered word that is not registered in the dictionary information, and the duplicate character removing unit determines the plurality of character strings that have been cut out as the part-of-speech information determining unit. A character string shaping method, wherein the same characters appearing continuously in a character string determined to be a registered word are output as one character.

8. The character string shaping method according to claim 5, wherein the character string output by the duplicate character removing unit is stored in a storage unit .

Data receiving means for receiving character string data, sentence analysis means for cutting out the character string received by the receiving means into a plurality of character strings based on dictionary information in which word information and part-of-speech information are registered, and the sentence analysis the head of the character determination means and, following the cut-out of a string and the string is determined character by said character determination means for determining the character string among the plurality of character strings cut out the by means the same character determining means and the character is equal to or the same unregistered word corresponding word information in the character string is not determined that the character is not registered in the dictionary information by the character determination means and part of speech information determining means for determining the cut except character string is determined to be the same as the first character of the next string by the same characters determination means Character; and a duplicate character removing means for a plurality of character strings, and outputs the same character appearing in succession in the determined character string as an unregistered word in the letter by part of speech information determining means Column shaping device.

Data receiving means for receiving character string data, sentence analysis means for cutting out the character string received by the receiving means into a plurality of character strings based on dictionary information in which word information and part-of-speech information are registered, and the sentence analysis the head of the character determination means and, next to the cut-out of a string and the string is determined character by said character determination means for determining the character string among the plurality of character strings cut out the by means The same character judging means for judging whether or not the character is the same, and the plurality of the cut out characters excluding the character string of one character judged to be the same as the first character of the next character string by the same character judging means A character string shaping device comprising: a duplicate character removing means for outputting the character string.

Data receiving means for receiving character string data, sentence analysis means for cutting out the character string received by the receiving means into a plurality of character strings based on dictionary information in which word information and part-of-speech information are registered, and the sentence analysis and part of speech information determining means for determining the unregistered word corresponding word information among the plurality of character strings cut out the is not registered in the dictionary information by means, said plurality of strings is cut out, the part of speech information A character string shaping device comprising: a duplicate character removing unit that outputs the same character that continuously appears in a character string determined as an unregistered word by the determining unit as a single character .

12. The character string shaping device according to claim 9, further comprising storage means for storing the character string output by the duplicate character removing means .

Data receiving means for receiving character string data; and text analysis means for cutting out the character string received by the receiving means into a plurality of character strings based on dictionary information in which word information and part-of-speech information are registered ; character of the character determination means for determining a string, following the cut-out character string and the character string is determined character by said character determining means among the plurality of character strings cut out the by the text analysis unit the same character determining means and the first character of string to determine whether the same, the corresponding word information in the character string is not determined before and character by said character determination means is not registered in the dictionary information and part of speech information determining means for determining the unregistered word, the character string is determined to be the same as the first character of the next string by the same characters determination means The Ku said plurality of character strings cut out, and duplicate characters removing means for outputting the same character in the character appearing in succession in the determined character string as an unregistered word by the part of speech information determining means, the overlap A program for comparing a character string output by a character removing unit and a character string received by the data receiving unit and functioning as a remorpheme analysis executing unit that executes a remorpheme analysis if they do not match.

Data receiving means for receiving character string data, sentence analysis means for cutting out the character string received by the receiving means into a plurality of character strings based on dictionary information in which word information and part-of-speech information are registered, and the sentence analysis One character judging means for judging a character string of one character among the plurality of character strings cut out by the means, a character string judged as one character by the one character judging means, and a head of the character string cut out next to the character string An unregistered word whose corresponding word information is not registered in the dictionary information among the same character determining means for determining whether or not the character is the same, and a character string that has not been determined as the previous character by the one character determining means The part-of-speech information judging means for judging whether the character string of one character judged to be the same as the first character of the next character string by the same character judging means is cut out. A plurality of character strings, a duplicate character removing means for outputting the same character that appears continuously in the character string determined as an unregistered word by the part of speech information judging means as one character, and the duplicate character removing means A character string shaping device comprising: a re-morpheme analysis execution unit that compares the output character string and the character string received by the data reception unit and executes re-morpheme analysis if they do not match .