JP4039583B2

JP4039583B2 - Language information processing device

Info

Publication number: JP4039583B2
Application number: JP12756094A
Authority: JP
Inventors: 直人中村; 智子瀬川; 郁子長澤; くにお松井; 誠塩津
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1994-06-09
Filing date: 1994-06-09
Publication date: 2008-01-30
Anticipated expiration: 2023-01-30
Also published as: JPH07334513A

Description

【０００１】
【産業上の利用分野】
本発明は、言語表現を扱う情報処理システムにおいて、区間を指定して入力された言語表現を表す文字列に基づいて、言語に関する規則や知識の登録処理やキーワードを用いた検索処理などを行う言語情報処理装置に関するものである。
【０００２】
言語表現を扱う情報処理システムとしては、機械翻訳システム，文章推敲システムや文章の特徴抽出システムなどがあり、このような情報処理システムでは、言語情報処理装置によって、熟語や言い回しなどの言語表現を表す文字列を区間を指定して入力し、この文字列を解析することにより、言語に関する規則や知識を登録して活用している。
【０００３】
また、言語情報処理装置は、文書中の特定の言語表現をキーワードとしてデータベースを検索するような用途にも利用することができる。
【０００４】
【従来の技術】
図８に、従来の言語情報処理装置の構成例を示す。
図８において、抽出処理部３０１は、利用者からの指示に応じて、文書を表す文字列全体のなかから区間指定で示された文字列を抽出し、分解処理部３０２は、抽出された文字列を処理単位に分解して、解析処理部３１１に送出する構成となっている。
【０００５】
この分解処理部３０２は、例えば、言語表現を表す文字列を自立語の語幹と活用語尾などの語尾情報とからなる形態素を蓄積している形態素辞書３０３に基づいて、指定区間の文字列を形態素に分解している。
【０００６】
このようにして得られた形態素の集まりの入力を受けて、解析処理部３１１は、形態素相互間のつながりを解析し、この結果を用いて、登録処理部３１２は、所定の規則で結びついた形態素の連なりとして、熟語および言い回しを熟語辞書３１３に登録する。
【０００７】
ここで、熟語辞書３１３への登録作業では、膨大な数の熟語をまとめて登録する場合が多い。
このような場合には、利用者は文書中の該当する区間を次々に指定していき、これに応じて、抽出処理部３０１が抽出した文字列を順次に蓄積していき、全ての区間の指定が終了したのちに、分解処理部３０２による分解処理および熟語辞書３１３への登録処理を一括してバッチ処理している。
【０００８】
【発明が解決しようとする課題】
ところで、上述したように、解析処理部３１１は、形態素相互間の関係を解析するのだから、この解析処理部３１１への入力は、形態素の連なりに分解されていなければならない。
【０００９】
このため、従来は、分解処理部３０２において、指定区間の文字列の全てを形態素に分解できなかった場合は、その時点で該当する指定区間の文字列についての処理を中止し、エラーメッセージなどでその文字列の指定を受け付けることができなかった旨を利用者に通知していた。
【００１０】
また、熟語や言い回しを動詞や形容詞の語幹として登録するためには、例えば「指定区間は、自立語で始まって、自立語で終わっていなければならない」というような制約条件が必要となる。
【００１１】
このような制約条件についての検討は、従来は、解析処理部３１１で行っており、分解処理部３０２から受け取った形態素の連なりが制約条件を満たしていない場合は、その指定区間の文字列についての処理は直ちに中止される。そして、この場合も、形態素に分解できなかった場合と同様に、その文字列の指定を受け付けることができなかった旨などを利用者に通知していた。
【００１２】
このように、従来の言語情報処理装置は、利用者が言語情報処理装置における処理単位や制約条件を意識して、これらに整合するように文字列の区間を正確に指定することを前提としている。
【００１３】
したがって、言語情報処理装置を使いこなすためには、利用者が、形態素など言語情報処理装置における処理単位に関する十分な知識と経験を身につけている必要があった。
【００１４】
しかしながら、一般の利用者は、そのような知識や経験を持っていない場合が多く、また、上述した形態素などの処理単位は、常識的な言語の単位と同一ではないため、処理単位の境界や制約条件に整合する区間を正確に指定することは非常に難しい。
【００１５】
また、十分な知識を持った利用者が指定区間の入力を行った場合でも、膨大な数の熟語や言い回しを一括して登録しようとした場合などには、利用者による指定にミスが発生しやすくなるため、多数の指定区間が受け付けられずに排除されてしまう。
【００１６】
従来の言語情報処理装置においては、受け付けを拒否された指定区間に対応する熟語や言い回しを登録するためには、利用者が指定区間の入力を訂正して登録作業を繰り返すしかなかった。しかし、この作業は利用者にとって煩わしいものであり、利用者の負担を大きくしていた。
【００１７】
本発明は、処理条件との不整合を含んだ言語表現の入力を柔軟に受け付ける言語情報処理装置を提供することを目的とする。
【００１８】
【課題を解決するための手段】
図１に、請求項１の言語情報処理装置の原理ブロック図を示す。
請求項１の発明は、処理の対象となる言語表現の入力を受けて、所定の処理を実行する言語情報処理装置において、処理の対象となる言語表現を含んだ文字列を入力する文字列入力手段１１１と、文字列に含まれている言語表現の範囲を示す指定区間を入力する指定区間入力手段１１２と、文字列入力手段１１１によって入力された文字列を処理単位に分解する分解手段１１３と、指定区間の境界が、分解手段１１３によって得られる一連の処理単位のいずれかの境界に一致しているか否かに基づいて、指定区間の正当性を判定する第１の判定手段１１４と、第１の判定手段１１４によって指定区間が正当でないと判定されたときに、指定区間の境界位置を処理単位のいずれかの境界に一致するように移動することによって前記指定区間を修正する第１の修正手段１１５と、第１の修正手段によって修正された指定区間に含まれる少なくとも一つの処理単位が、処理対象の言語表現において出現する順序的な位置とその位置に配置されるべき処理単位の種類とに関する規則を示す所定の制約条件を満たしているか否かに基づいて、指定区間の正当性を判定する第２の判定手段１２１と、第２の判定手段１２１によって指定区間が正当でないと判定されたときに、指定区間の境界位置を修正後の指定区間に含まれる処理単位の配列が制約条件を満たすように移動することによって指定区間を修正する第２の修正手段１２２と、分解手段１１３によって得られた一連の処理単位から、修正によって得られた指定区間に含まれる文字列に対応する処理単位を抽出する抽出手段１１６とを備えたことを特徴とする。
【００２０】
【作用】
請求項１の発明は、文字列入力手段１１１によって入力された文字列の全てを分解手段１１３による分解処理に供しているから、指定区間入力手段１１２によって示された指定区間の文字列とともに、その前後の文字列に関する情報を得ることができる。
【００２１】
したがって、第１の判定手段１１４により、指定区間が正当でない旨の判定結果が得られた場合に、第１の修正手段１１５は、指定区間およびその前後の文字列に関する情報に基づいて、指定区間の境界をこの指定区間の前後の文字列を含めた範囲で移動することが可能である。
【００２２】
このとき、第１の修正手段１１５が、指定区間の境界によって分けられてしまった処理単位について、指定範囲に含めるか排除するかを決定するための適切な規則にしたがって指定区間の境界を移動すれば、指定区間の境界と処理単位の境界との不整合を解消し、修正された指定区間に含まれる複数の処理単位を処理対象の言語表現に関する情報として後段の処理に供することができる。
【００２３】
つまり、このようにして修正された指定区間に含まれる複数の処理単位が、第２の判定手段１２１に入力され、この第２の判定手段１２１による判定処理、すなわち、これらの処理単位の配列が制約条件を満たしているか否かを判定する処理に供される。
【００２４】
第２の判定手段１２１が、制約条件に照らして指定区間が正当でないと判定した場合に、第２の修正手段１２２は、上述した制約条件に基づいて、指定区間の境界をこの指定区間に含まれる処理単位ごとに移動する。
【００２５】
このとき、第２の修正手段１２２が、例えば、言語表現の先頭や末尾の処理単位が満たすべき条件などを示す適切な規則にしたがって指定区間の境界を移動すれば、指定区間に含まれる処理単位の並び方と上述した制約条件によって示される言語表現における構造との不整合を解消することができる。これにより、抽出手段１１６は、正当な指定区間に基づいて言語表現に関する情報を抽出し、登録処理などの処理に供することができる。
【００２６】
【実施例】
以下、図面に基づいて本発明の実施例について詳細に説明する。
図２に、本発明の言語情報処理装置の実施例構成図を示す。
【００２７】
図２において、言語表現保持部２０１は、登録したい熟語や言い回しを含んだ文などの言語表現をそれぞれ１つの単位として蓄積しており、表示データ作成部２０２は、この言語表現保持部２０１に蓄積された言語表現を表示するための表示データを作成し、表示用メモリ２０３を介して、ディスプレイ装置２０４に送出する構成となっている。
【００２８】
ここで、上述した言語表現保持部２０１は、例えば、句点で区切られた１つの文を言語表現の１単位とし、各文に通し番号を付けて蓄積しておけばよい。
また、このとき、表示データ作成部２０２は、言語表現保持部２０１から少なくとも１つの文を順次に読み出して、ディスプレイ装置２０４による表示画面の行数や桁数に合わせて文字コードを配置した表示データを作成し、表示用メモリ２０３に格納すればよい。
【００２９】
この場合は、利用者はディスプレイ装置２０４によって表示された文を見ながら、マウス２０５やキーボード２０６を操作して、これらの文に含まれている熟語や言い回しを表す文字列の区間を指定すればよい。
【００３０】
このようにして指定された区間を示す情報は、表示画面上での位置の範囲を例えば行および桁で示す情報として、入力制御部２０７を介して、まず、表示データ作成部２０２に送出される。
【００３１】
この情報に基づいて、表示データ作成部２０２が、該当する文の指定された区間に含まれる文字に対応する属性情報を変更することにより、例えば、指定区間に含まれる文字に下線が施され、これにより、利用者が区間の指定を確認できるようになっている。
【００３２】
また、このとき、入力制御部２０７は、利用者からの区間指定があった旨を読出処理部２１１に通知し、これに応じて、この読出処理部２１１は、表示用メモリ２０３から該当する文に含まれる全ての文字列に対応する文字コードをその属性情報とともに読み出して、文字コード列を文字列保持部２１２に送出して保持するとともに、属性情報を区間情報検出部２１３に送出する。
【００３３】
この区間情報検出部２１３は、受け取った属性情報の中から区間指定を示す属性情報を検出し、この検出結果に基づいて、指定された区間の範囲を示す区間情報を作成して、区間情報保持部２１４に送出すればよい。
【００３４】
このとき、区間情報検出部２１３は、例えば、表示データ作成部２０２から１行の桁数などの文の表示形式に関する情報を受け取り、この情報に基づいて、指定区間の文における位置を文頭からの文字数として算出すればよい。また、１つの文のなかに、複数の指定区間がある場合は、各指定区間に番号を付けて、その番号とともに、区間情報保持部２１４に保持すればよい。
【００３５】
したがって、図３ (a) に示すように、「彼は腹を立てました。」という文の下線を付して示した区間が指定された場合は、言語表現の蓄積単位であるこの文「彼は腹を立てました。」が文字情報保持部２１２に送出されるとともに、区間情報検出部２１３により、下線で示された区間の範囲を示す区間情報が検出され、区間情報保持部２１４に、表１に示すように、区間番号０に対応する区間情報が格納される。
【００３６】
表１

ここで、表１においては、文の先頭文字から各文字に順に第０番から番号を付し、区間に含まれる番号の範囲を示すことにより、その区間の範囲を文字位置の範囲として示している。
【００３７】
このように、表示用メモリ２０３の内容を読出処理部２１１が読み出して、文字コード列と属性情報とに分離し、区間情報検出部２１３が属性情報から区間情報を抽出することにより、熟語などを含んだ言語表現そのものに関する文字情報と、登録すべき熟語などの範囲を示す区間情報との入力をそれぞれ受け付けることができる。
【００３８】
すなわち、マウス２０５やキーボード２０６の操作に応じて、入力制御部２０６が表示データ作成部２０２や読出処理部２１１を制御して上述した動作を起動することにより、これらの各部により、文字列入力手段１１１および指定区間入力手段１１２の機能を実現することができる。
【００３９】
このようにして入力された文字情報は、言語表現保持部２０１に蓄積された１つの単位の言語表現全体に相当するものであるから、分解手段１１３に相当する分解処理部２２１が、形態素辞書２２２を参照しながらこの文字情報を従来と同様にして形態素に分解することにより、指定区間の言語表現とともにその前後の言語表現に関する情報を得ることができる。
【００４０】
ここで、上述した形態素辞書２２２には、図４に示すように、「彼」，「腹」，「立て」などの自立語の語幹である形態素とともに、「は」，「を」，「ました」，「。」などの非自立語である形態素が、それぞれの属性などの情報とともに蓄積されている。但し、図４においては、各形態素に対応する情報の一部として、自立語である場合には丸印を付し、非自立語である場合にはバツ印を付して示した。
【００４１】
例えば、図３ (a) に示した例文“彼は腹を立てました。”を分解処理部２２１によって形態素に分解すると、図３ (b) にハイフンで区切って示すような各形態素が得られ、形態素保持部２２３を介して不整合検出部２２４に送出される。
【００４２】
この不整合検出部２２４は第１の判定手段１１４に相当するものであり、分解処理部２２１で得られた分解結果と、対応する区間情報とを照合して、指定区間の境界が形態素の境界と一致しているか否かを判定し、一致しない旨の判定結果を得たときに、不整合を検出したとして、修正処理部２２５を起動する構成となっている。
【００４３】
このとき、不整合検出部２２４は、指定区間の開始位置が形態素の前側の境界に一致しているか否かおよび指定区間の終了位置が形態素の後ろ側の境界に一致しているか否かをそれぞれ判定すればよい。
【００４４】
例えば、図３に示した例について不整合の検出処理を行うと、指定区間の開始位置は形態素の境界に一致しているが、指定区間の終了位置は非自立語である「ました」にかかっており、形態素の境界に一致していないことが分かる。
【００４５】
この場合に、不整合検出部２２４は、不整合を検出した指定区間の境界を指定して修正処理部２２５を起動し、該当する指定区間の境界と形態素の境界との不整合の修正処理を依頼する。
【００４６】
この修正処理部２２５は、修正規則保持部２２６内の修正規則に従って、後述する修正処理を行う構成となっている。
ここで、修正規則保持部２２６には、例えば、次に挙げる２つの規則の規則▲１▼および規則▲２▼を保持しておき、指定区間の境界が含まれている形態素が自立語であるか否かに応じて適用すればよい。
【００４７】
規則▲１▼ 該当する形態素が自立語である場合は、指定区間を該当する形態素全体に拡張する。
規則▲２▼ 該当する形態素が非自立語である場合は、その形態素を指定区間から排除する。
【００４８】
図３に示した指定区間の例を修正する際には、該当する形態素である「ました」が非自立語であることから規則▲２▼が適用され、指定区間から形態素「ました」が排除される。この場合に、修正処理部２２５は、区間情報保持部２１４の該当する区間番号に対応する区間情報を文字位置「２〜５」に修正して、図３ (c) に示すように、指定区間の終了位置を形態素「ました」の直前の形態素である「立て」の後ろ側に移動すればよい。
【００４９】
このように、修正処理部２２５が修正規則保持部２２６内の修正規則に従って動作することにより、図１に示した第１の修正手段１１５の機能を実現し、指定区間の境界と形態素の境界との不整合を解消することができる。
【００５０】
また、図２において、転送処理部２２７は抽出手段１１６として動作し、不整合が無い旨の検出結果あるいは上述した修正処理部２２５による修正処理が終了した旨の通知に応じて、区間情報保持部２１４に保持された区間情報に従って、形態素保持部２２３から指定区間に含まれる形態素を読み出し、順次に解析処理部３１１に送出すればよい。
【００５１】
このようにして、不整合を含んだ区間指定も柔軟に受け付けて、該当する文字列を形態素に分解し、この分解結果を解析処理および登録処理に供することができる。
【００５２】
この場合は、解析処理部３１１に入力される文字列は全て形態素に分解されているから、解析処理部３１１および登録処理部３１２がは、従来と同様の解析処理および登録処理を行って、指定区間の文字列によって表された熟語や言い回しを熟語辞書３１３に登録すればよい。
【００５３】
上述したようにして、利用者による指定区間の境界を自動的に修正することを可能としたことにより、利用者が言語表現を登録する際に、言語情報処理装置における処理単位を意識する必要を無くし、利用者が直観的に判断した文字列の区間を受け付けて、該当する言語表現を確実に入力することが可能となる。
【００５４】
したがって、同じ言語表現を繰り返し入力する手間を省いて利用者の作業負担を軽減し、専門的な知識の少ない利用者にとっても使いやすい言語情報処理装置を実現することができる。
【００５５】
なお、言語表現保持部２０１に言語表現を蓄積する単位としては、文法的に完結したいわゆる「文」に限らず、登録すべき熟語などを含んだ文の一部などでもよい。ただし、蓄積する言語表現の１単位は、全て形態素に分解可能であることが必要である。
【００５６】
また、指定区間の境界を修正するための規則としては、更に、次に挙げる規則▲３▼のような例も考えられる。
規則▲３▼ 形態素に分解できなかった文字列の途中に指定区間の境界がある場合には、その文字列全体に指定区間を拡張する。
【００５７】
この規則▲３▼は、形態素に分解できなかった文字列を固有名詞として捉え、その文字列全体を指定区間に含めることにより、利用者の意図をくみ取ろうとするものである。
【００５８】
これにより、言語表現入力装置に備えられた形態素辞書２２２に蓄積されていない固有名詞などを含んでいる場合においても、不完全な区間指定を柔軟に受け付けることができる。
【００５９】
また、上述した実施例のように、会話的に言語表現の入力処理および解析，登録処理を進める場合には、修正処理部２２５による修正結果を表示データ作成部２０２を介してディスプレイ装置２０４に表示することにより、利用者に専門的な知識を経験的に習得させることも可能である。
【００６０】
一方、利用者が多数の言語表現を一括して入力し、これらの言語表現に関する解析，登録処理をバッチ的に処理する場合もある。
図５に、本発明にかかわる言語情報処理装置の別実施例構成図を示す。
【００６１】
図５において、言語情報処理装置は、図２に示した文字情報保持部２１２の代わりに、文情報保持部２１５と読出処理部２２８とを備えて構成されている。
この場合は、利用者によって区間が指定されたときに、区間指定が施された文の言語表現保持部２０１における格納場所を示す文情報を文情報保持部２１５に保持しておき、解析，登録処理を行う際に、読出処理部２２８が、この文情報に基づいて、言語表現保持部２０１から該当する文を読み出して、その全ての文字列を分解処理部２２１に送出すればよい。
【００６２】
例えば、登録すべき熟語などを含んだ言語表現にそれぞれ文番号が与えられており、この文番号に対応して言語表現保持部２０１に蓄積されている場合は、文情報保持部２１５は、表示データ作成部２０２から該当する文番号を受け取り、この文番号を上述した文情報として保持しておけばよい。
【００６３】
この文番号に基づいて、読出処理部２２８が言語表現保持部２０１を検索すれば、該当する文を構成する全ての文字列を読み出すことができ、指定された区間の文字列とともにその前後の文字列を分解処理部２２１による形態素への分解処理に供することができる。
【００６４】
したがって、指定区間に含まれる文字列に関する情報とともに、その前後の文字列の情報を用いて、指定区間の境界と形態素の境界との整合性を判断し、検出された不整合を指定区間の境界を移動することによって解消することができ、不整合を含んだ区間の指定を柔軟に受け付けて登録処理を行うことができる。
【００６５】
また、この場合は、解析処理や登録処理とともに、分解処理や修正処理を一括して行うことができるから、情報処理装置のプロセッサの処理能力を有効に活用することができる。
【００６６】
更に、熟語や言い回しを動詞や形容詞の語幹として登録する場合などに必要とされる制約条件に関する整合性をチェックし、そのチェック結果に応じて指定区間の境界を修正することもできる。
【００６７】
図６に、本発明にかかわる言語情報処理装置の別実施例構成図を示す。
図６において、言語情報処理装置は、図２に示した言語情報処理装置に、条件保持部２３１と条件チェック部２３２と修正処理部２３３と修正規則保持部２３４とを付加し、上述した指定区間と形態素の境界との不整合の検出および解消を経たのちに動作し、その処理結果を転送処理部２２７を介して解析処理部３１１に送出する構成となっている。
【００６８】
図６において、条件保持部２３１は指定区間に含まれる形態素の順序や種類について、例えば、「先頭および末尾の形態素は自立語である」などの制約条件を保持しており、条件チェック部２３２は、受け取った一連の形態素がこの制約条件を満たしているか否かを判定すればよい。すなわち、条件保持部２３１と条件チェック部２３２とによって、第２の判定手段１２１の機能が果たされている。
【００６９】
例えば、図３ (c) に示した修正結果が入力された場合は、先頭の形態素「腹」および末尾の形態素「立て」の両方が自立語であるから、条件チェック部２３２は、この指定区間は制約条件を満たしていると判断し、これらの形態素を解析処理部３１１に送出する。
【００７０】
一方、図３ (d) に示すように、文字列「腹を立てました」が指定区間とされた場合は、指定区間の開始位置および終了位置共に形態素の境界と整合しているから、条件チェック部２３２には、先頭の形態素「腹」から末尾の形態素「ました」までの４つの形態素が入力される。
【００７１】
この場合は、末尾の形態素が自立語ではないから、条件チェック部２３２は制約条件を満たしていないと判断し、修正処理部２３３に指定区間の修正処理を依頼する。
【００７２】
ここで、修正規則保持部２３４は、例えば、次に挙げる２つの規則▲４▼，規則▲５▼を保持しており、修正処理部２３３による修正処理に供している。
規則▲４▼ 先頭の形態素が非自立語である場合は、自立語が現れるまで文頭に向かって指定区間を拡張する。
【００７３】
規則▲５▼ 末尾の形態素が非自立語である場合は、自立語が現れるまで文頭に向かって指定区間を縮小する。
例えば、図３ (d) に示した例の場合は、修正処理部２３３が、規則▲５▼を適用して指定区間の終了位置を修正し、図３ (e) に示すように、指定区間から形態素「ました」を削除して、指定区間の終了位置を「立て」の後ろ側とすることにより、上述した制約条件を満たす形態素の連なりを得ることができる。
【００７４】
このように、修正処理部２３３が修正規則保持部２３４内の修正規則に従って修正処理を行うことにより、図１に示した第２の修正手段１２２の機能を実現することができる。
【００７５】
これにより、制約条件との不整合を含んだ指定区間も柔軟に受け付けて、解析，登録処理を進めることが可能となるから、入力した言語表現を確実に登録することが可能となるから、同じ言語表現を繰り返し入力する手間を省くことができる。
【００７６】
また、利用者が制約条件を意識する必要性を除去するので、利用者の作業負担を大幅に軽減するとともに、専門的な知識の少ない利用者にも使いやすい言語情報処理装置を提供することができる。
【００７７】
ここで、登録しようとする言語表現が上述した制約条件「先頭および末尾の形態素は自立語である」が満たしていれば、その言語表現にそのまま活用語尾を付けたり、また、接頭語を付加したりすることができ、該当する言語表現を有効に活用することができる。特に、言語表現を動詞や形容詞として登録したい場合には、上述したような制約条件を満たしていることが望まれる。
【００７８】
したがって、上述した制約条件についてのチェックおよび修正機能は、動詞や形容詞などのように、語尾が活用する言語表現を登録する際に、特に有効である。
【００７９】
なお、キーワード検索などの場合は、例えば「指定区間が自立語を一つだけ含む」というような制約条件が考えられる。
この場合は、修正規則保持部２３４に、規則▲６▼として「先頭の自立語以外の指定を無視する」を保持しておき、先頭の自立語のみをキーワードとして検索処理部に送出すればよい。
【００８０】
【発明の効果】
以上説明したように本発明は、利用者が指定した区間の文字列およびその前後の文字列に関する情報に基づいて、指定区間の境界と形態素の境界との不整合や制約条件との不整合を検出し、該当する指定区間の境界位置を移動することにより、これらの不整合を解消することができる。これにより、不整合を含んだ指定区間を柔軟に受け付けて言語表現の解析，登録処理を行うことが可能となり、利用者の作業負担を大幅に軽減することができる。
【図面の簡単な説明】
【図１】本発明にかかわる言語情報処理装置の原理ブロック図である。
【図２】本発明にかかわる言語情報処理装置の実施例構成図である。
【図３】指定区間の修正動作を説明する図である。
【図４】形態素辞書の説明図である。
【図５】本発明にかかわる言語情報処理装置の別実施例構成図である。
【図６】本発明にかかわる言語情報処理装置の更に別の実施例構成図である。
【図７】従来の言語情報処理装置の構成例を示す図である。
【符号の説明】
１１１文字列入力手段
１１２指定区間入力手段
１１３分解手段
１１４第１の判定手段
１１５第１の修正手段
１１６抽出手段
１２１第２の判定手段
１２２第２の修正手段
２０１言語表現保持部
２０２表示データ作成部
２０３表示用メモリ
２０４ディスプレイ装置
２０５マウス
２０６キーボード
２０７入力制御部
２１１読出処理部
２１２文字列保持部
２１３区間情報検出部
２１４区間情報保持部
２１５文情報保持部
２２１，３０２分解処理部
２２２，３０３形態素辞書
２２３形態素保持部
２２４不整合検出部
２２５，２３３修正処理部
２２６，２３４修正規則保持部
２２７転送処理部
２２８読出処理部
２３１条件保持部
２３２条件チェック部
３０１抽出処理部
３１１解析処理部
３１２登録処理部
３１３熟語辞書[0001]
[Industrial application fields]
The present invention relates to a language for performing a rule registration or knowledge registration process or a search process using a keyword based on a character string representing a language expression input by specifying a section in an information processing system that handles language expressions. The present invention relates to an information processing apparatus.
[0002]
Information processing systems that handle linguistic expressions include machine translation systems, sentence selection systems, and sentence feature extraction systems. In such information processing systems, linguistic expressions such as idioms and phrases are expressed by linguistic information processing devices. By inputting a character string by specifying a section and analyzing the character string, rules and knowledge about the language are registered and utilized.
[0003]
The language information processing apparatus can also be used for searching a database using a specific language expression in a document as a keyword.
[0004]
[Prior art]
FIG. 8 shows a configuration example of a conventional language information processing apparatus.
In FIG. 8, the extraction processing unit 301 extracts the character string indicated by the section designation from the entire character string representing the document in response to an instruction from the user, and the decomposition processing unit 302 extracts the extracted character The sequence is divided into processing units and sent to the analysis processing unit 311.
[0005]
The decomposition processing unit 302, for example, converts a character string representing a linguistic expression into a morpheme based on a morpheme dictionary 303 that stores morphemes composed of stems of independent words and ending information such as inflection endings. Has been broken down.
[0006]
In response to the input of the morpheme collection obtained in this way, the analysis processing unit 311 analyzes the connection between morphemes, and using this result, the registration processing unit 312 uses the morpheme linked according to a predetermined rule. As a series of phrases, idioms and phrases are registered in the idiom dictionary 313.
[0007]
Here, in the registration work to the idiom dictionary 313, an enormous number of idioms are often registered together.
In such a case, the user designates corresponding sections in the document one after another, and accordingly, the character strings extracted by the extraction processing unit 301 are sequentially accumulated, After the designation is completed, the decomposition processing by the decomposition processing unit 302 and the registration processing to the idiom dictionary 313 are batch processed.
[0008]
[Problems to be solved by the invention]
As described above, since the analysis processing unit 311 analyzes the relationship between morphemes, the input to the analysis processing unit 311 must be decomposed into a series of morphemes.
[0009]
For this reason, conventionally, in the disassembly processing unit 302, when all the character strings in the specified section could not be decomposed into morphemes, the processing for the character string in the corresponding specified section is stopped at that time, and an error message or the like is issued. The user was notified that the specification of the character string could not be accepted.
[0010]
In addition, in order to register a idiom or phrase as a verb or adjective stem, a constraint condition such as “the specified section must start with an independent word and end with an independent word” is necessary.
[0011]
Conventionally, such a constraint condition has been examined by the analysis processing unit 311. If the sequence of morphemes received from the decomposition processing unit 302 does not satisfy the constraint condition, the character string in the designated section Processing is stopped immediately. Also in this case, the user is notified that the designation of the character string has not been accepted, as in the case where the character string cannot be decomposed.
[0012]
Thus, the conventional linguistic information processing apparatus is based on the premise that the user is aware of the processing units and restrictions in the linguistic information processing apparatus and accurately specifies the character string section so as to match them. .
[0013]
Therefore, in order to make full use of the language information processing apparatus, the user needs to have sufficient knowledge and experience regarding processing units in the language information processing apparatus such as morphemes.
[0014]
However, general users often do not have such knowledge and experience, and the processing units such as morphemes described above are not the same as common sense language units. It is very difficult to specify the interval that matches the constraint condition accurately.
[0015]
In addition, even if a user with sufficient knowledge inputs a specified section, if a large number of idioms and phrases are to be registered at once, an error will occur in the specification by the user. Since it becomes easy, many designation | designated areas will be excluded without being accepted.
[0016]
In a conventional language information processing apparatus, in order to register a idiom or phrase corresponding to a specified section that has been rejected, the user has to correct the input of the specified section and repeat the registration work. However, this operation is troublesome for the user and increases the burden on the user.
[0017]
An object of the present invention is to provide a language information processing apparatus that flexibly accepts input of language expressions including inconsistencies with processing conditions.
[0018]
[Means for Solving the Problems]
FIG. 1 shows a principle block diagram of the language information processing apparatus of claim 1.
According to the first aspect of the present invention, in a language information processing apparatus that receives an input of a language expression to be processed and executes a predetermined process, the character string input for inputting a character string including the language expression to be processed Means 111, designated section input means 112 for inputting a designated section indicating a range of language expressions included in the character string, and decomposition means 113 for decomposing the character string input by the character string input means 111 into processing units. A first determination unit 114 for determining the validity of the specified section based on whether or not the boundary of the specified section coincides with any one of a series of processing units obtained by the decomposing unit 113; When the determination section 114 determines that the specified section is not valid, the specified section is corrected by moving the boundary position of the specified section so as to coincide with one of the boundaries of the processing unit. A first correction means 115, at least one processing unit included in the designated section as modified by the first modification means, in linguistic expression to be processedAppearing sequential positionAnd a second determination means 121 for determining the validity of the designated section based on whether or not a predetermined restriction condition indicating a rule regarding the type of processing unit to be arranged at the position is satisfied, When the determination unit 121 determines that the specified section is not valid, the specified section is corrected by moving the boundary of the specified section so that the processing unit array included in the corrected specified section satisfies the constraint condition. A second correcting unit 122; and an extracting unit 116 for extracting a processing unit corresponding to the character string included in the designated section obtained by the correction from the series of processing units obtained by the decomposing unit 113. Features.
[0020]
[Action]
In the invention of claim 1, since all of the character strings input by the character string input means 111 are subjected to the decomposition processing by the decomposition means 113, the character string of the designated section indicated by the designated section input means 112 is Information about the preceding and following character strings can be obtained.
[0021]
Therefore, when the first determination unit 114 obtains a determination result that the specified section is not valid, the first correction unit 115 determines the specified section based on the information about the specified section and the character string before and after the specified section. Can be moved within the range including the character string before and after the specified section.
[0022]
At this time, the first correcting means 115 isTo decide whether to include or exclude processing units that are separated by the boundary of the specified sectionIf the boundary of the specified section is moved according to the appropriate rule, the inconsistency between the boundary of the specified section and the boundary of the processing unit will be resolved.Multiple processing units included in the modified specified section are used as information about the language expression to be processed.Can be used for processing.
[0023]
That is, a plurality of processing units included in the designated section corrected in this way are input to the second determination unit 121, and the determination process by the second determination unit 121, that is, the arrangement of these processing units is determined. It is subjected to a process for determining whether or not the constraint condition is satisfied.
[0024]
When the second determination unit 121 determines that the specified section is not valid in light of the constraint condition,Second correction means 122IsRestrictions mentioned aboveBased on, the boundary of the specified sectionFor each processing unit included inMoving.
[0025]
At this time,For example, the second correction unit 122 indicates a condition that the processing unit at the beginning or end of the language expression should satisfyIf the boundary of the specified section is moved according to the appropriate rule, the processing unit included in the specified sectionHow to arrangeAnd the constraints mentioned aboveStructure in the linguistic expression indicated byInconsistency withcan do. As a result, the extraction means 116Legitimate designated intervalOn the basis of theInformation about linguistic expressionsExtractIt can be used for processing such as registration processing.
[0026]
【Example】
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
In FIG.1 is a block diagram of a language information processing apparatus according to an embodiment of the present invention.
[0027]
In FIG.The linguistic expression holding unit 201 stores linguistic expressions such as phrases that include idioms and phrases to be registered as a unit, and the display data creation unit 202 stores the linguistic expressions stored in the linguistic expression holding unit 201. Display data is generated and sent to the display device 204 via the display memory 203.
[0028]
Here, the language expression holding unit 201 described above may store, for example, one sentence divided by punctuation as one unit of language expression, and add a serial number to each sentence.
At this time, the display data creation unit 202 sequentially reads at least one sentence from the language expression holding unit 201 and displays the display data in which character codes are arranged according to the number of lines and the number of digits on the display screen by the display device 204. May be created and stored in the display memory 203.
[0029]
In this case, the user can operate the mouse 205 and the keyboard 206 while looking at the sentences displayed on the display device 204 and designate a section of a character string representing an idiom or phrase included in these sentences. Good.
[0030]
Information indicating the section designated in this way is first sent to the display data creation unit 202 via the input control unit 207 as information indicating the range of positions on the display screen, for example, by lines and digits. .
[0031]
Based on this information, the display data creation unit 202 changes the attribute information corresponding to the character included in the specified section of the corresponding sentence, for example, the character included in the specified section is underlined, As a result, the user can confirm the designation of the section.
[0032]
At this time, the input control unit 207 notifies the reading processing unit 211 that the section has been designated by the user, and in response to this, the reading processing unit 211 reads the corresponding sentence from the display memory 203. The character codes corresponding to all the character strings included in the character string are read together with the attribute information, and the character code string is sent to and held in the character string holding unit 212, and the attribute information is sent to the section information detecting unit 213.
[0033]
The section information detection unit 213 detects attribute information indicating section specification from the received attribute information, creates section information indicating the range of the specified section based on the detection result, and stores section information. What is necessary is just to send to the part 214.
[0034]
At this time, for example, the section information detection unit 213 receives information on the display format of the sentence such as the number of digits in one line from the display data creation unit 202, and based on this information, the position in the sentence of the specified section is determined from the beginning of the sentence. What is necessary is just to calculate as the number of characters. Further, when there are a plurality of designated sections in one sentence, each designated section may be numbered and held together with the number in the section information holding unit 214.
[0035]
Therefore,FIG. (a) As shown inIf the section underlined with the sentence “He is angry.” Is specified, this sentence, which is the accumulated unit of linguistic expression, is written text. Section information indicating the range of the underlined section is detected by the section information detection unit 213 and sent to the holding unit 212. The section information holding unit 214 sets the section number 0 as shown in Table 1. Corresponding section information is stored.
[0036]
Table 1

Here, in Table 1, each character is numbered from the first character in order from the first character of the sentence, and the range of the number is indicated by indicating the range of the number included in the interval. Yes.
[0037]
In this way, the content of the display memory 203 is read by the read processing unit 211 and separated into character code strings and attribute information, and the section information detection unit 213 extracts section information from the attribute information, so that idioms and the like can be obtained. It is possible to accept input of character information related to the included language expression itself and section information indicating a range of idioms to be registered.
[0038]
That is, the input control unit 206 controls the display data creation unit 202 and the read processing unit 211 in accordance with the operation of the mouse 205 and the keyboard 206 to start the above-described operation, so that these units can perform character string input means. 111 and the function of the designated section input means 112 can be realized.
[0039]
Since the character information input in this way corresponds to the entire language expression of one unit stored in the language expression holding unit 201, the decomposition processing unit 221 corresponding to the disassembling unit 113 performs the morpheme dictionary 222. By disassembling this character information into morphemes in the same manner as in the past, it is possible to obtain information related to the language expression before and after the language expression of the designated section.
[0040]
Here, the morpheme dictionary 222 described above includesAs shown in FIG.Along with morphemes that are stems of independent words such as “he”, “belly”, and “standing”, morphemes that are non-independent words such as “ha”, “wo”, “was”, “.” It is stored with information such as attributes. However,In FIG.As a part of the information corresponding to each morpheme, a circle is attached when it is an independent word, and a cross mark is attached when it is a non-independent word.
[0041]
For example,FIG. (a) Pointing out toungueDecomposing the example sentence “He got angry” into morphemes by the decomposition processing unit 221.FIG. (b) InEach morpheme as indicated by a hyphen is obtained and sent to the mismatch detection unit 224 via the morpheme holding unit 223.
[0042]
The inconsistency detection unit 224 corresponds to the first determination unit 114, and the decomposition result obtained by the decomposition processing unit 221 is collated with the corresponding section information, and the boundary of the specified section is the boundary of the morpheme. It is configured to start the correction processing unit 225 on the assumption that an inconsistency is detected when a determination result indicating that the two do not match is obtained.
[0043]
At this time, the inconsistency detection unit 224 determines whether the start position of the specified section matches the front boundary of the morpheme and whether the end position of the specified section matches the rear boundary of the morpheme, respectively. What is necessary is just to judge.
[0044]
For example,As shown in FIG.When inconsistency detection processing is performed for the example, the start position of the specified section matches the boundary of the morpheme, but the end position of the specified section depends on the non-independent word “ta”, and the boundary of the morpheme It can be seen that they do not match.
[0045]
In this case, the inconsistency detection unit 224 activates the correction processing unit 225 by designating the boundary of the designated section in which the inconsistency is detected, and performs a process for correcting inconsistency between the boundary of the corresponding designated section and the boundary of the morpheme. Ask.
[0046]
The correction processing unit 225 is configured to perform correction processing described later in accordance with the correction rules in the correction rule holding unit 226.
Here, the modified rule holding unit 226 holds, for example, the following two rules (1) and (2), and the morpheme including the boundary of the designated section is an independent word. It may be applied depending on whether or not.
[0047]
Rule {circle around (1)} When the corresponding morpheme is an independent word, the designated section is extended to the entire corresponding morpheme.
Rule (2) If the corresponding morpheme is a non-independent word, the morpheme is excluded from the designated section.
[0048]
As shown in FIG.When the example of the designated section is modified, the rule {2} is applied because the corresponding morpheme “Sat” is a non-independent word, and the morpheme “Sat” is excluded from the specified section. In this case, the correction processing unit 225 corrects the section information corresponding to the corresponding section number in the section information holding unit 214 to the character position “2 to 5”,FIG. (c) As shown inThe end position of the designated section may be moved to the back side of “standing” that is the morpheme immediately before the morpheme “sata”.
[0049]
As described above, the correction processing unit 225 operates according to the correction rule in the correction rule holding unit 226, so thatFirst correcting means 115 shown in FIG.The inconsistency between the boundary of the designated section and the boundary of the morpheme can be resolved.
[0050]
Also,In FIG.The transfer processing unit 227 operates as the extraction unit 116, and the section held in the section information holding unit 214 in response to the detection result that there is no inconsistency or the notification that the correction processing by the correction processing unit 225 described above has been completed. According to the information, the morphemes included in the designated section may be read from the morpheme holding unit 223 and sequentially sent to the analysis processing unit 311.
[0051]
In this way, it is possible to flexibly accept section designations including inconsistencies, decompose the corresponding character string into morphemes, and use the decomposition results for analysis processing and registration processing.
[0052]
In this case, since all character strings input to the analysis processing unit 311 are decomposed into morphemes, the analysis processing unit 311 and the registration processing unit 312 perform analysis processing and registration processing similar to those in the past, and specify The idiom or phrase expressed by the character string of the section may be registered in the idiom dictionary 313.
[0053]
As described above, it is possible to automatically correct the boundary of the designated section by the user, so that the user needs to be aware of the processing unit in the language information processing apparatus when registering the language expression. It is possible to receive the section of the character string intuitively determined by the user and reliably input the corresponding language expression.
[0054]
Therefore, it is possible to realize a language information processing apparatus that is easy to use even for a user with less specialized knowledge by reducing the burden on the user by eliminating the trouble of repeatedly inputting the same language expression.
[0055]
The unit for accumulating language expressions in the language expression holding unit 201 is not limited to a so-called “sentence” that is grammatically complete, but may be a part of a sentence including idioms to be registered. However, one unit of language expression to be accumulated must be decomposable into morphemes.
[0056]
Further, as a rule for correcting the boundary of the designated section, an example such as the following rule (3) can be considered.
Rule {circle around (3)} When there is a boundary of a specified section in the middle of a character string that could not be decomposed into morphemes, the specified section is extended to the entire character string.
[0057]
This rule {circle around (3)} is intended to capture the user's intention by capturing the character string that could not be decomposed into morphemes as proper nouns and including the entire character string in the designated section.
[0058]
Thereby, even when a proper noun that is not stored in the morpheme dictionary 222 provided in the language expression input device is included, incomplete section designation can be flexibly accepted.
[0059]
In addition, as in the above-described embodiment, when the input processing, analysis, and registration processing of language expressions are performed interactively, the correction result by the correction processing unit 225 is displayed on the display device 204 via the display data creation unit 202. By doing so, it is also possible to let the user acquire specialized knowledge empirically.
[0060]
On the other hand, there is a case where a user inputs a large number of language expressions at once, and the analysis and registration processes related to these language expressions are processed in batches.
FIG. 5 is a block diagram showing another embodiment of the language information processing apparatus according to the present invention.
[0061]
In FIG.The language information processing deviceAs shown in FIG.Instead of the character information holding unit 212, a sentence information holding unit 215 and a reading processing unit 228 are provided.
In this case, when a section is designated by the user, sentence information indicating the storage location in the language expression holding unit 201 of the sentence for which the section is specified is held in the sentence information holding unit 215 for analysis and registration. When performing the processing, the read processing unit 228 may read the corresponding sentence from the language expression holding unit 201 based on the sentence information and send all the character strings to the decomposition processing unit 221.
[0062]
For example, when a sentence number is given to each linguistic expression including idioms to be registered and stored in the linguistic expression holding unit 201 corresponding to the sentence number, the sentence information holding unit 215 displays The corresponding sentence number may be received from the data creation unit 202 and the sentence number may be held as the sentence information described above.
[0063]
If the reading processing unit 228 searches the linguistic expression holding unit 201 based on the sentence number, all the character strings constituting the corresponding sentence can be read, and the character string before and after the character string in the specified section can be read. The column can be subjected to a morpheme decomposition process by the decomposition processing unit 221.
[0064]
Therefore, using the information on the character string included in the specified section and the information on the character strings before and after it, the consistency between the boundary of the specified section and the boundary of the morpheme is determined, and the detected inconsistency is identified as the boundary of the specified section. Can be eliminated, and the registration process can be performed by flexibly receiving the designation of the section including the inconsistency.
[0065]
Further, in this case, since the decomposition process and the correction process can be performed together with the analysis process and the registration process, the processing capability of the processor of the information processing apparatus can be effectively utilized.
[0066]
Furthermore, it is possible to check the consistency of constraints required when registering idioms and phrases as verbs and adjective stems, and to modify the boundaries of the designated section according to the check results.
[0067]
FIG. 6 is a block diagram showing another embodiment of the language information processing apparatus according to the present invention.
In FIG.The language information processing deviceFigure 2A condition holding unit 231, a condition checking unit 232, a correction processing unit 233, and a correction rule holding unit 234 are added to the language information processing apparatus shown to detect and eliminate the mismatch between the specified section and the morpheme boundary described above. After that, the operation results and the processing result is sent to the analysis processing unit 311 via the transfer processing unit 227.
[0068]
In FIG.The condition holding unit 231 holds a constraint condition such as “the first and last morphemes are independent words” with respect to the order and type of morphemes included in the specified section, and the condition check unit 232 receives the series of received It is sufficient to determine whether or not the morpheme satisfies the constraint condition. That is, the function of the second determination unit 121 is fulfilled by the condition holding unit 231 and the condition check unit 232.
[0069]
For example,FIG. (c) Pointing out toungueWhen the correction result is input, since both the first morpheme “antinode” and the last morpheme “stand” are independent words, the condition check unit 232 determines that the specified section satisfies the constraint condition. These morphemes are sent to the analysis processing unit 311.
[0070]
on the other hand,FIG. (d) As shown inWhen the character string “I got angry” is set as the specified section, the start and end positions of the specified section match the boundary of the morpheme. 4 morphemes from the last to the last morpheme “M”.
[0071]
In this case, since the last morpheme is not an independent word, the condition check unit 232 determines that the constraint condition is not satisfied, and requests the correction processing unit 233 to perform the correction process for the specified section.
[0072]
Here, the correction rule holding unit 234 holds, for example, the following two rules {circle around (4)} and (5) and is used for the correction processing by the correction processing unit 233.
Rule (4) If the first morpheme is a non-independent word, the designated section is extended toward the beginning of the sentence until the independent word appears.
[0073]
Rule (5) If the last morpheme is a non-independent word, the designated section is reduced toward the beginning of the sentence until the independent word appears.
For example,FIG. (d) Pointing out toungueIn the case of the example, the correction processing unit 233 applies the rule (5) to correct the end position of the designated section,FIG. (e) As shown inBy deleting the morpheme “sata” from the designated section and setting the end position of the designated section behind the “stand”, a series of morphemes satisfying the above-described constraint conditions can be obtained.
[0074]
In this way, the correction processing unit 233 performs the correction process according to the correction rule in the correction rule holding unit 234, so thatSecond correcting means 122 shown in FIG.The function can be realized.
[0075]
As a result, it is possible to flexibly accept specified sections that include inconsistencies with the constraint conditions, and to proceed with analysis and registration processing. Therefore, it is possible to reliably register the input language expression. This saves you the trouble of repeatedly entering language expressions.
[0076]
In addition, since it eliminates the need for the user to be aware of the constraints, it is possible to greatly reduce the workload of the user and provide a language information processing device that is easy to use even for users with less specialized knowledge. it can.
[0077]
Here, if the linguistic expression to be registered satisfies the above-mentioned restriction condition “the morphemes at the beginning and end are independent words”, the linguistic expression is added with a ending suffix as it is, or a prefix is added. And can make effective use of the corresponding language expression. In particular, when it is desired to register a linguistic expression as a verb or an adjective, it is desirable that the above-described constraints are satisfied.
[0078]
Therefore, the check and correction functions for the constraint conditions described above are particularly effective when registering linguistic expressions utilized by endings such as verbs and adjectives.
[0079]
In the case of keyword search or the like, for example, a constraint condition that “the specified section includes only one independent word” can be considered.
In this case, the correction rule holding unit 234 holds “ignore designations other than the first independent word” as rule (6), and sends only the first independent word as a keyword to the search processing unit. .
[0080]
【The invention's effect】
As described above, according to the present invention, inconsistency between the boundary of the specified section and the boundary of the morpheme or inconsistency with the constraint condition is based on the information about the character string of the section specified by the user and the character string before and after the section. These inconsistencies can be resolved by detecting and moving the boundary position of the corresponding designated section. As a result, it is possible to flexibly accept specified sections including inconsistencies and perform language expression analysis and registration processing, and the work burden on the user can be greatly reduced.
[Brief description of the drawings]
[Figure 1]Related to the present inventionIt is a principle block diagram of a language information processing apparatus.
[Figure 2]It is an Example block diagram of the language information processing apparatus concerning this invention.
[Fig. 3]It is a figure explaining the correction operation | movement of a designated area.
[Fig. 4]It is explanatory drawing of a morpheme dictionary.
[Figure 5]It is another Example block diagram of the language information processing apparatus concerning this invention.
[Fig. 6]It is another Example block diagram of the language information processing apparatus concerning this invention.
[Fig. 7]It is a figure which shows the structural example of the conventional language information processing apparatus.
[Explanation of symbols]
111 Character string input means
112 Specified section input means
113 Disassembling means
114 1st determination means
115FirstCorrection means
116 Extraction means
121 Second determination means
122 Second correction means
201 Language expression holding unit
202 Display data creation unit
203 Display memory
204 Display device
205 mice
206 keyboard
207 Input controller
211 Read processing section
212 Character string holding part
213 Section information detection unit
214 Section information holding unit
215 sentence information holding part
221,302 Decomposition processing unit
222,303 Morphological Dictionary
223 Morphological holder
224 Mismatch detection unit
225, 233 Correction processing unit
226, 234 Amendment rule holder
227 Transfer processing unit
228 Read processing unit
231 Condition holding unit
232 Condition check section
301 Extraction processing unit
311 Analysis processing unit
312 Registration processing section
313 Idioms dictionary

Claims

In a language information processing apparatus that receives an input of a language expression to be processed and executes a predetermined process,
A character string input means for inputting a character string including a language expression to be processed;
A designated section input means for inputting a designated section indicating a range of language expressions included in the character string;
Decomposing means for decomposing the character string input by the character string input means into processing units which are units of semantic analysis processing of language expression;
First determination means for determining the validity of the specified section based on whether or not the boundary of the specified section matches any boundary of a series of processing units obtained by the decomposing means;
When the first determination means determines that the specified section is not valid, the specified section is corrected by moving the boundary position of the specified section so as to coincide with any boundary of the processing unit. 1 correction means;
Rules relating to the sequential position at which at least one processing unit included in the specified section modified by the first modification means appears in the language expression to be processed and the type of processing unit to be arranged at that position Second determination means for determining the validity of the specified section based on whether or not a predetermined restriction condition indicating
When the second determining means determines that the specified section is not valid, the boundary position of the specified section is moved so that the processing unit array included in the corrected specified section satisfies the constraint condition. Second correction means for correcting the designated section;
An linguistic information processing apparatus comprising: extraction means for extracting a processing unit corresponding to a character string included in a specified section obtained by correction from a series of processing units obtained by the decomposing means.