JP3740541B2

JP3740541B2 - Language processing system and program

Info

Publication number: JP3740541B2
Application number: JP2004017845A
Authority: JP
Inventors: 真樹村田; 均井佐原
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2004-01-27
Filing date: 2004-01-27
Publication date: 2006-02-01
Anticipated expiration: 2021-10-09
Also published as: JP2004158038A

Description

本発明は、差分検出を行うdiff（ディフ）コマンドのような順序情報を保持したまま一致部分が最大になるように複数のデータの対応づけを行うシステムを用いた言語処理システム及びプログラムに関する。 The present invention relates to a language processing system and a program using a system for associating a plurality of data so that a matching portion is maximized while maintaining order information such as a diff command for performing difference detection.

従来、例えば、日本語文と英語文の対応付け、講演と予稿の対応付け、質問文と知識データの対応付け等は、判断処理としてのプログラムが複雑なものであった。 Conventionally, for example, associating a Japanese sentence with an English sentence, associating a lecture with a preliminary draft, associating a question sentence with knowledge data, etc., have complicated programs as judgment processing.

上記従来の対訳コーパスの対応付け、従来の講演と予稿の対応付け及び質問文と知識データの対応付けを行うものは、複雑なプログラムが必要であり、簡単に対応付けを行うのが困難なものであった。 The conventional bilingual corpus mapping, the conventional lecture / preliminary mapping, and the question text / knowledge data mapping requires a complicated program and is difficult to map easily. Met.

本発明は上記問題点の解決を図り、diffコマンドを用いて、対訳コーパスの対応付け、講演と予稿の対応付け等の複数の言語データの対応付けを簡単に行えるようにすることを目的とする。 SUMMARY OF THE INVENTION It is an object of the present invention to solve the above-mentioned problems and to easily perform correspondence between a plurality of language data such as correspondence between bilingual corpora and correspondence between lectures and preliminary reports using a diff command. .

図１は本発明の言語処理システムである。図１中、１は入力手段、２は処理部、３は形態素解析手段、４は差分検出手段、５は類似度演算手段である。 FIG. 1 shows a language processing system according to the present invention. In FIG. 1, 1 is an input means, 2 is a processing unit, 3 is a morphological analysis means, 4 is a difference detection means, and 5 is a similarity calculation means.

本発明は、前記従来の課題を解決するため次のような手段を有する。 The present invention has the following means in order to solve the conventional problems.

（１）：対応づけの目印になる記号を含む第一の言語データとその記号を含まない第二の言語データの複数の言語データを入力する入力手段１と、順序情報を保持したまま一致部分を最大になるように複数のデータの対応づけを行うシステムを用いて前記複数の言語データの共通部分と差分部分を検出する差分検出手段４とを備え、前記差分検出手段４は、前記共通部分からはそのまま言語データを取り出し、前記差分部分からは前記第二の言語データが対応する側の差分部分から言語データを取り出し、かつ、前記第一の言語データが対応する側の差分部分からは前記記号を取り出してそれらをそのままの順番のまま並べることにより、前記記号が挿入された第二の言語データを持つことにより、前記複数の言語データを対応づける。このため、例えば、対応づけの目印になる記号である章の情報だけを残して予稿のデータ（第一の言語データ）を消し去ることにより、講演データ（第二の言語データ）に章情報を挿入することができ、予稿と講演との対応付けを簡単に行うことができる。 (1): Input means 1 for inputting a plurality of language data of a first language data including a symbol that serves as a mark for association and a second language data not including the symbol, and a matching portion while maintaining order information Using a system for associating a plurality of data so as to maximize the difference between the plurality of language data and a difference detection unit 4 for detecting a difference part, wherein the difference detection unit 4 includes the common part. The language data is taken out as it is, the language data is taken out from the difference portion corresponding to the second language data from the difference portion, and the difference portion on the side corresponding to the first language data is taken out from the difference portion. By taking out the symbols and arranging them in the same order, the plurality of language data are associated with each other by having the second language data in which the symbols are inserted. For this reason, for example, by deleting the preliminary draft data (first language data) while leaving only the chapter information, which is a symbol for the correspondence, the chapter information is added to the lecture data (second language data). It can be inserted, and the correspondence between the draft and the lecture can be easily performed.

（２）：前記（１）の言語処理システムにおいて、予稿データとその講演データを対応づけるためのシステムであって、前記複数の言語データが、予稿データとその講演データであるものとする。このため、対応づけの目印になる記号である章の情報だけを残して予稿のデータを消し去ることにより、講演データに章情報を挿入することができ、予稿と講演との対応付けを簡単に行うことができる。 (2): In the language processing system of (1), it is a system for associating the draft data with the lecture data, and the plurality of language data are the draft data and the lecture data. For this reason, it is possible to insert chapter information into lecture data by leaving only the chapter information, which is a symbol of association, and deleting the draft data, making it easy to associate the draft with the lecture. It can be carried out.

（３）：前記（１）又は（２）の言語処理システムにおいて、前記順序情報を保持したまま一致部分を最大になるように複数のデータの対応づけを行うシステムとしてdiffコマンドを用いる。このため、複数の言語データの対応付けを簡単に行うことができる。 (3): In the language processing system of (1) or (2), a diff command is used as a system for associating a plurality of data so as to maximize the matching portion while maintaining the order information. For this reason, it is possible to easily associate a plurality of language data.

（４）：対応づけの目印になる記号を含む第一の言語データとその記号を含まない第二の言語データの複数の言語データを入力する処理と、順序情報を保持したまま一致部分を最大になるように複数のデータの対応づけを行うシステムを用いて前記複数の言語データの共通部分と差分部分を検出する処理と、前記共通部分からはそのまま言語データを取り出し、前記差分部分からは前記第二の言語データが対応する側の差分部分から言語データを取り出し、かつ、前記第一の言語データが対応する側の差分部分からは前記記号を取り出してそれらをそのままの順番のまま並べることにより、前記記号が挿入された第二の言語データを持つことにより、前記複数の言語データを対応づける処理とを、コンピュータに実現させるためのプログラムとする。このため、このプログラムをコンピュータにインストールすることで、予稿と講演等の複数の言語データの対応付けを容易に実現することができる言語処理システムを容易に提供することができる。 (4): The process of inputting a plurality of language data of the first language data including a symbol that becomes a mark of association and the second language data not including the symbol, and maximizing the matching portion while maintaining the order information A process for detecting a common part and a difference part of the plurality of language data using a system for associating a plurality of data so as to become, the language data is directly extracted from the common part, and the difference part is By extracting the language data from the difference part on the side corresponding to the second language data, and taking out the symbols from the difference part on the side corresponding to the first language data and arranging them in the same order , A program for causing a computer to realize processing for associating the plurality of language data by having second language data into which the symbols are inserted To. Therefore, by installing this program in a computer, it is possible to easily provide a language processing system that can easily realize correspondence between a plurality of language data such as a draft and a lecture.

本発明によれば次のような効果がある。 The present invention has the following effects.

（１）：差分検出手段で、共通部分からはそのまま言語データを取り出し、差分部分からは第二の言語データが対応する側の差分部分から言語データを取り出し、かつ、第一の言語データが対応する側の差分部分からは対応づけの目印になる記号を取り出してそれらをそのままの順番のまま並べることにより、前記記号が挿入された第二の言語データを持つことにより、複数の言語データを対応づけるため、例えば、対応づけの目印になる記号である章の情報だけを残して予稿のデータ（第一の言語データ）を消し去ることにより、講演データ（第二の言語データ）に章情報を挿入することができ、予稿と講演との対応付けを簡単に行うことができる。 (1): The difference detection means extracts the language data as it is from the common part, extracts the language data from the difference part corresponding to the second language data from the difference part, and corresponds to the first language data By taking out the symbols that become the mark of correspondence from the difference part on the side to be arranged and arranging them in the order as they are, by having the second language data in which the symbols are inserted, it is possible to support multiple language data For example, the chapter information is added to the lecture data (second language data) by erasing the preliminary data (first language data), leaving only the chapter information, which is a symbol for the correspondence. It can be inserted, and the correspondence between the draft and the lecture can be easily performed.

（２）：前記複数の言語データが、予稿データとその講演データであるものとするため、対応づけの目印になる記号である章の情報だけを残して予稿のデータを消し去ることにより、講演データに章情報を挿入することができ、予稿と講演との対応付けを簡単に行うことができる。 (2): Since the plurality of language data are the draft data and the lecture data, the lecture data is deleted by leaving only the chapter information, which is a symbol for the correspondence. Chapter information can be inserted into the data, making it easy to associate a draft with a lecture.

（３）：前記順序情報を保持したまま一致部分を最大になるように複数のデータの対応づけを行うシステムとしてdiffコマンドを用いるため、複数の言語データの対応付けを簡単に行うことができる。 (3): Since the diff command is used as a system for associating a plurality of data so as to maximize the matching portion while retaining the order information, it is possible to easily associate a plurality of language data.

（１）：言語処理システムの説明
図１は本発明の言語処理システムの説明図である。図１において、入力手段１は、差分検出を行う言語データを入力するものである。処理部２は、入力されたデータの処理を行うものである。形態素解析手段３は、入力された言語データを辞書と文法を用いて最適な単語列に分割するものである。差分検出手段４は、対応関係のある複数の言語データをdiffコマンドを用いて差分の検出を行うものである。類似度演算手段５は、差分検出手段４を用いて質問文と知識データとの類似度を求め、類似度の大きい知識データの質問文の疑問詞に対応する部分を出力するものである。 (1): Description of Language Processing System FIG. 1 is an explanatory diagram of a language processing system of the present invention. In FIG. 1, an input unit 1 inputs language data for performing difference detection. The processing unit 2 processes input data. The morpheme analyzing means 3 divides input language data into optimum word strings using a dictionary and grammar. The difference detection means 4 detects a difference between a plurality of language data having a correspondence relationship using a diff command. The similarity calculation means 5 obtains the similarity between the question sentence and the knowledge data by using the difference detection means 4 and outputs a part corresponding to the question sentence of the question sentence of the knowledge data having a high similarity.

（２）：diff、 mdiff、 mdiffc の説明
(1) diffの説明
diff（ディフ）とは、ＵＮＩＸ（ユニックス）（登録商標）のファイル比較ツールdiffのことである。このdiffコマンドは、与えられた二つのファイルの差分を順序情報を保持したまま行を単位として出力するものである。 (2): Explanation of diff, mdiff, mdiffc
(1) Explanation of diff
The diff is a file comparison tool diff of UNIX (registered trademark). This diff command outputs the difference between two given files in units of lines while maintaining the order information.

例えば、
今日
学校へ
いく
ということが書いてあるファイル(1) と
今日
大学へ
いく
ということが書いてあるファイル(2) があるとする。これらのdiffをとると、差分が
＜学校へ
＞大学へ
のような形で出力される。 For example,
Suppose you have a file (1) that says going to school today and a file (2) that says going to university today. When these diffs are taken, the difference is output in the form of <to school> to university.

(2) mdiffの説明
ところで、diffコマンドには、−Ｄオプションという便利なオプションがある。このオプションを付けてdiffコマンドを使うと差分部分だけでなく共通部分も出力される。つまり、ファイルのマージが実現される。また、差分部分は、Ｃ（プログラム言語）のプリプロセッサなどで使われるifdef 文などで表現されるが、ここでは、ifdef 文は、見にくいので差分部分は以下のように表示することにする。 (2) Explanation of mdiff By the way, the diff command has a convenient option called -D option. Using this option with the diff command will output not only the difference but also the common part. That is, file merging is realized. The difference portion is expressed by an ifdef statement used in a C (programming language) preprocessor or the like. Here, since the ifdef statement is difficult to see, the difference portion is displayed as follows.

；▽▽▽▽▽▽
（一つ目のファイルにだけある部分）
；●●●
（二つ目のファイルにだけある部分）
；△△△△△△
ここでは、“；▽▽▽▽▽▽”は差分部分の始まりを、“；△△△△△△”は差分部分の終わりを意味し、“；●●●”は差分を構成する二つのデータの境界を意味する。 ▽▽▽▽▽▽▽
(Part only in the first file)
; ●●●
(Part only in the second file)
△△△△△△
Here, “; ▽▽▽▽▽▽” means the beginning of the difference part, “; △△△△△△” means the end of the difference part, and “; ●●●” is the two that make up the difference. It means data boundary.

本実施の形態では、−Ｄオプションを付けて更にifdef の部分を上記のように表示（整形）して、ファイルのマージを行う場合のdiffをmdiff （エムディフ）と呼ぶ（m はmerge の mである）。 In the present embodiment, if the -D option is added and the ifdef part is further displayed (formatted) as described above and the files are merged, the diff is called mdiff (mdiff). is there).

実際に、先ほどのデータ（ファイル(1) とファイル(2) ）に対して、mdiff をかけてみると、以下のような結果になる。 In fact, when mdiff is applied to the previous data (file (1) and file (2)), the following results are obtained.

今日
；▽▽▽▽▽▽
学校へ
；●●●
大学へ
；△△△△△△
いく
これは「今日」が一致し、「学校へ」と「大学へ」が差分となり、「いく」がまた共通部分となっている。このように、mdiff の出力はdiffと異なり一致部分も出力されるために分かりやすい。 Today ； ▽▽▽▽▽▽
To school; ●●●
To university △△△△△△
Iku This is the same for “Today”, “To school” and “To university” are the differences, and “Iku” is also a common part. In this way, the output of mdiff is easy to understand because the matching part is also output unlike diff.

また、mdiff の結果からは元の二つのファイルのデータを完全に復元することができる。共通部分と、差分部分の黒丸（；●●●）の上側だけを取り出すと、
今日
学校へ
いく
のように一つ目のファイルの情報が取り出される。また、共通部分と、差分部分の黒丸（；●●●）の下側だけを取り出すと、
今日
大学へ
いく
のように二つ目のファイルの情報が取り出される。このように、もとの情報を完全に復元できることになる。 In addition, the data of the original two files can be completely restored from the result of mdiff. If you take out only the common part and the black circle (; ●●●) of the difference part,
The information in the first file is retrieved as if going to school today. Also, if you take out only the lower part of the common part and the black circle (; ●●●) of the difference part,
The information in the second file is retrieved as if going to university today. In this way, the original information can be completely restored.

また、mdiff では、一致部分は片方のデータにあったものだけを表示し、不一致部分のみ両方のデータのものを表示するために、元の二つのデータよりもデータ量は削減できるが、上記のように元の情報を完全に復元できるために、復元できる状態でデータ量を削減するという意味ではmdiff はデータ圧縮を実現しているものといえる。 In addition, in mdiff, only the matching part is displayed in one data and only the inconsistent part is displayed in both data, so the data amount can be reduced compared to the original two data, but the above Thus, since the original information can be completely restored, mdiff can be said to realize data compression in the sense of reducing the amount of data while it can be restored.

図２はmdiff の言語処理システムの説明図である。図２において、mdiff の言語処理システムには、ＵＮＩＸdiff処理部１１、整形部１２が設けてある。ＵＮＩＸdiff処理部１１は、入力された二つのファイルのdiffによる差分部分と共通部分を出力するものである。整形部１２は、ＵＮＩＸdiff処理部１１の出力を見やすい表現に整形するものである。 FIG. 2 is an explanatory diagram of the mdiff language processing system. In FIG. 2, the mdiff language processing system is provided with a UNIXdiff processing unit 11 and a shaping unit 12. The UNIXdiff processing unit 11 outputs a difference part and a common part by diff of two input files. The shaping unit 12 shapes the output of the UNIXdiff processing unit 11 into an easily viewable expression.

図３はmdiff によるフローチャートである。以下、図３の処理Ｓ１〜Ｓ４に従って説明する。 FIG. 3 is a flowchart based on mdiff. Hereinafter, a description will be given according to the processes S1 to S4 of FIG.

Ｓ１：ＵＮＩＸdiff処理部１１に、入力として二つのファイルが与えられる。 S1: Two files are given to the UNIXdiff processing unit 11 as input.

Ｓ２：ＵＮＩＸdiff処理部１１で、二つのファイルの一致・不一致部分を検出する（ここでのdiffはオプション−Ｄを付けておき一致部分の検出も行う）。 S2: The UNIXdiff processing unit 11 detects a match / mismatch portion between the two files (diff is added with option -D to detect a match portion).

Ｓ３：整形部１２において、ＵＮＩＸdiff処理部１１の出力を整形する。具体的には“diff−Ｄ記号”の出力のifdef 文を“；▽▽▽▽▽▽”などの見やすい表現に直す処理を行う。 S3: The shaping unit 12 shapes the output of the UNIXdiff processing unit 11. Specifically, processing is performed to change the ifdef sentence of the output of “diff-D symbol” into an easy-to-read expression such as “; ▽▽▽▽▽▽”.

Ｓ４：整形部１２は、整形された結果を出力する。 S4: The shaping unit 12 outputs the shaped result.

3)：mdiffcの説明 3): Explanation of mdiffc

次に、文字を単位としたmdiff を考える。言語処理の場合は、文字単位を差分で取りたい場合が多い。そのようなときは一度ファイルの中身の情報を、一文字ずつ改行をして出力したファイルでmdiff をとればよい。例えば先のファイル(1) の情報だと、
今 Next, consider mdiff in units of characters. In the case of language processing, it is often desirable to take the character unit as a difference. In such a case, you can do mdiff with the file that has been output with the contents of the file once after each line break. For example, the information of the previous file (1)
now

日
学
校
へ
い
く
という形にしてから、mdiff をとればよい。この一文字単位でmdiff をかけることをmdiffcと呼ぶ（mdiffcの cはcharacter のc ）。 After taking the form of going to a Japanese school, take mdiff. Applying mdiff on a character basis is called mdiffc (mdiffc c is character c).

diffの表示は見にくく、mdiff はdiffで表示される情報を完全に含むので以降の説明は、mdiff を用いて行う。 The display of diff is difficult to see, and mdiff completely contains the information displayed by diff, so the following explanation is done using mdiff.

（３）：差分検出及び書き換え規則の獲得の説明
(1) 複数システムの出力の差分検出の説明
以前、juman のシステムのバージョンが複数乱立しているとき、この複数のjuman の出力をmdiff によりマージして形態素解析結果の品質を向上させるようなことをしていた（参考文献、村田真樹，日本語文章における名詞の指示対象の推定，京都大学工学部博士論文，(1995)、石間衛, 藤井敦, 石川徹也, 日本語形態素・構文解析システムJEMONIの開発と評価について, 情報処理学会自然言語処理研究会 98-NL-127,(1998) 、参照）。ここでは「といったこと」の例で説明する。 (3): Explanation of difference detection and rewrite rule acquisition
(1) Explanation of differential detection of multiple system outputs Previously, when multiple versions of juman's system were confused, the output of morphological analysis was improved by merging the multiple juman outputs with mdiff. (Reference, Maki Murata, Estimation of noun indication target in Japanese sentences, Doctoral dissertation, Faculty of Engineering, Kyoto University, (1995), Mamoru Ishima, Satoshi Fujii, Tetsuya Ishikawa, Japanese morphological and syntactic analysis system JEMONI For the development and evaluation of IPSJ, see IPSJ Natural Language Processing Study Group 98-NL-127, (1998)). Here, an example of “something” will be described.

「といったこと」を解析し、juman のＡというバージョンの出力が
とと助詞
いった言う動詞
ことこと名詞
となっていて、Ｂのバージョンの出力が
とと助詞
いった行く動詞
ことこと名詞
となっているとする。「いった」という語は「行く」と「言う」の曖昧性があり、Ｂのバージョンではこれを誤って「行く」の方の語であると出力していたとする。ここでmdiff をとると以下のような結果となる。 Analyzing "That", the output of juman's A version is a noun that is a verb and noun, and the output of B is a noun that is a noun and a verb that is a verb. Suppose that It is assumed that the word “had” has the ambiguity of “go” and “say”, and the version B erroneously outputs it as the word of “go”. Here is the result of mdiff:

とと助詞
；▽▽▽▽▽▽
いった言う動詞
；●●●
いった行く動詞
；△△△△△△
ことこと名詞
mdiff をとることで複数のシステムの出力の差異を容易に検出することができる。この場合、「いった」の部分が出力に差異があることが分かる。ここで、出力修正の作業者は、このような差分が検出された箇所においてどちらが正しいかを判断し、上が正しければ何もせず下が正しければ「；●●●」の先頭に“ｘ”を付けるなどとすると決めておく。そのようにすると、“ｘ”がなければ差分の下を、あれば差分の上の情報と区切り、記号を消すことで、その作業結果のデータから自動的にそれぞれの差分からよい結果の方を選び、それぞれのバージョンのものより高い精度の結果を生成できる。また、差分の両方が誤っている場合がよくある。このときは「；●●●」の上の方のデータを実際に書き直すとよい。 And and particle ； ▽▽▽▽▽▽
Say the verb ； ●●●
Going verb △△△△△△△
Thing thing noun
By taking mdiff, it is possible to easily detect differences in the output of multiple systems. In this case, it can be seen that there is a difference in the output of the “like” part. Here, the operator who corrects the output determines which one is correct at a position where such a difference is detected, and if the top is correct, nothing is done, and if the bottom is correct, “x” is added to the head of “; ●●●”. Decide that you want to add. By doing so, if there is no “x”, it is separated from the information below the difference if there is no “x”, and the symbol is deleted, so that the better result is automatically obtained from each difference from the data of the work result. You can choose to produce results that are more accurate than those of each version. Also, both differences are often incorrect. In this case, it is better to actually rewrite the data above “; ●●●”.

この方法を用いると、修正できないものは両方のバージョンで同じように誤るものだけであり、多くの形態素誤りを修正できる。ここで注意すべきことは異なる性質のシステムを複数用意しないといけないということである。誤り方が同じシステムの場合だと多くの誤りを見逃すことになる。 Using this method, the only thing that can't be corrected is the same error in both versions, and many morphological errors can be corrected. It should be noted here that multiple systems with different properties must be prepared. If the error is in the same system, many errors will be missed.

また、システムが三つある場合は、diff3 コマンドを使うとよい。diff3 は、三つのファイルの差分を検出することができる。 If you have three systems, use the diff3 command. diff3 can detect differences between three files.

上記では、形態素解析を例にあげたが、他の解析でも解析結果を行単位にすることでmdiff で差分をとることができる。また、文字単位が必要ならばmdiffcを使えばよい。 In the above, morphological analysis was taken as an example, but in other analysis, the difference can be obtained with mdiff by making the analysis result line by line. If you need character units, use mdiffc.

ここでは、複数のシステムの出力の差分をとる話をしたが、一つをタグ付きコーパスとし、それを何かのシステムで解析した結果と比較することで、そのタグ付きコーパスの誤りを検出し修正することもできる。 Here, we talked about taking the difference between the outputs of multiple systems, but one is a tagged corpus, and by comparing it with the result of analysis on some system, errors in the tagged corpus are detected. It can also be corrected.

(2) 差分の考察と書き換え規則の獲得の説明
ここでは、話し言葉と書き言葉のdiffの研究について記述する。対応のとれた話し言葉と書き言葉のデータを使い、それらの差分から話し言葉と書き言葉の違いを考察したり、話し言葉から書き言葉への言い換え規則、また、その逆のための規則を獲得するものである。データとしては、学会の口頭発表を話し言葉データとし、その口頭発表の内容が記されたその学会の予稿原稿を書き言葉として用いた。 (2) Consideration of differences and explanation of acquisition of rewrite rules This section describes research on spoken and written diffs. Using the data of spoken and written languages that have been dealt with, the difference between the spoken and written languages is considered from the difference between them, and the paraphrase rule from spoken to written language and vice versa. As the data, the oral presentation of the academic society was used as spoken language data, and the preliminary draft of the academic society that describes the contents of the oral presentation was used as written language.

図４は書き言葉データと話し言葉データの例の説明図である。例えば、書き言葉と話し言葉のデータが図４のような形で与えられたとする。ここでは、差分がとりやすいように形態素解析システム（形態素解析手段３）などで１行に１単語がはいるような形に変換してある。このような書き言葉と話し言葉のデータが与えられたとき、mdiff をとると、図５のような結果を得る。図５は書き言葉データと話し言葉データのdiffの結果の説明図である。図５の結果から、差分部分だけを抽出すると、図６のような結果が得られる。図６は差分部分の抽出の説明図である。 FIG. 4 is an explanatory diagram of examples of written word data and spoken word data. For example, assume that written and spoken language data is given in the form shown in FIG. Here, the morpheme analysis system (morpheme analysis unit 3) or the like is converted into a form in which one word is inserted in one line so that a difference can be easily obtained. Given such written and spoken language data, mdiff results in the results shown in FIG. FIG. 5 is an explanatory diagram of the result of diff between written language data and spoken language data. If only the difference portion is extracted from the result of FIG. 5, the result as shown in FIG. 6 is obtained. FIG. 6 is an explanatory diagram of extraction of the difference portion.

図６の結果から、話し言葉には「え」などが挿入されること、また話し言葉では「っていうの」という表現をいれて発話をなめらかにすることなどが分かる。また、「述べる」が「述べます」と言い換えられていることが分かる。以上のように、mdiff を使うことで話し言葉と書き言葉の差異を検出でき、また、それを考察することで、話し言葉と書き言葉の違いのようなものを調査できることが分かる。また、これらの差分は、話し言葉と書き言葉の言い換え規則としてみることもできる。 From the result of FIG. 6, it can be seen that “e” or the like is inserted into the spoken language, and that the spoken language is smoothed by adding the expression “to say”. It can also be seen that “state” is rephrased as “state”. As you can see, mdiff can be used to detect the difference between spoken and written language, and by examining it, it can be seen that the difference between spoken and written language can be investigated. These differences can also be seen as paraphrasing rules for spoken and written words.

例えば、「え」の部分は、書き言葉に何もないところに話し言葉に変換する場合「え」をいれるという規則のように見ることができる。また、「述べる」と「述べます」の部分は、話し言葉に変換する場合は「述べる」を「述べます」に言い換える規則のように見ることができる。その意味でmdiff を用いることで言い換え規則、もしくは、変換規則のようなものを検出できることが分かる。 For example, the “e” part can be seen as a rule that “e” is entered when converted into spoken language where there is nothing in the written language. In addition, the parts of “state” and “state” can be viewed as a rule that translates “state” into “state” when converted to spoken language. In this sense, it is understood that paraphrasing rules or conversion rules can be detected by using mdiff.

ここでは、話し言葉と書き言葉のデータを例にとったが、このようなことは様々なところで可能である。例えば、英文校閲前のテキストと英文校閲後のテキストで、mdiff をとると、どのような違いをどのように直せばよいかが分かるし、また英文校閲用の規則のようなものが獲得できる。また、要約前のテキストと要約後のテキストで、mdiff をとると、どのように要約されているかを如実に見ることができるし、要約用の規則のようなものが獲得できる。その他にも対応のとれた性質の異なるデータに対してmdiff をとると、様々な考察と、言い換え規則の獲得ができる。 Here, spoken and written data are used as examples, but this is possible in many places. For example, if you take mdiff between the text before English text review and the text after English text review, you can find out what kind of differences should be corrected and how to get the rules for English text review. Also, if you take mdiff between the pre-summary text and the post-summary text, you can see how they are summarized, and you can get a summary rule. In addition, various considerations and paraphrasing rules can be obtained if mdiff is applied to data with different characteristics.

（４）：データのマージの説明
(1) 対訳コーパスの対応付けの説明
ここでは、対訳コーパスの対応付けを考える。条件として、それぞれのコーパスには、対応する箇所に同じ記号が入っていることを前提とする。また、対応付けの単位は、この記号で区切られた部分であるとする。 (4): Explanation of data merging
(1) Explanation of bilingual corpus mapping Here, bilingual corpus mapping is considered. As a condition, it is assumed that each corpus has the same symbol in the corresponding location. Assume that the unit of association is a portion delimited by this symbol.

図７はコーパスの構成の説明図である。ここでは、日本語のコーパスと英語のコーパスがまだ、ばらばらに存在し、対応付けられていないとする。図７の例のように両方ともSection 1 などの同じ形をしたセクション情報が与えられているとする。このとき、日本語と英語では、同じセクションのものは、同じ内容であるとする。 FIG. 7 is an explanatory diagram of the structure of the corpus. Here, it is assumed that the Japanese corpus and the English corpus are still scattered and not associated with each other. It is assumed that section information having the same shape such as Section 1 is given to both as in the example of FIG. At this time, it is assumed that the contents of the same section are the same in Japanese and English.

この場合、これらのデータのmdiff をとることで、図８のような結果を得ることができる。図８はmdiff によって対応付けられた対訳コーパスの説明図である。図８の結果では、Section 1 などが共通部分となり、その他の部分が不一致部分となる。この不一致部分では、日本語と英語が上下に分かれて格納されることになる。このようにすることで、mdiff を用いて対訳データが作成されることになる。 In this case, the result as shown in FIG. 8 can be obtained by taking mdiff of these data. FIG. 8 is an explanatory diagram of a bilingual corpus associated with mdiff. In the result of FIG. 8, Section 1 and the like are common parts, and other parts are non-matching parts. In this inconsistent portion, Japanese and English are stored separately on the top and bottom. By doing so, bilingual data is created using mdiff.

ここで示したものは、文ごとなどの細かい対応付けをするものではなく、セクションなどの大雑把なもので一見役に立たないように思われるかもしれないが、文の対応付けは難しい問題で、まず予め対応がとれていることがはっきりしている章、段落のレベルで対応付けをしてから細かい対応付けをするという考え方もあり、その意味ではこのような粗い対応付けも役立つものである。 The one shown here is not a detailed correspondence such as sentence-by-sentence, but it seems that it is not useful at first glance due to rough things such as sections, but sentence association is a difficult problem. There is also an idea that fine correspondence is made after making correspondence at the chapter or paragraph level where it is clear that correspondence is taken. In this sense, such rough association is also useful.

また、ここで示したものは、Section 1 などの情報を認識させて区分するだけでそのようなことをするプログラムを書くことでも同じように対訳データの対応付けを行うことができる。しかし、mdiff を使うとそのような複雑なプログラムを書くこともなく対応付けを容易に実現できるものである。 In addition, what is shown here can be made to correspond to bilingual data in the same way by writing a program that recognizes information such as Section 1 and classifies it. However, using mdiff makes it possible to easily perform mapping without writing such a complicated program.

図９は対訳コーパスの言語処理システムの説明図である。図９において、対訳コーパスの言語処理システムには、mdiff 処理部２１が設けてある。mdiff 処理部２１は、入力された原文データと翻訳データとの二つのファイルのmdiff を出力するものである。 FIG. 9 is an explanatory diagram of a language processing system for a bilingual corpus. In FIG. 9, the mdiff processing unit 21 is provided in the language processing system of the bilingual corpus. The mdiff processing unit 21 outputs mdiff of two files of input original text data and translation data.

図１０は対訳コーパスのmdiff によるフローチャートである。以下、図１０の処理Ｓ１１〜Ｓ１３に従って説明する。 FIG. 10 is a flowchart of the bilingual corpus by mdiff. Hereinafter, a description will be given according to processing S11 to S13 of FIG.

Ｓ１１：mdiff 処理部２１に、入力として二つのファイルが与えられる。ここではこの二つのファイルは、それぞれ英語文章、日本語文章を格納したものである。 S11: Two files are given to the mdiff processing unit 21 as input. Here, these two files store English sentences and Japanese sentences, respectively.

Ｓ１２：mdiff 処理部２１で、この二つのファイルのmdiff をとる。 S12: The mdiff processing unit 21 takes mdiff of these two files.

Ｓ１３：mdiff 処理部２１は、mdiff の結果を出力する。 S13: The mdiff processing unit 21 outputs the result of mdiff.

(2) 講演と予稿の対応付けの説明
講演と予稿の対応付けを考える。この講演と予稿は、先の書き換え規則の獲得でも述べた書き言葉データと話し言葉データに対応する。即ち、講演は学会の口頭発表で、予稿はその口頭発表に対応する論文のことである。このような講演と予稿が与えられたとき、講演の各部分と、予稿の各部分の対応がとれると、講演を聞いている時だと、それに対応する予稿の部分を参照できるし、予稿を読んでいるときだと、それに対応する講演の部分を参照できて便利である。ここでは、この講演と予稿の対応付けをmdiff で行う説明をする。 (2) Explanation of correspondence between lectures and drafts Consider the correspondence between lectures and drafts. This lecture and preliminary paper correspond to the written data and spoken language data mentioned in the previous acquisition of rewriting rules. In other words, a lecture is an oral presentation of an academic conference, and a preliminary paper is a paper corresponding to the oral presentation. When such a lecture and draft are given, each part of the lecture and each part of the draft can be handled. When listening to the lecture, you can refer to the corresponding draft part and When reading, it is convenient to see the corresponding part of the lecture. Here, we will explain how to use mdiff to associate this lecture with a draft.

ここでは、特に予稿の各章が講演のどこの部分に対応するかをmdiff でもとめることにする。ここで予稿と講演とは、同じ順序でなされると仮定する。また、予稿の章が認識しやすいように予稿データには、図１１のように“<Chapter 1> ”のような記号を挿入しておく。図１１は予稿データの構成の説明図である。この形にしておいて、予稿と講演のデータに対して、形態素解析をして各行に単語がくる状態でmdiff を使うことで、もしくは、mdiffcを使うことで、図１２のような結果を得る。図１２は予稿と講演のmdiff の結果の説明図である。ここで、差分部分で予稿に対応する上半分の方を、“<Chapter 1> ”のような記号を除いてすべて消し去ると図１３のような結果を得る。図１３は講演データへの章の情報の挿入結果の説明図である。図１３では、元の講演データに対して“<Chapter 1> ”のような記号だけが挿入された形になる。つまり、講演のどの部分が予稿のどの章にあたるかが分かることになる。 Here, in particular, we will use mdiff to determine which part of the lecture corresponds to each part of the lecture. Here, it is assumed that the manuscript and the lecture are made in the same order. Further, a symbol such as “<Chapter 1>” is inserted into the draft data so that the chapter of the draft is easily recognized. FIG. 11 is an explanatory diagram of the structure of the draft data. In this form, morphological analysis is performed on the data of the manuscript and the lecture and mdiff is used with a word on each line, or mdiffc is used, or the result shown in FIG. 12 is obtained. . FIG. 12 is an explanatory diagram of the results of mdiff of the draft and the lecture. Here, if all of the upper half corresponding to the draft in the difference portion is erased except for a symbol such as “<Chapter 1>”, a result as shown in FIG. 13 is obtained. FIG. 13 is an explanatory diagram of the result of inserting chapter information into lecture data. In FIG. 13, only a symbol such as “<Chapter 1>” is inserted into the original lecture data. In other words, you can see which part of the lecture corresponds to which chapter of the draft.

これは簡単にいうと、mdiff の照合能力を用いて予稿と講演を照合し、章の情報だけ残して予稿の情報を消し去ることにより、講演データに章の情報を挿入するということを行っていることを意味する。このような予稿と講演の対応付けもmdiff を用いると簡単に行うことができる。 To put it simply, it uses the matching capability of mdiff to collate the manuscript with the lecture, erase the manuscript information while leaving only the chapter information, and insert the chapter information into the lecture data. Means that Such mapping between a draft and a lecture can be easily done using mdiff.

図１４は講演と予稿の対応付けの言語処理システムの説明図である。図１４において、mdiff の言語処理システムには、mdiff 処理部２１、予稿削除部２２が設けてある。mdiff 処理部２１は、入力された二つのファイルのmdiff をとり出力するものである。予稿削除部２２は、予稿側の差分部分で<Chapter 1> などの章情報のみを残して予稿データをすべて削除するものである（mdiff 記号の“；△△△△△△”なども削除する）。 FIG. 14 is an explanatory diagram of a language processing system for associating a lecture with a draft. In FIG. 14, the mdiff language processing system includes an mdiff processing unit 21 and a preliminary document deletion unit 22. The mdiff processing unit 21 takes mdiff of two input files and outputs them. The draft deletion unit 22 deletes all the draft data while leaving only chapter information such as <Chapter 1> in the difference part on the draft side (deleting mdiff symbols such as “; △△△△△△” etc.) ).

図１５は講演と予稿の対応付けの処理フローチャートである。以下、図１５の処理Ｓ２１〜Ｓ２４に従って説明する。 FIG. 15 is a processing flowchart for associating a lecture with a draft. Hereinafter, a description will be given according to the processes S21 to S24 in FIG.

Ｓ２１：mdiff 処理部２１に、入力として二つのファイルが与えられる。この二つのファイルは、それぞれ予稿、講演の文章を格納したものである。また、予稿データの方は、<Chapter 1> などの章の範囲を示す記号が付されているものとする。 S21: The mdiff processing unit 21 is given two files as inputs. These two files contain the text of the draft and the lecture, respectively. Also, the manuscript data shall be marked with a chapter range such as <Chapter 1>.

Ｓ２２：mdiff 処理部２１で、この二つのファイルのmdiff をとる。 S22: The mdiff processing unit 21 takes mdiff of these two files.

Ｓ２３：予稿削除部２２は、予稿側の差分部分で<Chapter 1> などの章情報のみを残して予稿データをすべて削除するものである。また、mdiff 記号の“；△△△△△△”なども削除する。 S23: The draft deletion unit 22 deletes all the draft data while leaving only chapter information such as <Chapter 1> in the difference part on the draft side. Also delete the mdiff symbol “; △△△△△△” etc.

Ｓ２４：予稿削除部２２は、予稿側の差分部分で章情報のみを残して予稿データをすべて削除した結果を出力する。 S24: The draft deletion unit 22 outputs the result of deleting all the draft data while leaving only chapter information in the difference portion on the draft side.

（５）：最適照合能力を用いた質問応答システムの説明
ここでは mdiffの最適照合能力を用いた質問応答システム（質問応答言語処理システム）について記述する。質問応答システムとは、例えば、「日本の首都はどこですか」と聞くと「東京」と答えそのものをずばり返すシステムである。 (5): Explanation of question answering system using optimum matching ability This section describes a question answering system (question answering language processing system) that uses mdiff's optimum matching ability. The question answering system is, for example, a system that repeats the answer itself as “Tokyo” when asked “Where is the capital of Japan?”.

知識が自然言語で書かれていると仮定すると、基本的には質問文と知識の文を照合し、その照合結果で疑問詞に対応するところを答えとして出力すればよい。例えば先の問題だと、「日本の首都は東京です」という文を探してきて、この文で疑問詞に対応する「東京」を解として出力するのである。ここではこれを mdiffで行なうことを考える。 Assuming that the knowledge is written in a natural language, the question sentence and the knowledge sentence are basically collated, and the result of the collation corresponding to the question word is output as an answer. For example, in the previous problem, a sentence “Tokyo is the capital of Japan” is searched, and “Tokyo” corresponding to the question word is output as a solution in this sentence. Here we consider doing this with mdiff.

まず、質問文の疑問詞の部分をＸに置き換え、また文末を平叙文に変換し、「日本の首都はＸです」を得る。また、知識ベースから「日本の首都は東京です」を得る。ここでこの二つの mdiffc をとると以下のような結果を得る。 First, the question word part of the question sentence is replaced with X, and the end of the sentence is converted to a plain text to obtain “the capital of Japan is X”. In addition, “Tokyo is the capital of Japan” from the knowledge base. If we take these two mdiffcs, we get the following result.

日
本
の
首
都
は
; ▽▽▽▽▽▽
X
; ●●●
東
京
; △△△△△△
で
す
ここでＸと差分部分で組になっているものを解とすると、「東京」を正しく取り出せることになる。 The capital of Japan is
; ▽▽▽▽▽▽
X
; ●●●
Tokyo
; △△△△△△
Here, if the solution is the combination of X and the difference part, “Tokyo” can be extracted correctly.

ところで mdiffc を使う場合少々文に食い違いがあっても、答えを正しく取り出すことができる。例えば、知識ベースの文が「日本国の首都は東京です」であったとする。この場合は mdiffc の結果は以下のようになる。 By the way, when using mdiffc, even if there is a slight discrepancy, you can get the answer correctly. For example, suppose that the knowledge base sentence is “Tokyo is the capital of Japan”. In this case, the result of mdiffc is as follows.

日
本
; ▽▽▽▽▽▽
; ●●●
国
; △△△△△△
の
首
都
は
; ▽▽▽▽▽▽
X
; ●●●
東
京
; △△△△△△
で
す
差分部分は少し増えるがＸに対応する箇所は「東京」のままで、解を正しく抽出できる。 Japan
; ▽▽▽▽▽▽
; ●●●
Country
; △△△△△△
The capital city is
; ▽▽▽▽▽▽
X
; ●●●
Tokyo
; △△△△△△
The difference part increases a little, but the part corresponding to X remains “Tokyo”, and the solution can be extracted correctly.

ところで、われわれが提案する質問応答システムでは、類似度を尺度として用いた変形をくりかえし、質問文と知識データの文がより一致した状態で上記のような照合を行なう。このために類似度を定義する必要がある。 By the way, in the question answering system proposed by us, the transformation using the similarity as a scale is repeated, and the above matching is performed in a state where the question sentence and the sentence of the knowledge data are more consistent. For this purpose, it is necessary to define similarity.

mdiff を用いた場合は、一致部分と不一致部分が認定できるので、類似度は、（一致部分の文字数）／（全文字数）のような形で定義できる。ここではmdiff により類似度を求めるようなことをしている。このように mdiffは文の類似性／類似度を求めることにも役に立つ。 When mdiff is used, a matching part and a non-matching part can be recognized, and the similarity can be defined as (number of characters of matching part) / (total number of characters). Here, mdiff is used to calculate the similarity. Thus mdiff is also useful for finding similarity / similarity of sentences.

ここで、「日本国」と「日本」を言い換える規則があれば「日本の首都はＸです」を「日本国の首都はＸです」と言い換えて照合し、不一致部分を減らすことで、より確実に解を得ることができる。 Here, if there is a rule to paraphrase “Japan” and “Japan”, “Japan ’s capital is X” will be rephrased as “Japan's capital is X” and collation will be reduced, and the inconsistency will be reduced, so To get a solution.

このように、 mdiffを使うだけで簡単な質問に応答する言語処理システムを、容易に構築できることは簡便さの観点から価値がある。 In this way, it is valuable from the viewpoint of simplicity to be able to easily construct a language processing system that responds to simple questions using just mdiff.

図１６は質問応答システムの説明図である。図１６において、質問応答システムには、質問文変換部３１、質問文保存部３２、キーワード抽出部３３、データベース文検索部３４、データベース文保存部３５、類似度演算部３６、mdiff 処理部３７、質問文変形部３８、データベース文変形部３９、対応部出力部４０が設けてある。質問文変換部３１は、疑問文を平叙文にまた疑問詞をＸに変換するものである。質問文保存部３２は、質問文を保存するものである。キーワード抽出部３３は、質問文からキーワードを抽出するものである。データベース文検索部３４は、キーワードを多く含むデータベース文を検索するものである。データベース文保存部３５は、データベース文検索部３４で検索したデータベース文を保存するものでる。類似度演算部３６は、質問文とデータベース文とを比較し類似度を求めるものである。mdiff 処理部３７は、質問文とデータベース文とを１文字単位でmdiff をとるものである。質問文変形部３８は、質問文の変形を行うものである。データベース文変形部３９は、データベース文の変形を行うものである。対応部出力部４０は、mdiff 処理部３７の出力結果のうち“Ｘ”に対応するデータベース側の表現を抽出し出力するものである。 FIG. 16 is an explanatory diagram of the question answering system. 16, the question answering system includes a question sentence conversion unit 31, a question sentence storage unit 32, a keyword extraction unit 33, a database sentence search unit 34, a database sentence storage unit 35, a similarity calculation unit 36, an mdiff processing unit 37, A question sentence transformation unit 38, a database sentence transformation unit 39, and a corresponding part output unit 40 are provided. The question sentence conversion unit 31 converts the question sentence into a plain sentence and the question word into X. The question sentence storage unit 32 stores the question sentence. The keyword extraction unit 33 extracts keywords from the question sentence. The database sentence search unit 34 searches for a database sentence including many keywords. The database sentence storage unit 35 stores the database sentence searched by the database sentence search unit 34. The similarity calculation unit 36 calculates the similarity by comparing the question sentence and the database sentence. The mdiff processing unit 37 takes mdiff of the question sentence and the database sentence in character units. The question sentence deforming unit 38 deforms the question sentence. The database sentence transformation unit 39 performs transformation of the database sentence. The corresponding unit output unit 40 extracts and outputs the database-side expression corresponding to “X” from the output result of the mdiff processing unit 37.

図１７は質問応答の処理フローチャートである。以下、図１７の処理Ｓ３１〜Ｓ３９に従って説明する。 FIG. 17 is a flowchart of the question response process. Hereinafter, a description will be given according to processing S31 to S39 of FIG.

Ｓ３１：質問応答システムの類似度Ｓ０の値を適当に与え（通常最初は“０”とする）処理Ｓ３２に移る。 S31: Appropriately assign the value of the similarity S0 of the question answering system (usually set to “0” at first) and proceed to the processing S32.

Ｓ３２：質問応答システムの入力として、質問文が与えられ処理Ｓ３３に移る。 S32: A question sentence is given as an input of the question answering system, and the process proceeds to step S33.

Ｓ３３：質問文変換部３１は、与えられた質問文の疑問文を平叙文に、また疑問詞をＸに変換し、この変換結果の文を質問文保存部３２に渡す。 S33: The question sentence conversion unit 31 converts the question sentence of the given question sentence into a plain sentence and also converts the question word into X, and passes this converted result sentence to the question sentence storage unit 32.

Ｓ３４：キーワード抽出部３３にも、入力として与えられた質問文が渡され、質問文からキーワードが抽出される。この抽出されたキーワードはデータベース文検索部３４に渡され処理Ｓ３５に移る。 S34: The question sentence given as input is also passed to the keyword extracting unit 33, and the keyword is extracted from the question sentence. The extracted keyword is transferred to the database sentence search unit 34, and the process proceeds to step S35.

Ｓ３５：データベース文検索部３４で、キーワードを多く含むデータベース文が、知識ベース（図示しないデータベース）からいくつか検索され、その結果はデータベース文保存部３５に格納され処理Ｓ３６に移る。 S35: The database sentence search unit 34 searches the knowledge base (database not shown) for several database sentences including many keywords, and the result is stored in the database sentence storage unit 35, and the process proceeds to step S36.

Ｓ３６：類似度演算部３６で、質問文とデータベース文とを比較し類似度を求める。この類似度の算出にはmdiff 処理部３７を用いる。即ち、類似度を求めたい質問文とデータベース文をmdiff 処理部３７に入力し、このmdiff 結果により一致部分の文字数、不一致部分の文字数が求まる。類似度は、（一致部分の文字数）／（全文字数）と予めきめておく。そして、この類似度を質問文保存部３２とデータベース文保存部３５にあるすべての文の組に対して求め、このとき最も類似度が高かったときの類似度をＳとする。次に、類似度演算部３６は、類似度ＳがＳ０より大きいかどうか判断する。この判断で、類似度ＳがＳ０より大きい場合は処理Ｓ３７に移り、類似度ＳがＳ０と同じとき処理Ｓ３８に移る。 S36: The similarity calculation unit 36 compares the question sentence and the database sentence to obtain the similarity. The mdiff processing unit 37 is used for calculating the similarity. That is, the question sentence and the database sentence for which the similarity is to be obtained are input to the mdiff processing unit 37, and the number of characters in the matching part and the number of characters in the non-matching part are obtained from this mdiff result. The degree of similarity is determined in advance as (number of characters in matching part) / (total number of characters). Then, the similarity is obtained for all sentence pairs in the question sentence storage unit 32 and the database sentence storage unit 35, and the similarity degree when the similarity degree is the highest is S. Next, the similarity calculation unit 36 determines whether the similarity S is greater than S0. In this determination, if the similarity S is greater than S0, the process proceeds to S37, and if the similarity S is the same as S0, the process proceeds to S38.

Ｓ３７：質問文変形部３８とデータベース文変形部３９では、それぞれ質問文とデータベース文の変形を行い、その変形結果をそれぞれの保存部に格納する。類似度演算部３６はＳ０に類似度Ｓの値をセットし、処理Ｓ３６に戻る。なお、ここでの変形とは、「日本国の首都はＸです」を「日本の首都はＸです」にいいかえるようなことを意味している。 S37: The question sentence transformation unit 38 and the database sentence transformation unit 39 transform the question sentence and the database sentence, respectively, and store the transformation results in the respective storage units. The similarity calculation unit 36 sets the value of the similarity S to S0 and returns to the process S36. Here, the transformation means that “the capital of Japan is X” can be changed to “the capital of Japan is X”.

Ｓ３８：類似度演算部３６は、このときの類似度Ｓの値を求めるときに使ったmdiff 処理部３７の出力結果を対応部出力部４０に渡し処理Ｓ３９に移る。 S38: The similarity calculation unit 36 passes the output result of the mdiff processing unit 37 used for obtaining the value of the similarity S at this time to the corresponding unit output unit 40, and proceeds to processing S39.

Ｓ３９：対応部出力部４０では、mdiff 処理部３７の出力結果のうち“Ｘ”に対応するデータベース側の表現を抽出して出力する。 S39: The corresponding unit output unit 40 extracts and outputs the database-side expression corresponding to “X” from the output result of the mdiff processing unit 37.

なお、上記実施の形態では、差分検出を行うのにdiffコマンドを用いたが、予め定めた単位で、順序情報を保持したまま一致部分を最大にする対応付けを行うシステムであれば他の差分検出手段を用いることができる。 In the above embodiment, the diff command is used to detect the difference. However, if the system performs the association that maximizes the matching portion while maintaining the order information in a predetermined unit, other differences may be used. Detection means can be used.

（６）：プログラムインストールの説明
入力手段１、処理部２、形態素解析手段３、差分検出手段４、類似度演算手段５、ＵＮＩＸdiff処理部１１、整形部１２、mdiff 処理部２１、予稿削除部２２、質問文変換部３１、質問文保存部３２、キーワード抽出部３３、データベース文検索部３４、データベース文保存部３５、類似度演算部３６、mdiff 処理部３７、質問文変形部３８、データベース文変形部３９、対応部出力部４０等は、プログラムで構成でき、主制御部（ＣＰＵ）が実行するものであり、主記憶に格納されているものである。このプログラムは、一般的な、コンピュータで処理されるものである。このコンピュータは、主制御部、主記憶、ファイル装置、表示装置、キーボード等の入力手段である入力装置などのハードウェアで構成されている。 (6): Description of program installation Input unit 1, processing unit 2, morpheme analysis unit 3, difference detection unit 4, similarity calculation unit 5, UNIXdiff processing unit 11, shaping unit 12, mdiff processing unit 21, draft removal unit 22 , Question sentence conversion unit 31, question sentence storage unit 32, keyword extraction unit 33, database sentence search unit 34, database sentence storage unit 35, similarity calculation unit 36, mdiff processing unit 37, question sentence transformation unit 38, database sentence transformation The unit 39, the corresponding unit output unit 40, and the like can be configured by programs, and are executed by the main control unit (CPU) and stored in the main memory. This program is generally processed by a computer. This computer is composed of hardware such as an input device which is an input means such as a main control unit, a main memory, a file device, a display device, and a keyboard.

このコンピュータに、本発明のプログラムをインストールする。このインストールは、フロッピィ、光磁気ディスク等の可搬型の記録（記憶）媒体に、これらのプログラムを記憶させておき、コンピュータが備えている記録媒体に対して、アクセスするためのドライブ装置を介して、或いは、ＬＡＮ等のネットワークを介して、コンピュータに設けられたファイル装置にインストールされる。そして、このファイル装置から処理に必要なプログラムステップを主記憶に読み出し、主制御部が実行するものである。 The program of the present invention is installed on this computer. In this installation, these programs are stored in a portable recording (storage) medium such as a floppy disk or a magneto-optical disk, and a drive device for accessing the recording medium provided in the computer is used. Alternatively, it is installed in a file device provided in the computer via a network such as a LAN. Then, the program steps necessary for processing are read from the file device into the main memory and executed by the main control unit.

本発明の言語処理システムの説明図である。It is explanatory drawing of the language processing system of this invention. 実施の形態におけるmdiff の言語処理システムの説明図である。It is explanatory drawing of the language processing system of mdiff in embodiment. 実施の形態におけるmdiff によるフローチャートである。It is a flowchart by mdiff in an embodiment. 実施の形態における書き言葉データと話し言葉データの例の説明図である。It is explanatory drawing of the example of the written word data and spoken word data in embodiment. 実施の形態における書き言葉データと話し言葉データのdiffの結果の説明図である。It is explanatory drawing of the result of diff of written language data and spoken language data in embodiment. 実施の形態における差分部分の抽出の説明図である。It is explanatory drawing of extraction of the difference part in embodiment. 実施の形態におけるコーパスの構成の説明図である。It is explanatory drawing of the structure of corpus in embodiment. 実施の形態におけるmdiff によって対応付けられた対訳コーパスの説明図である。It is explanatory drawing of the bilingual corpus matched by mdiff in embodiment. 実施の形態における対訳コーパスの言語処理システムの説明図である。It is explanatory drawing of the language processing system of the bilingual corpus in embodiment. 実施の形態における対訳コーパスのmdiff によるフローチャートである。It is a flowchart by mdiff of the bilingual corpus in embodiment. 実施の形態における予稿データの構成の説明図である。It is explanatory drawing of a structure of the draft data in embodiment. 実施の形態における予稿と講演のmdiff の結果の説明図である。It is explanatory drawing of the result of mdiff of a preliminary draft and a lecture in an embodiment. 実施の形態における講演データへの章の情報の挿入結果の説明図である。It is explanatory drawing of the insertion result of the information of the chapter to the lecture data in embodiment. 実施の形態における講演と予稿の対応付けの言語処理システムの説明図である。It is explanatory drawing of the language processing system of matching of the lecture and preliminary draft in embodiment. 実施の形態における講演と予稿の対応付けの処理フローチャートである。It is a processing flowchart of matching of a lecture and a preliminary report in an embodiment. 実施の形態における質問応答システムの説明図である。It is explanatory drawing of the question answering system in embodiment. 実施の形態における質問応答の処理フローチャートである。It is a process flowchart of the question response in embodiment.

Explanation of symbols

１入力手段
２処理部
３形態素解析手段
４差分検出手段
５類似度演算手段 DESCRIPTION OF SYMBOLS 1 Input means 2 Processing part 3 Morphological analysis means 4 Difference detection means 5 Similarity calculation means

Claims

Input means for inputting a plurality of language data of a first language data including a symbol to be a mark of correspondence and a second language data not including the symbol;
A difference detection means for detecting a common part and a difference part of the plurality of language data using a system that associates a plurality of data so as to maximize the matching part while retaining the order information,
The difference detection means extracts the language data as it is from the common part, extracts the language data from the difference part corresponding to the second language data from the difference part, and the first language data is By extracting the symbols from the difference part on the corresponding side and arranging them in the order as they are, by having the second language data with the symbols inserted, it is possible to associate the plurality of language data A featured language processing system.

A system for associating preliminary data with its lecture data,
The language processing system according to claim 1, wherein the plurality of language data are preliminary data and lecture data.

3. The language processing system according to claim 1, wherein a diff command is used as a system for associating a plurality of data so as to maximize a matching portion while maintaining the order information.

A process of inputting a plurality of language data of a first language data including a symbol to be a mark of correspondence and a second language data not including the symbol;
A process for detecting a common part and a difference part of the plurality of language data using a system for associating a plurality of data so as to maximize the matching part while retaining the order information;
The language data is taken out from the common part as it is, the language data is taken out from the difference part corresponding to the second language data from the difference part, and the difference part on the side corresponding to the first language data The process of associating the plurality of language data by having the second language data in which the symbols are inserted by taking out the symbols and arranging them in the order as they are,
A program to be realized on a computer.