JP6187745B2

JP6187745B2 - Document analysis system, method and program

Info

Publication number: JP6187745B2
Application number: JP2013104648A
Authority: JP
Inventors: 綾子久野; 英司平尾
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2013-05-17
Filing date: 2013-05-17
Publication date: 2017-08-30
Anticipated expiration: 2033-05-17
Also published as: JP2014225174A

Description

本発明は、文書分析システム、方法およびプログラムに関し、特に、自然言語で書かれた文書から曖昧性を持つ表現を抽出する文書分析システム、方法およびプログラムに関する。 The present invention relates to a document analysis system, method, and program, and more particularly, to a document analysis system, method, and program for extracting an ambiguous expression from a document written in a natural language.

近年、情報処理装置を用いて、自然言語で書かれた文書を分析して、その文書の優先的な修正点や品質を分析する文書分析システムが開発されている。 2. Description of the Related Art In recent years, document analysis systems for analyzing documents written in a natural language using an information processing apparatus and analyzing preferential correction points and quality of the documents have been developed.

その文書分析システムに関する技術の一例が、特許文献１に「日本語文章修正装置、日本語文章修正方法および日本語文章修正のためのプログラム」として開示されている。この特許文献１に開示された日本語文章修正装置は、機械翻訳し易い構成に関するルールが登録された文章構成データベースと、機械翻訳し易い用字・用語に関するルールが登録された用字・用語データベースと、機械翻訳し易いスタイルに関するルールが登録されたスタイルデータベースと、修正箇所抽出手段と、表示手段と、修正手段と、出力手段と、を有する。 An example of a technique related to the document analysis system is disclosed in Patent Document 1 as “Japanese sentence correcting device, Japanese sentence correcting method, and Japanese sentence correcting program”. The Japanese sentence correction device disclosed in Patent Document 1 includes a sentence composition database in which rules relating to structures that are easy to machine translate are registered, and a script / term database in which rules relating to letters and terms that are easy to machine translate are registered. And a style database in which rules relating to styles that are easy to machine translate are registered, a correction location extraction means, a display means, a correction means, and an output means.

このような構成を有する類似表現抽出装置は、次のように動作する。すなわち、修正箇所抽出手段は、読み込む日本語原文章から、「文を、述語の数だけ作成する」といった上記構成に関するルール、「文は、当て字、誤字、脱字を含まない」といった上記用字・用語に関するルール、または「文は、曖昧な語句を含まない」といった上記スタイルに関するルールに反する構成、用字・用語またはスタイルを抽出する。さらに、表示手段は上記抽出された構成、用字・用語またはスタイルを表示する。次に、修正手段は外部から入力されたデータを用いて上記抽出された構成、用字・用語またはスタイルを訂正、追加または削除して上記日本語原文章を修正する。最後に出力手段は修正により得られた翻訳易文章を出力する。このような構成により、日本語原文章から上記各ルールに反するポイントを抽出・表示し、外部からの訂正、追加、削除といった修正を支援している。 The similar expression extraction device having such a configuration operates as follows. That is, the correction part extracting means reads from the original Japanese text to be read, the rules relating to the above configuration, such as “create a sentence by the number of predicates”, and the above-mentioned script characters such as “the sentence does not include a letter, a typo, or a missing letter”. A configuration, script, term, or style that is contrary to the rules relating to the term or the rules relating to the style such as “the sentence does not include an ambiguous phrase” is extracted. Further, the display means displays the extracted configuration, script / term or style. Next, the correction means corrects the original Japanese sentence by correcting, adding, or deleting the extracted configuration, script, term, or style using data input from the outside. Finally, the output means outputs an easy-to-translate sentence obtained by the correction. With this configuration, points that violate the above rules are extracted and displayed from the original Japanese text, and corrections such as external correction, addition, and deletion are supported.

さらに、文書分析システムに関する技術の他の例が、非特許文献１に「仕様書の曖昧性を検出するツールの試作と評価」として開示されている。この曖昧性の検出方法では、辞書に登録した語句を検索し、曖昧語候補を抽出、使い方に関するルールに沿って各曖昧語候補の曖昧さのレベルを曖昧語、準曖昧語、非曖昧語のいずれかに分類、というステップにより、曖昧でない語句を除外して曖昧性の高い語句だけを選択的に検出、修正作業の効率化を可能にしている。 Furthermore, another example of a technique related to a document analysis system is disclosed in Non-Patent Document 1 as “prototyping and evaluation of a tool for detecting ambiguity of specifications”. In this ambiguity detection method, the words registered in the dictionary are searched, the ambiguous word candidates are extracted, and the ambiguity level of each ambiguous word candidate is determined according to the usage rules. The step of categorizing into any one makes it possible to selectively detect only words with high ambiguity by excluding unambiguous words and improve the efficiency of correction work.

特開２００７−３１６８３４号公報JP 2007-316834 A

仕様書の曖昧性を検出するツールの試作と評価、電子情報通信学会総合大会講演論文集２０１２年_情報・システム, ２７, ２０１２-０３-０６Prototype and evaluation of a tool to detect ambiguity of specifications, Proceedings of IEICE General Conference 2012_Information & Systems, 27, 2012-03-06

これらの開示技術の課題は、自然言語で書かれた文書から、曖昧性を持つ表現を抽出するのに、上記技術による分析方法を適用しても、曖昧性の高い箇所の検出を精度よく行えないことである。その理由は、一般に曖昧であるとされる表現が、本当に曖昧な表現であるかどうかは、その表現が使用された用例次第で大きく異なるため、特許文献１の手法で用いられているような登録された曖昧語の有無を検出する方法では、各使用場面で曖昧性の高くない表現まで検出してしまうためである。 The problem with these disclosed technologies is that even if the analysis method based on the above technique is applied to extract an ambiguous expression from a document written in a natural language, a highly ambiguous part can be detected with high accuracy. It is not. The reason for this is that whether or not an expression that is generally ambiguous is a truly ambiguous expression varies greatly depending on the example in which the expression is used. This is because the method for detecting the presence / absence of an ambiguous word detects an expression that is not ambiguous in each use scene.

また、非特許文献１の手法で用いられているような、予め辞書に登録してある語句ベースで一般に曖昧であるとされる表現を検出し、用例とのパターンマッチングで用例を判別して曖昧な表現を絞り込むような方法であっても、現状の自然言語処理技術の精度では真に曖昧な用例だけを完全に判別することは難しいため、曖昧性の高くない表現を多く含む検出となるためである。 In addition, an expression that is generally ambiguous based on a word and phrase registered in advance in a dictionary, such as that used in the technique of Non-Patent Document 1, is detected, and the example is discriminated by pattern matching with the example. Even with a method that narrows down simple expressions, it is difficult to completely discriminate only those examples that are truly ambiguous with the accuracy of the current natural language processing technology. It is.

なお、ここでの文書の曖昧性とは、文書の書き手と読み手の間の理解に齟齬が起きにくいかや、文書が読み手にとって理解しやすいかといった、少なくとも複数の解釈が起きる可能性を含む、文書の情報伝達の性能に関わる特徴を指す。 Note that the ambiguity of the document here includes the possibility that at least multiple interpretations will occur, such as whether the writer is difficult to understand between the writer and the reader, and whether the document is easy for the reader to understand. This refers to the characteristics related to the performance of document information transmission.

すなわち、本発明の目的は、上記課題に鑑み、自然言語で記載された文書に対して、分析対象の文書が曖昧性に配慮して書かれているかという程度を情報として利用し、記載された曖昧性を持つ表現から本当に曖昧性のある用例だけを判別する際の条件を変更して誤検出を削減することで、優先的な修正が必要な曖昧な表現を精度よく抽出する文書分析システム、方法およびプログラムを提供することにある。 That is, the object of the present invention has been described using information as to whether or not the document to be analyzed is written in consideration of ambiguity with respect to the document described in the natural language in view of the above problems. A document analysis system that accurately extracts ambiguous expressions that require preferential correction by changing the conditions for distinguishing only ambiguous examples from ambiguous expressions to reduce false positives, It is to provide a method and program.

本発明に係る文書分析システムは、文書から曖昧性を持つ表現を抽出する文書分析システムであって、対象とする文書もしくは文書群の入力を受け付ける文書入力部と；文書もしくは文書群を構成する文章に使用されている各単語とその使用箇所に関する単語情報の抽出を行う文書解析部と；所定の文書品質指標化ルールに基づき、入力された文書の特徴を利用して文書品質を評価し、文書品質指標を算出する文書品質評価部と；曖昧性を持つ可能性のある曖昧語の文字列と、曖昧語毎の曖昧性の異なる用例の特徴と用例毎の曖昧さの程度である曖昧度を蓄積する曖昧用例データベースと；単語情報に基づき、文書中の曖昧語の有無を曖昧用例データベースに問合せ、曖昧語が有る場合は、所定の曖昧性分析ルールに基づき、単語情報から各曖昧語の用例の特徴を判別し、曖昧用例データベースに問合せることで、曖昧語とその曖昧度、および文書内での存在位置を、それぞれ各曖昧語に関する曖昧用例情報として抽出する用例分析部と；曖昧性分析ルール毎に、曖昧性分析ルールを文書に適用した際の曖昧語の用例の分類精度を収集して蓄積する分類精度データベースと；文書品質指標、および特定の単語に関する曖昧性分析ルールの問い合わせに対して分類精度データベースから得られる分類精度を利用し、曖昧性分析ルールがより曖昧度の高い用例かどうか判別するルールであれば、文書品質指標の値が高いほど、分類精度の悪い曖昧性分析ルールは適用しないようにし、曖昧性分析ルールがより曖昧度の低い用例かどうか判別するルールであれば、文書品質指標の値が高いほど、分類精度の悪い曖昧性分析ルールでも適用するようにする所定の曖昧性分析条件変更ルールに基づき、各曖昧語について曖昧性を分析する条件を変更する曖昧性分析条件変更部と；変更した曖昧性の分析条件に基づき、各曖昧語の曖昧性を判定する曖昧性判定部と；曖昧性判定部で曖昧性が閾値以上と判定した各曖昧語について、対応する曖昧度および文書内での存在位置を曖昧性情報として出力する曖昧情報出力部と；を備える。 A document analysis system according to the present invention is a document analysis system that extracts an ambiguous expression from a document, and a document input unit that receives an input of a target document or document group; and a document that constitutes the document or document group A document analysis unit that extracts word information related to each word used in the document and a portion where the word is used; a document quality is evaluated using a feature of the input document based on a predetermined document quality indexing rule; A document quality evaluation unit that calculates a quality index; character strings of ambiguous words that may have ambiguity, characteristics of examples with different ambiguities for each ambiguous word, and ambiguity that is the degree of ambiguity for each example An ambiguous example database to be stored; based on word information, the ambiguous example database is queried for the presence or absence of ambiguous words in the document. If there are ambiguous words, each word information is An example analysis unit that determines characteristics of an ambiguous word example and queries the ambiguous example database to extract the ambiguous word, its ambiguity, and the position in the document as ambiguous example information about each ambiguous word; A classification accuracy database that collects and accumulates classification accuracy of ambiguous word examples when the ambiguity analysis rule is applied to a document for each ambiguity analysis rule; a document quality index, and an ambiguity analysis rule for a specific word If the classification accuracy obtained from the classification accuracy database for the query is used and the ambiguity analysis rule is a rule that determines whether it is an example with higher ambiguity, the higher the value of the document quality index, the lower the classification accuracy. If the ambiguity analysis rule is a rule that discriminates whether the ambiguity analysis rule is an example with a lower ambiguity, the value of the document quality index is high. Etc., based on a predetermined ambiguity analysis condition change rule to be applied in the classification accuracy poor ambiguity analysis rule, and ambiguity analysis condition changing unit for changing the conditions to analyze the ambiguity for each ambiguous word; changed An ambiguity determination unit that determines the ambiguity of each ambiguous word based on the ambiguity analysis condition; for each ambiguous word that the ambiguity is determined to be equal to or greater than a threshold by the ambiguity determination unit, the corresponding ambiguity and in the document An ambiguous information output unit that outputs an existing position as ambiguous information.

本発明によれば、自然言語で書かれた曖昧語を含む文書から、曖昧性が高い用例に限定した抽出を可能にすることで、優先的な修正が必要な曖昧性の高い箇所に限定した指摘ができるようになり、文書の修正時の負荷を低減し、またレビューを効率化できる文書分析システム、方法およびプログラムを提供できる。 According to the present invention, it is possible to extract from a document including an ambiguous word written in a natural language to an example having a high ambiguity, thereby limiting to a highly ambiguous portion requiring a preferential correction. It is possible to provide a document analysis system, method, and program capable of pointing out, reducing the load when correcting a document, and improving the efficiency of review.

本発明の実施形態に係る文書分析システムの構成を示すブロック図である。It is a block diagram which shows the structure of the document analysis system which concerns on embodiment of this invention. 図１に示した文書分析システムの動作例を示すシーケンス図である。It is a sequence diagram which shows the operation example of the document analysis system shown in FIG. 本発明の実施例に係る文書分析システムの構成を示すブロック図である。It is a block diagram which shows the structure of the document analysis system which concerns on the Example of this invention. 曖昧語Ｗａｊ、用例Ｆａｊ、曖昧性分析ルールＲａｆｊ、曖昧度Ａａｆｊの一部の例を示す説明図である。It is explanatory drawing which shows the example of a part of ambiguous word Waj, the example Faj, the ambiguity analysis rule Rafj, and the ambiguity Aafj. 図４の曖昧語Ｗａｊに関する分類精度データベースの一部の例を示す説明図である。It is explanatory drawing which shows a part of classification | category precision database regarding the ambiguous word Waj of FIG. 図４、５の曖昧度Ａａｆｊが１以上となる曖昧語Ｗａｊに関する用例の判別結果の例を示す説明図である。6 is an explanatory diagram illustrating an example of a determination result of an example regarding an ambiguous word Waj having an ambiguity Aafj of 1 or more in FIGS. 図４、５の曖昧度Ａａｆｊが１未満となる曖昧語Ｗａｊに関する用例の判別結果の例を示す説明図である。6 is an explanatory diagram illustrating an example of a determination result of an example regarding an ambiguous word Waj having an ambiguity Aafj of less than 1 in FIGS.

［実施形態］
最初に、本発明の実施形態について、図面を参照して詳細に説明する。 [Embodiment]
First, an embodiment of the present invention will be described in detail with reference to the drawings.

図１は、本発明の実施形態に係る文書分析システム１００の構成を示すブロック図である。 FIG. 1 is a block diagram showing a configuration of a document analysis system 100 according to an embodiment of the present invention.

図１を参照すると、本発明の実施形態に係る文書分析システム１００は、基本的に電子機器内もしくはサーバと電子機器およびこれらを相互に接続するインターネット等の情報通信ネットワークからなるシステム内に、少なくとも、文書入力部１０と、文書解析部２０と、文書品質評価部３０と、用例分析部４０と、曖昧性分析条件変更部５０と、曖昧性判定部６０と、曖昧情報出力部７０と、曖昧用例データベース１１０と、分類精度データベース１２０と、を含む。 Referring to FIG. 1, a document analysis system 100 according to an embodiment of the present invention basically includes at least an electronic device or a system that includes a server and an electronic device and an information communication network such as the Internet that interconnects the server and the electronic device. , Document input unit 10, document analysis unit 20, document quality evaluation unit 30, example analysis unit 40, ambiguity analysis condition change unit 50, ambiguity determination unit 60, ambiguity information output unit 70, and ambiguity An example database 110 and a classification accuracy database 120 are included.

図示の文書分析システム１００は、自然言語で記載された文書に対して、分析対象の文書が曖昧性に配慮して書かれているかという程度を情報として利用し、記載された曖昧性を持つ表現から本当に曖昧性のある用例だけを判別する際の条件を変更して誤検出を削減することで、優先的な修正が必要な曖昧な表現を精度よく抽出する、文書分析システムである。 The illustrated document analysis system 100 uses the degree to which the analysis target document is written in consideration of the ambiguity with respect to the document described in the natural language, and the expression having the described ambiguity. Therefore, it is a document analysis system that accurately extracts ambiguous expressions that need to be preferentially modified by changing the conditions for discriminating only examples that are truly ambiguous to reduce false detections.

電子機器で文書分析システムを構成する場合、文書分析システム１００は、プログラム制御により動作するコンピュータで実現可能である。図示はしないが、この種のコンピュータは、周知のように、データを入力する入力装置と、データ処理装置と、データ処理装置での処理結果を出力する出力装置と、種々のデータベースとして働く補助記憶装置とを備えている。そして、データ処理装置は、プログラムを記憶するリードオンリメモリ（ＲＯＭ）と、データを一時的に記憶するワークエリアとして使用されるランダムアクセスメモリ（ＲＡＭ）と、ＲＯＭに記憶されたプログラムに従って、ＲＡＭに記憶されているデータを処理する中央処理装置（ＣＰＵ）とから構成される。 When a document analysis system is configured with electronic devices, the document analysis system 100 can be realized by a computer that operates under program control. Although not shown, this type of computer, as is well known, includes an input device for inputting data, a data processing device, an output device for outputting processing results in the data processing device, and an auxiliary memory serving as various databases. Device. Then, the data processing device stores data in a read-only memory (ROM) that stores a program, a random access memory (RAM) that is used as a work area that temporarily stores data, and a program stored in the ROM. It consists of a central processing unit (CPU) that processes stored data.

この場合、入力装置が文書入力部１０として動作し、データ処理装置が文書解析部２０、文書品質評価部３０、用例分析部４０、曖昧性分析条件変更部５０、曖昧性判定部６０として動作し、補助記憶装置が曖昧用例データベース１１０と分類精度データベース１２０として動作し、出力装置が曖昧情報出力部７０として動作する。 In this case, the input device operates as the document input unit 10, and the data processing device operates as the document analysis unit 20, the document quality evaluation unit 30, the example analysis unit 40, the ambiguity analysis condition change unit 50, and the ambiguity determination unit 60. The auxiliary storage device operates as the ambiguous example database 110 and the classification accuracy database 120, and the output device operates as the ambiguous information output unit 70.

次に、文書分析システム１００を構成する各構成要素の動作について説明する。 Next, the operation of each component constituting the document analysis system 100 will be described.

文書入力部１０は、曖昧語を含む可能性が有り、優先的な修正が必要な曖昧性の高い箇所を分析する文書もしくは文書群の入力を受け付ける。 The document input unit 10 receives an input of a document or a document group for analyzing a highly ambiguous portion that may include ambiguous words and needs to be preferentially corrected.

文書解析部２０は、文書もしくは文書群を構成する各文章に形態素解析を適用することで、各文章に使用されている全単語の単語情報の抽出を行う。ここで、単語は名詞、動詞、形容詞など単独で意味をなす自立語に加え、助詞などの付属語も個別の単語とみなす。また、同一の文字列からなる単語であっても出現箇所が異なれば、それぞれについて単語情報の抽出を行う。さらに、上記単語情報は少なくとも使用されている単語の文字列と単語毎の品詞、文内での単語間の相対的な位置関係、単語の文書内での存在位置などの情報を含む。単語の文書内での存在位置に関する情報とは、使用箇所が同定可能な情報であればよく、単語の存在する文の出現順位や頁、目次上の章や節、項などが該当する。 The document analysis unit 20 extracts word information of all words used in each sentence by applying morphological analysis to each sentence constituting the document or the document group. Here, in addition to self-supporting words such as nouns, verbs, and adjectives, words are also regarded as individual words such as particles. Also, even if the words are composed of the same character string, if the appearance location is different, word information is extracted for each. Further, the word information includes at least information such as the character string of the word used and the part of speech for each word, the relative positional relationship between the words in the sentence, and the position of the word in the document. The information on the position of the word in the document may be information that can identify the location of use, and includes the order of appearance of the sentence in which the word exists, the page, chapters, sections, and sections on the table of contents.

文書品質評価部３０は、所定の文書品質指標化ルールに基づき、入力された文書の特徴を利用して文書品質を評価し文書品質指標を算出する。ここで、「文書品質指標化ルール」とは、文書の内容を読み手に伝達する上での実効性を指標化する方法であればよい。文書品質の指標化例としては、文書中の文の総数に対する一定の文字数未満の文の割合、主語と述語が一対一対応している文の割合、係り受け解析の結果として複数の係り受け候補が無い文の割合、誤字脱字の無い文の割合、表記ゆれの数と単調減少の関係にある値など、値が高いほど文書の品質が良いことを示す指標が考えられる。 The document quality evaluation unit 30 evaluates the document quality using the characteristics of the input document based on a predetermined document quality indexing rule, and calculates a document quality index. Here, the “document quality indexing rule” may be any method that indexes the effectiveness in transmitting the contents of a document to a reader. Examples of document quality indexing include the ratio of sentences with less than a certain number of characters to the total number of sentences in the document, the ratio of sentences in which the subject and predicate correspond one-to-one, and multiple dependency candidates as a result of dependency analysis. An index indicating that the higher the value is, the better the quality of the document, such as the ratio of sentences with no typographical error, the ratio of sentences without typographical errors, or the value that is in a monotonous decrease relationship with the number of writing fluctuations.

曖昧用例データベース１１０は、曖昧性を持つ可能性のある曖昧語の文字列と、曖昧語毎の曖昧性の異なる用例の特徴と用例毎の曖昧さの程度である曖昧度を蓄積するデータベースである。そして、曖昧用例データベース１１０は、任意の単語と前記単語の用例の特徴に関する問い合わせに対し、問い合わせ対象の単語が文字列として蓄積された曖昧語と一致するか検索し、さらに曖昧語と一致した場合にこの曖昧語の用例の特徴に合った曖昧度を応答する。 The ambiguous example database 110 is a database that accumulates character strings of ambiguous words that may have ambiguity, characteristics of examples having different ambiguities for each ambiguous word, and ambiguity that is the degree of ambiguity for each example. . The ambiguous example database 110 searches for an inquiry about an arbitrary word and the characteristics of the example of the word to determine whether the word to be inquired matches the ambiguous word stored as a character string, and further matches the ambiguous word Respond to the ambiguity that matches the characteristics of this ambiguous word example.

ここで、「曖昧度」とは、用例に対する曖昧さの程度を表す指標であり、連続値を持つ指標であっても良いし、曖昧性を持つか持たないかを示す０、１などの不連続な値からなる指標であっても良い。また、「単語の用例」とは、単語毎の意味的に異なる用い方を分類した情報である。なお、曖昧用例データベース１１０として、ネットワーク上のデータベースを使用しても良い。 Here, the “ambiguity” is an index indicating the degree of ambiguity with respect to the example, and may be an index having a continuous value, or an unclear value such as 0 or 1 indicating whether or not there is ambiguity. It may be an index consisting of continuous values. The “word example” is information in which semantically different usages are classified for each word. Note that a database on the network may be used as the ambiguous example database 110.

用例分析部４０は、文書解析部２０で抽出された各文章に使用されている全単語の単語情報に基づき、文書中の曖昧語の有無を曖昧用例データベース１１０に問合せ、曖昧語が有る場合は、所定の曖昧性分析ルールに基づき、単語情報から各曖昧語の用例の特徴を判別し、曖昧用例データベース１１０に問合せることで、曖昧語とその曖昧度、および文書内での存在位置を、それぞれ各曖昧語に関する曖昧用例情報として抽出する。 The example analysis unit 40 queries the ambiguous example database 110 for the presence or absence of ambiguous words in the document based on the word information of all words used in each sentence extracted by the document analysis unit 20, and if there is an ambiguous word Based on a predetermined ambiguity analysis rule, the characteristics of each ambiguous word example are determined from the word information, and the ambiguous example database 110 is queried to determine the ambiguous word, its ambiguity, and the position in the document. Extracted as ambiguous example information about each ambiguous word.

ここで、「曖昧性分析ルール」とは、単語の文字列と単語毎の品詞、文内での単語間の相対的な位置関係などから、単語の用例を把握するルールであり、自然言語の意味解析技術や、単語と周辺の語の組合せパターンなどの利用が適している。「単語と周辺の語の組合せパターン」とは、単語が使用されている文における特定の位置、例えば前記単語の直前、直後、前方、後方、文頭、文末、直前の文、直後の文などに、所定の情報、例えば特定の語や、品詞、記号、数値表現などがあるかどうかを分類した情報である。 Here, the “ambiguity analysis rule” is a rule for grasping an example of a word from the character string of the word, the part of speech for each word, the relative positional relationship between the words in the sentence, and the like. Use of semantic analysis technology and combination patterns of words and surrounding words is suitable. “Combination pattern of words and surrounding words” refers to a specific position in a sentence in which the word is used, for example, immediately before, immediately after, forward, backward, beginning of a sentence, end of sentence, immediately preceding sentence, immediately following sentence, etc. , Information that classifies whether or not there is predetermined information such as a specific word, part of speech, symbol, numerical expression, or the like.

分類精度データベース１２０は、曖昧性分析ルール毎に、曖昧性分析ルールを文書に適用した際の曖昧語の用例の分類精度を収集して蓄積するデータベースである。そして、分類精度データベース１２０は、特定の単語に関する曖昧性分析ルールの問い合わせに対し、分類精度を検索し、応答する。 The classification accuracy database 120 is a database that collects and accumulates classification accuracy of ambiguous word examples when the ambiguity analysis rule is applied to a document for each ambiguity analysis rule. Then, the classification accuracy database 120 searches and responds to the inquiry of the ambiguity analysis rule regarding the specific word for the classification accuracy.

ここで、「分類精度」とは、各曖昧性分析ルールを文書に適用した際に、同じ文字列からなる曖昧語でありながら異なる曖昧度となる用例を分離することの正確さを表す指標である。例えば、分類精度は、実際の文書に適用し、用例を正しく分離したかの正解率を統計的に算出した連続値を持つ指標であっても良いし、分析者の経験などに基づき決められた高、中、低などの不連続な値からなる指標であっても良い。なお、分類精度データベース１２０として、ネットワーク上のデータベースを使用しても良い。 Here, “classification accuracy” is an index that indicates the accuracy of separating examples that have different ambiguities while being ambiguous words that consist of the same character string when each ambiguity analysis rule is applied to a document. is there. For example, the classification accuracy may be an index with a continuous value that is applied to an actual document and statistically calculated the accuracy rate of whether the example was correctly separated, or determined based on the experience of the analyst, etc. It may be an index composed of discontinuous values such as high, medium and low. Note that a database on a network may be used as the classification accuracy database 120.

曖昧性分析条件変更部５０は、文書品質評価部３０で算出した文書品質指標、および特定の単語に関する曖昧性分析ルールの問い合わせに対して分類精度データベース１２０から得られる分類精度を利用し、所定の曖昧性分析条件変更ルールに基づき、各曖昧語について曖昧性を分析する条件を変更する。 The ambiguity analysis condition changing unit 50 uses the document quality index calculated by the document quality evaluation unit 30 and the classification accuracy obtained from the classification accuracy database 120 for the query of the ambiguity analysis rule regarding a specific word, Based on the ambiguity analysis condition change rule, the condition for analyzing ambiguity is changed for each ambiguous word.

ここで、「曖昧性分析条件変更ルール」とは、曖昧度の異なる用例の判別に関して、文書品質指標の値が高いほど、曖昧度の高い用例と分類しにくくするように曖昧語の用例の判別条件を変更するルールが適切である。すなわち、曖昧性分析ルールがより曖昧度の高い用例かどうか判別するルールであれば、文書品質指標の値が高いほど、分類精度の悪い曖昧性分析ルールは適用しないようにし、曖昧性分析ルールがより曖昧度の低い用例かどうか判別するルールであれば、文書品質指標の値が高いほど、分類精度の悪い曖昧性分析ルールでも適用するようにすればよい。例えば、曖昧度の高い用例かどうか判別する曖昧性分析ルールについては、文書品質指標と単調増加の関係にあるよう設定した分類精度閾値以上の分類精度の曖昧性分析ルールのみに限定し、曖昧度の低い用例かどうか判別する曖昧性分析ルールについては、文書品質指標と単調減少の関係にあるよう設定した分類精度閾値以上の分類精度の曖昧性分析ルールのみに限定して曖昧性分析を行うように変更する方法などが適している。 Here, the “ambiguity analysis condition change rule” refers to the determination of an example of an ambiguous word so that the higher the value of the document quality index, the more difficult it is to classify it as an example of higher ambiguity. Rules that change conditions are appropriate. In other words, if the ambiguity analysis rule is a rule that determines whether it is an example with a higher ambiguity, the higher the value of the document quality index, the more the ambiguity analysis rule with poor classification accuracy is not applied. In the case of a rule for determining whether or not the example has a lower ambiguity, the higher the value of the document quality index, the more the ambiguity analysis rule with the lower classification accuracy may be applied. For example, ambiguity analysis rules that determine whether or not an example has a high degree of ambiguity are limited to ambiguity analysis rules with a classification accuracy equal to or higher than the classification accuracy threshold set to have a monotonically increasing relationship with the document quality index. For ambiguity analysis rules that discriminate whether or not the usage is low, the ambiguity analysis should be limited to classification accuracy ambiguity analysis rules with a classification accuracy threshold equal to or greater than the document quality index and a monotonic decrease. The method of changing to is suitable.

また、曖昧性分析条件変更ルールの別の例としては、分類精度データベース１２０の分類精度は利用せず、曖昧語の各用例の曖昧かどうかの判定に関して、文書品質指標の値が高いほど、曖昧とみなしにくくなるよう、曖昧とみなす曖昧度の境界となる曖昧度判定閾値を上げるように曖昧語の曖昧性の判定条件を変更するルールであってもよい。 In addition, as another example of the ambiguous analysis condition change rule, the classification accuracy of the classification accuracy database 120 is not used, and the determination of whether each example of the ambiguous word is ambiguous is higher as the value of the document quality index is higher. It may be a rule that changes the ambiguity determination condition of an ambiguous word so as to increase the ambiguity determination threshold value that becomes a boundary of the ambiguity that is regarded as ambiguous so that it is difficult to be regarded as ambiguous.

曖昧性判定部６０は、曖昧性分析条件変更部５０で曖昧性分析条件変更ルールによって変更した曖昧性の分析条件に基づき、各曖昧語の曖昧性を判定する。 The ambiguity determination unit 60 determines the ambiguity of each ambiguous word based on the ambiguity analysis condition changed by the ambiguity analysis condition change unit 50 by the ambiguity analysis condition change rule.

曖昧性分析条件変更ルールが、曖昧度の異なる用例の判別に関して、文書品質指標の値が高いほど、曖昧度の高い用例と分類しにくくするように曖昧語の用例の判別条件を変更するルールであるとする。この場合、曖昧性判定部６０は、変更した分析条件に従って曖昧語の用例分析を再度実施し、用例毎に曖昧度を変更した上で、所定の値以上の用例の曖昧語を真に曖昧な曖昧語として判定する。 The ambiguity analysis condition change rule is a rule that changes the determination condition of an ambiguous word example so that the higher the document quality index value is, the more difficult it is to classify an example with a higher ambiguity level. Suppose there is. In this case, the ambiguity determination unit 60 performs the example analysis of the ambiguous word again according to the changed analysis condition, and after changing the ambiguity for each example, the ambiguous word of the example having a predetermined value or more is truly ambiguous. Judge as an ambiguous word.

また、曖昧性分析条件変更ルールが、曖昧語の各用例の曖昧かどうかの判定に関して、文書品質指標の値が高いほど、曖昧とみなしにくくなるよう、曖昧とみなす曖昧度の境界となる曖昧度判定閾値を上げるように曖昧語の曖昧性の判定条件を変更するルールであるとする。この場合、曖昧性判定部６０は、変更した分析条件に従って用例毎に曖昧度が曖昧度判定閾値以上の用例の曖昧語を真に曖昧な曖昧語として判定する。 In addition, regarding the determination of whether the ambiguous analysis condition change rule is ambiguous for each example of an ambiguous word, the higher the value of the document quality index, the more ambiguous the boundary of the ambiguity considered as ambiguous so that it is less likely to be regarded as ambiguous It is assumed that the rule changes the ambiguity determination condition of an ambiguous word so as to increase the determination threshold. In this case, the ambiguity determination unit 60 determines an ambiguous word of an example whose ambiguity is equal to or greater than the ambiguity determination threshold for each example according to the changed analysis condition as a truly ambiguous ambiguous word.

曖昧情報出力部７０は、曖昧性判定部６０で曖昧性が閾値以上と判定した各曖昧語について、対応する曖昧度および文書内での存在位置を曖昧性情報として出力する。出力形態は、所要の形態で出力すれば良い。 The ambiguity information output unit 70 outputs the corresponding ambiguity and the position in the document as ambiguity information for each ambiguous word that the ambiguity determination unit 60 has determined that the ambiguity is equal to or greater than a threshold value. The output form may be output in a required form.

例えば、曖昧情報出力部７０は、文書内における各曖昧語を色分けや太字による強調、文字の拡大などで明示することで、文書全体を出力する形態などが適当である。他にも、出力形態としては、各曖昧語を抽出した表などの形態であって良い。他に、出力形態としては、曖昧度によって色分けや太字による強調もしくは単語の文字の大きさなどに強弱を与えるなどしても良い。 For example, the ambiguous information output unit 70 is appropriate to output the entire document by clearly indicating each ambiguous word in the document by color coding, bold emphasis, character enlargement, or the like. In addition, the output form may be a form such as a table in which each ambiguous word is extracted. In addition, as an output form, depending on the degree of ambiguity, color coding, bold emphasis, or the size of a word character may be given.

また、曖昧情報出力部７０は、曖昧語毎の曖昧度を文書全体または任意の範囲で集計し、文書の品質を表す指標として表形式で出力しても良い。また、曖昧情報出力部７０は、各出力形態を選択できるようにして、ベースとなる表示形態から必要に応じて表に移行できるようにしても良い。 Further, the ambiguous information output unit 70 may aggregate the ambiguity for each ambiguous word in the entire document or in an arbitrary range, and output it in a table format as an index representing the quality of the document. Further, the ambiguous information output unit 70 may be able to select each output form so that the display form as a base can be shifted to a table as necessary.

次に、図１及び図２のシーケンス図を参照して、本発明の実施形態に係る文書分析システム１００の全体の動作について詳細に説明する。なお、図２に示すシーケンス図及び以下の説明は処理例であり、適宜求める処理に応じて処理順等を入れ替えたり処理を戻したり繰り返したりすることを行っても良い。 Next, the overall operation of the document analysis system 100 according to the embodiment of the present invention will be described in detail with reference to the sequence diagrams of FIGS. 1 and 2. Note that the sequence diagram shown in FIG. 2 and the following description are examples of processing, and the processing order and the like may be changed or the processing may be returned or repeated depending on the processing that is appropriately obtained.

文書入力部１０は、曖昧語を含む可能性が有り優先的な修正が必要な曖昧性の高い箇所を分析する文書もしくは文書群の入力を受け付ける（図２のステップＡ１）。 The document input unit 10 receives an input of a document or a document group that analyzes a highly ambiguous part that may contain ambiguous words and that requires preferential correction (step A1 in FIG. 2).

文書解析部２０は、文書もしくは文書群を構成する各文章に形態素解析を適用することで、各文章に使用されている全単語の単語情報の抽出を行う（ステップＡ２）。 The document analysis unit 20 extracts word information of all words used in each sentence by applying morphological analysis to each sentence constituting the document or document group (step A2).

文書品質評価部３０は、所定の文書品質指標化ルールに基づき、入力された文書の特徴を利用して文書品質を評価し文書品質指標を算出する（ステップＡ３）。 The document quality evaluation unit 30 evaluates the document quality using the characteristics of the input document based on a predetermined document quality indexing rule, and calculates a document quality index (step A3).

曖昧用例データベース１１０は、曖昧性を持つ可能性のある曖昧語の文字列と、曖昧語毎の曖昧性の異なる用例の特徴と用例毎の曖昧さの程度である曖昧度を蓄積し、任意の単語と前記単語の用例の特徴に関する問い合わせに対し、問い合わせ対象の単語が文字列として蓄積された曖昧語と一致するか検索し、さらに曖昧語と一致した場合にこの曖昧語の用例に合った曖昧度を応答する（ステップＡ４）。 The ambiguous example database 110 accumulates character strings of ambiguous words that may have ambiguity, characteristics of examples having different ambiguities for each ambiguous word, and ambiguity that is the degree of ambiguity for each example. In response to an inquiry about a word and an example feature of the word, a search is made as to whether the query target word matches the ambiguous word stored as a character string, and if the word matches the ambiguous word, the ambiguous word matches the ambiguous word example. The degree is returned (step A4).

用例分析部４０は、文書解析部２０で抽出された各文章に使用されている全単語の単語情報に基づき、文書中の曖昧語の有無を曖昧用例データベース１１０に問合せ、曖昧語が有る場合は、所定の曖昧性分析ルールに基づき、単語情報から各曖昧語の用例の特徴を判別し、曖昧用例データベース１１０に問合せることで、曖昧語とその曖昧度、および文書内での存在位置を、それぞれ各曖昧語に関する曖昧用例情報として抽出する（ステップＡ５）。 The example analysis unit 40 queries the ambiguous example database 110 for the presence or absence of ambiguous words in the document based on the word information of all words used in each sentence extracted by the document analysis unit 20, and if there is an ambiguous word Based on a predetermined ambiguity analysis rule, the characteristics of each ambiguous word example are determined from the word information, and the ambiguous example database 110 is queried to determine the ambiguous word, its ambiguity, and the position in the document. Extracted as ambiguous example information regarding each ambiguous word (step A5).

分類精度データベース１２０は、曖昧性分析ルール毎に、曖昧性分析ルールを文書に適用した際の曖昧語の用例の分類精度を収集して蓄積し、特定の単語に関する曖昧性分析ルールの問い合わせに対し、分類精度を検索し、応答する（ステップＡ６）。 For each ambiguity analysis rule, the classification accuracy database 120 collects and accumulates classification accuracy of ambiguity word examples when the ambiguity analysis rule is applied to a document, and responds to an ambiguity analysis rule query for a specific word. The classification accuracy is searched and responded (step A6).

曖昧性分析条件変更部５０は、文書品質評価部３０で算出した文書品質指標、および特定の単語に関する曖昧性分析ルールの問い合わせに対して分類精度データベース１２０から得られる分類精度を利用し、所定の曖昧性分析条件変更ルールに基づき、各曖昧語について曖昧性を分析する条件を変更する（ステップＡ７）。 The ambiguity analysis condition changing unit 50 uses the document quality index calculated by the document quality evaluation unit 30 and the classification accuracy obtained from the classification accuracy database 120 for the query of the ambiguity analysis rule regarding a specific word, Based on the ambiguity analysis condition change rule, the condition for analyzing ambiguity is changed for each ambiguous word (step A7).

曖昧性判定部６０は、曖昧性分析条件変更部５０で曖昧性分析条件変更ルールによって変更した曖昧性の分析条件に基づき、各曖昧語の曖昧性を判定する（ステップＡ８）。 The ambiguity determination unit 60 determines the ambiguity of each ambiguous word based on the ambiguity analysis condition changed by the ambiguity analysis condition change unit 50 by the ambiguity analysis condition change unit 50 (step A8).

曖昧情報出力部７０は、曖昧性判定部６０で曖昧性が閾値以上と判定した各曖昧語について、対応する曖昧度および文書内での存在位置を曖昧性情報として出力する（ステップＡ９）。 The ambiguity information output unit 70 outputs the corresponding ambiguity and the position in the document as ambiguity information for each ambiguous word determined by the ambiguity determination unit 60 as being ambiguity or more (step A9).

次に、本発明の実施形態に係る文書分析システム１００の効果について説明する。 Next, effects of the document analysis system 100 according to the embodiment of the present invention will be described.

本実施形態では、文書の特徴を利用し、文書品質の良い文書ほど、用例の分類精度の高い曖昧性分析ルールに絞り込んで用例を分析し、曖昧度を算出するように構成している。このため、曖昧語を含む文書に、不確実な曖昧性分析ルールを適用する際に、曖昧性が少ない文書ほど、曖昧度の高い用例と分類しにくくするよう分析条件を変更でき、優先的な修正が必要な曖昧性の高い箇所を抽出する際の誤検出を低減することが可能になる。 In the present embodiment, the document features are used, and a document with better document quality is configured to narrow down the ambiguity analysis rule with higher classification accuracy of the example, analyze the example, and calculate the ambiguity. For this reason, when applying uncertain ambiguity analysis rules to documents containing ambiguous words, the analysis conditions can be changed so that the less ambiguous documents are harder to be classified as examples with higher ambiguity. It is possible to reduce false detections when extracting highly ambiguous parts that need correction.

なお、上記本発明の実施形態に係る文書分析システム１００は、文書分析方法として実現され得る。また、上記本発明の実施形態に係る文書分析システム１００は、文書分析プログラムによりコンピュータによって実行させるようにしても良い。 The document analysis system 100 according to the embodiment of the present invention can be realized as a document analysis method. The document analysis system 100 according to the embodiment of the present invention may be executed by a computer using a document analysis program.

次に、図３を参照して、具体的な実施例を用いて、本発明の実施形態に係る文書分析システム１００の動作について説明する。 Next, the operation of the document analysis system 100 according to the embodiment of the present invention will be described using a specific example with reference to FIG.

本実施例では、次のことを目的としている。 The purpose of this embodiment is as follows.

先ず、文書分析システム１００は、情報システム構築に関する提案書や仕様書といった、曖昧な箇所を排除すべき文書Ｄ内で、曖昧語Ｗａについて、文書Ｄの品質に与える曖昧性である曖昧度ＩＡを各曖昧語Ｗａの使用場面毎に算出し、各曖昧語Ｗａの文書中の位置や曖昧性の程度に関する曖昧性情報Ａを推定する。そして、文書分析システム１００は、推定された曖昧性情報Ａを出力することで、文書Ｄにおける優先的に修正すべき曖昧箇所の把握や、複数の文書間の比較として品質の低い文書の明確化などを容易にし、文書の改善を効率化する。 First, the document analysis system 100 sets an ambiguity IA that is an ambiguity given to the quality of the document D with respect to the ambiguous word Wa in the document D that should exclude an ambiguous part, such as a proposal or a specification regarding information system construction. It is calculated for each usage scene of each ambiguous word Wa, and ambiguity information A regarding the position in the document and the degree of ambiguity of each ambiguous word Wa is estimated. Then, the document analysis system 100 outputs the estimated ambiguity information A, thereby grasping the ambiguity portion to be corrected with priority in the document D and clarifying the low-quality document as a comparison between a plurality of documents. To improve the efficiency of document improvement.

また、本実施例では、文書分析システム１００は、図３に示されるように、文書解析システムＹと、イントラネット・サーバＺとで構成されるものとする。 In the present embodiment, the document analysis system 100 is configured by a document analysis system Y and an intranet server Z as shown in FIG.

文書解析システムＹは、分析実施者Ｂの持つＰＣ端末上で動作し、入力部及び出力部を介して、分析実施者Ｂが曖昧度Ａを推定したい文書群を構成する文章の入力と、曖昧度Ａの提示を実現する。 The document analysis system Y operates on the PC terminal possessed by the analyst B, and through the input unit and the output unit, the input of the text constituting the group of documents that the analyst B wants to estimate the ambiguity A is ambiguous. The presentation of degree A is realized.

イントラネット・サーバＺは、通信ネットワークを介して文書解析システムＹを実装した分析実施者Ｂの持つＰＣ端末と接続されている。イントラネット・サーバＺは、文書解析システムＹからの単語の問い合わせに対し、曖昧語Ｗａとしての登録の有無の検索を可能し、また単語の用例の問い合わせに対し、用例の内容や用例の数など単語の用例に関連する用例情報Ｃの検索を可能にする装置である。 The intranet server Z is connected to a PC terminal of an analysis person B who has implemented the document analysis system Y via a communication network. The intranet server Z can search for the presence / absence of registration as an ambiguous word Wa in response to a word query from the document analysis system Y, and can also search for a word example query such as the contents of the example and the number of examples. This is a device that enables retrieval of example information C related to the example.

次に、図３と図１との対応関係について説明する。 Next, the correspondence between FIG. 3 and FIG. 1 will be described.

文書入力部１０は、ＰＣ端末の入力部として動作する。文書解析部２０と、文書品質評価部３０と、用例分析部４０と、曖昧性分析条件変更部５０と、曖昧性判定部６０とは、文書解析システムＹ内に含まれている。曖昧情報出力部７０は、ＰＣ端末の出力部として動作する。曖昧用例データベース１１０と分類精度データベース１２０とはイントラネット・サーバＺ内に含まれている。 The document input unit 10 operates as an input unit of a PC terminal. The document analysis unit 20, the document quality evaluation unit 30, the example analysis unit 40, the ambiguity analysis condition change unit 50, and the ambiguity determination unit 60 are included in the document analysis system Y. The ambiguous information output unit 70 operates as an output unit of the PC terminal. The ambiguous example database 110 and the classification accuracy database 120 are included in the intranet server Z.

この様な手段を備えた文書解析システムＹ、イントラネット・サーバＺは以下のような動作をする。 The document analysis system Y and the intranet server Z provided with such means operate as follows.

文書解析システムＹは、入力部から、情報システム構築に関する提案書や仕様書といった、分析実施者Ｂが曖昧な箇所を排除するための曖昧度Ａを得たい文書Ｄの入力を受け付ける。そして、文書解析システムＹは、文書Ｄを構成する文章毎に形態素解析を適用し、文書を構成する単語Ｗに分解し、文書Ｄに含まれる全ての単語Ｗｉ（ｉ＝１、２、・・・、ｎ）について単語の種類、単語間の連結関係を単語情報として抽出する。さらに、文書解析システムＹは、文書の文の出現順に通し番号として文番号を付け、各単語を含む文の文番号を単語情報に加える。 The document analysis system Y receives from the input unit an input of a document D that the analysis operator B wants to obtain an ambiguity A such as a proposal or specification regarding information system construction to eliminate an ambiguous part. Then, the document analysis system Y applies morphological analysis to each sentence constituting the document D, decomposes it into words W constituting the document, and all the words Wi (i = 1, 2,... Included in the document D).・ For n), the type of word and the connection relationship between words are extracted as word information. Further, the document analysis system Y assigns sentence numbers as serial numbers in the order of appearance of the sentences in the document, and adds the sentence numbers of sentences including each word to the word information.

さらに文書解析システムＹは、文書Ｄに含まれる文Ｌの全てについて文字数をカウントし、文字数が一定値未満の文の割合を文書品質の特徴として集計し、文書Ｄの文書品質指標Ｎｄとして算出する。 Further, the document analysis system Y counts the number of characters for all the sentences L included in the document D, totals the ratio of sentences having a number of characters less than a certain value as a document quality feature, and calculates the document quality index Nd of the document D. .

イントラネット・サーバＺは、曖昧性を持つ可能性のある曖昧語Ｗａの文字列と、曖昧語Ｗａ毎に曖昧性が変わる各用例Ｆａとその曖昧さの程度である曖昧度Ａａｆを収集し曖昧用例情報Ｃａとして蓄積する。また、イントラネット・サーバＺは、任意の単語や表現の情報を抽出する検索エンジンなどの機能も提供することで、文書解析システムＹからの問い合わせに応じて、問い合わせ対象の単語が曖昧用例情報Ｃａに存在するかどうかを判定し、存在する場合は曖昧語の用例毎の曖昧度Ａａｆを提示する。なお、抽出する曖昧語は読み手によって複数の解釈をさせる作用を持つ表現であれば何でも良く、「等」、「など」といった省略表現や、「あれ」、「この」のような指示表現、「大きい」、「速い」といった定性的な表現などの単語が該当する。 The intranet server Z collects a character string of an ambiguous word Wa that may have ambiguity, an example Fa that changes ambiguity for each ambiguous word Wa, and an ambiguity Aaf that indicates the degree of ambiguity. Accumulated as information Ca. In addition, the intranet server Z also provides a function such as a search engine for extracting information on arbitrary words and expressions, so that the query target word is included in the ambiguous example information Ca in response to a query from the document analysis system Y. It is determined whether it exists, and if it exists, the ambiguity Aaf for each example of the ambiguous word is presented. Note that the ambiguous word to be extracted may be any expression as long as it allows the reader to perform multiple interpretations, such as an abbreviated expression such as “etc.”, “etc.”, an instruction expression such as “that”, “this”, This applies to words such as qualitative expressions such as “large” and “fast”.

さらに文書解析システムＹは、文書Ｄに含まれる全ての単語Ｗｉについてイントラネット・サーバＺに曖昧用例情報Ｃａの曖昧語に該当する単語であるかどうかを問い合わせ、曖昧語に該当するという判定結果となった単語Ｗｊ（ｊ＝１、２、・・・、ｍ）を曖昧語Ｗａｊ（ｊ＝１、２、・・・、ｍ）とし、曖昧語Ｗａｊについて、曖昧語Ｗａｊの曖昧度が変わる全用例Ｆａを抽出する。さらに文書解析システムＹは、曖昧語Ｗａｊの用例を、各曖昧語の種類毎に用意した曖昧性分析ルールＲａｆに従って該当する用例Ｆａと曖昧度Ａａｆを判定し、曖昧語Ｗａｊとその用例Ｆａｊ、曖昧度Ａａｆｊ、および曖昧語の存在する文の文番号を曖昧用例情報Ｃａｊとして抽出する。なお、文書Ｄ内に同一の曖昧語が複数回使用されていた場合もそれぞれ別々に抽出しておく。 Further, the document analysis system Y inquires the intranet server Z about all the words Wi included in the document D as to whether or not the word corresponds to the ambiguous word in the ambiguous example information Ca, and the determination result indicates that it corresponds to the ambiguous word. All the examples in which the word Wj (j = 1, 2,..., M) is an ambiguous word Waj (j = 1, 2,..., M) and the ambiguity of the ambiguous word Waj changes with respect to the ambiguous word Waj. Extract Fa. Further, the document analysis system Y determines the corresponding example Fa and ambiguity Aaf according to the ambiguity analysis rule Raf prepared for each type of ambiguous word, and determines the ambiguous word Waj and its example Faj, ambiguous. The degree Aafj and the sentence number of an ambiguous word are extracted as ambiguous example information Caj. In addition, when the same ambiguous word is used in the document D several times, it is extracted separately.

例えば、曖昧語Ｗａｊとして「原則」、「等」、「あれ」、「位」、「以下」、「以外」、「大きい」を想定し、文書Ｄ内に「原則、上書きするが、読み取り専用のファイルはコピーを作成。」、「均等に配分」、「値があれば・・」、「５位」、「以下の処理」、「１０％以下の場合は、・・」、「ＡかつＢ以外」、「所定より大きい値がある場合は、・・」という文章が存在したとする。この場合、曖昧語Ｗａｊ、用例Ｆａｊ、対応する曖昧性分析ルールＲａｆｊと曖昧度Ａａｆｊは図４のように例示される。ここで、曖昧度Ａａｆｊはアンケートなど任意の方法で決定された値で、図４の事例では０から２の間で大きいほど曖昧性が高く、１の場合は各曖昧語の標準的な曖昧性を持つ用例であることを指し、１より小さい場合は曖昧度が標準より弱い用例、１より大きい場合は曖昧度が標準より強い用例であることを意味する。 For example, assuming “Principle”, “etc.”, “that”, “rank”, “below”, “other than”, “large” as the ambiguous word Waj, "Make a copy of the file.", "Equal distribution", "If there is a value ...", "5th place", "Following processing", "If less than 10% ...", "A and It is assumed that there are sentences “other than B” and “if there is a value greater than a predetermined value,... In this case, the ambiguity word Waj, the example Faj, the corresponding ambiguity analysis rule Rafj, and the ambiguity Aafj are exemplified as shown in FIG. Here, the ambiguity Aafj is a value determined by an arbitrary method such as a questionnaire. In the case of FIG. 4, the ambiguity is higher as it is larger between 0 and 2, and in the case of 1, the ambiguity of each ambiguous word is standard. If it is smaller than 1, it means that the ambiguity is weaker than the standard. If it is larger than 1, it means that the ambiguity is stronger than the standard.

次に、イントラネット・サーバＺは、曖昧性分析ルールＲａｆ毎に、曖昧性分析ルールＲａｆを文書に適用した際の曖昧語Ｗａの用例Ｆａの分類精度Ｐａｆを分類精度データベースとして収集して蓄積する。また、イントラネット・サーバＺは、任意の用例Ｆａに対する分類精度Ｐａｆを抽出する検索エンジンなどの機能も提供することで、文書解析システムＹからの問い合わせに応じて、問い合わせ対象の曖昧語Ｗａの用例Ｆａに関する曖昧性分析ルールＲａｆの分類精度Ｐａｆを検索して提示する。 Next, for each ambiguity analysis rule Raf, the intranet server Z collects and accumulates the classification accuracy Paf of the example Fa of the ambiguous word Wa when the ambiguity analysis rule Raf is applied to a document as a classification accuracy database. In addition, the intranet server Z also provides a function such as a search engine that extracts the classification accuracy Paf for an arbitrary example Fa, so that an example Fa of an ambiguous word Wa to be queried in response to a query from the document analysis system Y. The categorical accuracy Paf of the ambiguity analysis rule Raf relating to is retrieved and presented.

例えば、曖昧語Ｗａとして「原則」、「等」、「あれ」、「位」、「以下」、「以外」、「大きい」に対応する用例Ｆａの一部に関する曖昧性分析ルールＲａｆおよび分類精度Ｐａｆは図５のような表になる。 For example, the ambiguity analysis rule Raf and the classification accuracy for a part of the example Fa corresponding to “principle”, “etc.”, “that”, “rank”, “below”, “other”, “large” as the ambiguous word Wa Paf is a table as shown in FIG.

さらに、文書解析システムＹは、各曖昧語Ｗａｊの曖昧度Ａａｆｊの値と文書品質指標Ｎｄを利用し、以下の（１）式に従い、用例の判別条件として適用する曖昧性分析ルールを判断する分類精度Ｐａｆの下限となる分類精度閾値Ｐｓを算出する。
Ａａｆｊ≧１：Ｐｓ＝Ｎｄ
Ａａｆｊ＜１：Ｐｓ＝１−Ｎｄ・・・（１）式 Further, the document analysis system Y uses the value of the ambiguity Aafj of each ambiguous word Waj and the document quality index Nd, and classifies the ambiguity analysis rule to be applied as an example discrimination condition according to the following equation (1). A classification accuracy threshold value Ps that is a lower limit of the accuracy Paf is calculated.
Aafj ≧ 1: Ps = Nd
Aafj <1: Ps = 1−Nd (1)

さらに、文書解析システムＹは、各曖昧語Ｗａｊの用例Ｆａｊに対応する分類精度Ｐａｆと分類精度閾値Ｐｓとを比較し、分類精度閾値Ｐｓ以上の分類精度Ｐａｆである曖昧性分析ルールＲａｆのみを適用するように用例の判別条件を変更し、各曖昧語Ｗａｊの用例Ｆａｊ毎の曖昧度Ａａｆｊを再計算し、変更する。 Further, the document analysis system Y compares the classification accuracy Paf corresponding to the example Faj of each ambiguous word Waj and the classification accuracy threshold value Ps, and applies only the ambiguity analysis rule Raf having the classification accuracy Paf equal to or higher than the classification accuracy threshold value Ps. The determination conditions of the example are changed so that the ambiguity Aafj for each example Faj of each ambiguous word Waj is recalculated and changed.

これは、曖昧度Ａａｆｊが１以上という標準より曖昧度が高い用例かどうかことを判定する場合は、文書品質指標Ｎｄと比例した分類精度閾値以上の分類精度Ｐａｆの曖昧性分析ルールのみに限定し、曖昧度Ａａｆｊが１未満という標準より曖昧度が低い用例かどうかを判定する場合は、文書品質指標Ｎｄが大きいほど小さい分類精度閾値以上の分類精度Ｐａｆの曖昧性分析ルールに限定するよう用例の判別条件を変えることを意味する。 This is limited to only an ambiguity analysis rule with a classification accuracy Paf equal to or greater than a classification accuracy threshold proportional to the document quality index Nd when determining whether the ambiguity Aafj is higher than the standard of 1 or more. When determining whether the ambiguity Aafj is lower than the standard of less than 1, the ambiguity analysis rule is limited to the ambiguity analysis rule having the classification accuracy Paf that is smaller than the classification accuracy threshold as the document quality index Nd increases. It means changing the judgment condition.

例えば、図４、５の例で文書品質指標Ｎｄが６０％の場合は、図６のように曖昧度Ａａｆｊが１以上となる曖昧語Ｗａｊの曖昧性分析ルールＲａｆｊでは、分類精度閾値Ｐｓが６０％となるので、分類精度Ｐａｆｊが５０％である［「以外」の同文で、前方に「かつ」がある］場合に曖昧度Ａａｆｊを１．５とする曖昧性分析ルールＲａｆｊや［「大きい」の同文で、後方に「場合」がある］場合に曖昧度Ａａｆｊを１．２とする曖昧性分析ルールＲａｆｊなどが用例を判別する際に不適用に変更される。 For example, when the document quality index Nd is 60% in the examples of FIGS. 4 and 5, the ambiguity analysis rule Rafj of the ambiguous word Waj having the ambiguity Aafj of 1 or more as shown in FIG. Therefore, when the classification accuracy Pafj is 50% [in the same sentence other than “with“ and ”” in front, the ambiguity analysis rule Rafj having an ambiguity Aafj of 1.5 or ““ large ” In the same sentence, the ambiguity analysis rule Rafj having an ambiguity Aafj of 1.2 is changed to inapplicable when determining an example.

同様に、図４、５の例で文書品質指標Ｎｄが６０％の場合は、図７のように曖昧度Ａａｆｊが１未満となる曖昧語Ｗａｊの曖昧性分析ルールＲａｆｊでは、分類精度閾値Ｐｓが４０％となるので、分類精度Ｐａｆｊが１０％である［「位」の直前が数値である］場合に曖昧度Ａａｆｊを０．５とする曖昧性分析ルールＲａｆｊなどが用例を判別する際に不適用に変更される。 Similarly, in the example of FIGS. 4 and 5, when the document quality index Nd is 60%, the classification accuracy threshold Ps is set in the ambiguity analysis rule Rafj of the ambiguity word Waj having the ambiguity Aafj less than 1 as shown in FIG. Therefore, when the classification accuracy Pafj is 10% [a value immediately before “rank” is a numerical value], the ambiguity analysis rule Rafj having an ambiguity Aafj of 0.5 is not useful in determining an example. Changed to apply.

さらに、文書解析システムＹは、曖昧語Ｗａとその曖昧度Ａａｆｊ、曖昧語Ｗａｊの文番号に基づき、文書Ｄ内における各曖昧語Ｗａを着色し明示することで、修正すべき曖昧な文の箇所を分かりやすくし表示する。また、文書解析システムＹは、曖昧度Ａａｆｊに基づき、曖昧語Ｗａの曖昧度Ａａｆｊを文書Ｄ全体および、目次の章単位などで集計して表もしくはグラフなどの形式で出力することで、文書Ｄの品質を表すメトリクスおよび、修正すべき曖昧な章を判断する情報を提供する。 Further, the document analysis system Y colors each ambiguous word Wa in the document D based on the ambiguous word Wa, its ambiguity Aafj, and the sentence number of the ambiguous word Waj, thereby clearly identifying the position of the ambiguous sentence to be corrected. Is displayed in an easy-to-understand manner. Further, the document analysis system Y collects the ambiguity Aafj of the ambiguous word Wa based on the ambiguity Aafj in the form of a table or a graph by summing up the entire document D and the chapter unit of the table of contents. It provides metrics that describe the quality of the and the information that determines the ambiguous chapters that need to be corrected.

以上、実施形態（及び実施例）を参照して本発明を説明したが、本発明は上記実施形態（及び実施例）に限定されるものではない。本発明の構成や詳細には、本発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 The present invention has been described above with reference to the embodiments (and examples), but the present invention is not limited to the above embodiments (and examples). Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

上記の実施形態の一部又は全部は、以下の付記のようにも記載されうるが、以下には限られない。 A part or all of the above-described embodiment can be described as in the following supplementary notes, but is not limited thereto.

（付記１）文書から曖昧性を持つ表現を抽出する文書分析システムであって、
対象とする文書もしくは文書群の入力を受け付ける文書入力部と、
文書もしくは文書群を構成する文章に使用されている各単語とその使用箇所に関する単語情報の抽出を行う文書解析部と、
所定の文書品質指標化ルールに基づき、入力された文書の特徴を利用して文書品質を評価し、文書品質指標を算出する文書品質評価部と、
曖昧性を持つ可能性のある曖昧語の文字列と、曖昧語毎の曖昧性の異なる用例の特徴と用例毎の曖昧さの程度である曖昧度を蓄積する曖昧用例データベースと、
前記単語情報に基づき、文書中の曖昧語の有無を前記曖昧用例データベースに問合せ、曖昧語が有る場合は、所定の曖昧性分析ルールに基づき、前記単語情報から各曖昧語の用例の特徴を判別し、前記曖昧用例データベースに問合せることで、曖昧語とその曖昧度、および文書内での存在位置を、それぞれ各曖昧語に関する曖昧用例情報として抽出する用例分析部と、
前記曖昧性分析ルール毎に、前記曖昧性分析ルールを文書に適用した際の曖昧語の用例の分類精度を収集して蓄積する分類精度データベースと、
前記文書品質指標、および特定の単語に関する前記曖昧性分析ルールの問い合わせに対して前記分類精度データベースから得られる分類精度を利用し、前記曖昧性分析ルールがより曖昧度の高い用例かどうか判別するルールであれば、前記文書品質指標の値が高いほど、前記分類精度の悪い曖昧性分析ルールは適用しないようにし、前記曖昧性分析ルールがより曖昧度の低い用例かどうか判別するルールであれば、前記文書品質指標の値が高いほど、前記分類精度の悪い曖昧性分析ルールでも適用するようにする所定の曖昧性分析条件変更ルールに基づき、各曖昧語について曖昧性を分析する条件を変更する曖昧性分析条件変更部と、
該変更した曖昧性の分析条件に基づき、各曖昧語の曖昧性を判定する曖昧性判定部と、
該曖昧性判定部で曖昧性が閾値以上と判定した各曖昧語について、対応する曖昧度および文書内での存在位置を曖昧性情報として出力する曖昧情報出力部と、
を備えたことを特徴とする文書分析システム。 (Supplementary note 1) A document analysis system for extracting an ambiguous expression from a document,
A document input unit for receiving input of a target document or document group, and
A document analysis unit for extracting word information about each word used in a document or a sentence constituting a document group and a used part thereof;
A document quality evaluation unit that evaluates document quality using the characteristics of the input document based on a predetermined document quality indexing rule and calculates a document quality index;
An ambiguous word database that accumulates character strings of ambiguous words that may have ambiguity, characteristics of examples with different ambiguities for each ambiguous word, and ambiguity that is the degree of ambiguity for each example;
Based on the word information, the ambiguous word database is queried for the presence or absence of an ambiguous word. If there is an ambiguous word, the characteristic of each ambiguous word example is determined from the word information based on a predetermined ambiguity analysis rule. And, by querying the ambiguous example database, an ambiguous word, its degree of ambiguity, and an existing position in a document are extracted as ambiguous example information about each ambiguous word, respectively,
For each ambiguity analysis rule, a classification accuracy database that collects and accumulates classification accuracy of examples of ambiguous words when the ambiguity analysis rule is applied to a document;
A rule for determining whether the ambiguity analysis rule is an example with higher ambiguity by using the classification accuracy obtained from the classification accuracy database for the document quality index and the query of the ambiguity analysis rule regarding a specific word If so, the higher the value of the document quality index, the more the ambiguity analysis rule with poor classification accuracy is not applied, and the ambiguity analysis rule is a rule for determining whether the ambiguity analysis rule has a lower ambiguity. The higher the value of the document quality index, the more ambiguous the condition for analyzing the ambiguity for each ambiguous word based on the predetermined ambiguity analysis condition change rule that is applied even to the ambiguity analysis rule with poor classification accuracy. Sex analysis condition change part,
An ambiguity determination unit that determines the ambiguity of each ambiguous word based on the changed ambiguity analysis condition;
An ambiguity information output unit that outputs the corresponding ambiguity and the position in the document as ambiguity information for each ambiguous word determined by the ambiguity determination unit to be equal to or greater than a threshold;
A document analysis system characterized by comprising:

（付記２）前記用例分析部の曖昧性分析ルールが、単語の文字列と単語毎の品詞、文内での単語間の相対的な位置関係から、単語の用例を把握するルールである、ことを特徴とする付記１に記載の文書分析システム。 (Additional remark 2) The ambiguity analysis rule of the said example analysis part is a rule which grasps | ascertains the example of a word from the character string of a word, the part of speech for every word, and the relative positional relationship between the words in a sentence. The document analysis system according to appendix 1, characterized by:

（付記３）前記文書品質評価部の文書品質指標化ルールが、文書の内容を読み手に伝達する上での実効性を指標化する方法である、ことを特徴とする付記１又は２に記載の文書分析システム。 (Supplementary Note 3) The document quality indexing rule of the document quality evaluation unit is a method for indexing effectiveness in transmitting the contents of a document to a reader. Document analysis system.

（付記４）前記分類精度データベースは、前記曖昧語の用例としてイントラネット上の文書群、もしくは分析する対象とする文書と同一ドメインの文書群の前記分類精度を収集して蓄積する、ことを特徴とする付記１乃至３のいずれか１項に記載の文書分析システム。 (Supplementary Note 4) The classification accuracy database collects and accumulates the classification accuracy of a document group on the intranet as an example of the ambiguous word or a document group of the same domain as the document to be analyzed, The document analysis system according to any one of supplementary notes 1 to 3.

（付記５）前記曖昧性分析条件変更部の曖昧性分析条件変更ルールが、曖昧度の異なる用例の判別に関して、文書品質指標の値が高いほど、曖昧度の高い用例と分類しにくくするように曖昧語の用例の判別条件を変更するルールである、ことを特徴とする付記１乃至４のいずれか１項に記載の文書分析システム。 (Additional remark 5) Regarding the ambiguity analysis condition change rule of the ambiguity analysis condition change unit, regarding the discrimination of the examples having different ambiguities, the higher the value of the document quality index, the more difficult it is to classify the ambiguity analysis condition change rule. The document analysis system according to any one of appendices 1 to 4, wherein the document analysis system is a rule for changing a determination condition of an ambiguous word example.

（付記６）前記曖昧性判定部が、変更した分析条件に従って曖昧語の用例分析を再度実施し、用例毎に曖昧度を変更した上で、所定の値以上の用例の曖昧語を真に曖昧な曖昧語として判定する、ことを特徴とする付記５に記載の文書分析システム。 (Additional remark 6) The said ambiguity determination part implements the example analysis of an ambiguous word again according to the changed analysis conditions, and after changing the ambiguity for every example, an ambiguous word of an example more than a predetermined value is truly ambiguous. The document analysis system according to appendix 5, wherein the document analysis system is determined as an ambiguous word.

（付記７）前記曖昧性分析条件変更部の曖昧性分析条件変更ルールが、曖昧語の各用例の曖昧かどうかの判定に関して、文書品質指標の値が高いほど、曖昧とみなしにくくなるよう、曖昧とみなす曖昧度の境界となる曖昧度判定閾値を上げるように曖昧語の曖昧性の判定条件を変更するルールである、ことを特徴とする付記１乃至４のいずれか１項に記載の文書分析システム。 (Additional remark 7) Regarding the determination of whether the ambiguity analysis condition change rule of the ambiguity analysis condition change unit is ambiguous for each example of an ambiguous word, the higher the value of the document quality index, the more ambiguous it is The document analysis according to any one of appendices 1 to 4, which is a rule for changing an ambiguity determination condition of an ambiguous word so as to increase an ambiguity determination threshold that is a boundary of ambiguity to be regarded as system.

（付記８）前記曖昧性判定部が、変更した分析条件に従って用例毎に曖昧度が前記曖昧度判定閾値以上の用例の曖昧語を真に曖昧な曖昧語として判定する、ことを特徴とする付記７に記載の文書分析システム。 (Additional remark 8) The said ambiguousness determination part determines the ambiguous word of the example whose ambiguity is more than the said ambiguity determination threshold value for every example according to the changed analysis conditions as a truly ambiguous ambiguous word. 8. The document analysis system according to 7.

（付記９）文書から曖昧性を持つ表現を抽出する文書分析方法であって、
文書入力部が、対象とする文書もしくは文書群の入力を受け付ける文書入力ステップと、
文書解析部が、文書もしくは文書群を構成する文章に使用されている各単語とその使用箇所に関する単語情報の抽出を行う文書解析ステップと、
文書品質評価部が、所定の文書品質指標化ルールに基づき、入力された文書の特徴を利用して文書品質を評価し、文書品質指標を算出する文書品質評価ステップと、
用例分析部が、前記単語情報に基づき、文書中の曖昧語の有無を、曖昧性を持つ可能性のある曖昧語の文字列と曖昧語毎の曖昧性の異なる用例の特徴と用例毎の曖昧さの程度である曖昧度を蓄積する曖昧用例データベースに問合せ、曖昧語が有る場合は、所定の曖昧性分析ルールに基づき、前記単語情報から各曖昧語の用例の特徴を判別し、前記曖昧用例データベースに問合せることで、曖昧語とその曖昧度、および文書内での存在位置を、それぞれ各曖昧語に関する曖昧用例情報として抽出する用例分析ステップと、
曖昧性分析条件変更部が、前記文書品質指標、および特定の単語に関する前記曖昧性分析ルールの問い合わせに対して、前記曖昧性分析ルール毎に前記曖昧性分析ルールを文書に適用した際の曖昧語の用例の分類精度を収集して蓄積する分類精度データベースから得られる分類精度を利用し、前記曖昧性分析ルールがより曖昧度の高い用例かどうか判別するルールであれば、前記文書品質指標の値が高いほど、前記分類精度の悪い曖昧性分析ルールは適用しないようにし、前記曖昧性分析ルールがより曖昧度の低い用例かどうか判別するルールであれば、前記文書品質指標の値が高いほど、前記分類精度の悪い曖昧性分析ルールでも適用するようにする所定の曖昧性分析条件変更ルールに基づき、各曖昧語について曖昧性を分析する条件を変更する曖昧性分析条件変更ステップと、
曖昧性判定部が、前記変更した曖昧性の分析条件に基づき、各曖昧語の曖昧性を判定する曖昧性判定ステップと、
曖昧情報出力部が、該曖昧性判定ステップで曖昧性が閾値以上と判定した各曖昧語について、対応する曖昧度および文書内での存在位置を曖昧性情報として出力する曖昧情報出力ステップと、
を含むことを特徴とする文書分析方法。 (Supplementary note 9) A document analysis method for extracting an ambiguous expression from a document,
A document input step in which the document input unit accepts input of a target document or document group; and
A document analysis step in which a document analysis unit extracts each word used in a sentence that constitutes a document or a document group and word information related to the use location;
A document quality evaluation unit that evaluates document quality using the characteristics of the input document based on a predetermined document quality indexing rule and calculates a document quality index; and
Based on the word information, the example analysis unit determines whether or not there is an ambiguous word in the document. The character string of the ambiguous word that may have ambiguity and the characteristics of the example with different ambiguity for each ambiguous word and the ambiguity for each example Queries the ambiguity example database that accumulates the degree of ambiguity, and if there is an ambiguity word, based on a predetermined ambiguity analysis rule, the feature of each ambiguity word example is determined from the word information, and the ambiguity example An example analysis step of extracting an ambiguous word, its ambiguity, and an existing position in a document as ambiguous example information for each ambiguous word by querying the database,
An ambiguous word when the ambiguity analysis condition changing unit applies the ambiguity analysis rule to a document for each ambiguity analysis rule in response to an inquiry of the ambiguity analysis rule regarding the document quality index and a specific word If the classification accuracy obtained from the classification accuracy database that collects and accumulates the classification accuracy of the example is used and the ambiguity analysis rule is a rule that determines whether it is an example with a higher ambiguity, the value of the document quality index The higher the value of the document quality index is, the higher the value of the document quality index is, the higher the value of the document quality index is, the higher the value of the document quality index is. based on a predetermined ambiguity analysis condition change rule to be applied in the classification accuracy poor ambiguity analysis rule, varying the conditions to analyze the ambiguity for each ambiguous word And ambiguity analysis conditions changing step of,
An ambiguity determination unit determines an ambiguity of each ambiguous word based on the changed ambiguity analysis condition,
The ambiguous information output unit outputs the corresponding ambiguity and the position in the document as ambiguity information for each ambiguous word for which the ambiguity is determined to be greater than or equal to the threshold in the ambiguity determination step,
A document analysis method comprising:

（付記１０）前記曖昧性分析ルールが、単語の文字列と単語毎の品詞、文内での単語間の相対的な位置関係から、単語の用例を把握するルールである、ことを特徴とする付記９に記載の文書分析方法。 (Additional remark 10) The said ambiguity analysis rule is a rule which grasps | ascertains the example of a word from the character string of a word, the part of speech for every word, and the relative positional relationship between the words in a sentence, The document analysis method according to attachment 9.

（付記１１）前記文書品質指標化ルールが、文書の内容を読み手に伝達する上での実効性を指標化する方法である、ことを特徴とする付記９又は１０に記載の文書分析方法。 (Supplementary note 11) The document analysis method according to supplementary note 9 or 10, wherein the document quality indexing rule is a method for indexing the effectiveness in transmitting the contents of a document to a reader.

（付記１２）前記曖昧性分析条件変更ルールが、曖昧度の異なる用例の判別に関して、文書品質指標の値が高いほど、曖昧度の高い用例と分類しにくくするように曖昧語の用例の判別条件を変更するルールである、ことを特徴とする付記９乃至１１のいずれか１項に記載の文書分析方法。 (Additional remark 12) Regarding the determination of examples with different ambiguity, the ambiguity analysis condition change rule is such that the higher the value of the document quality index is, the more difficult it is to classify the ambiguity example as a high ambiguity example. 12. The document analysis method according to any one of appendices 9 to 11, wherein the document analysis method is a rule for changing an item.

（付記１３）前記曖昧性判定ステップでは、前記曖昧性判定部が、変更した分析条件に従って曖昧語の用例分析を再度実施し、用例毎に曖昧度を変更した上で、所定の値以上の用例の曖昧語を真に曖昧な曖昧語として判定する、ことを特徴とする付記１２に記載の文書分析方法。 (Additional remark 13) In the said ambiguity determination step, the said ambiguity determination part re-executes the example analysis of an ambiguous word according to the changed analysis conditions, changes an ambiguity for every example, and is an example more than a predetermined value The document analysis method according to appendix 12, wherein the ambiguous word is determined as a truly ambiguous ambiguous word.

（付記１４）前記曖昧性分析条件変更ルールが、曖昧語の各用例の曖昧かどうかの判定に関して、文書品質指標の値が高いほど、曖昧とみなしにくくなるよう、曖昧とみなす曖昧度の境界となる曖昧度判定閾値を上げるように曖昧語の曖昧性の判定条件を変更するルールである、ことを特徴とする付記９乃至１１のいずれか１項に記載の文書分析方法。 (Supplementary Note 14) Regarding the determination of whether or not the ambiguity analysis condition change rule is ambiguous for each example of an ambiguous word, the higher the value of the document quality index, the less likely it is to be regarded as ambiguous. The document analysis method according to any one of appendices 9 to 11, wherein the ambiguity determination condition is changed so as to increase the ambiguity determination threshold.

（付記１５）前記曖昧性判定ステップでは、前記曖昧性判定部が、変更した分析条件に従って用例毎に曖昧度が前記曖昧度判定閾値以上の用例の曖昧語を真に曖昧な曖昧語として判定する、ことを特徴とする付記１４に記載の文書分析方法。 (Supplementary Note 15) In the ambiguity determination step, the ambiguity determination unit determines an ambiguous word of an example whose ambiguity is equal to or greater than the ambiguity determination threshold for each example according to the changed analysis condition as a truly ambiguous ambiguous word. The document analysis method according to appendix 14, characterized by:

（付記１６）コンピュータに文書から曖昧性を持つ表現を抽出させる文書分析プログラムであって、前記コンピュータに、
対象とする文書もしくは文書群の入力を受け付ける文書入力手順と、
文書もしくは文書群を構成する文章に使用されている各単語とその使用箇所に関する単語情報の抽出を行う文書解析手順と、
所定の文書品質指標化ルールに基づき、入力された文書の特徴を利用して文書品質を評価し、文書品質指標を算出する文書品質評価手順と、
前記単語情報に基づき、文書中の曖昧語の有無を、曖昧性を持つ可能性のある曖昧語の文字列と曖昧語毎の曖昧性の異なる用例の特徴と用例毎の曖昧さの程度である曖昧度を蓄積する曖昧用例データベースに問合せ、曖昧語が有る場合は、所定の曖昧性分析ルールに基づき、前記単語情報から各曖昧語の用例の特徴を判別し、前記曖昧用例データベースに問合せることで、曖昧語とその曖昧度、および文書内での存在位置を、それぞれ各曖昧語に関する曖昧用例情報として抽出する用例分析手順と、
前記文書品質指標、および特定の単語に関する前記曖昧性分析ルールの問い合わせに対して、前記曖昧性分析ルール毎に前記曖昧性分析ルールを文書に適用した際の曖昧語の用例の分類精度を収集して蓄積する分類精度データベースから得られる分類精度を利用し、前記曖昧性分析ルールがより曖昧度の高い用例かどうか判別するルールであれば、前記文書品質指標の値が高いほど、前記分類精度の悪い曖昧性分析ルールは適用しないようにし、前記曖昧性分析ルールがより曖昧度の低い用例かどうか判別するルールであれば、前記文書品質指標の値が高いほど、前記分類精度の悪い曖昧性分析ルールでも適用するようにする所定の曖昧性分析条件変更ルールに基づき、各曖昧語について曖昧性を分析する条件を変更する曖昧性分析条件変更手順と、
前記変更した曖昧性の分析条件に基づき、各曖昧語の曖昧性を判定する曖昧性判定手順と、
該曖昧性判定ステップで曖昧性が閾値以上と判定した各曖昧語について、対応する曖昧度および文書内での存在位置を曖昧性情報として出力する曖昧情報出力手順と、
を実行させる文書分析プログラム。
(Supplementary Note 16) A document analysis program for causing a computer to extract an ambiguous expression from a document, wherein the computer
A document input procedure for receiving input of a target document or document group,
A document analysis procedure for extracting word information related to each word used in a document or a sentence constituting a document group and a used part thereof;
A document quality evaluation procedure for evaluating the document quality using the characteristics of the input document based on a predetermined document quality indexing rule and calculating the document quality index;
Based on the word information, whether or not there is an ambiguous word in the document is a character string of an ambiguous word that may have ambiguity and an example feature with different ambiguity for each ambiguous word and the degree of ambiguity for each example. Queries in the ambiguous example database that accumulates the ambiguity, if there are ambiguous words, based on a predetermined ambiguity analysis rule, determine the characteristics of the example of each ambiguous word from the word information, by querying the ambiguous example database , An example analysis procedure for extracting ambiguous words, their ambiguity, and their position in the document as ambiguous example information for each ambiguous word,
In response to the query of the ambiguity analysis rule regarding the document quality index and the specific word, the classification accuracy of the ambiguity word example when the ambiguity analysis rule is applied to the document for each ambiguity analysis rule is collected. If the classification accuracy obtained from the classification accuracy database stored in the table is used and the ambiguity analysis rule is a rule for determining whether the ambiguity analysis rule is a higher ambiguity example, the higher the value of the document quality index, the higher the classification accuracy. A bad ambiguity analysis rule is not applied, and if the ambiguity analysis rule is a rule for determining whether the ambiguity analysis rule is an example with a lower ambiguity, the higher the value of the document quality index, the worse the ambiguity analysis with the classification accuracy. based on a predetermined ambiguity analysis condition change rule to be applied in the rule, ambiguity analysis condition change for changing the conditions to analyze the ambiguity for each ambiguous word And order,
An ambiguity determination procedure for determining the ambiguity of each ambiguous word based on the changed ambiguity analysis condition;
An ambiguity information output procedure for outputting the corresponding ambiguity and the position in the document as ambiguity information for each ambiguous word determined to have an ambiguity equal to or greater than a threshold in the ambiguity determination step;
Document analysis program that executes

（付記１７）前記曖昧性分析ルールが、単語の文字列と単語毎の品詞、文内での単語間の相対的な位置関係から、単語の用例を把握するルールである、ことを特徴とする付記１６に記載の文書分析プログラム。 (Additional remark 17) The said ambiguity analysis rule is a rule which grasps | ascertains the example of a word from the character string and the part of speech for every word, and the relative positional relationship between the words in a sentence, It is characterized by the above-mentioned. The document analysis program according to attachment 16.

（付記１８）前記文書品質指標化ルールが、文書の内容を読み手に伝達する上での実効性を指標化する方法である、ことを特徴とする付記１６又は１７に記載の文書分析プログラム。 (Supplementary note 18) The document analysis program according to supplementary note 16 or 17, characterized in that the document quality indexing rule is a method for indexing the effectiveness in transmitting the contents of a document to a reader.

（付記１９）前記曖昧性分析条件変更ルールが、曖昧度の異なる用例の判別に関して、文書品質指標の値が高いほど、曖昧度の高い用例と分類しにくくするように曖昧語の用例の判別条件を変更するルールである、ことを特徴とする付記１６乃至１８のいずれか１項に記載の文書分析プログラム。 (Additional remark 19) Regarding the determination of the ambiguity analysis condition change rule, the determination condition of the ambiguity word example is such that the higher the value of the document quality index, the more difficult it is to classify the ambiguity analysis condition change rule. 19. The document analysis program according to any one of supplementary notes 16 to 18, wherein

（付記２０）前記曖昧性判定手順は、変更した分析条件に従って曖昧語の用例分析を再度実施し、用例毎に曖昧度を変更した上で、所定の値以上の用例の曖昧語を真に曖昧な曖昧語として判定する、ことを特徴とする付記１９に記載の文書分析プログラム。 (Additional remark 20) The said ambiguity determination procedure performs the example analysis of an ambiguous word again according to the changed analysis conditions, and after changing the ambiguity for every example, an ambiguous word of an example more than a predetermined value is truly ambiguous. The document analysis program according to appendix 19, wherein the document analysis program is determined as an ambiguous word.

（付記２１）前記曖昧性分析条件変更ルールが、曖昧語の各用例の曖昧かどうかの判定に関して、文書品質指標の値が高いほど、曖昧とみなしにくくなるよう、曖昧とみなす曖昧度の境界となる曖昧度判定閾値を上げるように曖昧語の曖昧性の判定条件を変更するルールである、ことを特徴とする付記１６乃至１８のいずれか１項に記載の文書分析プログラム。 (Additional remark 21) Regarding the determination of whether or not the ambiguity analysis condition change rule is ambiguous for each example of an ambiguous word, the higher the value of the document quality index, the more ambiguous the boundary that is regarded as ambiguous. The document analysis program according to any one of appendices 16 to 18, wherein the ambiguity determination condition is changed so as to increase the ambiguity determination threshold.

（付記２２）前記曖昧性判定手順は、変更した分析条件に従って用例毎に曖昧度が前記曖昧度判定閾値以上の用例の曖昧語を真に曖昧な曖昧語として判定する、ことを特徴とする付記２１に記載の文書分析プログラム。 (Additional remark 22) The said ambiguousness determination procedure determines the ambiguous word of the example whose ambiguity is more than the said ambiguity determination threshold value for every example according to the changed analysis conditions as a truly ambiguous ambiguous word. 21. The document analysis program according to 21.

本発明によれば、ソフトウェアやシステムの開発における要件定義などの作業においてやり取りされる各種文書に関して、文書の曖昧さを特に問題の大きい場所から優先的に修正することが可能になり、文書作成や文書レビューの効率化に繋がり、複数の読み手の間に異なる解釈が起きる状況などが減少するので、手戻りの減少や顧客満足の向上などシステム開発の効率化に関する用途に適用できる。 According to the present invention, it is possible to preferentially correct the ambiguity of various documents exchanged in work such as requirement definition in software and system development, particularly from a place where a problem is large, This leads to the efficiency of document review, and the situation where different interpretations occur among multiple readers is reduced. Therefore, it can be applied to applications related to the efficiency of system development, such as reducing rework and improving customer satisfaction.

１０文書入力部
２０文書解析部
３０文書品質評価部
４０用例分析部
５０曖昧性分析条件変更部
６０曖昧性判定部
７０曖昧情報出力部
１１０曖昧用例データベース
１２０分類精度データベース
Ｄ文書
Ａ曖昧語
Ｙ文書解析システム
Ｚイントラネット・サーバ DESCRIPTION OF SYMBOLS 10 Document input part 20 Document analysis part 30 Document quality evaluation part 40 Example analysis part 50 Ambiguity analysis condition change part 60 Ambiguity determination part 70 Ambiguous information output part 110 Ambiguous example database 120 Classification accuracy database D Document A Ambiguous word Y Document analysis System Z intranet server

Claims

A document analysis system that extracts ambiguous expressions from a document,
A document input unit for receiving input of a target document or document group, and
A document analysis unit for extracting word information about each word used in a document or a sentence constituting a document group and a used part thereof;
A document quality evaluation unit that evaluates document quality using the characteristics of the input document based on a predetermined document quality indexing rule and calculates a document quality index;
An ambiguous word database that accumulates character strings of ambiguous words that may have ambiguity, characteristics of examples with different ambiguities for each ambiguous word, and ambiguity that is the degree of ambiguity for each example;
Based on the word information, the ambiguous word database is queried for the presence or absence of an ambiguous word. If there is an ambiguous word, the characteristic of each ambiguous word example is determined from the word information based on a predetermined ambiguity analysis rule. And, by querying the ambiguous example database, an ambiguous word, its degree of ambiguity, and an existing position in a document are extracted as ambiguous example information about each ambiguous word, respectively,
For each ambiguity analysis rule, a classification accuracy database that collects and accumulates classification accuracy of examples of ambiguous words when the ambiguity analysis rule is applied to a document;
A rule for determining whether the ambiguity analysis rule is an example with higher ambiguity by using the classification accuracy obtained from the classification accuracy database for the document quality index and the query of the ambiguity analysis rule regarding a specific word If so, the higher the value of the document quality index, the more the ambiguity analysis rule with poor classification accuracy is not applied, and the ambiguity analysis rule is a rule for determining whether the ambiguity analysis rule has a lower ambiguity. The higher the value of the document quality index, the more ambiguous the condition for analyzing the ambiguity for each ambiguous word based on the predetermined ambiguity analysis condition change rule that is applied even to the ambiguity analysis rule with poor classification accuracy. Sex analysis condition change part,
An ambiguity determination unit that determines the ambiguity of each ambiguous word based on the changed ambiguity analysis condition;
An ambiguity information output unit that outputs the corresponding ambiguity and the position in the document as ambiguity information for each ambiguous word determined by the ambiguity determination unit to be equal to or greater than a threshold;
A document analysis system characterized by comprising:

The ambiguity analysis rule of the example analysis unit is a rule for grasping an example of a word from a character string of a word, a part of speech for each word, and a relative positional relationship between words in a sentence. The document analysis system according to claim 1.

3. The document analysis system according to claim 1, wherein the document quality indexing rule of the document quality evaluation unit is a method for indexing effectiveness in transmitting the contents of a document to a reader. .

The classification accuracy database collects and accumulates the classification accuracy of a document group on an intranet as an example of the ambiguous word or a document group of the same domain as a document to be analyzed. 4. The document analysis system according to any one of items 1 to 3.

The ambiguity analysis condition change rule of the ambiguity analysis condition change unit is an example of an ambiguous word so that the higher the value of the document quality index, the more difficult it is to classify the ambiguity analysis condition change rule. 5. The document analysis system according to claim 1, wherein the rule is to change the determination condition.

The ambiguity determination unit performs the example analysis of the ambiguous word again according to the changed analysis condition, and after changing the ambiguity for each example, the ambiguous word of the example exceeding the predetermined value is regarded as a truly ambiguous ambiguous word. 6. The document analysis system according to claim 5, wherein the determination is performed.

The degree of ambiguity that the ambiguity analysis condition change rule of the ambiguity analysis condition change unit regards as ambiguous so that the higher the value of the document quality index, the less likely it is to be regarded as ambiguous as to whether or not each example of an ambiguous word is ambiguous 5. The document analysis system according to claim 1, wherein the ambiguity determination condition of an ambiguous word is changed so as to increase an ambiguity determination threshold value that becomes a boundary of the document.

8. The ambiguity determination unit determines an ambiguity word of an example whose ambiguity is equal to or greater than the ambiguity determination threshold for each example according to the changed analysis condition as a truly ambiguous ambiguity word. Document analysis system.

A document analysis method for extracting an ambiguous expression from a document,
A document input step in which the document input unit accepts input of a target document or document group; and
A document analysis step in which a document analysis unit extracts each word used in a sentence that constitutes a document or a document group and word information related to the use location;
A document quality evaluation unit that evaluates document quality using the characteristics of the input document based on a predetermined document quality indexing rule and calculates a document quality index; and
Based on the word information, the example analysis unit determines whether or not there is an ambiguous word in the document. Queries the ambiguity example database that accumulates the degree of ambiguity, and if there is an ambiguity word, based on a predetermined ambiguity analysis rule, the feature of each ambiguity word example is determined from the word information, and the ambiguity example An example analysis step of extracting an ambiguous word, its ambiguity, and an existing position in a document as ambiguous example information for each ambiguous word by querying the database,
An ambiguous word when the ambiguity analysis condition changing unit applies the ambiguity analysis rule to a document for each ambiguity analysis rule in response to an inquiry of the ambiguity analysis rule regarding the document quality index and a specific word If the classification accuracy obtained from the classification accuracy database that collects and accumulates the classification accuracy of the example is used and the ambiguity analysis rule is a rule that determines whether it is an example with a higher ambiguity, the value of the document quality index The higher the value of the document quality index is, the higher the value of the document quality index is, the higher the value of the document quality index is, based on a predetermined ambiguity analysis condition change rule to be applied in the classification accuracy poor ambiguity analysis rule, varying the conditions to analyze the ambiguity for each ambiguous word And ambiguity analysis conditions changing step of,
An ambiguity determination unit determines an ambiguity of each ambiguous word based on the changed ambiguity analysis condition,
The ambiguous information output unit outputs the corresponding ambiguity and the position in the document as ambiguity information for each ambiguous word for which the ambiguity is determined to be greater than or equal to the threshold in the ambiguity determination step,
A document analysis method comprising:

A document analysis program for causing a computer to extract an ambiguous expression from a document, wherein the computer
A document input procedure for receiving input of a target document or document group,
A document analysis procedure for extracting word information related to each word used in a document or a sentence constituting a document group and a used part thereof;
A document quality evaluation procedure for evaluating the document quality using the characteristics of the input document based on a predetermined document quality indexing rule and calculating the document quality index;
Based on the word information, whether or not there is an ambiguous word in the document is a character string of an ambiguous word that may have ambiguity and an example feature with different ambiguity for each ambiguous word and the degree of ambiguity for each example. Queries in the ambiguous example database that accumulates the ambiguity, if there are ambiguous words, based on a predetermined ambiguity analysis rule, determine the characteristics of the example of each ambiguous word from the word information, by querying the ambiguous example database , An example analysis procedure for extracting ambiguous words, their ambiguity, and their position in the document as ambiguous example information for each ambiguous word,
In response to the query of the ambiguity analysis rule regarding the document quality index and the specific word, the classification accuracy of the ambiguity word example when the ambiguity analysis rule is applied to the document for each ambiguity analysis rule is collected. If the classification accuracy obtained from the classification accuracy database stored in the table is used and the ambiguity analysis rule is a rule for determining whether the ambiguity analysis rule is a higher ambiguity example, the higher the value of the document quality index, the higher the classification accuracy. A bad ambiguity analysis rule is not applied, and if the ambiguity analysis rule is a rule for determining whether the ambiguity analysis rule is an example with a lower ambiguity, the higher the value of the document quality index, the worse the ambiguity analysis with the classification accuracy. based on a predetermined ambiguity analysis condition change rule to be applied in the rule, ambiguity analysis condition change for changing the conditions to analyze the ambiguity for each ambiguous word And order,
An ambiguity determination procedure for determining the ambiguity of each ambiguous word based on the changed ambiguity analysis condition;
An ambiguity information output procedure for outputting the corresponding ambiguity and the position in the document as ambiguity information for each ambiguous word determined to have an ambiguity equal to or greater than a threshold in the ambiguity determination step;
Document analysis program that executes