JP5214985B2

JP5214985B2 - Text segmentation apparatus and method, program, and computer-readable recording medium

Info

Publication number: JP5214985B2
Application number: JP2008012026A
Authority: JP
Inventors: 直人阿部; 俊郎内山; 匡内山
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2008-01-22
Filing date: 2008-01-22
Publication date: 2013-06-19
Anticipated expiration: 2028-01-22
Also published as: JP2009175895A

Description

本発明は、Ｗｅｂ検索を利用したテキストセグメンテーション装置及び方法及びプログラム及びコンピュータ読取可能な記録媒体に係り、特に、テキストを計算機上で利用する分野において、テキスト（各種記事や物語などの文章）中の各文の意味内容を推定し、テキストを意味的なまとまり毎に分割するＷｅｂ検索を利用したテキストセグメンテーション装置及び方法及びプログラム及びコンピュータ読取可能な記録媒体に関する。 The present invention relates to a text segmentation apparatus and method using a Web search, a program, and a computer-readable recording medium. In particular, in a field where text is used on a computer, the text is included in text (sentences such as various articles and stories). The present invention relates to a text segmentation apparatus and method using a Web search that estimates the semantic content of each sentence and divides text into semantic groups, and a computer-readable recording medium.

近年、急速な計算機の性能向上に伴い、莫大なテキスト（ここでは、文字列だけで構成される文の集合）を蓄積し、データベースを構築することが可能になった。しかし、保存されたテキストを人手で整理・管理することは一般的に困難となってきている。そこで、蓄積されたテキストデータベースを解析し、テキストを意味的な内容（意味段落と呼ぶ）に応じて分割するテキストセグメンテーションと呼ばれる技術が開発されており、テキストデータベースの分類や整理を計算機で自動的に行うことに応用されつつある。例えば、概念ベースと呼ばれる情報を用いてテキストセグメンテーションを行う技術がある（例えば、特許文献１参照）。 In recent years, with the rapid improvement of computer performance, it has become possible to accumulate a huge amount of text (here, a set of sentences composed only of character strings) and construct a database. However, it is generally difficult to manually organize and manage stored text. Therefore, a technique called text segmentation has been developed that analyzes the stored text database and divides the text according to semantic content (called semantic paragraphs), and automatically classifies and organizes the text database with a computer. It is being applied to. For example, there is a technique for performing text segmentation using information called a concept base (see, for example, Patent Document 1).

この技術ではある単語とそれに共起するパターンを数値ベクトル化した概念ベクトルを予め蓄積した学習データから複数作成する。そして、概念ベクトルの集まりである概念ベースを利用してテキストセグメンテーションを行う。学習データは一つの分野に関する（例えば、「政治」の分野だけに関する）テキストが数多く蓄積されている。 In this technique, a plurality of concept vectors in which a certain word and a co-occurrence pattern are converted into numerical vectors are created from learning data stored in advance. Then, text segmentation is performed using a concept base that is a collection of concept vectors. In the learning data, many texts related to one field (for example, only about the field of “politics”) are accumulated.

また、従来のテキストセグメンテーションでは複数の文間に対する連結度に基づいて文間の意味的連続性を評価する方法が主である（例えば、非特許文献１参照）。この従来の技術では、連結度を算出する際に考慮する文の個数が少ない場合には、局所的な意味内容の変化に追従し易い代わりに、過剰に意味段落を推定する可能性が増える。一方で、考慮する文の個数が多い場合には、大域的な意味内容の変化を捉えることができる代わりに穏やかに意味内容が変化するテキストに対して対処することが難しい。
特開２００２−３４２３２４号公報 Hearst, M.A.,:Multi-Paragraph Segmention of Expository Text, 32nd Annual Meeting of the Association for Computational Linguistics, pp.9-16 (1994) In addition, the conventional text segmentation mainly uses a method of evaluating semantic continuity between sentences based on the degree of connectivity between a plurality of sentences (for example, see Non-Patent Document 1). In this conventional technique, when the number of sentences to be considered when calculating the degree of connectivity is small, the possibility of excessively estimating semantic paragraphs increases instead of easily following local changes in semantic content. On the other hand, when the number of sentences to be considered is large, it is difficult to deal with text whose meaning and contents change gently instead of being able to capture global changes in meaning and contents.
JP 2002-342324 A Hearst, MA ,: Multi-Paragraph Segmention of Expository Text, 32nd Annual Meeting of the Association for Computational Linguistics, pp. 9-16 (1994)

しかしながら、従来のテキストセグメンテーション手法の精度を高めるためには、大規模な学習データを用意しなくてはならない。そのため、学習データが小規模な場合には概念ベースを適切に作成できずテキストセグメンテーションの精度が低下する問題がある。また、事前に用意した学習データに含まれている分野に対応できる反面、異なる分野のテキストに対してテキストセグメンテーションを行うことができない。例えば、学習データに『政治』や『経済』に関する情報だけが蓄積されている場合、『スポーツ』の分野のテキストに対してテキストセグメンテーションは困難となる。 However, in order to improve the accuracy of the conventional text segmentation technique, large-scale learning data must be prepared. For this reason, when the learning data is small, the concept base cannot be appropriately created, and there is a problem that the accuracy of text segmentation is lowered. In addition, while it is possible to correspond to the fields included in the learning data prepared in advance, text segmentation cannot be performed on texts in different fields. For example, when only information related to “politics” and “economy” is accumulated in the learning data, text segmentation becomes difficult for text in the field of “sports”.

本発明は、上記の点に鑑みなされたもので、学習データを必要とせずにテキストセグメンテーションが可能なＷｅｂ検索を利用したテキストセグメンテーション装置及び方法及びプログラム及びコンピュータ読取可能な記録媒体を提供することを目的とする。 The present invention has been made in view of the above points, and provides a text segmentation apparatus and method using a Web search capable of text segmentation without requiring learning data, a program, and a computer-readable recording medium. Objective.

図１は、本発明の原理構成図である。 FIG. 1 is a principle configuration diagram of the present invention.

本発明（請求項１）は、テキストを内容に応じて分割するテキストセグメンテーション装置であって、
入力されたテキストを文単位に分割するテキスト分解手段２０１と、
テキスト分解手段２０１により分割された文を形態素解析し、形態素解析された名詞、副詞、動詞、形容詞、形容動詞を検索語として抽出し、該動詞、該形容詞、該形容動詞のうち終止形でないものを終止形に変換し、検索語記憶手段２１２に格納する検索語抽出手段２１１と、
検索語に基づいてウェブ検索し、検索されたテキストを形態素解析し、解析された形態素のうちで、名詞、副詞、動詞、形容詞、形容動詞を関連語として取得し、該動詞、該形容詞、該形容動詞のうち終止形でないものを終止形に変換し、関連語記憶手段２２２に格納する関連語取得手段２２１と、
検索語記憶手段２１２から検索語を取得し、関連語記憶手段２２２から関連語を取得し、該検索語と該関連語との組み合わせであるキーワード集合を用いて、入力テキストを分割した複数の文同士の連結性を判定し、該連結性の谷と谷の間にある文同士である意味段落を抽出することによって入力テキストを分割する連結性判定手段２３１と、
を有し、
連結性判定手段２３１は、
キーワード集合を纏めたブロックＢ１，Ｂ２を作成し、ｉ番目とｉ＋１番目の２つの文の連結度を、単語ｔの出現頻度を用いて、

（但し、ｗ_ｔ ^B1はブロックＢ１にある単語ｔの頻度、ｗ_ｔ ^B2はブロックＢ２にある単語ｔの頻度を表す。

は０以上１以下の値を取り、１に近いほどブロックＢ１とブロックＢ２に含まれている単語が同じであることを表す）
により求める手段と、
ｉ＝｛１，２，…，Ｎ｝と変化させ、

を計算し、ブロックの大きさｂのパラメータをｂ＝（ｂ_１，ｂ_２，…，ｂ_Ｍ）とＭ個設定して各ブロック幅に対して連結度

を計算し、それらの平均値をｉ番目とｉ＋１番目の文における平均連結度Ｃ_ｉとして、

により求める手段と、
平均連結度Ｃ_ｉ（但し、ｉ＝（１，２，…，Ｎ））を用いて意味段落の境界である平均連結度の谷を、条件

に基づいて抽出し、該谷に基づいて意味段落を取得する手段と
を含む。

The present invention (Claim 1) is a text segmentation device that divides text according to content,
Text decomposing means 201 for dividing the input text into sentence units;
A sentence divided by the text decomposition means 201 is subjected to morphological analysis , and the nouns, adverbs, verbs, adjectives and adjective verbs that have been subjected to morphological analysis are extracted as search words, and the verb, the adjective, and the adjective verb are not final A search term extraction means 211 for converting into a closed form and storing it in the search term storage means 212;
Web search based on the search term, morphological analysis of the searched text , noun, adverb, verb, adjective, adjective verb among the analyzed morphemes are acquired as related words, the verb, the adjective, the A related word acquisition means 221 that converts an adjective verb that is not a terminal form into a terminal form and stores it in the related word storage means 222;
A plurality of sentences obtained by acquiring a search word from the search word storage unit 212, acquiring a related word from the related word storage unit 222, and dividing an input text using a keyword set that is a combination of the search word and the related word Connectivity determination means 231 for determining the connectivity between each other and dividing the input text by extracting a semantic paragraph that is a sentence between the valleys of the connectivity;
Have
The connectivity determination means 231
Create blocks B1 and B2 that summarize the keyword set, and use the frequency of occurrence of the word t to determine the connectivity of the i-th and i + 1-th two sentences.

(W _t ^B1 represents the frequency of the word t in the block B1, and w _t ^B2 represents the frequency of the word t in the block B2.

Takes a value between 0 and 1, and the closer to 1, the more the same words are included in block B1 and block B2)
Means to obtain
i = {1, 2,..., N},

And the block size b parameter is set to b = (b ₁ , b ₂ ,..., B _M ), and the connectivity is set for each block width.

And the average value thereof as the average connectivity C _i in the i-th and i + 1-th sentences,

Means to obtain
The average connectivity C _i (where i = (1, 2,..., N)) is used to define the average connectivity valley that is the boundary of the semantic paragraph as a condition.

And a means for obtaining a semantic paragraph based on the valley.

また、本発明（請求項２）は、連結性判定手段２３１において、
複数個の閾値Ｃ_Ｔを用いて前記平均連結度Ｃ_ｉを算出し、意味段落を抽出する手段を含む。 Further, the present invention (Claim 2 ) provides connectivity determination means 231
Means for calculating the average connectivity C _i using a plurality of thresholds C _T and extracting semantic paragraphs.

図２は、本発明の原理を説明するための図である。 FIG. 2 is a diagram for explaining the principle of the present invention.

本発明（請求項３）は、テキストを内容に応じて分割する装置におけるテキストセグメンテーション方法であって、
テキスト分解手段が、入力されたテキストを文単位に分割するテキスト分解ステップ（ステップ１）と、
検索語抽出手段が、テキスト分解ステップにおいて分割された文を形態素解析し、形態素解析された名詞、副詞、動詞、形容詞、形容動詞を検索語として抽出し、該動詞、該形容詞、該形容動詞のうち終止形でないものを終止形に変換し、検索語記憶手段に格納する検索語抽出ステップ（ステップ２）と、
関連語取得手段が、検索語に基づいてウェブ検索し、検索されたテキストを形態素解析し、形態素解析された名詞、副詞、動詞、形容詞、形容動詞を関連語として取得し、該動詞、該形容詞、該形容動詞のうち終止形でないものを終止形に変換し、関連語記憶手段に格納する関連語取得ステップ（ステップ３）と、
連結性判定手段が、検索語記憶手段から検索語を取得し、関連語記憶手段から関連語を取得し、該検索語と該関連語との組み合わせであるキーワード集合を用いて、入力テキストを分割した複数の文同士の連結性を判定し(ステップ４)、平均連結度の谷が検出された場合には(ステップ５)、該連結性の谷と谷の間にある文同士である意味段落を抽出する(ステップ６)ことによって入力テキストを分割する連結性判定ステップと、
を行い、
連結性判定ステップにおいて、
キーワード集合を纏めたブロックＢ１，Ｂ２を作成し、ｉ番目とｉ＋１番目の２つの文の連結度を、単語ｔの出現頻度を用いて、

（但し、ｗ_ｔ ^B1はブロックＢ１にある単語ｔの頻度、ｗ_ｔ ^B2はブロックＢ２にある単語ｔの頻度を表す。

は０以上１以下の値を取り、１に近いほどブロックＢ１とブロックＢ２に含まれている単語が同じであることを表す）
により求めるステップと、
ｉ＝｛１，２，…，Ｎ｝と変化させ、

を計算し、ブロックの大きさｂのパラメータをｂ＝（ｂ_１，ｂ_２，…，ｂ_Ｍ）とＭ個設定して各ブロック幅に対して連結度

を計算し、それらの平均値をｉ番目とｉ＋１番目の文における平均連結度Ｃ_ｉとして、

により求めるステップと、
平均連結度Ｃ_ｉ（但し、ｉ＝（１，２，…，Ｎ））を用いて意味段落の境界である平均連結度の谷を、条件

に基づいて抽出し、該谷に基づいて意味段落を取得するステップと、を含む。

The present invention (Claim 3) is a text segmentation method in an apparatus for dividing text according to content,
A text decomposition step (step 1) in which the text decomposition means divides the input text into sentence units;
The search word extracting means performs morphological analysis on the sentence divided in the text decomposition step, extracts nouns, adverbs, verbs, adjectives, and adjective verbs as morphological analysis as search words, and the verb, the adjective, and the adjective verb A search word extraction step (step 2) of converting a non-end form into a stop form and storing it in the search word storage means;
The related word acquisition means searches the web based on the search word, performs morphological analysis on the searched text, acquires nouns, adverbs, verbs, adjectives, and adjective verbs that are morphologically analyzed as related words, and the verb and the adjectives A related word acquisition step (step 3) of converting the non-terminal form of the adjective verb into a terminal form and storing it in the related word storage means;
The connectivity determination means acquires a search word from the search word storage means, acquires a related word from the related word storage means, and divides the input text using a keyword set that is a combination of the search word and the related word The connectivity between the plurality of sentences is determined (step 4), and if a valley of the average connectivity is detected (step 5), the meaning paragraph that is the sentences between the valleys of the connectivity A connectivity determination step of dividing the input text by extracting (step 6);
And
In the connectivity determination step,
Create blocks B1 and B2 that summarize the keyword set, and use the frequency of occurrence of the word t to determine the connectivity of the i-th and i + 1-th two sentences.

(W _t ^B1 represents the frequency of the word t in the block B1, and w _t ^B2 represents the frequency of the word t in the block B2.

Takes a value between 0 and 1, and the closer to 1, the more the same words are included in block B1 and block B2)
A step to obtain by
i = {1, 2,..., N},

And the block size b parameter is set to b = (b ₁ , b ₂ ,..., B _M ), and the connectivity is set for each block width.

And the average value thereof as the average connectivity C _i in the i-th and i + 1-th sentences,

A step to obtain by
The average connectivity C _i (where i = (1, 2,..., N)) is used to define the average connectivity valley that is the boundary of the semantic paragraph as a condition.

And extracting a semantic paragraph based on the valley.

また、本発明（請求項４）は、連結性判定ステップ（ステップ３）において、
複数個の閾値Ｃ_Ｔを用いて前記平均連結度Ｃ_ｉを算出し、前記意味段落を抽出する。 Further, the present invention (Claim 4 ) is the connectivity determination step (Step 3).
Calculating the average degree of coupling C _i using a plurality of threshold C _T, it extracts the meaning paragraph.

本発明（請求項５）は、請求項１または２に記載のテキストセグメンテーション装置を構成する各手段としてコンピュータを機能させるテキストセグメンテーションプログラムである。 The present invention (Claim 5 ) is a text segmentation program that causes a computer to function as each means constituting the text segmentation device according to Claim 1 or 2 .

本発明（請求項６）は、請求項５記載のテキストセグメンテーションプログラムを格納したことを特徴とするコンピュータ読取可能な記録媒体である。 The present invention (Claim 6 ) is a computer-readable recording medium in which the text segmentation program according to Claim 5 is stored.

上記のように、本発明によれば、Ｗｅｂで検索する概念を利用することで学習データを事前に用意する必要がないテキストセグメンテーション技術が期待できる。この記述は莫大なテキストデータを扱う分野やニュース記事を配信する分野において、データベースの整理・更新を自動的に行う支援策として応用できる。また、解析対象となるテキストに関して、学習データを使用せずにＷｅｂ検索により幅広い分野における関連語を収集できるという点から記述内容や作成時期に制約が少ないと言う利点がある。 As described above, according to the present invention, it is possible to expect a text segmentation technique that does not require preparation of learning data in advance by using the concept of searching on the Web. This description can be applied as a support measure for automatically organizing and updating databases in the field of handling enormous text data and the field of distributing news articles. In addition, there is an advantage that there are few restrictions on the description content and the creation time with respect to the text to be analyzed, in that related words in a wide range of fields can be collected by Web search without using learning data.

更に、内容的なまとまりに分割されている点から、あるキーワードを含む内容的に関連のある文章だけを収集する技術として利用できる。 Furthermore, since the contents are divided into contents, it can be used as a technique for collecting only contents-related sentences including a certain keyword.

本発明において、検索語に名詞、副詞、形容詞、形容動詞、動詞の終止形を利用することで、ニュース記事やブログ記事など、テキストの内容や書き方に幅広く対応することができる。また、本発明では、検索語として使用できる単語として形態素解析で得られる全ての品詞を使用することもできる。得られる単語は、活用するか活用しないかの二通りだけなので、活用形の無い単語はそのまま使用し、活用形のある単語は全て終止形に変換することで、全ての品詞の単語を検索語として利用できる。 In the present invention, nouns, adverbs, adjectives, adjective verbs, and verb end forms are used as search words, so that it is possible to deal with a wide variety of text contents and writing methods such as news articles and blog articles. In the present invention, all parts of speech obtained by morphological analysis can also be used as words that can be used as search terms. Since there are only two types of words that can be used or not used, the words without usage are used as they are, and all the words with usage are converted into ending forms, so that all words of part of speech can be searched. Available as

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

本発明は、Ｗｅｂで検索を行う概念に着目した新しいテキストセグメンテーション技術を用いる。現在Ｗｅｂ上には膨大な情報が蓄積されており、最新の話題も常に提供されている。つまり、Ｗｅｂは様々な情報を持つ記事の集合として捉えることができる。実際、
我々はある事に関して調べる際、検索サイトで検索語を入力してＷｅｂ上で検索を行い、単語の意味や物事の内容を調べている。その観点から学習データを使用しなくてもＷｅｂ上にある情報を適切に利用すれば、「サッカー」や「野球」に対応するのは「スポーツ」や「ボール」という概念を取得できると言える。つまり、Ｗｅｂ上にある様々な情報を基にテキストの内容に応じた単語を取得し、文同士の関連性を単語の変化によって追跡することで意味段落に分割することができる。 The present invention uses a new text segmentation technique that focuses on the concept of searching on the Web. Currently, a huge amount of information is accumulated on the Web, and the latest topics are always provided. That is, the Web can be regarded as a set of articles having various information. In fact,
When we investigate a certain thing, we input a search word at a search site and perform a search on the Web to check the meaning of the word and the contents of things. From this point of view, it can be said that the concept of “sports” and “ball” can be acquired to support “soccer” and “baseball” if information on the Web is appropriately used without using learning data. That is, it is possible to divide into semantic paragraphs by acquiring words according to the contents of the text based on various information on the Web, and tracking the relationship between sentences by the change of words.

［第１の実施の形態］
図３は、本発明の第１の実施の形態におけるウェブ検索を利用したテキストセグメンテーションシステムの構成を示す。 [First Embodiment]
FIG. 3 shows a configuration of a text segmentation system using web search according to the first embodiment of the present invention.

同図に示すシステムは、コンピュータ２５１、ネットワーク２５２、ウェブ２５３、表示部２５６から構成される。コンピュータ２５１にはネットワーク２５２が接続されており、ウェブ２５３にアクセスできる。ウェブ２５３には複数のＨＴＭＬやＸＭＬ等の構造化言語で記述された記事２５４が蓄積されている。 The system shown in the figure includes a computer 251, a network 252, a web 253, and a display unit 256. A network 252 is connected to the computer 251 and can access the web 253. The web 253 stores a plurality of articles 254 described in a structured language such as HTML or XML.

コンピュータ２５１は、テキスト分解処理部２０１、分解文章記憶部２０２、検索語抽出処理部２１１、検索語記憶部２１２、関連語取得処理部２２１、関連語記憶部２２２、連結性判定処理部２３１、意味段落記憶部２３２、入力部２４１、出力部２４２から構成される。テキスト２５５はコンピュータ２５１の入力部２４１に入力されるテキストである。表示部２５６は制御部２４０から出力部２４２を通じて出力された結果を表示するための装置である。 The computer 251 includes a text decomposition processing unit 201, a decomposed sentence storage unit 202, a search word extraction processing unit 211, a search word storage unit 212, a related word acquisition processing unit 221, a related word storage unit 222, a connectivity determination processing unit 231, and a meaning. A paragraph storage unit 232, an input unit 241, and an output unit 242 are included. Text 255 is text that is input to the input unit 241 of the computer 251. The display unit 256 is a device for displaying the result output from the control unit 240 through the output unit 242.

図４は、本発明の第１の実施の形態における処理手順の概要を示すフローチャートである。 FIG. 4 is a flowchart showing an outline of a processing procedure in the first embodiment of the present invention.

まず、コンピュータ２５１の入力部２４１にテキストが入力され（ステップ１０１）、テキスト分割処理部２０１において入力されたテキストを文単位に分割し（ステップ１０２）、検索語抽出処理部２１１において文から検索語を抽出し（ステップ１０３）、関連語取得処理部２２１において検索語を用いてウェブ上で検索を行い検索結果から関連語を取得する（ステップ１０４）。連結性判定処理部２３１において、検索語と関連語を組にしたキーワード集合から意味段落を抽出する（ステップ１０５）。最後に出力部２４２からテキストセグメンテーション結果を出力する（ステップ１０６）。 First, text is input to the input unit 241 of the computer 251 (step 101), the text input by the text division processing unit 201 is divided into sentence units (step 102), and the search word extraction processing unit 211 converts the search word from the sentence. (Step 103), the related word acquisition processing unit 221 searches the web using the search word, and acquires the related word from the search result (step 104). In the connectivity determination processing unit 231, a semantic paragraph is extracted from a keyword set in which a search word and a related word are paired (step 105). Finally, the text segmentation result is output from the output unit 242 (step 106).

図５は、本発明の第１の実施の形態におけるテキストの一例を示す。 FIG. 5 shows an example of text in the first embodiment of the present invention.

同図において、テキスト２５５は入力部２４１に入力されるテキストの例である。 In the figure, text 255 is an example of text input to the input unit 241.

図６は、本発明の第１の実施の形態における分解文章記憶部に格納された文の一例を示す。入力部２４１からテキスト２５５が入力されると、テキスト分解処理部２０１によって複数個の文章に分解され、分解文章記憶部２０２に格納される。 FIG. 6 shows an example of a sentence stored in the decomposed sentence storage unit in the first embodiment of the present invention. When the text 255 is input from the input unit 241, the text is decomposed into a plurality of sentences by the text decomposition processing unit 201 and stored in the decomposed sentence storage unit 202.

図７は、本発明の第１の実施の形態における一般語リストに登録されている一般語の例である。当該一般語リストはメモリ（図示せず）に格納されており、同図において、ｋは一般語リストに登録されている単語であり、検索語抽出処理部２１１と関連語取得処理部２２１により参照される。 FIG. 7 is an example of general words registered in the general word list according to the first embodiment of this invention. The general word list is stored in a memory (not shown). In the figure, k is a word registered in the general word list, and is referred to by the search word extraction processing unit 211 and the related word acquisition processing unit 221. Is done.

図８は、本発明の第１の実施の形態における検索語記憶部に格納された検索語の一例を示す。同図において、例えば、図６の１番目の分解文章から検索語抽出処理部２１１によって「ドライブ高速道路」が抽出され、検索語記憶部２１２に格納される。 FIG. 8 shows an example of a search word stored in the search word storage unit in the first embodiment of the present invention. In FIG. 6, for example, “drive highway” is extracted from the first decomposed text of FIG. 6 by the search word extraction processing unit 211 and stored in the search word storage unit 212.

図９は、本発明の第１の実施の形態における関連語記憶部に格納された関連語の一例を示す。関連語取得処理部２２１は、検索語記憶部２１２から取得した検索語に対応する関連語をメモリ（図示せず）の一般語リストに登録されているものを除いた単語から抽出して、当該検索語に対応する関連語として関連語記憶部２１２に格納する。 FIG. 9 shows an example of related words stored in the related word storage unit according to the first embodiment of the present invention. The related word acquisition processing unit 221 extracts a related word corresponding to the search word acquired from the search word storage unit 212 from words other than those registered in a general word list of a memory (not shown), and The related word storage unit 212 stores the related word corresponding to the search word.

図１０は、本発明の第１の実施の形態における連結性判定処理部において作成されたキーワード集合の一例を示す。同図に示すキーワード集合は連結性判定処理部２３１内のメモリ（図示せず）に格納される。連結性判定処理部２３１は、検索語と関連語の組からキーワード集合を生成し、意味段落記憶部２３２に格納する。例えば、同図の例において、図８の検索語「ゴルフ、ショット、仕事、打ちっぱなし、弾道、練習」と、当該検索語に対応する図９の関連語「打つ、ボール、スコア、スイング、飛距離、買う、ラウンド、ドライバー、初心者、受ける、購入、感じる、アイアン、かける」の組から図１０に示す「ゴルフ、ショット、仕事、打ちっぱなし、弾道、練習、打つ、ボール、スコア、スイング、飛距離、買う、ラウンド、ドライバー、初心者、受ける、購入、感じる、アイアン、かける」からなるキーワード集合が得られ、これらを、メモリ（図示せず）に格納する。 FIG. 10 shows an example of a keyword set created by the connectivity determination processing unit in the first embodiment of the present invention. The keyword set shown in the figure is stored in a memory (not shown) in the connectivity determination processing unit 231. The connectivity determination processing unit 231 generates a keyword set from a set of search terms and related terms and stores the keyword set in the semantic paragraph storage unit 232. For example, in the example of FIG. 9, the search word “golf, shot, work, strike, ballistic, practice” in FIG. 8 and the related word “hit, ball, score, swing, FIG. 9 corresponding to the search word” "Golf, shot, work, untouched, ballistic, practice, hit, ball, score, swing" from the set of "flying distance, buy, round, driver, beginner, receive, purchase, feel, iron, apply" , Flight distance, buy, round, driver, beginner, receive, purchase, feel, iron, call, and the like, and these are stored in a memory (not shown).

図１１は、本発明の第１の実施の形態における意味段落記憶部に格納された意味段落の一例を示す。同図において、ある閾値に対応する意味段落番号毎に、連結性判定処理部２３１で検出された意味段落に属する文の番号を格納する。 FIG. 11 shows an example of the semantic paragraph stored in the semantic paragraph storage unit in the first embodiment of the present invention. In the figure, for each semantic paragraph number corresponding to a certain threshold, the number of a sentence belonging to the semantic paragraph detected by the connectivity determination processing unit 231 is stored.

以下、上記の構成におけるテキストセグメンテーションの処理手順を具体的に説明する。 Hereinafter, the processing procedure of text segmentation in the above configuration will be specifically described.

図１２は、本発明の第１の実施の形態における関連語取得処理部の処理のフローチャートである。 FIG. 12 is a flowchart of the process of the related word acquisition processing unit in the first embodiment of the present invention.

まず、テキスト２５５が入力部２４１を介して入力されると（ステップ１０１）、制御部２４０からテキスト分解処理部２０１が呼び出される。テキスト分解処理部２０１では、テキスト２５５を一文字ずつ読み込み、文単位で切り出しＮ個の文を取得し、制御部２４０を介して、分解文章記憶部２０２に格納する。ここで、文は句点「。」で区切られる一文を指す。テキスト２５５の一例は図５に示すようなテキスト２５５に対してテキスト分解処理部２０１の処理を実行すると、図６に示すように文単位に分解された９つの文が生成され、分解文章記憶部２０２に格納される（ステップ１０２）。テキスト分解処理部２０１において生成される文の個数は入力されるテキストによって異なる。また、意味的に複数に続く文や句点「。」の入力ミスの場合には、複数の文が一つの文として扱われる。 First, when the text 255 is input via the input unit 241 (step 101), the text decomposition processing unit 201 is called from the control unit 240. The text decomposition processing unit 201 reads the text 255 character by character, cuts out sentence by sentence, obtains N sentences, and stores them in the decomposed sentence storage unit 202 via the control unit 240. Here, a sentence refers to one sentence delimited by a punctuation mark “.”. As an example of the text 255, when the processing of the text decomposition processing unit 201 is performed on the text 255 as shown in FIG. 5, nine sentences decomposed into sentence units are generated as shown in FIG. 202 (step 102). The number of sentences generated in the text decomposition processing unit 201 varies depending on the input text. In addition, in the case of an input mistake of semantically following a plurality of sentences or the phrase “.”, A plurality of sentences are handled as one sentence.

次に、分解文章記憶部２０２に格納されたそれぞれの文に対して、制御部２４０により検索語抽出処理部２１１が起動される。検索語とは、ウェブ上で検索を行う際に入力する一つまたは複数の単語を指す。初めに、検索語抽出処理部２１１は、最初に入力された文に対して形態素解析を行う。そして、形態素解析により名詞、副詞、形容詞、形容動詞、動詞の４つに分類された単語を検索語として取り出し、制御部２４０を介して検索語記憶部２１２に格納する。その際、形容詞、形容動詞、動詞の三つに関しては活用形を全て終止形に直した単語を使用する。 Next, the search word extraction processing unit 211 is activated by the control unit 240 for each sentence stored in the decomposed text storage unit 202. A search term refers to one or more words that are input when searching on the web. First, the search word extraction processing unit 211 performs morphological analysis on the first input sentence. Then, words classified into four nouns, adverbs, adjectives, adjective verbs, and verbs by morphological analysis are extracted as search words and stored in the search word storage unit 212 via the control unit 240. At that time, for the adjectives, adjective verbs, and verbs, the words whose inflected forms are all changed to final forms are used.

ここで、抽出された単語には「年」や「ある」のような一般的に使用される単語（以下、一般語と記す）も含まれる。そこで、図７に示すような一般語リストを予め作成し、メモリ（図示せず）に格納しておき、一般語リストに登録されていない単語を検索語として扱う。一般語のリストは図７に示す通りである。検索語記憶部２１１に格納される検索語は一般語リストによって変わる。 Here, the extracted words include commonly used words such as “year” and “a” (hereinafter referred to as general words). Therefore, a general word list as shown in FIG. 7 is created in advance and stored in a memory (not shown), and a word not registered in the general word list is handled as a search word. The list of general words is as shown in FIG. The search terms stored in the search term storage unit 211 vary depending on the general word list.

また、ウェブ検索を行う際に適切な個数の単語でＡＮＤ検索をする方が好ましい。そこで、抽出された単語の個数が所定の閾値Ｓ_Ｔ未満の場合には、検索語抽出処理部２１１では検索語は抽出せず、検索語記憶部２１２に単語は何も格納しない。逆に、抽出単語の個数Ｓが閾値Ｔ以上の場合には、Ｓ個の検索語からＴ個の検索語をランダムに選択し、検索語記憶部２１２に格納する。Ｔ＝２０，Ｓ_Ｔ＝２の場合において、検索語抽出処理部２１１を起動させると、図８のような検索語が検索語記憶部２１２に格納される。 Also, it is preferable to perform an AND search with an appropriate number of words when performing a web search. Therefore, if the extracted number of words is less than the predetermined threshold value S _T, the search word in the search word extraction processing unit 211 does not extract a word to the search word storing unit 212 does not store anything. Conversely, when the number S of extracted words is equal to or greater than the threshold value T, T search words are randomly selected from the S search words and stored in the search word storage unit 212. When T = 20 and S _T = 2, when the search word extraction processing unit 211 is activated, a search word as shown in FIG. 8 is stored in the search word storage unit 212.

図６の文から図８に示す検索語が作成された後、制御部２４０から関連語取得処理部２２１が起動される（ステップ２０１）。関連語取得処理部２２１では、初めに検索語抽出処理部２１１で抽出された検索語が制御部２４０を介して検索語記憶部２１２から取り出され、入力される（ステップ２０２）。次に、入力された検索語を用いてネットワーク２５２で接続されているウェブ２５３上でＡＮＤ検索を行う（ステップ２０４）。ＡＮＤ検索を行うことで、検索語の入力する順序に影響せずウェブ２５３で検索することができる。そして、検索結果で参照されているウェブ２５３の中からテキスト２５５が作成された時期との差が少ない順にＰ個の記事２５４を取得する（ステップ２０５）。ウェブ２５３にある記事は作成された日付が一般的に記録されているため、テキスト２５５との時間的な差を測ることができる。この時間的な差を利用することで、テキスト２５５が作成された時期と内容に強く関連する記事をウェブ２５３から収集できる。 After the search word shown in FIG. 8 is created from the sentence of FIG. 6, the related word acquisition processing unit 221 is activated from the control unit 240 (step 201). In the related word acquisition processing unit 221, the search word first extracted by the search word extraction processing unit 211 is extracted from the search word storage unit 212 via the control unit 240 and input (step 202). Next, an AND search is performed on the web 253 connected by the network 252 using the input search word (step 204). By performing an AND search, it is possible to search on the web 253 without affecting the input order of search terms. Then, P articles 254 are acquired in ascending order of difference from the time when the text 255 was created from the web 253 referenced in the search result (step 205). Since articles created on the web 253 generally record the date of creation, the time difference from the text 255 can be measured. By using this time difference, articles strongly related to the time and contents of the text 255 can be collected from the web 253.

このように、ウェブ検索を行う際に入力されるテキストの作成時期との差が少ない順にテキストを収集するが、入力されたテキストの内容に関連性の高い単語を収集するという点で最も好ましい。しかし、時間的な差を考慮せず、得られた検索結果で参照されているＰ個のテキスト２５４を使用することで、十分な精度で関連語を収集することもできる。 As described above, the texts are collected in ascending order of the difference from the creation time of the text input when the web search is performed, but it is most preferable in terms of collecting words highly relevant to the contents of the input text. However, related words can be collected with sufficient accuracy by using P texts 254 that are referred to in the obtained search results without considering the time difference.

ここで、検索語記憶部２１２に該当する検索語が存在しない場合には、関連語取得処理部２２１ではウェブ検索を行わず、関連語記憶部２２２に対して何も格納しない（ステップ２０３,No）。また、検索語の個数ＳがＳ＝Ｔである場合にも（ステップ２０３、No）、ウェブ検索を行わず関連語記憶部２２２に関連語は格納しない。適切な関連語を得るためにはウェブ検索により得られる記事２５４の個数はできるだけ多い方がよい。そこで、ウェブ検索により得られるテキストの個数ＰがＰ_Ｔ未満の場合には（ステップ２０６、Ｙｅｓ）、検索語を修正し（ステップ２０７）、再びウェブ上でＡＮＤ検索によりテキストを収集する（ステップ２０４，２０５）。具体的には、Ｓ個の検索語をＷ個減らして最も検索件数が多くなるＳ−Ｗ個の検索語を選択し、再検索を行うことで記事２５４をＰ個収集する。 Here, when there is no search word corresponding to the search word storage unit 212, the related word acquisition processing unit 221 does not perform web search and stores nothing in the related word storage unit 222 (Step 203, No). ). Further, even when the number S of search words is S = T (step 203, No), the web search is not performed and the related words are not stored in the related word storage unit 222. In order to obtain appropriate related terms, the number of articles 254 obtained by web search should be as large as possible. Therefore, when the number P of texts obtained by web search is less than _PT (step 206, Yes), the search word is corrected (step 207), and the text is collected again by AND search on the web (step 204). 205). Specifically, the number of S search words is reduced to W, the SW search words having the largest number of searches are selected, and P articles 254 are collected by performing a re-search.

次に、時間順に収集されたＰ個の記事２５４からテキストを抽出する（ステップ２０８５）記事２５４はＨＴＭＬやＸＭＬ等の構造化言語で記述されている。よって、得られた記事２５４に対して"＜"と"＞"で囲まれた文字列から構成されるタグを解析することでテキストが得られる。そして、抽出された当該テキストに対して関連語取得処理部２２１は形態素解析を行い、名詞、副詞、形容詞、形容動詞、動詞を抽出する（ステップ２０９）。その際、検索語抽出処理部２１１と同様に、形容詞、形容動詞、動詞の活用形は全て終止形に変換した単語を抽出する。得られる関連語の個数はウェブ検索を行う際の検索語や収集される記事２５４の個数によって変わる。また、抽出した単語を直接関連語として使用すると、検索語抽出処理部２１１と同様に一般語が関連語として扱われる場合がある。そこで、関連度取得処理部２２１では、検索語抽出処理部２１１と同様にメモリ（図示せず）上の図７に示すような一般語リストを参照して一般語を除く（ステップ２１０）。そして、検索語がＳ個であるとき、Ｐ個の本文のテキストから抽出した単語に対し、出現頻度の高い順にＴ−Ｓ個の単語を関連語とし、制御部２４０を介して関連語記憶部２２２に格納する（ステップ２１１）。つまり、各文において抽出される検索語と関連度の合計個数は予め与えられた値Ｔと一定となるようにする。 Next, text is extracted from P articles 254 collected in order of time (step 2085). The article 254 is described in a structured language such as HTML or XML. Therefore, text can be obtained by analyzing a tag composed of a character string surrounded by “<” and “>” for the obtained article 254. Then, the related word acquisition processing unit 221 performs morphological analysis on the extracted text to extract nouns, adverbs, adjectives, adjective verbs, and verbs (step 209). At that time, as with the search word extraction processing unit 211, the adjectives, adjective verbs, and the verb usage forms are all extracted into words that have been converted to the final form. The number of related terms obtained varies depending on the search terms used when performing a web search and the number of articles 254 collected. If the extracted word is directly used as a related word, a general word may be handled as a related word in the same manner as the search word extraction processing unit 211. Therefore, the relevance level acquisition processing unit 221 removes the general terms by referring to the general term list as shown in FIG. 7 on the memory (not shown) as in the search term extraction processing unit 211 (step 210). Then, when there are S search words, for words extracted from P body texts, TS words are set as related words in order of appearance frequency, and the related word storage unit is connected via the control unit 240. It stores in 222 (step 211). That is, the total number of search words and relevances extracted in each sentence is set to a predetermined value T.

そして、記事２５４の個数ＰがＰ_Ｔ以上となるまで検索語の修正とウェブ検索を繰り返し、Ｐ≧Ｐ_Ｔとなった時点でＰ個の記事２５４からＴ−Ｓ個の関連語を抽出する。一方で、検索語を修正しても収集される記事２５４の個数がＰ_Ｔ以上とならない場合には、元のＳ個の検索語は検索語記憶部２１２に残し、関連語記憶部２２２に対して関連語として何も格納しない。一例として、図９の検索語記憶部２１２に格納されている検索語に対してＴ＝２０，Ｐ_Ｔ＝２０のとき、関連語取得処理部２２１を実行して得られた関連語は図９に示すようになる。 Then, the correction of the search word and the web search are repeated until the number P of articles 254 becomes _{equal to} or greater than _PT, and TS related words are extracted from the P articles 254 when P ≧ P _T. On the other hand, if the number of articles 254 collected is not equal to or greater than _PT even when the search word is corrected, the original S search words are left in the search word storage unit 212 and are stored in the related word storage unit 222. Do not store anything as a related term. As an example, when T = 20 and P _T = 20 with respect to the search word stored in the search word storage unit 212 of FIG. 9, the related word obtained by executing the related word acquisition processing unit 221 is FIG. As shown.

また、ウェブ検索により得られるテキスト２５４の個数ＰがＰ_Ｔ未満の場合には、検索語を修正し、再検索を行うことが各文に検索語と関連語の組からなるキーワード集合を割り当てられる点で最も好ましい。しかし、本発明は、Ｐ_Ｔ未満の場合において再検索を行わず検索語だけを用いてキーワード集合を作成することもできる。この場合、基準となる文に対して前後にある複数のキーワード集合を考慮している点から、少ない計算時間で実用的な精度で本文の内容を解析しテキストセグメンテーションを行うことができる。 Further, when the number P of texts 254 obtained by web search is less than P _T , a keyword set consisting of a combination of a search word and a related word can be assigned to each sentence by correcting the search word and performing a re-search. Most preferred in terms. However, according to the present invention, a keyword set can be created using only search words without performing re-search in the case of less than _PT . In this case, since a plurality of keyword sets before and after the reference sentence are taken into consideration, the text content can be analyzed and the text segmentation can be performed with practical accuracy in a short calculation time.

最後に、分解文章記憶部２０２に格納されている全ての文に対して検索語抽出処理部２１１と関連語取得処理部２２１の処理が終了すると、制御部２４０は、連結性判定処理部２３１を起動する。連結性判定処理部２３１では、最初に制御部２４０を介して検索語記憶部２１２と関連語記憶部２２２に格納されている検索語と関連語を読み出し、それらを組み合わせてキーワード集合を作成する。図８の検索語の例と図９の関連語の例から作成したキーワード集合の例を図１０に示す。例えば、図１０に示すキーワード集合は、図８の検索語と図９の関連語から作成されたものである。ここで、連結性判定処理部２３１では、検索語が無い場合にはそれに対応する関連語も存在しないため、キーワード集合を作成しない。また、検索語は存在し、関連語が存在しない場合には検索語のみを用いてキーワード集合を作成する。 Finally, when the processing of the search word extraction processing unit 211 and the related word acquisition processing unit 221 is completed for all the sentences stored in the decomposed text storage unit 202, the control unit 240 switches the connectivity determination processing unit 231. to start. The connectivity determination processing unit 231 first reads out the search terms and related terms stored in the search term storage unit 212 and the related term storage unit 222 via the control unit 240 and combines them to create a keyword set. FIG. 10 shows an example of a keyword set created from the example of the search term in FIG. 8 and the example of the related term in FIG. For example, the keyword set shown in FIG. 10 is created from the search terms in FIG. 8 and the related terms in FIG. Here, the connectivity determination processing unit 231 does not create a keyword set because there is no related word corresponding to the search word when there is no search word. If there are search terms and no related terms exist, a keyword set is created using only the search terms.

キーワード集合は、テキストの内容を反映する単語であることから、キーワード集合に含まれる単語の変化を調べることでテキスト２５５における内容の変化を捉えることができる。そこで、連結性判定処理部２３１では、生成されたキーワード集合を比較し、内容的にまとまっている一文または複数の文から構成される意味段落を見つける。抽出された意味段落は制御部２４０を通じて意味段落記憶部２３２に格納される。比較の方法は、テキストは先頭から順に書かれることが一般的であるため、テキストの先頭から順に複数のキーワード集合を纏めたブロックを作成し、比較を行う。具体的には、ｂをブロックの大きさとすると、ｉ＋１−ｂ番目からｉ番目までのキーワード集合が含まれるブロックＢ１と、ｉ＋１番目からｉ＋ｂ番目までのキーワード集合が含まれるブロックＢ２を決定し、二つのブロックＢ１とＢ２内に含まれるキーワード集合内の単語を比較する。単語が存在しないキーワード集合はブロックＢ１とブロックＢ２を作成する際には含めず、ブロックＢ１においては該当する文よりも前の文で空でないキーワード集合を、ブロックＢ２においては該当する文の後の文で空でないキーワード集合を代わりにブロックに含める。例えば、ｊ番目の文に対するキーワード集合が空の場合、ブロックＢ１作成時にはｊ−１，ｊ−２，…，１番目の順に空でないキーワード集合を発見し、ブロックＢ１に含める。一方、ブロックＢ２作成時にはｊ＋１，ｊ＋２，…，Ｎ番目の順に空でないキーワード集合を発見し、ブロックＢ２に含める。 Since the keyword set is a word that reflects the content of the text, the change in the content in the text 255 can be captured by examining the change in the word included in the keyword set. Therefore, the connectivity determination processing unit 231 compares the generated keyword sets and finds a semantic paragraph composed of one sentence or a plurality of sentences that are grouped in content. The extracted semantic paragraph is stored in the semantic paragraph storage unit 232 through the control unit 240. As a comparison method, text is generally written in order from the top, so a block in which a plurality of keyword sets are collected in order from the top of the text is created and compared. Specifically, assuming that b is the size of a block, a block B1 including a keyword set from i + 1-bth to ith and a block B2 including a keyword set from i + 1th to i + bth are determined. The words in the keyword set included in the two blocks B1 and B2 are compared. A keyword set that does not include a word is not included when creating blocks B1 and B2, and a keyword set that is not empty in a sentence before the corresponding sentence in block B1, and after a corresponding sentence in block B2. Include a non-empty keyword set in the block instead. For example, if the keyword set for the j-th sentence is empty, a non-empty keyword set is found in the order of j−1, j−2,..., And included in the block B1 when the block B1 is created. On the other hand, when the block B2 is created, a non-empty keyword set is found in the order of j + 1, j + 2,..., N and included in the block B2.

上記のようにしてブロックＢ１とブロックＢ２を作成後、それぞれのブロックに含まれる単語ｔの頻度ｗ_ｔを計算する。そして、ｉ番目とｉ＋１番目の二つの文の連結度を単語ｔの頻度ｗ_ｔを用いて以下の式で評価する。 After creating the block B1 and the block B2 as described above, the frequency w _{t of the} word t included in each block is calculated. Then, to evaluate by the following equation using the frequency w _t of the i-th and (i + 1) th word connectivity of two statements t.

但し、ｗ_t ^B1はブロックＢ１にある単語ｔの頻度、ｗ_t ^B2はブロックＢ２にある単語ｔの頻度を表す。 Here, w _t ^B1 represents the frequency of the word t in the block B1, and w _t ^B2 represents the frequency of the word t in the block B2.

は０以上１以下の値を取り、１に近いほどブロックＢ１とブロックＢ２に含まれている単語が同じであることを表す。ここで、ブロックＢ１、またはブロックＢ２内に単語が一切含まれていない場合、連結度 Takes a value between 0 and 1, and the closer to 1, the more the words contained in the block B1 and the block B2 are the same. Here, if no word is included in block B1 or block B2, the degree of connectivity

の値は０と算出される。連結性判定処理部２３１では、ｉ＝｛１，２，…，Ｎ｝と変化させ、 The value of is calculated as 0. The connectivity determination processing unit 231 changes i = {1, 2,..., N},

を計算する。更に、ブロックの大きさｂのパラメータをｂ＝（ｂ_１，ｂ_２，…，ｂ_Ｍ）とＭ個設定して各ブロック幅に対して連結度 Calculate Further, the parameter of the block size b is set to b = (b ₁ , b ₂ ,..., B _M ), and the connectivity is set for each block width.

を計算し、それらの平均値をｉ番目とｉ＋１番目の文における平均連結度Ｃ_iとする。 And the average value thereof is defined as the average connectivity C _i in the i-th and i + 1-th sentences.

最後に、平均連結度Ｃ_i（但し、ｉ＝（１，２，…，Ｎ））を用いて意味段落の境界である平均連結度の谷を抽出し、テキストの内容の変化を解析する。ここで、「平均連結度の谷」とは、意味段落の境界であり、与えられたテキストの内容が変化する際に現れる。この内容が変化する箇所は意味的な段落であるということで「意味段落」と呼ばれ、意味段落の境界を見つけることでテキストセグメンテーションを行うことができる。平均連結度の谷は以下の条件を満たす。 Finally, the average connectivity C _i (where i = (1, 2,..., N)) is used to extract the average connectivity valley that is the boundary of the semantic paragraph, and the change in the text content is analyzed. Here, the “average connectivity valley” is a boundary between semantic paragraphs and appears when the content of a given text changes. The part where the content changes is a semantic paragraph, so it is called a “semantic paragraph”, and text segmentation can be performed by finding the boundary of the semantic paragraph. The valley of average connectivity satisfies the following conditions.

連結性判定処理２３１では、検出された谷に応じて意味段落を抽出する。具体的には、ｉ番目で谷が検出されると、一つ目の意味段落に関しては先頭から、二つ目以降の意味段落に関しては以前に谷が検出された箇所からｉ−１番目までを一つの意味段落とする。閾値Ｃ_Ｔ＝０．１とし、図１０に示すキーワード集合に対して連結性判定処理部２３１の処理を行った結果を図１１に示す。図１０の意味段落記憶部２３２の例は、ｉ＝６で谷が検出されて二つの意味段落に分割されたものである。 In the connectivity determination process 231, a semantic paragraph is extracted according to the detected valley. Specifically, when a valley is detected at the i-th, the first semantic paragraph is measured from the beginning, and the second and subsequent semantic paragraphs are searched from the location where the valley was previously detected to the i−1th. One semantic paragraph. FIG. 11 shows the result of processing of the connectivity determination processing unit 231 for the keyword set shown in FIG. 10 with the threshold C _T = 0.1. In the example of the semantic paragraph storage unit 232 in FIG. 10, valleys are detected when i = 6 and divided into two semantic paragraphs.

連結性判定処理部２３１の処理が終了すると、制御部２４０により出力部２４２が起動され、テキスト２５５に対してテキストセグメンテーションを行った結果を表示部２５４に表示する。具体的には意味段落記憶部２３２を参照し、格納されている文番号を表示する。例えば、図１１に示す文番号を表示する方法が挙げられる。 When the processing of the connectivity determination processing unit 231 ends, the control unit 240 activates the output unit 242 and displays the result of text segmentation on the text 255 on the display unit 254. Specifically, referring to the semantic paragraph storage unit 232, the stored sentence number is displayed. For example, there is a method of displaying the sentence numbers shown in FIG.

上記のように、計算時間や精度はパラメータＴ，Ｓ_Ｔ，Ｐ_Ｔ，Ｃ_Ｔ，ｂにより調整することができる。パラメータＴとＳ_Ｔ，Ｐ_Ｔを調整することで、ウェブ検索を行うことで得られる関連語の抽出精度や計算時間を調整することができる。例えば、ウェブ検索により記事を収集する際、得られた記事の個数ＰがＰ_Ｔに達した時点でＰ≧Ｐ_Ｔを満たすことから、記事の収集を終了させることで計算時間の削減につながる。また、パラメータｂ、Ｃ_Ｔにより本文の内容の変化を捉える敏感さを調整することができる。更に、複数のパラメータｂに対する平均連結度Ｃ_ｉを用いることで、局所的な内容の変化と大域的な内容の変化を同時に考慮し、従来の方法において問題であった過剰に意味段落を抽出することや緩やかに内容が変化する際の意味段落抽出に失敗することを解決している。 As described above, the calculation time and accuracy can be adjusted by the parameters T, S _T , P _T , C _T , and b. By adjusting the parameters T, S _T , and P _T , it is possible to adjust the extraction accuracy and calculation time of related words obtained by performing a web search. For example, when collecting articles by web search, P ≧ P _T is satisfied when the number P of articles obtained reaches P _T , so that the calculation time can be reduced by terminating the collection of articles. Further, it is possible to adjust the sensitivity to capture the change in the content of the text by the parameter b, C _T. Furthermore, by using the average connectivity C _i for a plurality of parameters b, local change of contents and global change of contents are considered at the same time, and excessive semantic paragraphs that have been a problem in the conventional method are extracted. This solves the fact that semantic paragraph extraction fails when content changes slowly.

［第２の実施の形態］
本実施の形態では、第１の実施の形態の応用例として複数の閾値Ｃ_Ｔを用いる場合について説明する。 [Second Embodiment]
In the present embodiment, a case where a plurality of threshold values _CT are used as an application example of the first embodiment will be described.

図１３は、本発明の第２の実施の形態におけるウェブ検索を利用したテキストセグメンテーションシステムの構成を示す。同図において、図３と同一構成部分には同一符号を付しその説明を省略する。 FIG. 13 shows the configuration of a text segmentation system using web search according to the second embodiment of the present invention. In the figure, the same components as those in FIG.

図１３に示すコンピュータ１１６１には、閾値を入力するためのキーボード１１６２が接続されている。また、コンピュータ１１６１内に、図３に示す構成に分割結果選択処理部１１４１、生成ブロック記憶部１１３２が付加された構成である。 A keyboard 1162 for inputting a threshold value is connected to the computer 1161 shown in FIG. Further, the computer 1161 has a configuration in which a division result selection processing unit 1141 and a generation block storage unit 1132 are added to the configuration shown in FIG.

図１４は、本発明の第２の実施の形態における処理手順の概要を示すフローチャートである。 FIG. 14 is a flowchart showing an outline of a processing procedure in the second embodiment of the present invention.

コンピュータ１１６１の入力部２４１にテキスト２５５が入力され（ステップ１００１）、テキスト分割処理部２０１において入力されたテキストを文単位に分割し（ステップ１００２）、検索語抽出処理部２１１において文から検索語を抽出し（ステップ１００３）、関連語取得処理部２２１において検索語を用いてウェブ上で検索を行い検索結果から関連語を取得する（ステップ１００４）。連結性判定処理部２３１において、検索語と関連語を組にしたキーワード集合から意味段落を抽出する（ステップ１００５）。分割結果選択処理部１１４１において、セグメンテーション結果が複数あるかを判断し、（ステップ１００６）、複数のセグメンテーション結果がある場合には、ユーザに結果を提示してキーボード１１６２から閾値を入力させる（ステップ１００７）。分割結果選択処理部１１４１において、入力された閾値に基づいてテキストセグメンテーション結果を出力部２４２から出力する（ステップ１００８）。 Text 255 is input to the input unit 241 of the computer 1161 (step 1001), the text input by the text division processing unit 201 is divided into sentence units (step 1002), and the search word extraction processing unit 211 searches the search word from the sentence. Extracted (step 1003), the related word acquisition processing unit 221 performs a search on the web using the search word, and acquires the related word from the search result (step 1004). In the connectivity determination processing unit 231, a semantic paragraph is extracted from a keyword set in which a search word and a related word are paired (step 1005). The division result selection processing unit 1141 determines whether there are a plurality of segmentation results (step 1006). If there are a plurality of segmentation results, the result is presented to the user and a threshold value is input from the keyboard 1162 (step 1007). ). The division result selection processing unit 1141 outputs the text segmentation result from the output unit 242 based on the input threshold value (step 1008).

以下では、図１３におけるテキスト２５５、分解文章記憶部２０２、検索語記憶部２１２、関連度記憶部２２２に格納されるデータの例は、前述の第１の実施の形態と同様のものを用いて、第１の実施の形態と異なる部分について説明する。 Hereinafter, examples of data stored in the text 255, the decomposed sentence storage unit 202, the search word storage unit 212, and the relevance degree storage unit 222 in FIG. 13 are the same as those in the first embodiment described above. Differences from the first embodiment will be described.

連結性判定処理部２３１において、複数個の閾値Ｃ_Ｔに対して実行される。具体的には、キーワード集合を作成した後、複数の閾値Ｃ_Ｔを順に用いながら、平均連結度Ｃ_ｉを算出し、意味段落を抽出する。意味段落の抽出後、制御部２４０を介して意味段落記憶部２３２を参照し、分割結果が異なる場合にはその時の閾値と意味段落を制御部２４０を介して意味段落記憶部２３２に格納する。複数の分割結果が意味段落記憶部２３２に格納された例を図１５に示す。ここで、複数の閾値に対して同じ分割結果が得られる場合には、閾値と分割結果はそれぞれ一つしか意味段落記憶部２３２に格納されない。 In the connectivity determination unit 231, it is performed on a plurality of threshold C _T. Specifically, after creating the keyword set, while using a plurality of threshold C _T in order to calculate the average connectivity of C _i, extracts the meaning paragraph. After the semantic paragraph is extracted, the semantic paragraph storage unit 232 is referred to via the control unit 240. If the division results are different, the threshold value and the semantic paragraph at that time are stored in the semantic paragraph storage unit 232 via the control unit 240. An example in which a plurality of division results are stored in the semantic paragraph storage unit 232 is shown in FIG. Here, when the same division result is obtained for a plurality of threshold values, only one threshold value and one division result are stored in the semantic paragraph storage unit 232.

連結性判定処理部２３１の処理の後、制御部２４０により分割結果選択部１１４１が起動される。分割結果選択部１１４１では、制御部２４０を介して意味段落記憶部１１３２を参照する。そして、閾値が複数個格納されている場合には、出力部２４２を介して表示部２５６に意味段落記憶部２３２に格納されている分割結果を表示し、キーボード１１６２を通じて閾値の入力をユーザに求める。ユーザは表示部２５６に表示さえている分割結果を参照し、閾値を一つ入力する。閾値が入力されると、制御部２４０は、入力された閾値を分割結果選択処理部１１４１に渡す。その後、分割結果選択処理部１１４１において、制御部２４０を介して意味段落記憶部２３２を参照し、入力された閾値に対応する分類結果だけを残し、他の分割結果を削除する。そして、一つだけ意味段落記憶部２３２に格納されている分割結果が一つの場合には、分割結果選択処理部１１４１ではユーザに閾値の入力を求めず、意味段落記憶部２３２に格納されている分割結果を出力部２４２を通じて表示部２５６に出力する。 After the process of the connectivity determination processing unit 231, the control unit 240 activates the division result selection unit 1141. The division result selection unit 1141 refers to the semantic paragraph storage unit 1132 via the control unit 240. When a plurality of threshold values are stored, the division result stored in the semantic paragraph storage unit 232 is displayed on the display unit 256 via the output unit 242, and the user is requested to input the threshold value via the keyboard 1162. . The user refers to the division result displayed even on the display unit 256 and inputs one threshold value. When the threshold value is input, the control unit 240 passes the input threshold value to the division result selection processing unit 1141. Thereafter, the division result selection processing unit 1141 refers to the semantic paragraph storage unit 232 via the control unit 240, leaves only the classification result corresponding to the input threshold value, and deletes other division results. If there is only one division result stored in the semantic paragraph storage unit 232, the division result selection processing unit 1141 does not ask the user to input a threshold value and stores it in the semantic paragraph storage unit 232. The division result is output to the display unit 256 through the output unit 242.

上記の第２の実施の形態において、複数の閾値Ｃ_Ｔに対して処理を行う場合、複数の結果を提示し、閾値を入力させることでセグメンテーション結果を選択させる。これにより、大きい内容の変化で分割するか、または細かい内容の変化でも分割するか、ユーザが調整を行いながらテキストセグメンテーションを行うことができる効果がある。一方で、入力されたテキストを一括して分割処理する場合には、一つの閾値Ｃ_Ｔで処理を行うことで自動的に入力テキストを分割することができる。 In the above-described second embodiment, when processing for a plurality of threshold values C _T, presenting a plurality of results to select the segmentation result by inputting the threshold. Accordingly, there is an effect that text segmentation can be performed while the user adjusts whether to divide by a large change in content or to divide by a fine change in content. On the other hand, when the division processing in a batch input text can be divided automatically input text by performing the process with one threshold C _T.

なお、上記の第１・第２の実施の形態におけるテキストセグメンテーション装置（コンピュータ）の構成要素をプログラムとして構築し、テキストセグメンテーション装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 The components of the text segmentation device (computer) in the first and second embodiments described above are constructed as a program and installed in a computer used as the text segmentation device to be executed, or via a network It can be distributed.

また、構築されたプログラムをハードディスクや、フレキシブルディスク・ＣＤ−ＲＯＭ等の可搬記憶媒体に格納し、コンピュータにインストールする、または、配布することが可能である。 Further, the constructed program can be stored in a portable storage medium such as a hard disk, a flexible disk, or a CD-ROM, and can be installed or distributed in a computer.

なお、本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.

本発明は、コンピュータ上で各種記事や物語等の文章中の各文を意味的なまとまりに分割する技術に適用可能である。 The present invention can be applied to a technique for dividing each sentence in sentences such as various articles and stories on a computer into semantic groups.

本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の原理を説明するための図である。It is a figure for demonstrating the principle of this invention. 本発明の第１の実施の形態におけるウェブ検索を利用したテキストセグメンテーション装置の構成図である。It is a block diagram of the text segmentation apparatus using the web search in the 1st Embodiment of this invention. 本発明の第１の実施の形態における処理手順の概要を示すフローチャートである。It is a flowchart which shows the outline | summary of the process sequence in the 1st Embodiment of this invention. 本発明の第１の実施の形態におけるテキストの例である。It is an example of the text in the 1st Embodiment of this invention. 本発明の第１の実施の形態における分解文章記憶部に格納された文の例である。It is an example of the sentence stored in the decomposition | disassembly sentence memory | storage part in the 1st Embodiment of this invention. 本発明の第１の実施の形態における一般語リストに登録されている一般語の例である。It is an example of the general word registered into the general word list | wrist in the 1st Embodiment of this invention. 本発明の第１の実施の形態における検索語記憶部に格納された検索語の例である。It is an example of the search word stored in the search word memory | storage part in the 1st Embodiment of this invention. 本発明の第１の実施の形態における関連語記憶部に格納された関連語の例である。It is an example of the related word stored in the related word memory | storage part in the 1st Embodiment of this invention. 本発明の第１の実施の形態における連結性判定処理部で作成されるキーワード集合の例である。It is an example of the keyword set produced by the connectivity determination processing unit in the first embodiment of the present invention. 本発明の第１の実施の形態における意味段落記憶部に格納された文番号の例である。It is an example of the sentence number stored in the semantic paragraph memory | storage part in the 1st Embodiment of this invention. 本発明の第１の実施の形態における関連語取得処理部の処理のフローチャートである。It is a flowchart of the process of the related word acquisition process part in the 1st Embodiment of this invention. 本発明の第２の実施の形態におけるウェブ検索を利用したテキストセグメンテーションシステムの構成図である。It is a block diagram of the text segmentation system using the web search in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における処理手順の概要を示すフローチャートである。It is a flowchart which shows the outline | summary of the process sequence in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における意味段落記憶部に格納された文番号の例である。It is an example of the sentence number stored in the semantic paragraph memory | storage part in the 2nd Embodiment of this invention.

Explanation of symbols

２０１テキスト分解手段、テキスト分解処理部
２０２分解文章記憶部
２１１検索語抽出手段、検索語抽出処理部
２１２検索語記憶手段、検索語記憶部
２２１関連語抽出手段、関連語取得処理部
２２２関連語記憶手段、関連語記憶部
２３１連結性判定手段、連結性判定処理部
２３２意味段落記憶部
２４１入力部
２４２出力部
２５１コンピュータ
２５２ネットワーク
２５３ウェブ
２５４構造化言語で記述された記事
２５５テキスト
２５６表示部
１１３２生成ブロック記憶部
１１４１分割結果選択処理部
１１６２キーボード 201 Text Decomposition Unit, Text Decomposition Processing Unit 202 Decomposed Sentence Storage Unit 211 Search Term Extraction Unit, Search Term Extraction Processing Unit 212 Search Term Storage Unit, Search Term Storage Unit 221 Related Word Extraction Unit, Related Word Acquisition Processing Unit 222 Related Word Storage Means, related word storage unit 231 connectivity determination unit, connectivity determination processing unit 232 semantic paragraph storage unit 241 input unit 242 output unit 251 computer 252 network 253 web 254 article 255 text 256 display unit 1132 generated in structured language Block storage unit 1141 Division result selection processing unit 1162 Keyboard

Claims

A text segmentation device that divides text according to content,
Text decomposition means for dividing the input text into sentences,
A morphological analysis is performed on the sentence divided by the text decomposition means , and nouns, adverbs, verbs, adjectives, and adjective verbs that have been analyzed are extracted as search words, and the verb, the adjective, and the adjective verb are not final forms A search term extracting means for converting into a closed form and storing it in the search word storage means;
Web search based on the search term, morphological analysis of the searched text , among the analyzed morphemes, nouns, adverbs, verbs, adjectives, adjective verbs are obtained as related words, the verbs, the adjectives, A related word acquisition means for converting a non-terminal form of the adjective verb into a final form and storing it in a related word storage means;
A search word is acquired from the search word storage means, a related word is acquired from the related word storage means, and a plurality of keywords obtained by dividing the input text using a keyword set that is a combination of the search word and the related word Connectivity determination means for determining connectivity between sentences and dividing the input text by extracting a semantic paragraph that is a sentence between the valleys of the connectivity;
Have
The connectivity determination means includes
Blocks B1 and B2 that summarize the keyword set are created, and the connectivity between the i-th and i + 1-th two sentences is expressed using the appearance frequency of the word t,

(W _t ^B1 represents the frequency of the word t in the block B1, and w _t ^B2 represents the frequency of the word t in the block B2.

Takes a value between 0 and 1, and the closer to 1, the more the same words are included in block B1 and block B2)
Means to obtain
i = {1, 2,..., N},

And the block size b parameter is set to b = (b ₁ , b ₂ ,..., B _M ), and the connectivity is set for each block width.

And the average value of these as the average connectivity C _i in the i-th and i + 1-th sentences,

Means to obtain
Using the average connectivity C _i (where i = (1, 2,..., N)), a valley of average connectivity that is a boundary of a semantic paragraph is defined as a condition.

And a means for acquiring the semantic paragraph based on the valley.

The connectivity determination means includes
2. The text segmentation device according to claim 1, further comprising means for calculating the average connectivity C _i using a plurality of thresholds C _T and extracting the semantic paragraph.

A text segmentation method in a device for dividing text according to content,
A text decomposition step in which the text decomposition means divides the input text into sentence units;
The search word extraction means performs morphological analysis on the sentence divided in the text decomposition step, extracts nouns, adverbs, verbs, adjectives, and adjective verbs as morphological analysis as search words, and the verb, the adjective, and the adjective verb A search term extraction step of converting a non-termination type into a termination type and storing it in the search term storage means;
A related word acquisition means performs a web search based on the search word, performs a morphological analysis on the searched text, acquires nouns, adverbs, verbs, adjectives, and adjective verbs that are morphologically analyzed as related words. An adjective, a related word acquisition step of converting an adjective verb that is not a terminal form into a terminal form and storing it in a related word storage means;
The connectivity determination means acquires a search word from the search word storage means, acquires a related word from the related word storage means, and uses the keyword set that is a combination of the search word and the related word, to input the input A connectivity determination step of determining connectivity between a plurality of sentences obtained by dividing the text, and dividing the input text by extracting a semantic paragraph that is a sentence between the valleys of the connectivity;
And
In the connectivity determination step,
Blocks B1 and B2 that summarize the keyword set are created, and the connectivity between the i-th and i + 1-th two sentences is expressed using the appearance frequency of the word t,

(W _t ^B1 represents the frequency of the word t in the block B1, and w _t ^B2 represents the frequency of the word t in the block B2.

Takes a value between 0 and 1, and the closer to 1, the more the same words are included in block B1 and block B2)
A step to obtain by
i = {1, 2,..., N},

And the block size b parameter is set to b = (b ₁ , b ₂ ,..., B _M ), and the connectivity is set for each block width.

And the average value thereof as the average connectivity C _i in the i-th and i + 1-th sentences,

A step to obtain by
Using the average connectivity C _i (where i = (1, 2,..., N))

And extracting the semantic paragraph based on the valley, and a text segmentation method.

In the connectivity determination step,
Wherein calculating the average degree of coupling C _i using a plurality of threshold C _T, text segmentation method as claimed in claim 3, wherein extracting the meaning paragraph.

A text segmentation program for causing a computer to function as each means constituting the text segmentation apparatus according to claim 1.

A computer-readable recording medium storing the text segmentation program according to claim 5.