JP5379749B2

JP5379749B2 - Document classification apparatus, document classification method, program thereof, and recording medium

Info

Publication number: JP5379749B2
Application number: JP2010135350A
Authority: JP
Inventors: 真詞田本; 理吉岡; 敏高橋
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2010-06-14
Filing date: 2010-06-14
Publication date: 2013-12-25
Anticipated expiration: 2030-06-14
Also published as: JP2012003334A

Description

本発明は、文書等の分類に用いる分類基準の統計的信頼性に優れた文書分類装置、文書分類方法、そのプログラムおよび記録媒体に関する。 The present invention relates to a document classification apparatus, a document classification method, a program thereof, and a recording medium, which are excellent in statistical reliability of classification criteria used for classification of documents and the like.

複数の文書等を分類する場合、各文書の特徴量の相違を利用し、特徴量の類似性に基づき分類を行う。分類に用いる特徴量として、一般に文書に含まれる各単語の相対出現頻度を要素とする文書ベクトルが用いられ、各文書の文書ベクトル間の距離に基づき、各文書を複数のクラスタに分類する。各文書ベクトル間の距離は、各文書に含まれる単語の相対出現頻度の類似性に基づき定義する。具体的には例えば、各文書に含まれる単語の相対出現頻度の真値の事後確率密度分布の最尤値間の距離として定義する。そして、このように定義された距離が近い文書ベクトル同士が同じクラスタに属するようクラスタリングする。 When classifying a plurality of documents or the like, classification is performed based on the similarity of feature amounts using the difference in feature amounts of the documents. As a feature amount used for classification, a document vector whose element is a relative appearance frequency of each word included in a document is generally used, and each document is classified into a plurality of clusters based on the distance between the document vectors of each document. The distance between each document vector is defined based on the similarity of the relative appearance frequency of words included in each document. Specifically, for example, it is defined as the distance between the maximum likelihood values of the true posterior probability density distribution of the true value of the relative appearance frequency of the words included in each document. Then, clustering is performed so that document vectors having a close distance defined in this way belong to the same cluster.

特開２００８−２３４４８２号公報JP 2008-234482 A

各文書ベクトル間の距離を、各文書に含まれる単語の相対出現頻度の類似性のみにより定義した場合、同じ相対出現頻度であっても、各相対出現頻度のサンプル数が異なる場合、すなわち、出現頻度の絶対値が異なる場合、確率密度分布にも相違が生じ、統計的信頼性に相違が生じる。この相違は、サンプル数が小さい場合に特に問題となり、具体的には、出現頻度の低い単語の影響が相対的に大きくなることで統計的信頼性が損なわれる。これにより例えば、本来同一クラスタに分類されるべきでない文書が単一のクラスタに分類される等の問題が生じる。このような問題を回避するために、出現頻度の低い単語を人為的に排除するという対策（例えば、特許文献１）も考えられるが、効率面やコスト面などの新たな問題が生じる。 When the distance between each document vector is defined only by the similarity of the relative appearance frequency of words included in each document, even if the relative appearance frequency is the same, the number of samples of each relative appearance frequency is different, that is, the appearance When the absolute value of the frequency is different, the probability density distribution is also different, and the statistical reliability is different. This difference is particularly problematic when the number of samples is small. Specifically, the statistical reliability is impaired due to the relatively large influence of words with low appearance frequency. This causes a problem that, for example, documents that should not be classified into the same cluster are classified into a single cluster. In order to avoid such a problem, a countermeasure (for example, Patent Document 1) of artificially removing words with low appearance frequency is conceivable, but new problems such as efficiency and cost arise.

本発明の目的は、複数の文書をクラスタリングする際に、文書が短い場合でも人手を介することなく精度よく文書を分類することが可能な文書分類装置、文書分類方法、そのプログラムおよび記録媒体を提供することにある。 An object of the present invention is to provide a document classification apparatus, a document classification method, a program therefor, and a recording medium that can accurately classify a plurality of documents even when the documents are short, even if the documents are short. There is to do.

本発明の文書分類装置は、複数の文書をクラスタリングする装置であって、文書ベクトル生成部と距離算出部とクラスタリング部とを備える。文書ベクトル生成部は、文書に含まれる各単語の相対出現頻度を要素とする文書ベクトルを、各文書について生成する。距離算出部は、各文書ベクトル間の距離を、単語の相対出現頻度の事後確率密度分布の最尤値間の距離に確率推定の信頼性を加味した距離尺度を用いて算出する。クラスタリング部は、各文書ベクトル間の距離に基づき、各文書を複数のクラスタに分類する。 The document classification device of the present invention is a device that clusters a plurality of documents, and includes a document vector generation unit, a distance calculation unit, and a clustering unit. The document vector generation unit generates a document vector having the relative appearance frequency of each word included in the document as an element for each document. The distance calculation unit calculates the distance between each document vector by using a distance scale in which the reliability of probability estimation is added to the distance between the maximum likelihood values of the posterior probability density distribution of the relative appearance frequency of words. The clustering unit classifies each document into a plurality of clusters based on the distance between each document vector.

本発明の文書分類装置、文書分類方法、そのプログラムおよび記録媒体によれば、本発明の文書分類装置、文書分類方法、そのプログラムおよび記録媒体によれば、各文書ベクトル間の距離を、単語の相対出現頻度の事後確率密度分布の最尤値間の距離に確率推定の信頼性を加味した距離尺度を用いて算出する。これにより、事後確率密度分布間の距離も考慮されることになるため、低出現頻度の単語の影響により低下していた分類基準の統計的信頼性を向上することができる。そのため、複数の文書をクラスタリングする際に、文書が短い場合でも人手を介することなく精度よく文書を分類することが可能となる。 According to the document classification device, document classification method, program and recording medium of the present invention, according to the document classification device, document classification method, program and recording medium of the present invention, the distance between each document vector The distance between the maximum likelihood values of the posterior probability density distribution of the relative appearance frequency is calculated using a distance scale in which the reliability of probability estimation is added. Thereby, since the distance between the posterior probability density distributions is also taken into consideration, the statistical reliability of the classification criterion that has been lowered due to the influence of the words having a low appearance frequency can be improved. Therefore, when a plurality of documents are clustered, even if the documents are short, the documents can be accurately classified without human intervention.

文書分類装置１００の構成例を示すブロック図。2 is a block diagram illustrating a configuration example of a document classification device 100. FIG. 文書分類装置１００の処理フロー例を示す図。FIG. 5 is a diagram showing an example of a processing flow of the document classification apparatus 100. ２つの文書の事後確率密度分布の距離を模式化した図。The figure which modeled the distance of the posterior probability density distribution of two documents. クラスタリング部１３０における処理フローの例を示す図。The figure which shows the example of the processing flow in the clustering part.

図1は本発明の文書分類装置１００の構成例を示すブロック図である。また、図２はその処理フロー例である（Ｓ１〜Ｓ４）。文書分類装置１００は、複数の文書をクラスタリングする装置であって、文書ベクトル生成部１１０と距離算出部１２０とクラスタリング部１３０と結果表示部１４０とを備える。 FIG. 1 is a block diagram illustrating a configuration example of a document classification apparatus 100 according to the present invention. FIG. 2 shows an example of the processing flow (S1 to S4). The document classification device 100 is a device that clusters a plurality of documents, and includes a document vector generation unit 110, a distance calculation unit 120, a clustering unit 130, and a result display unit 140.

文書ベクトル生成部１１０は、文書に含まれる各単語の相対出現頻度を要素とする文書ベクトルを、分類の対象となる各文書について生成する（Ｓ１）。このとき、文書ベクトルに含める単語は必ずしも文書に含まれる全ての単語でなくてもよく、予め文書中の単語の解析などを行い、特徴的な主要単語に限定してもよい。 The document vector generation unit 110 generates a document vector having the relative appearance frequency of each word included in the document as an element for each document to be classified (S1). At this time, the words included in the document vector do not necessarily have to be all the words included in the document, but may be limited to characteristic main words by analyzing the words in the document in advance.

距離算出部１２０は、各文書ベクトル間の距離を、単語の相対出現頻度の事後確率密度分布の最尤値間の距離に確率推定の信頼性を加味した距離尺度を用いて算出する（Ｓ２）。これにより、事後確率密度分布間の距離も考慮されることになるため、低出現頻度の単語の影響により低下していた分類基準の統計的信頼性を向上することができる。具体的には、例えば以下のように算出する。 The distance calculation unit 120 calculates the distance between each document vector by using a distance scale in which the reliability of probability estimation is added to the distance between the maximum likelihood values of the posterior probability density distribution of the relative appearance frequency of words (S2). . Thereby, since the distance between the posterior probability density distributions is also taken into consideration, the statistical reliability of the classification criterion that has been lowered due to the influence of the words having a low appearance frequency can be improved. Specifically, for example, the calculation is performed as follows.

或る文書ｈの文書ベクトルから、当該文書ｈ中に単語がｎ個出現し、そのうち単語ｗがｋ回出現することが観測された場合における、条件付き確率Ｐ(ｗ|ｈ)の真値ｐの事後確率は、 The true value p of the conditional probability P (w | h) when n words appear in the document h from the document vector of a document h and it is observed that the word w appears k times. The posterior probability of

と表すことができる。式(1)は、約分することにより二項事後分布 It can be expressed as. Equation (1) is binomial posterior distribution by dividing

として表すことができる。 Can be expressed as

２つの文書ｉ、ｊにおける単語ｗの出現確率分布から独立に標本を抽出するとき、その差の二乗の統計的期待値は次式のような距離尺度として表すことができる。 When a sample is extracted independently from the appearance probability distribution of the word w in the two documents i and j, the statistical expectation value of the square of the difference can be expressed as a distance measure such as the following equation.

この距離尺度ｄ_i,j ^(w)には、２つの文書ｉ、ｊにおける単語ｗの相対出現頻度の事後確率密度分布の最尤値間の距離に加え、確率推定の信頼性が反映される。図３に、式(3)で求められる２つの文書ｉ、ｊに対応するｘ、ｙの確率密度関数間の距離を模式化して示す。図３中の２つの二項事後確率密度分布からそれぞれランダムに選ばれる点の確率密度関数が各々の分布に従い、距離尺度ｄ_i,j ^(w)は選ばれた２点の距離の統計的期待値となる。 This distance measure d _{i, j} ^(w) reflects the reliability of probability estimation in addition to the distance between the maximum likelihood values of the posterior probability density distribution of the relative appearance frequency of the word w in the two documents i and j. . FIG. 3 schematically shows the distance between the probability density functions of x and y corresponding to the two documents i and j obtained by Expression (3). The probability density function of points randomly selected from the two binomial posterior probability density distributions in FIG. 3 follows each distribution, and the distance measure d _{i, j} ^(w) is the statistical expectation of the distance between the two selected points. Value.

そして、複数の単語の共起関係から２つの文書ｉ，ｊの類似度を示す距離尺度ｄ_i,jを、各単語について式(2),(3)に基づき定義される二項事後分布に基づき次式により定義することができる（参考文献１参照）。 Then, the distance measure d _{i, j} indicating the similarity between the two documents i and j from the co-occurrence relationship of a plurality of words is converted into a binomial posterior distribution defined for each word based on the equations (2) and (3). Based on the following formula, it can be defined (see Reference 1).

〔参考文献１〕田本真詞、川端豪、「連接共起に注目した単語のクラスタリング」、電子情報通信学会技術研究報告、1993、ＳＰ、音声93-125、p.55-62 [Reference 1] Shinji Tamoto, Go Kawabata, “Clustering of Words Focusing on Concatenated Co-occurrence”, IEICE Technical Report, 1993, SP, Speech 93-125, p.55-62

クラスタリング部１３０は、距離算出部１２０で得られた各文書ベクトル間の距離に基づき、各文書を複数のクラスタに分類する（Ｓ３）。クラスタリングの方法としては種々の方法を適用できる。例えば、凝集型による階層的クラスタリング手法を用いる場合には、図４に示すような手順（Ｓ１１〜Ｓ１３）でクラスタリングすることができる。なお、｛Ｄ_ｐ｜１≦ｐ≦ｎ｝を文書の集合とし、｛Ｃ_ｐ｜１≦ｐ≦ｈ｝を分類の集合とし、各Ｃ_ｐに対応付けられた文書集合を｛Ｋ_ｐｕ｜１≦ｕ≦ｎ_ｐ｝とする。また、初期状態では、Ｃ_ｐ＝｛Ｄ_ｐ｝とする。 The clustering unit 130 classifies each document into a plurality of clusters based on the distance between each document vector obtained by the distance calculation unit 120 (S3). Various methods can be applied as a clustering method. For example, when using the hierarchical clustering method based on the aggregation type, clustering can be performed by the procedure (S11 to S13) as shown in FIG. Note that {D _p | 1 ≦ p ≦ n} is a set of documents, {C _p | 1 ≦ p ≦ h} is a set of classifications, and a set of documents associated with each C _p is {K _pu | 1 ≦ u ≦ n _p }. In the initial state, C _p = {D _p }.

まず、Ｃ_ｐ、Ｃ_ｑクラスタ間の距離 d_ｃ(Ｃ_ｐ，Ｃ_ｑ) を計算し、最も距離の近い２つのクラスタを処理対象とする（Ｓ１１）。クラスタ間の距離d_ｃ(Ｃ_ｐ，Ｃ_ｑ)は、クラスタ内の代表文書Ｄ_ｐ，Ｄ_ｑの文書ベクトルの距離d_ｖ(Ｄ_ｐ，Ｄ_ｑ)とし、代表文書Ｄ_ｐは、Ｃ_ｐの文書のうち、Σ（１≦ｓ，ｔ≦ｎ_ｐ，ｓ≠ｔ）d_ｖ(Ｋ_ｐｓ，Ｋ_ｐｔ)を最小にする文書Ｋ_ｐｔとする。次に、Ｃ_ｐ，Ｃ_ｑクラスタを結合し、新たなクラスタＣ_ｒを生成する（Ｓ１２）。そして、クラスタ数ｈが既定数に達していなければ、Ｓ１１に移行し、達していればクラスタリング処理を終了する（Ｓ１３）。 First, a distance d _c (C _p , C _q ) between C _p and C _q clusters is calculated, and two clusters having the closest distance are set as processing targets (S11). The distance between the clusters _{_{_{d c (C p, C q}}} ) is representative document _D p in the cluster, the distance of the document vector of _{_{_{D q d v (D p,}}} D q) and the representative document _{D p} is the _{C p} Of the documents, a document K _pt that minimizes Σ (1 ≦ s, t ≦ n _p , s ≠ t) d _v (K _ps , K _pt ) is set. Next, the C _p and C _q clusters are combined to generate a new cluster C _r (S12). If the number of clusters h has not reached the predetermined number, the process proceeds to S11, and if it has reached, the clustering process is terminated (S13).

そして、結果表示部１４０は、クラスタリング結果を表示する（Ｓ４）。表示方法は任意であり、例えば、クラスタリング部１３０で分類された各クラスタと、それぞれのクラスタに属する文書のラベルのリストを表示する。 Then, the result display unit 140 displays the clustering result (S4). The display method is arbitrary. For example, each cluster classified by the clustering unit 130 and a list of labels of documents belonging to each cluster are displayed.

以上のように、本発明の文書分類装置、文書分類方法、そのプログラムおよび記録媒体によれば、各文書ベクトル間の距離を、単語の相対出現頻度の事後確率密度分布の最尤値間の距離に確率推定の信頼性を加味した距離尺度を用いて算出する。これにより、事後確率密度分布間の距離も考慮されることになるため、低出現頻度の単語の影響により低下していた分類基準の統計的信頼性を向上することができる。そのため、複数の文書をクラスタリングする際に、文書が短い場合でも人手を介することなく精度よく文書を分類することが可能となる。 As described above, according to the document classification device, the document classification method, the program, and the recording medium of the present invention, the distance between each document vector is the distance between the maximum likelihood values of the posterior probability density distribution of the relative appearance frequency of words. And a distance scale that takes into account the reliability of probability estimation. Thereby, since the distance between the posterior probability density distributions is also taken into consideration, the statistical reliability of the classification criterion that has been lowered due to the influence of the words having a low appearance frequency can be improved. Therefore, when a plurality of documents are clustered, even if the documents are short, the documents can be accurately classified without human intervention.

上記の各処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 Each of the above processes is not only executed in time series according to the description, but may also be executed in parallel or individually according to the processing capability of the apparatus that executes the processes or as necessary. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.

また、本発明の文書分類装置をコンピュータによって実現する場合、装置の各部が有す機能の処理内容はプログラムによって記述される。そのプログラムは、例えば、ハードディスク装置に格納されており、実行時には必要なプログラムやデータがＲＡＭ(Random Access Memory)に読み込まれる。その読み込まれたプログラムがＣＰＵにより実行されることにより、コンピュータ上で各処理内容が実現される。なお、処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, when the document classification device of the present invention is realized by a computer, the processing contents of the functions possessed by each part of the device are described by a program. The program is stored in, for example, a hard disk device, and necessary programs and data are read into a RAM (Random Access Memory) at the time of execution. The read program is executed by the CPU, whereby each processing content is realized on the computer. Note that at least a part of the processing content may be realized by hardware.

上記の処理機能を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよいが、具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the above processing functions can be recorded on a computer-readable recording medium. The computer-readable recording medium may be any medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, or a semiconductor memory. Specifically, for example, the magnetic recording device may be a hard disk device or a flexible Discs, magnetic tapes, etc. as optical disks, DVD (Digital Versatile Disc), DVD-RAM (Random Access Memory), CD-ROM (Compact Disc Read Only Memory), CD-R (Recordable) / RW (ReWritable), etc. As the magneto-optical recording medium, MO (Magneto-Optical disc) or the like can be used, and as the semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory) or the like can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

また、上述した実施形態とは別の実行形態として、コンピュータが可搬型記録媒体から直接このプログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 As an execution form different from the above-described embodiment, the computer may read the program directly from the portable recording medium and execute processing according to the program. Each time is transferred, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

Claims

A document classification device for clustering a plurality of documents,
A document vector generation unit that generates, for each document, a document vector whose element is the relative frequency of occurrence of each word included in the document;
A distance calculation unit that calculates the distance between each document vector by using a distance scale in which the reliability of probability estimation is added to the distance between the maximum likelihood values of the posterior probability density distribution of the relative appearance frequency of words;
A clustering unit that classifies each document into a plurality of clusters based on the distance between each document vector;
A document classification device comprising:

The document classification device according to claim 1,
The distance calculation unit
The posterior probability of the true value p of the conditional probability P (w | h) when n words appear in the document h from the document vector, and it is observed that the word w appears k times.

Binomial posterior distribution obtained by reducing

Is the statistical expectation value of the square of the difference between the probability distributions of the word w in the two documents i and j

And the distance d _{i, j} between the two documents _{i and j} is

A document classification apparatus characterized by obtaining using a distance scale defined in (1).

A document classification method for clustering a plurality of documents,
A document vector generation step for generating, for each document, a document vector whose element is a relative frequency of occurrence of each word included in the document;
A distance calculation step of calculating a distance between each document vector using a distance scale in which reliability of probability estimation is added to a distance between maximum likelihood values of a posteriori probability density distribution of a relative appearance frequency of words;
A clustering step of classifying each document into a plurality of clusters based on the distance between each document vector;
Document classification method to execute.

The document classification method according to claim 3,
The distance calculating step includes:
The posterior probability of the true value p of the conditional probability P (w | h) when n words appear in the document h from the document vector, and it is observed that the word w appears k times.

Binomial posterior distribution obtained by reducing

And the distance d _{i, j} between the two documents _{i and j} is

A document classification method characterized by being obtained using a distance scale defined in 1.

A program for causing a computer to function as the document classification device according to claim 1.

A computer-readable recording medium in which a program for causing a computer to function as the document classification device according to claim 1 is recorded.