JP5648008B2

JP5648008B2 - Document classification method, apparatus, and program

Info

Publication number: JP5648008B2
Application number: JP2012062818A
Authority: JP
Inventors: 真理子川場; のぞみ小林; 平野　徹; 徹平野; 牧野　俊朗; 俊朗牧野
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2012-03-19
Filing date: 2012-03-19
Publication date: 2015-01-07
Anticipated expiration: 2032-03-19
Also published as: JP2013196382A

Description

本発明は、文書分類方法、装置、及びプログラムに係り、特に、Ｗｅｂサイトのページ文書が何れのカテゴリに属するかを分類する文書分類方法、装置、及びプログラムに関する。 The present invention relates to a document classification method, apparatus, and program, and more particularly to a document classification method, apparatus, and program for classifying to which category a page document of a website belongs.

文書中の単語のn-gramを利用して文書を分類する分類方法が知られている（例えば、特許文献１）。この分類方法では、特定のカテゴリに出現しやすい単語のn-gramを学習してモデルを作成している。 A classification method for classifying a document using n-grams of words in the document is known (for example, Patent Document 1). In this classification method, a model is created by learning n-grams of words that are likely to appear in a specific category.

平野耕一、古林紀哉、高橋淳一、「日本語圏ブログの自動分類」、情報処理学会研究報告、2005年Koichi Hirano, Kiya Kobayashi, Junichi Takahashi, "Automatic classification of Japanese-speaking blogs", IPSJ Research Report, 2005

しかしながら、文書中の単語全てのn-gramを手掛かりとして文書分類する場合、本文のカテゴリと関連の無い語が大量に含まれてしまい、ノイズとなってしまう。その結果、分類性能が低下する、という問題がある。 However, when document classification is performed using n-grams of all the words in the document as clues, a large number of words that are not related to the body category are included, resulting in noise. As a result, there is a problem that the classification performance is degraded.

例えば、図５に示すテキストは、『ワンピース』カテゴリの文書中の本文だが、『ネックレス』、『ベルト』、『アクセサリー』、『ショートパンツ』など、『ワンピース』に関連するが他のカテゴリ名になっている単語や、他のカテゴリとも強く関連する単語が多く出現する。本文にはこのような単語が多く出現するため、分類誤りの原因となり、分類性能が低下してしまう。 For example, the text shown in FIG. 5 is the text in the document of the “One Piece” category, but it is related to “One Piece” such as “Necklace”, “Belt”, “Accessory”, “Shorts”, but other category names. And many words that are strongly related to other categories. Since many such words appear in the text, it causes a classification error and the classification performance deteriorates.

本発明は、上記の事情を鑑みてなされたもので、Ｗｅｂサイトのページ文書を精度良く分類することができる文書分類方法、装置、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a document classification method, apparatus, and program that can accurately classify page documents on a Web site.

上記の目的を達成するために本発明に係わる文書分類方法は、
サイト構造抽出手段によって、複数のページを含むＷｅｂサイトの分類対象ページのページ文書から、前記分類対象ページに到達するまでに経由する各ページに関する情報、あるいは前記分類対象ページを経由して到達する各ページに関する情報を表すサイト構造情報を抽出するステップと、
形態素解析手段によって、前記サイト構造抽出手段によって抽出された前記サイト構造情報に対して形態素解析を行うステップと、
素性抽出手段によって、前記形態素解析手段による形態素解析結果から得られる複数の形態素から、名詞句をつないでいる記号の形態素を除いた複数の形態素の表記と、当該複数の形態素の出現順序とを表わす素性を抽出するステップと、
分類手段によって、学習用文書から素性を用いて予め機械学習された、ページ文書が複数のカテゴリの何れに属するか否かを分類するための分類モデルと、前記素性抽出手段によってサイト構造情報から抽出された素性であり、名詞句をつないでいる記号の形態素を除いた複数の形態素の表記と当該複数の形態素の出現順序とを表わす素性とに基づいて、前記分類対象ページのページ文書が前記複数のカテゴリの何れに属するかを分類するステップと、
を含む。
In order to achieve the above object, a document classification method according to the present invention includes:
Information about each page that is passed from the page document of the classification target page of the website including a plurality of pages to the classification target page by the site structure extraction unit, or each of the information that is reached via the classification target page Extracting site structure information representing information about the page;
Performing morpheme analysis on the site structure information extracted by the site structure extraction unit by a morpheme analysis unit;
The feature extraction means represents the notation of a plurality of morphemes excluding the morpheme of the symbol connecting the noun phrase from the plurality of morphemes obtained from the morpheme analysis result by the morpheme analysis means, and the appearance order of the plurality of morphemes Extracting a feature;
A classification model for classifying whether a page document belongs to any of a plurality of categories, which has been machine-learned in advance from the learning document by using the classification means, and extracted from the site structure information by the feature extraction means The plurality of page documents of the classification target page based on the feature representing the notation of the plurality of morphemes excluding the morpheme of the symbol connecting the noun phrases and the appearance order of the plurality of morphemes. Categorizing which of the categories belong to,
including.

本発明に係わる文書分類装置は、
複数のページを含むＷｅｂサイトの分類対象ページのページ文書から、前記分類対象ページに到達するまでに経由する各ページに関する情報、あるいは前記分類対象ページを経由して到達する各ページに関する情報を表すサイト構造情報を抽出するサイト構造抽出手段と、
前記サイト構造抽出手段によって抽出された前記サイト構造情報に対して形態素解析を行う形態素解析手段と、
前記形態素解析手段による形態素解析結果から得られる複数の形態素から、名詞句をつないでいる記号の形態素を除いた複数の形態素の表記と、当該複数の形態素の出現順序とを表わす素性を抽出する素性抽出手段と、
学習用文書から素性を用いて予め機械学習された、ページ文書が複数のカテゴリの何れに属するか否かを分類するための分類モデルと、前記素性抽出手段によってサイト構造情報から抽出された素性であり、名詞句をつないでいる記号の形態素を除いた複数の形態素の表記と当該複数の形態素の出現順序とを表わす素性とに基づいて、前記分類対象ページのページ文書が前記複数のカテゴリの何れに属するかを分類する分類手段と、
を含んで構成されている。
The document classification apparatus according to the present invention is:
A site representing information related to each page that passes from the page document of the classification target page of the Web site including a plurality of pages until reaching the classification target page, or information related to each page reached via the classification target page Site structure extraction means for extracting structure information;
Morphological analysis means for performing morphological analysis on the site structure information extracted by the site structure extraction means;
A feature that extracts features representing notation of a plurality of morphemes excluding morphemes of symbols connecting noun phrases and appearance order of the plurality of morphemes from a plurality of morphemes obtained from a result of morpheme analysis by the morpheme analysis means Extraction means;
Pre machine learning from the learning document using the identity, a classification model for the page document to classify whether belonging to any of a plurality of categories, the at feature extracted from the site structure information said feature extracting means Yes, the page document of the page to be classified is one of the plurality of categories based on the notation of a plurality of morphemes excluding the morpheme of the symbol connecting the noun phrases and the feature indicating the appearance order of the plurality of morphemes. A classification means for classifying whether or not
It is comprised including.

本発明に係わるプログラムは、
コンピュータを、
複数のページを含むＷｅｂサイトの分類対象ページのページ文書から、前記分類対象ページに到達するまでに経由する各ページに関する情報、あるいは前記分類対象ページを経由して到達する各ページに関する情報を表すサイト構造情報を抽出するサイト構造抽出手段、
前記サイト構造抽出手段によって抽出された前記サイト構造情報に対して形態素解析を行う形態素解析手段、
前記形態素解析手段による形態素解析結果から得られる複数の形態素から、名詞句をつないでいる記号の形態素を除いた複数の形態素の表記と、当該複数の形態素の出現順序とを表わす素性を抽出する素性抽出手段、及び
学習用文書から素性を用いて予め機械学習された、ページ文書が複数のカテゴリの何れに属するか否かを分類するための分類モデルと、前記素性抽出手段によってサイト構造情報から抽出された素性であり、名詞句をつないでいる記号の形態素を除いた複数の形態素の表記と当該複数の形態素の出現順序とを表わす素性とに基づいて、前記分類対象ページのページ文書が前記複数のカテゴリの何れに属するかを分類する分類手段
として機能させるためのプログラムである。
The program according to the present invention is:
Computer
A site representing information related to each page that passes from the page document of the classification target page of the Web site including a plurality of pages until reaching the classification target page, or information related to each page reached via the classification target page Site structure extraction means for extracting structure information;
Morphological analysis means for performing morphological analysis on the site structure information extracted by the site structure extraction means;
A feature that extracts features representing notation of a plurality of morphemes excluding morphemes of symbols connecting noun phrases and appearance order of the plurality of morphemes from a plurality of morphemes obtained from a result of morpheme analysis by the morpheme analysis means Extraction means, and
Pre machine learning from the learning document using the identity, a classification model for the page document to classify whether belonging to any of a plurality of categories, the at feature extracted from the site structure information said feature extracting means Yes, the page document of the page to be classified is one of the plurality of categories based on the notation of a plurality of morphemes excluding the morpheme of the symbol connecting the noun phrases and the feature indicating the appearance order of the plurality of morphemes. It is a program for causing it to function as a classification means for classifying whether it belongs to.

以上説明したように、本発明の文書分類方法、装置、及びプログラムによれば、Ｗｅｂサイトのページ文書から抽出されるサイト構造情報から素性を抽出して、ページ文書が複数のカテゴリの何れに属するかを分類することにより、Ｗｅｂサイトのページ文書を精度良く分類することができる、という効果が得られる。 As described above, according to the document classification method, apparatus, and program of the present invention, features are extracted from site structure information extracted from a page document of a website, and the page document belongs to any of a plurality of categories. By classifying these, it is possible to obtain an effect that the page documents of the website can be classified with high accuracy.

本発明の実施の形態に係るページ文書分類装置の構成を示す概略図である。It is the schematic which shows the structure of the page document classification | category apparatus which concerns on embodiment of this invention. 入力されるページ文書を示す図である。It is a figure which shows the page document input. パンくずリストの例を示す図である。It is a figure which shows the example of a breadcrumb list. 本発明の実施の形態に係るページ文書分類装置における文書分類処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the document classification processing routine in the page document classification device based on Embodiment of this invention. テキストの例を示す図である。It is a figure which shows the example of a text.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜システム構成＞
本発明の実施の形態に係るページ文書分類装置１００は、ＷｅｂサイトのＨＴＭＬテキストであるページ文書が入力され、予め人手で用意された複数のカテゴリの中で、どのカテゴリに属するかの分類結果を出力する。このページ文書分類装置１００は、ＣＰＵと、ＲＡＭと、後述する文書分類処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成され、機能的には次に示すように構成されている。図１に示すように、ページ文書分類装置１００は、入力部１０と、演算部２０と、出力部３０とを備えている。 <System configuration>
The page document classification apparatus 100 according to the embodiment of the present invention receives a page document that is an HTML text of a Web site, and displays a classification result to which category belongs among a plurality of categories prepared in advance by hand. Output. This page document classification apparatus 100 is configured by a computer including a CPU, a RAM, and a ROM storing a program for executing a document classification processing routine described later, and is functionally configured as follows. ing. As shown in FIG. 1, the page document classification device 100 includes an input unit 10, a calculation unit 20, and an output unit 30.

入力部１０は、分類対象として入力された、複数のページで構成されるＷｅｂサイトのあるページのＨＴＭＬテキストであるページ文書を受け付ける。例えば、図２に示すようなＨＴＭＬテキストが、分類対象のページ文書として入力される。 The input unit 10 receives a page document that is input as a classification target and is HTML text of a page on a Web site composed of a plurality of pages. For example, HTML text as shown in FIG. 2 is input as a page document to be classified.

演算部２０は、テキスト解析部２１、サイト構造抽出部２２、形態素解析部２３、素性抽出部２４、モデル記憶部２５、及び分類部２６を備えている。 The calculation unit 20 includes a text analysis unit 21, a site structure extraction unit 22, a morpheme analysis unit 23, a feature extraction unit 24, a model storage unit 25, and a classification unit 26.

テキスト解析部２１は、入力されたページ文書のＨＴＭＬテキストを解析し、サイト構造、本文、タイトルなどの切り出しを行う。 The text analysis unit 21 analyzes the HTML text of the input page document, and cuts out the site structure, text, title, and the like.

サイト構造抽出部２２は、テキスト解析部２１の解析結果に基づいて、入力されたページ文書のＨＴＭＬテキスト中のサイト構造情報を抽出する。サイト構造情報とは、分類対象ページに到達するまでに経由する各ページに関する情報を表し、例えば、ＨＴＭＬテキスト中のパンくずリスト部分から取得される。 The site structure extraction unit 22 extracts site structure information in the HTML text of the input page document based on the analysis result of the text analysis unit 21. The site structure information represents information related to each page that passes through to reach the classification target page, and is acquired from, for example, a breadcrumb list portion in the HTML text.

商品ページなどはサイト構造情報を持っており、サイト構造情報に基づいて、Ｗｅｂサイトのトップページから当該商品のページへたどり着くまでに経由したページの情報を利用する。Ｗｅｂサイトのトップページから当該商品のページまで到達する間には、当該商品に関連のあるページを多く経由する事になるため、サイト構造情報から、当該商品のカテゴリと関連の深いキーワードを得ることができる。本実施の形態では、サイト構造情報として、パンくずリストを利用する場合を例に説明する。 The product page has site structure information, and based on the site structure information, the information on the page that passes through from the top page of the website to the product page is used. Since many pages related to the product go through from the top page of the website to the page of the product, keywords related to the product category are obtained from the site structure information. Can do. In this embodiment, a case where a breadcrumb list is used as site structure information will be described as an example.

サイト構造抽出部２２は、例えば、図３に示すようなパンくずリストを、サイト構造情報として抽出する。パンくずリストには、上記図３の例のように、サイトのトップページから当該ページまでに経由した各ページに関する情報が記されている。 For example, the site structure extraction unit 22 extracts a bread crumb list as shown in FIG. 3 as site structure information. In the breadcrumb list, as in the example of FIG. 3 described above, information about each page that passes from the top page of the site to the page is described.

形態素解析部２３は、サイト構造抽出部２２によって抽出されたサイト構造情報を表わすテキストを、既存の技術である形態素解析によって単語に区切り、さらに各単語に品詞を付与し、出力する。例えば、「トップ＞鞄・財布＞トートバッグ」というサイト構造情報のテキストが入力されると、形態素解析結果として、「トップ（名詞）/ （空白）/＞（Symbol）/ （空白）/鞄（名詞）/・（Symbol）/財布（名詞）．．．」が得られる。 The morpheme analysis unit 23 divides the text representing the site structure information extracted by the site structure extraction unit 22 into words by morpheme analysis, which is an existing technology, and further gives a part of speech to each word and outputs it. For example, if the text of the site structure information “Top> Bag / Wallet> Tote Bag” is entered, the result of morphological analysis is “Top (noun) / (blank) /> (Symbol) / (blank) / 鞄 ( Noun) / ・ (Symbol) / Wallet (noun) ... ”.

素性抽出部２４は、サイト構造抽出部２２で抽出したサイト構造情報（パンくずリスト）を形態素解析した結果から得られる複数の形態素から、名詞句をつないでいる記号の形態素を除いた複数の形態素の表記と、当該複数の形態素の出現順序とを表わす素性を抽出する。 The feature extraction unit 24 includes a plurality of morphemes obtained by removing the morphemes of symbols connecting noun phrases from a plurality of morphemes obtained from the result of morphological analysis of the site structure information (breadcrumb list) extracted by the site structure extraction unit 22. And a feature representing the appearance order of the plurality of morphemes.

例えば、サイト構造「サイトトップ＞婦人装飾小物＞ハンドバッグ」から、素性として「（サイトトップ（婦人装飾小物（ハンドバッグ）））」が抽出される。なお、「家電＞掃除機＞商品A」のようなサイト構造があった場合、「家電→掃除機→商品A」の順番で出現したことが分かるような素性であれば、他の素性であってもよい。 For example, “(site top (female decorative accessory (handbag)))” is extracted from the site structure “site top> female decorative accessory> handbag”. In addition, if there is a site structure such as “Home Appliance> Vacuum Cleaner> Product A”, if the feature shows that it has appeared in the order of “Home Appliance → Vacuum Cleaner → Product A”, it is another feature. May be.

モデル記憶部２５には、予め用意された複数のカテゴリ毎に、分類モデルが記憶されている。モデル記憶部２５に記憶される分類モデルは、複数のカテゴリ毎に学習部（図示省略）によって学習されたものである。 The model storage unit 25 stores a classification model for each of a plurality of categories prepared in advance. The classification model stored in the model storage unit 25 is learned by a learning unit (not shown) for each of a plurality of categories.

学習部は、カテゴリ毎に、学習用文書であるページ文書群から得られた正例の学習データ（当該カテゴリに属するページ文書から得られる素性）及び負例の学習データ（当該カテゴリに属しないページ文書から得られる素性）を用いて、機械学習によって、入力されたページ文書が当該カテゴリに属する否かを分類するための分類モデルを作成して、モデル記憶部２５に記憶する。例えば、１０個のカテゴリに分類したい場合、１０個のカテゴリそれぞれについて分類モデルを作成する。機械学習アルゴリズムとしては、例えばサポートベクトルマシン（SVM）やMarkov Logic Network (MLN)などのアルゴリズムを利用することができる。 For each category, the learning unit acquires positive example learning data (features obtained from page documents belonging to the category) and negative example learning data (pages not belonging to the category) obtained from the page document group that is a learning document. A classification model for classifying whether or not the input page document belongs to the category is created by machine learning using a feature obtained from the document and stored in the model storage unit 25. For example, if it is desired to classify into 10 categories, a classification model is created for each of the 10 categories. As the machine learning algorithm, for example, an algorithm such as support vector machine (SVM) or Markov Logic Network (MLN) can be used.

分類部２６は、周知の技術である「one-versus-rest」に従って、学習器で作成した各カテゴリの分類モデルと、素性抽出部２４で抽出された素性とに基づいて、入力されたページ文書ごとに、ページ文書が属するカテゴリを出力する。具体的には、１０個のカテゴリについて作成した10個の分類モデルの各々を用いてスコア値を各々算出し、各カテゴリの中で最も大きなスコア値をつけたカテゴリを、当該ページ文書が属するカテゴリとする。 The classification unit 26 receives the input page document based on the classification model of each category created by the learning device and the features extracted by the feature extraction unit 24 according to “one-versus-rest” which is a well-known technique. For each page, the category to which the page document belongs is output. Specifically, score values are calculated using each of the 10 classification models created for 10 categories, and the category with the highest score value among the categories is the category to which the page document belongs. And

なお、学習器で、多値分類を行うための分類モデルを作成して、入力されたページ文書ごとに、素性抽出部２４で抽出された素性に基づき、多値分類を行うようにしてもよい。 Note that a learning model may be used to create a classification model for performing multi-level classification, and for each input page document, multi-level classification may be performed based on the features extracted by the feature extraction unit 24. .

分類部２６による分類結果として、入力されたページ文書とカテゴリとのペアが、出力部３０より出力される。 As a classification result by the classification unit 26, the input page document and category pair is output from the output unit 30.

＜ページ文書分類装置の作用＞
次に、本実施の形態に係るページ文書分類装置１００の作用について説明する。まず、Ｗｅｂサイトから得られる学習用文書としての複数のページ文書である文書群と、当該複数のページ文書の各々が、予め用意された複数のカテゴリの何れに属するかを示す教師情報とがページ文書分類装置１００に入力されると、ページ文書分類装置１００の学習部によって、カテゴリ毎に分類モデルが作成され、モデル記憶部２５に格納される。 <Operation of page document classification device>
Next, the operation of the page document classification apparatus 100 according to this embodiment will be described. First, a document group which is a plurality of page documents as learning documents obtained from a website, and teacher information indicating which of the plurality of categories prepared in advance each of the plurality of page documents is a page. When input to the document classification device 100, a classification model is created for each category by the learning unit of the page document classification device 100 and stored in the model storage unit 25.

そして、複数のページから構成されるＷｅｂサイトの分類対象のページ文書がページ文書分類装置１００に入力されると、ページ文書分類装置１００によって、図４に示す文書分類処理ルーチンが実行される。 Then, when a page document to be classified on a website composed of a plurality of pages is input to the page document classification device 100, the page document classification device 100 executes a document classification processing routine shown in FIG.

まず、ステップＳ１０１において、入力部１０により入力されたページ文書を受け付ける。そして、ステップＳ１０２において、テキスト解析部２１によって、上記ステップＳ１０１において入力されたページ文書のテキスト解析を行う。ステップＳ１０３では、サイト構造抽出部２２によって、上記ステップ１０２の解析結果に基づいて、ページ文書からサイト構造を抽出する。 First, in step S101, a page document input by the input unit 10 is received. In step S102, the text analysis unit 21 performs text analysis on the page document input in step S101. In step S103, the site structure extraction unit 22 extracts the site structure from the page document based on the analysis result in step 102.

そして、ステップＳ１０４では、形態素解析部２３によって、上記ステップＳ１０３で抽出したサイト構造のテキストに対して、形態素解析処理を行う。次のステップＳ１０５では、素性抽出部２４によって、上記ステップＳ１０３におけるサイト構造の形態素解析結果に基づいて、素性を抽出する。ステップＳ１０６では、分類部２６によって、上記ステップＳ１０５で抽出された素性と、モデル記憶部２５に記憶された各カテゴリの分類モデルとに基づいて、ページ文書が、どのカテゴリに属するか分類する。ステップＳ１０７において、上記ステップＳ１０６の分類結果とページ文書とを出力部３０により出力して、文書分類処理ルーチンを終了する。 In step S104, the morpheme analysis unit 23 performs morpheme analysis on the site structure text extracted in step S103. In the next step S105, the feature extraction unit 24 extracts features based on the morphological analysis result of the site structure in step S103. In step S106, the classification unit 26 classifies the category to which the page document belongs based on the features extracted in step S105 and the classification model of each category stored in the model storage unit 25. In step S107, the classification result in step S106 and the page document are output by the output unit 30, and the document classification processing routine is terminated.

以上説明したように、本実施の形態に係るページ文書分類装置によれば、Ｗｅｂサイトのページ文書から抽出されるサイト構造情報から素性を抽出して、ページ文書が複数のカテゴリの何れに属するかを分類することにより、Ｗｅｂサイトのページ文書を精度良く分類することができる。 As described above, according to the page document classification apparatus according to the present embodiment, a feature is extracted from site structure information extracted from a page document of a Web site, and a page document belongs to which of a plurality of categories. By classifying, the page documents of the website can be classified with high accuracy.

従来技術ではページ文書のＨＴＭＬテキストの本文（DESCRIPTION）から、例えば「（バーバリーブロッサム（の（ランウェア（にも（登場（し（た（バッグ（の（ミニサイズ（が（登場（。））））））））））））…」のような素性を抽出していたが、本実施の形態では、サイト構造（TOPIC_PASS）に着目し、例えば「（サイトトップ（婦人装飾小物（ハンドバッグ）））」といった素性を抽出することにより、ノイズを少なく学習を行うため、分類性能の向上が可能になる。 In the prior art, from the body of the HTML text of the page document (DESCRIPTION), for example, “(Burberry Blossom (no (runware (also (appeared (and (bag (no (mini size (but (appeared (.))) )))))))) ... ”was extracted, but in this embodiment, focusing on the site structure (TOPIC_PASS), for example,“ (site top (women's decorative accessory (handbag))) ” By extracting features such as “”, learning is performed with less noise, and thus the classification performance can be improved.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、サイト構造情報として、分類対象ページに到達するまでに経由する各ページに関する情報を表わすものに限定されるものではなく、分類対象ページを経由して到達する各ページに関する情報を表すサイト構造情報を抽出するようにしてもよい。パンくずリストがないＨＴＭＬテキストであれば、ページ文書のＴＨＭＬテキストに含まれるリンク構造を抽出し、リンク先の各ページのタイトルをトピックとして、サイト構造情報を作成し、作成されたサイト構造情報から素性を抽出するようにしてもよい。 For example, the site structure information is not limited to information about each page that passes through to reach the classification target page, but is site structure information that represents information about each page that arrives through the classification target page May be extracted. If the HTML text has no breadcrumb list, the link structure included in the THML text of the page document is extracted, the site structure information is created using the title of each linked page as a topic, and the created site structure information is used. You may make it extract a feature.

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１０入力部
２０演算部
２１テキスト解析部
２２サイト構造抽出部
２３形態素解析部
２４素性抽出部
２５モデル記憶部
２６分類部
３０出力部
１００ページ文書分類装置 DESCRIPTION OF SYMBOLS 10 Input part 20 Operation part 21 Text analysis part 22 Site structure extraction part 23 Morphological analysis part 24 Feature extraction part 25 Model storage part 26 Classification part 30 Output part 100 Page document classification device

Claims

Information about each page that is passed from the page document of the classification target page of the website including a plurality of pages to the classification target page by the site structure extraction unit, or each of the information that is reached via the classification target page Extracting site structure information representing information about the page;
Performing morpheme analysis on the site structure information extracted by the site structure extraction unit by a morpheme analysis unit;
The feature extraction means represents the notation of a plurality of morphemes excluding the morpheme of the symbol connecting the noun phrase from the plurality of morphemes obtained from the morpheme analysis result by the morpheme analysis means, and the appearance order of the plurality of morphemes Extracting a feature;
The classifying means, pre machine learning from the learning document using the feature, a classification model for the page document to classify whether belonging to any of a plurality of categories, the site structure information by said feature extracting means The page document of the classification target page is based on the extracted features and the features representing the appearance of the plurality of morphemes excluding the morphemes of the symbols connecting the noun phrases and the appearance order of the plurality of morphemes. Categorizing which of a plurality of categories it belongs to;
Document classification method.

A site representing information related to each page that passes from the page document of the classification target page of the Web site including a plurality of pages until reaching the classification target page, or information related to each page reached via the classification target page Site structure extraction means for extracting structure information;
Morphological analysis means for performing morphological analysis on the site structure information extracted by the site structure extraction means;
A feature that extracts features representing notation of a plurality of morphemes excluding morphemes of symbols connecting noun phrases and appearance order of the plurality of morphemes from a plurality of morphemes obtained from a result of morpheme analysis by the morpheme analysis means Extraction means;
Pre machine learning from the learning document using the feature, a classification model for the page document to classify whether belonging to any of a plurality of categories, the feature extracted from the site structure information by said feature extracting means The page document of the classification target page is based on the features representing the notation of a plurality of morphemes excluding the morpheme of the symbol connecting the noun phrases and the appearance order of the plurality of morphemes. A classification means for classifying to which one belongs,
Document classification device including

Computer
A site representing information related to each page that passes from the page document of the classification target page of the Web site including a plurality of pages until reaching the classification target page, or information related to each page reached via the classification target page Site structure extraction means for extracting structure information;
Morphological analysis means for performing morphological analysis on the site structure information extracted by the site structure extraction means;
A feature that extracts features representing notation of a plurality of morphemes excluding morphemes of symbols connecting noun phrases and appearance order of the plurality of morphemes from a plurality of morphemes obtained from a result of morpheme analysis by the morpheme analysis means extracting means, and pre machine learning using the feature from learning document, a classification model for the page document to classify whether belonging to any of a plurality of categories, the site structure information by said feature extracting means The page document of the classification target page is based on the extracted features and the features representing the appearance of the plurality of morphemes excluding the morphemes of the symbols connecting the noun phrases and the appearance order of the plurality of morphemes. A program for functioning as a classification means for classifying which of a plurality of categories it belongs to.