JP6480377B2

JP6480377B2 - Classifier learning apparatus, table type classification apparatus, method, and program

Info

Publication number: JP6480377B2
Application number: JP2016093270A
Authority: JP
Inventors: 京介西田; 松尾　義博; 義博松尾; 東中　竜一郎; 竜一郎東中; 九月貞光
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2016-05-06
Filing date: 2016-05-06
Publication date: 2019-03-06
Anticipated expiration: 2036-05-06
Also published as: JP2017201482A

Description

本発明は、表形式データの表種類を分類するための分類器学習装置、表種類分類装置、方法、及びプログラムに関するものである。 The present invention relates to a classifier learning device, a table type classification device, a method, and a program for classifying table types of tabular data.

コンピュータ技術の発展により、Web上のHTMLで記述された表データや、表計算ソフトウェアなどで作成されたスプレッドシート上の表データは大量に存在するようになった。表データには、縦あるいは横方向のリスト型表、縦あるいは横方向の属性型表、縦あるいは横方向の列挙型表、行列型表、その他のレイアウト用表など幾つかの種類が存在する。この表タイプを正しく理解することができれば、情報検索や質問応答など幅広いサービスに応用可能な知識が獲得できる。 With the development of computer technology, a large amount of tabular data written in HTML on the Web and spreadsheet data created with spreadsheet software etc. has come to exist. There are several types of table data, such as a vertical or horizontal list type table, a vertical or horizontal attribute type table, a vertical or horizontal enumeration type table, a matrix type table, and other layout tables. If this table type can be understood correctly, knowledge applicable to a wide range of services such as information retrieval and question answering can be acquired.

Crestan, Eric and Patrick Pantel (2011). “Web-scale Table Census and Classification”. In: Proceedings of the 4th ACM International Conference on Web Search and Data Mining. ACM, pp. 545-554.Crestan, Eric and Patrick Pantel (2011). “Web-scale Table Census and Classification”. In: Proceedings of the 4th ACM International Conference on Web Search and Data Mining. ACM, pp. 545-554.

表タイプを分類するための従来手法として、非特許文献１などが提案されている。これらの従来手法は、表内のセルに記載されたテキスト情報について、テキストの長さ、テキストのタイプ（数値、日付、金額など）、テキストのパターン（特定の記号を含む）などの素性を利用しているが、図７（Ａ）、（Ｂ）に示すような同様の構造・テキスト長を持つリスト型と行列型の表については識別することができない。 As a conventional method for classifying table types, Non-Patent Document 1 and the like have been proposed. These conventional methods use features such as text length, text type (numeric, date, money, etc.), text pattern (including specific symbols), etc. for text information contained in cells in the table. However, the list type and matrix type tables having the same structure and text length as shown in FIGS. 7A and 7B cannot be identified.

リスト型表は、図７（Ａ）に示すように、ヘッダ部に属性（例：月額費用）、データ部に属性値（1800円）が並ぶ表であり、行列型表は、図７（Ｂ）に示すように、行および列のヘッダ部にエンティティ（回線○○、オプションAなど）が記載されるものである。この定義より、リスト型表はヘッダ部とデータ部の意味的類似度が高いが、行列型表はその特徴を持たない。ヘッダ部とデータ部の意味的類似度を測ることができれば、リスト型とデータ型の識別精度を大きく向上させることが可能である。 As shown in FIG. 7A, the list type table is a table in which attributes (eg, monthly fee) are arranged in the header portion and attribute values (1800 yen) are arranged in the data portion, and the matrix type table is shown in FIG. ), Entities (line XX, option A, etc.) are described in the headers of the rows and columns. According to this definition, the list type table has a high semantic similarity between the header part and the data part, but the matrix type table does not have the feature. If the semantic similarity between the header portion and the data portion can be measured, the identification accuracy between the list type and the data type can be greatly improved.

本発明では、上記事情を鑑みて成されたものであり、表形式データの表種類を精度よく分類することができる分類器学習装置、表種類分類装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and an object thereof is to provide a classifier learning device, a table type classification device, a method, and a program that can accurately classify table types of tabular data. And

上記目的を達成するために、本発明に係る分類器学習装置は、表形式データと表種類を表す正解ラベルとの組の集合である訓練データ集合に含まれる前記表形式データの各々について、同じ行の各セルのテキストの意味的類似度、及び同じ列の各セルのテキストの意味的類似度を含む、分類に利用する素性集合を抽出する素性抽出部と、前記素性抽出部により前記訓練データ集合に含まれる前記表形式データの各々について抽出された素性集合と、前記訓練データ集合に含まれる前記正解ラベルとに基づいて、表形式データの表種類を分類するための分類器を学習する分類器学習部と含んで構成されている。 In order to achieve the above object, the classifier learning device according to the present invention is the same for each of the tabular data included in the training data set, which is a set of tabular data and a correct answer label representing a table type. A feature extraction unit that extracts a feature set used for classification, including the semantic similarity of the text of each cell in a row and the semantic similarity of the text of each cell in the same column; and the training data by the feature extraction unit Classification for learning a classifier for classifying table types of tabular data based on the feature set extracted for each of the tabular data included in the set and the correct answer label included in the training data set And a learning unit.

本発明に係る分類器学習方法は、素性抽出部が、表形式データと表種類を表す正解ラベルとの組の集合である訓練データ集合に含まれる前記表形式データの各々について、同じ行の各セルのテキストの意味的類似度、及び同じ列の各セルのテキストの意味的類似度を含む、分類に利用する素性集合を抽出し、分類器学習部が、前記素性抽出部により前記訓練データ集合に含まれる前記表形式データの各々について抽出された素性集合と、前記訓練データ集合に含まれる前記正解ラベルとに基づいて、表形式データの表種類を分類するための分類器を学習する。 In the classifier learning method according to the present invention, the feature extraction unit uses each of the same row for each of the tabular data included in the training data set that is a set of tabular data and a correct answer label indicating a table type. A feature set used for classification is extracted including the semantic similarity of cell text and the semantic similarity of each cell text in the same column, and a classifier learning unit uses the feature extraction unit to extract the training data set. A classifier for classifying the table type of the tabular data is learned based on the feature set extracted for each of the tabular data included in the table and the correct label included in the training data set.

本発明に係る表種類分類装置は、入力された表形式データについて、同じ行の各セルのセル内テキストの意味的類似度、及び同じ列の各セルのセル内テキストの意味的類似度を含む、分類に利用する素性集合を抽出する素性抽出部と、前記素性抽出部により抽出された素性集合と、予め学習された表形式データの表種類を分類するための分類器とに基づいて、前記入力された表形式データの表種類を分類する表種類分類部と、を含んで構成されている。 The table type classification device according to the present invention includes, for the input tabular data, the semantic similarity of the in-cell text of each cell in the same row and the semantic similarity of the in-cell text of each cell in the same column. , Based on a feature extraction unit that extracts a feature set used for classification, a feature set extracted by the feature extraction unit, and a classifier for classifying table types of pre-learned tabular data, A table type classification unit for classifying the table type of the input tabular data.

本発明に係る表種類分類方法は、素性抽出部が、入力された表形式データについて、同じ行の各セルのセル内テキストの意味的類似度、及び同じ列の各セルのセル内テキストの意味的類似度を含む、分類に利用する素性集合を抽出し、表種類分類部が、前記素性抽出部により抽出された素性集合と、予め学習された表形式データの表種類を分類するための分類器とに基づいて、前記入力された表形式データの表種類を分類する。 In the table type classification method according to the present invention, the feature extraction unit, for the input tabular data, the semantic similarity of the in-cell text of each cell in the same row and the meaning of the in-cell text of each cell in the same column Classification for classifying the feature set used for classification, including the similarity, and the table type classification unit classifying the feature set extracted by the feature extraction unit and the table type of the tabular data learned in advance The table type of the input tabular data is classified based on the table.

また、本発明のプログラムは、コンピュータを、上記の分類器学習装置、又は表種類分類装置を構成する各部として機能させるためのプログラムである。 Moreover, the program of this invention is a program for functioning a computer as each part which comprises said classifier learning apparatus or table kind classification | category apparatus.

以上説明したように、本発明の分類器学習装置、方法、及びプログラムによれば、訓練データ集合に含まれる前記表形式データの各々について、同じ行の各セルのテキストの意味的類似度、及び同じ列の各セルのテキストの意味的類似度を含む、分類に利用する素性集合を抽出することにより、表種類の分類精度を向上可能な分類器を学習することができる。 As described above, according to the classifier learning apparatus, method, and program of the present invention, for each of the tabular data included in the training data set, the semantic similarity of the text of each cell in the same row, and By extracting a feature set used for classification including the semantic similarity of the text of each cell in the same column, a classifier capable of improving the classification accuracy of the table type can be learned.

また、本発明の表種類分類装置、方法、及びプログラムによれば、表形式データについて、同じ行の各セルのテキストの意味的類似度、及び同じ列の各セルのテキストの意味的類似度を含む、分類に利用する素性集合を抽出することにより、精度よく表種類を分類することができる。 Further, according to the table type classification apparatus, method, and program of the present invention, for the tabular data, the semantic similarity of the text of each cell in the same row and the semantic similarity of the text of each cell in the same column are determined. By extracting the feature set used for classification, the table types can be classified with high accuracy.

本発明の実施形態に係る分類器学習装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the classifier learning apparatus which concerns on embodiment of this invention. 素性抽出部の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of a feature extraction part. 表形式データの一例を示す図である。It is a figure which shows an example of tabular data. 本発明の実施形態に係る表種類分類装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the table type classification device which concerns on embodiment of this invention. 本発明の実施形態に係る分類器学習装置における分類器学習処理ルーチンのフローチャート図である。It is a flowchart figure of the classifier learning process routine in the classifier learning apparatus which concerns on embodiment of this invention. 本発明の実施形態に係る表種類分類装置における表種類分類処理ルーチンのフローチャート図である。It is a flowchart figure of the table type classification process routine in the table type classification device which concerns on embodiment of this invention. （Ａ）横リスト型の表形式データの一例を示す図、及び（Ｂ）行列型の表形式データの一例を示す図である。FIG. 4A is a diagram illustrating an example of horizontal list type tabular data, and FIG. 5B is a diagram illustrating an example of matrix type tabular data.

以下、図面を参照して本発明の実施形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜本発明の実施の形態に係る分類器学習装置の構成＞
次に、本発明の実施の形態に係る分類器学習装置の構成について説明する。図１に示すように、本実施の形態に係る分類器学習装置１００は、ＣＰＵと、ＲＡＭと、後述する分類器学習処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この分類器学習装置１００は、機能的には図１に示すように入力部１０と、演算部２０と、出力部４０とを含んで構成されている。 <Configuration of Classifier Learning Device According to Embodiment of the Present Invention>
Next, the configuration of the classifier learning device according to the embodiment of the present invention will be described. As shown in FIG. 1, a classifier learning device 100 according to the present embodiment includes a CPU, a RAM, and a ROM that stores a program and various data for executing a classifier learning processing routine described later. Can be configured with a computer. The classifier learning device 100 is functionally configured to include an input unit 10, an arithmetic unit 20, and an output unit 40 as shown in FIG.

入力部１０は、表形式データと表種類を表す正解ラベルの組の集合である訓練データ集合を受け付ける。 The input unit 10 accepts a training data set that is a set of tabular data and a correct answer label representing a table type.

表形式データは、行列形式のセルの集合であり、各セルは、行番号、列番号、テキスト文字列、ヘッダ／データフラグ（1のときヘッダ、0のときデータを表す）から構成される。ヘッダ／データフラグについては、HTMLではthタグによる指定や、非特許文献２などに記載の方法で推定された情報から与えられる。 The tabular data is a set of cells in a matrix format, and each cell is composed of a row number, a column number, a text character string, and a header / data flag (1 indicates header, 0 indicates data). The header / data flag is given from designation by th tag in HTML and information estimated by a method described in Non-Patent Document 2 or the like.

［非特許文献２］Jing Fang, Prasenjit Mitra, Zhi Tang, C. Lee Giles: Table Header Detection and Classification. AAAI 2012 [Non-Patent Document 2] Jing Fang, Prasenjit Mitra, Zhi Tang, C. Lee Giles: Table Header Detection and Classification. AAAI 2012

表種類は、例えば、縦あるいは横方向のリスト型表、縦あるいは横方向の属性型表、縦あるいは横方向の列挙型表、行列型表、その他のレイアウト用表などである。 The table types include, for example, a vertical or horizontal list type table, a vertical or horizontal attribute type table, a vertical or horizontal enumeration type table, a matrix type table, and other layout tables.

演算部２０は、図１に示すように、素性抽出部２２及び分類器学習部２４を備えて構成されている。 As shown in FIG. 1, the calculation unit 20 includes a feature extraction unit 22 and a classifier learning unit 24.

素性抽出部２２は、訓練データ集合に含まれる表形式データの各々について、各行に対する、同じ行の各セルのテキストの意味的類似度、及び各列に対する、同じ列の各セルのテキストの意味的類似度を表す意味的類似度素性を含む、分類に利用する素性集合を抽出する。 For each of the tabular data included in the training data set, the feature extraction unit 22 performs the semantic similarity of the text of each cell in the same row for each row, and the semantic of the text in each cell in the same column for each column. A feature set used for classification including a semantic similarity feature representing similarity is extracted.

素性抽出部２２は、図２に示すように、通常素性抽出部３０と、意味類似度抽出部３２とを備えている。 As shown in FIG. 2, the feature extraction unit 22 includes a normal feature extraction unit 30 and a semantic similarity extraction unit 32.

通常素性抽出部３０は、訓練データ集合に含まれる表形式データの各々について、表形式データの特徴を表す通常素性を抽出する。通常素性としては非特許文献１などに記載された素性が利用可能である。 The normal feature extraction unit 30 extracts a normal feature representing the characteristics of the tabular data for each tabular data included in the training data set. As the normal feature, the feature described in Non-Patent Document 1 or the like can be used.

意味類似度抽出部３２は、訓練データ集合に含まれる表形式データの各々について、各行に対する、同じ行のヘッダセルのセル内テキストとデータセルのセル内テキストとの間の意味的類似度、及び各列に対する、同じ列のヘッダセルのセル内テキストとデータセルのセル内テキストとの間の意味的類似度を表す意味的類似度素性を抽出する。 The semantic similarity extraction unit 32, for each of the tabular data included in the training data set, for each row, the semantic similarity between the in-cell text of the header cell of the same row and the in-cell text of the data cell, and each A semantic similarity feature representing a semantic similarity between the in-cell text of the header cell and the in-cell text of the data cell is extracted for the column.

具体的には、まず、各行（行番号i）に対し、当該行に含まれるセル集合について、ヘッダ／データフラグが1のセルであるヘッダセル集合と、ヘッダ／データフラグが0のセルであるデータセル集合に分割する。 Specifically, for each row (row number i), for the cell set included in the row, first, a header cell set whose header / data flag is 1 cell and data whose header / data flag is 0 cell. Divide into cell sets.

次に、ヘッダセル集合に含まれる各ヘッダセルのセル内テキストについて、データセル集合に含まれるそれぞれのセルのセル内テキストとの類似度を計算し、類似度平均値および類似度分散値を計算する。 Next, the similarity between the in-cell text of each header cell included in the header cell set and the in-cell text of each cell included in the data cell set is calculated, and the similarity average value and the similarity variance value are calculated.

最終的に、各行に対し、各ヘッダセルについて算出された類似度平均値、類似度分散値について、ヘッダセル間での最大値、最小値、分散値、平均値を算出し、これらを表タイプ分類の当該行についての素性値とする。素性名は、行(行番号i)_意味類似度_(ヘッダ・データ間類似度統計量名)_(ヘッダ間類似度統計量名)とする。ヘッダセル・データセル間の類似度の計算には非特許文献３に記載のword2vecなどの手法が利用可能である。なお、word2vecにより単語の類似度が計測できない場合、類似度は0とする。ヘッダセルが行に存在しない場合、全ての前記素性値は0とする。データセルが行に存在しない場合も、全ての前記素性値は0とする。 Finally, for each row, for the similarity average value and similarity variance value calculated for each header cell, the maximum value, minimum value, variance value, and average value between the header cells are calculated, and these are calculated for the table type classification. The feature value for the row is used. The feature name is row (line number i) _meaning similarity_ (header-data similarity statistic name) _ (header similarity statistic name). A method such as word2vec described in Non-Patent Document 3 can be used to calculate the similarity between the header cell and the data cell. If the word similarity cannot be measured by word2vec, the similarity is 0. If no header cell exists in the row, all the feature values are set to 0. All the feature values are set to 0 even when no data cell exists in a row.

［非特許文献３］Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013. [Non-Patent Document 3] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.

各列（列番号j）に対し、当該列に含まれるセル集合についても、各行に含まれるセル集合と同様に計算を行って、表タイプ分類の当該列についての素性値を算出する。素性名は、列(列番号j)_意味類似度_(ヘッダ・データ間類似度統計量名)_(ヘッダ間類似度統計量名)などとする。図３に示す表形式データを例として、素性値のうち、ヘッダ・データ間類似度統計量名が平均値、ヘッダ間類似度統計量名が平均値の場合の算出例を以下に示す。 For each column (column number j), the cell set included in the column is calculated in the same manner as the cell set included in each row to calculate a feature value for the column in the table type classification. The feature name is column (column number j) _semantic similarity_ (header-data similarity statistic name) _ (header similarity statistic name). Taking the tabular data shown in FIG. 3 as an example, among the feature values, a calculation example in the case where the header-data similarity statistic name is the average value and the header similarity statistic name is the average value is shown below.

行1_意味類似度_平均値_平均値= 0
行2_意味類似度_平均値_平均値
= 平均 { 平均 { 類似度(費用, 1800円), 類似度(費用,2000円) }, 平均 { 類似度(年額, 1800円), 類似度(年額,2000円) }}
= 平均 { 平均 { 0.85, 0.83} , 平均{0.83, 0.87}} = 平均{ 0.84, 0.85 } = 0.845
行3_意味類似度_平均値_平均値
= 平均 { 平均 { 類似度(費用, 21600円), 類似度(費用,24000円) }, 平均 { 類似度(月額, 21600円), 類似度(月額,24000円) }}
= 平均 { 平均 { 0.8, 0.8} , 平均{0.75, 0.77}} = 平均{ 0.8, 0.76 } = 0.78
列1_意味類似度_平均値 _平均値= 0
列2_意味類似度_平均値 _平均値= 0
列3_意味類似度_平均値_平均値
= 平均 { 類似度(オプションA, 1800円), 類似度(オプションA,21600円) }
= 平均 { 0.18, 0.12 } = 0.15
列4_意味類似度_平均値_平均値
= 平均 { 類似度(オプションB, 2000円), 類似度(オプションB,24000円) }
= 平均 { 0.15, 0.15 } = 0.15 Row 1_Meaning similarity_Average value_Average value = 0
Row 2_Meaning similarity_Average value_Average value
= Average {average {similarity (cost, 1800 yen), similarity (cost, 2000 yen)}, average {similarity (annual, 1800 yen), similarity (annual, 2000 yen)}}
= Average {average {0.85, 0.83}, average {0.83, 0.87}} = average {0.84, 0.85} = 0.845
Row 3_Meaning similarity_Average value_Average value
= Average {Average {Similarity (cost, 21600 yen), Similarity (cost, 24000 yen)}, Average {Similarity (monthly, 21600 yen), Similarity (monthly, 24000 yen)}}
= Average {average {0.8, 0.8}, average {0.75, 0.77}} = average {0.8, 0.76} = 0.78
Column 1_Meaning similarity_Average value_Average value = 0
Column 2 _ semantic similarity _ average value _ average value = 0
Column 3_Meaning similarity_Average value_Average value
= Average {Similarity (Option A, 1800 yen), Similarity (Option A, 21600 yen)}
= Average {0.18, 0.12} = 0.15
Column 4_Meaning similarity_Average value_Average value
= Average {Similarity (Option B, 2000 yen), Similarity (Option B, 24000 yen)}
= Average {0.15, 0.15} = 0.15

分類器学習部２４は、素性抽出部２２により訓練データ集合に含まれる表形式データの各々について抽出された素性集合と、訓練データ集合に含まれる正解ラベルとに基づいて、表形式データの表種類を分類するための分類器を学習する。 The classifier learning unit 24 determines the table type of the tabular data based on the feature set extracted for each of the tabular data included in the training data set by the feature extraction unit 22 and the correct answer label included in the training data set. Learn a classifier to classify.

具体的には、分類器学習部２４は、素性抽出部２２が抽出した通常素性および意味的類似度素性を含む素性集合を利用して、素性集合と正解ラベルの組について学習を行い、出力部４０により、分類器を出力する。学習アルゴリズムには、サポートベクターマシンやランダムフォレスト法など任意の分類アルゴリズムが利用可能である。 Specifically, the classifier learning unit 24 uses the feature set including the normal feature and the semantic similarity feature extracted by the feature extraction unit 22 to learn about the set of the feature set and the correct label, and outputs the output unit. At 40, the classifier is output. As a learning algorithm, any classification algorithm such as a support vector machine or a random forest method can be used.

＜本発明の実施の形態に係る表種類分類装置の構成＞
次に、本発明の実施の形態に係る表種類分類装置の構成について説明する。図４に示すように、本実施の形態に係る表種類分類装置１５０は、ＣＰＵと、ＲＡＭと、後述する表種類分類処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この表種類分類装置１５０は、機能的には図４に示すように入力部６０と、演算部７０と、出力部８０とを含んで構成されている。 <Configuration of Table Type Sorting Device According to Embodiment of the Present Invention>
Next, the configuration of the table type classification apparatus according to the embodiment of the present invention will be described. As shown in FIG. 4, a table type classification apparatus 150 according to the present embodiment includes a CPU, a RAM, and a ROM that stores a program and various data for executing a table type classification processing routine described later. Can be configured with a computer. The table type classification device 150 is functionally configured to include an input unit 60, a calculation unit 70, and an output unit 80 as shown in FIG.

入力部６０は、分類対象となる表形式データを受け付ける。 The input unit 60 receives tabular data to be classified.

演算部７０は、図４に示すように、素性抽出部７２及び表種類分類部７４を備えて構成されている。 As shown in FIG. 4, the calculation unit 70 includes a feature extraction unit 72 and a table type classification unit 74.

素性抽出部７２は、入力された表形式データについて、素性抽出部２２と同様に、意味的類似度素性を含む、分類に利用する素性集合を抽出する。 The feature extraction unit 72 extracts feature sets to be used for classification, including semantic similarity features, in the same manner as the feature extraction unit 22 for the input tabular data.

素性抽出部７２は、分類器学習装置１００の素性抽出部２２と同様に、通常素性抽出部３０と、意味類似度抽出部３２とを備えている。 Similar to the feature extraction unit 22 of the classifier learning device 100, the feature extraction unit 72 includes a normal feature extraction unit 30 and a semantic similarity extraction unit 32.

表種類分類部７４は、素性抽出部７２により抽出された素性集合と、分類器学習装置１００により学習された表形式データの表種類を分類するための分類器とに基づいて、入力された表形式データの表種類を分類する。 The table type classifying unit 74 inputs the table based on the feature set extracted by the feature extracting unit 72 and the classifier for classifying the table type of the tabular data learned by the classifier learning device 100. Classify table type of format data.

＜本発明の実施形態に係る分類器学習装置の作用＞
次に、本発明の実施形態に係る分類器学習装置１００の作用について説明する。分類器学習装置１００は、入力部１０によって、訓練データ集合を受け付けると、分類器学習装置１００によって、図５に示す分類器学習処理ルーチンが実行される。 <Operation of Classifier Learning Device According to Embodiment of Present Invention>
Next, the operation of the classifier learning device 100 according to the embodiment of the present invention will be described. When the classifier learning apparatus 100 receives a training data set by the input unit 10, the classifier learning apparatus 100 executes a classifier learning processing routine shown in FIG.

まず、ステップＳ１００で、入力された訓練データ集合に含まれる表形式データの各々について、通常素性を抽出する。 First, in step S100, normal features are extracted for each of the tabular data included in the input training data set.

ステップＳ１０２では、入力された訓練データ集合に含まれる表形式データの各々について、各行に対する、同じ行のヘッダセルのセル内テキストとデータセルのセル内テキストとの間の意味的類似度、及び各列に対する、同じ行のヘッダセルのセル内テキストとデータセルのセル内テキストとの間の意味的類似度を表す意味的類似度素性を抽出する。 In step S102, for each of the tabular data included in the input training data set, for each row, the semantic similarity between the in-cell text in the header cell and the in-cell text in the data cell for each row, and each column The semantic similarity feature representing the semantic similarity between the in-cell text of the header cell and the in-cell text of the data cell is extracted.

ステップＳ１０４では、訓練データ集合に含まれる表形式データの各々について上記ステップＳ１００、Ｓ１０２で抽出された通常素性及び意味的類似度素性を含む素性集合と、訓練データ集合に含まれる正解ラベルとに基づいて、表形式データの表種類を分類するための分類器を学習し、出力部４０により出力し、分類器学習処理ルーチンを終了する。 In step S104, based on the feature set including the normal feature and the semantic similarity feature extracted in steps S100 and S102 for each of the tabular data included in the training data set, and the correct label included in the training data set. Thus, the classifier for classifying the table type of the tabular data is learned and output by the output unit 40, and the classifier learning processing routine is terminated.

＜本発明の実施形態に係る表種類分類装置の作用＞
次に、本発明の実施形態に係る表種類分類装置１５０の作用について説明する。表種類分類装置１５０は、入力部６０によって、表形式データを受け付けると、表種類分類装置１５０によって、図６に示す表種類分類処理ルーチンが実行される。 <Operation of Table Type Sorting Device According to Embodiment of the Present Invention>
Next, the operation of the table type classification device 150 according to the embodiment of the present invention will be described. When the table type classification device 150 receives the tabular data through the input unit 60, the table type classification device 150 executes the table type classification processing routine shown in FIG.

まず、ステップＳ１５０で、入力された表形式データについて、通常素性を抽出する。 First, in step S150, normal features are extracted from the input tabular data.

ステップＳ１５２では、入力された表形式データについて、各行に対する、同じ行のヘッダセルのセル内テキストとデータセルのセル内テキストとの間の意味的類似度、及び各列に対する、同じ列のヘッダセルのセル内テキストとデータセルのセル内テキストとの間の意味的類似度を表す意味的類似度素性を抽出する。 In step S152, for the input tabular data, for each row, the semantic similarity between the in-cell text of the header cell in the same row and the in-cell text of the data cell for each row, and the cell of the header cell in the same column for each column A semantic similarity feature representing a semantic similarity between the inner text and the inner text of the data cell is extracted.

ステップＳ１５４では、上記ステップＳ１５０、Ｓ１５２で抽出された通常素性及び意味的類似度素性を含む素性集合と、分類器学習装置１００によって学習された分類器とに基づいて、表形式データの表種類を分類し、出力部８０により出力し、表種類分類処理ルーチンを終了する。 In step S154, the table type of the tabular data is changed based on the feature set including the normal feature and the semantic similarity feature extracted in steps S150 and S152 and the classifier learned by the classifier learning device 100. The data is classified and output by the output unit 80, and the table type classification processing routine is terminated.

以上説明したように、本発明の実施の形態に係る表種類分類装置によれば、表形式データについて、各行に対する、同じ行の各セルのテキストの意味的類似度、及び各列に対する、同じ列の各セルのテキストの意味的類似度を含む、分類に利用する素性集合を抽出することにより、精度よく表種類を分類することができる。 As described above, according to the table type classification apparatus according to the embodiment of the present invention, for tabular data, the semantic similarity of the text of each cell in the same row for each row, and the same column for each column Table types can be classified with high accuracy by extracting feature sets used for classification including the semantic similarity of the text of each cell.

また、入力された表形式データから、行単位及び列単位の意味的類似度が計算可能なので、従来の素性では識別できなかったリスト表と行列表を分類することが可能になる。 In addition, since the semantic similarity in units of rows and columns can be calculated from the input tabular data, it is possible to classify list tables and matrix tables that could not be identified by conventional features.

また、本発明の実施の形態に係る分類器学習装置によれば、訓練データ集合に含まれる前記表形式データの各々について、各行に対する、同じ行の各セルのテキストの意味的類似度、及び各列に対する、同じ列の各セルのテキストの意味的類似度を含む、分類に利用する素性集合を抽出することにより、表種類の分類精度を向上可能な分類器を学習することができる。 Further, according to the classifier learning device according to the embodiment of the present invention, for each of the tabular data included in the training data set, for each row, the semantic similarity of the text of each cell in the same row, and each A classifier that can improve the classification accuracy of the table type can be learned by extracting a feature set used for classification including the semantic similarity of the text of each cell in the same column.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、上記の実施の形態では、分類器学習装置と表種類分類装置とを別々に設ける場合を例に説明したが、分類器学習装置と表種類分類装置とを、１つの装置で実現するようにしてもよい。 For example, in the above embodiment, the case where the classifier learning device and the table type classification device are provided separately has been described as an example. However, the classifier learning device and the table type classification device may be realized by a single device. It may be.

本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能であるし、ネットワークを介して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium or provided via a network. Is also possible.

本発明は、表種類を理解することで精度を向上可能な表形式データからの知識獲得および情報検索などに利用可能である。 The present invention can be used for knowledge acquisition and information retrieval from tabular data whose accuracy can be improved by understanding table types.

１０、６０入力部
２０、７０演算部
２２、７２素性抽出部
２４分類器学習部
３０通常素性抽出部
３２意味類似度抽出部
４０、８０出力部
７４表種類分類部
１００分類器学習装置
１５０表種類分類装置 10, 60 Input unit 20, 70 Calculation unit 22, 72 Feature extraction unit 24 Classifier learning unit 30 Normal feature extraction unit 32 Semantic similarity extraction unit 40, 80 Output unit 74 Table type classification unit 100 Classifier learning device 150 Table type Classification device

Claims

For each of the tabular data included in the training data set, which is a set of tabular data and a correct label representing a table type , between the in-cell text of the header cell and the in-cell text of the data cell in the same row A feature extraction unit that extracts a feature set used for classification, including semantic similarity and semantic similarity between the in-cell text of the header cell and the data cell in the same column;
Based on the feature set extracted for each of the tabular data included in the training data set by the feature extraction unit and the correct label included in the training data set, the table type of the tabular data is set to vertical Direction list type table, horizontal list type table, vertical attribute type table, horizontal attribute type table, vertical enumeration type table, horizontal enumeration type table, matrix type table, and other layouts A classifier learning unit for learning a classifier for classifying into any of the tables ;
A classifier learning device characterized by comprising:

For input tabular data, the semantic similarity between the in-cell text in the header cell and the data cell in the same row, and the in-cell text in the header cell and the data cell in the same column A feature extraction unit that extracts a feature set to be used for classification, including semantic similarity between them;
Based on the feature set extracted by the feature extraction unit and a classifier for classifying the table type of the tabular data learned in advance, the table type of the input tabular data is displayed in a vertical list. Any of type table, horizontal list type table, vertical attribute type table, horizontal attribute type table, vertical enumeration type table, horizontal enumeration type table, matrix type table, and other layout tables and table type classification unit for crab classification,
A table type classification apparatus characterized by comprising:

For each of the tabular data included in the training data set, in which the feature extraction unit is a set of tabular data and a correct answer label indicating a table type , the in-cell text of the header cell and the data cell in the same row semantic similarity between the text, and a semantic similarity between the cells in the text in the cell text and data cell header cell in the same column, extracting the feature set to be used for classification,
Based on the feature set extracted by the feature extraction unit for each of the tabular data included in the training data set by the feature extraction unit and the correct label included in the training data set, the tabular data The table types are : vertical list type table, horizontal list type table, vertical attribute type table, horizontal attribute type table, vertical enumeration type table, horizontal enumeration type table, matrix type table And a classifier learning method for learning a classifier for classifying into any one of the layout tables .

The feature extraction unit, for the input tabular data, the semantic similarity between the in-cell text of the header cell in the same row and the in-cell text of the data cell, and the in-cell text and data cell of the header cell in the same column Extract feature sets used for classification, including semantic similarity with in-cell text ,
The table type classification unit has the feature type extracted by the feature extraction unit and the table type of the input tabular data based on a classifier for classifying the table type of the tabular data learned in advance. Vertical list type table, horizontal list type table, vertical attribute type table, horizontal attribute type table, vertical enumeration type table, horizontal enumeration type table, matrix type table, and others A table type classification method characterized by classifying into any of the layout tables .

Computer, classifier learning device according to claim 1, or a program to function as each section of the table type classification apparatus according to claim 2.