JP6976910B2

JP6976910B2 - Data classification system, data classification method, and data classification device

Info

Publication number: JP6976910B2
Application number: JP2018127516A
Authority: JP
Inventors: 雅文露木; 洋司小澤
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2018-07-04
Filing date: 2018-07-04
Publication date: 2021-12-08
Anticipated expiration: 2038-07-04
Also published as: JP2020008992A

Description

本発明は、データ分類システム、データ分類方法、および、データ分類装置に関するものである。 The present invention relates to a data classification system, a data classification method, and a data classification device.

大量の文書や画像、数値などのデータを効率的に管理するためには、当該データが属するカテゴリごとに絞り込みを行い、目的のデータを特定する検索システムが有用である。
こうした検索システムを実現する場合、上述のような大量のデータを、予め定義済みのカテゴリ別に分類する構成が必要となる。一方、ひとつのデータが複数カテゴリに属する可能性もある。そのため、ひとつのデータに対してカテゴリを表すラベルを複数付与することでデータの分類を行う技術が存在する（多ラベル分類問題）。 In order to efficiently manage data such as a large amount of documents, images, and numerical values, a search system that narrows down the data to each category and identifies the target data is useful.
In order to realize such a search system, it is necessary to classify a large amount of data as described above into predefined categories. On the other hand, one data may belong to multiple categories. Therefore, there is a technique for classifying data by assigning a plurality of labels representing categories to one data (multi-label classification problem).

しかしながら、そうした技術を採用するとしても、分類対象のデータが大量である場合、人手による分類は困難である。そこで、教師あり学習を利用した機械学習によって多ラベル分類器を作成し、自動的に分類を行う技術も存在する。 However, even if such a technique is adopted, it is difficult to manually classify when the data to be classified is large. Therefore, there is also a technique for creating a multi-label classifier by machine learning using supervised learning and automatically classifying it.

分類器とは、分類対象となるデータの特徴量を入力に、付与すべきラベルについて分類確率を計算、出力するプログラム全般を指す（機械学習ではなくＩＦ文などによって人手で作成することもできる）。上述の多ラベル分類器は、こうした分類器の一種であり、ひとつのデータに複数のラベルを付与して分類する分類器を指す。 A classifier refers to a general program that calculates and outputs the classification probability for the label to be assigned by inputting the feature amount of the data to be classified (it can also be created manually by IF statement etc. instead of machine learning). .. The above-mentioned multi-label classifier is a kind of such a classifier, and refers to a classifier that classifies one data with a plurality of labels.

上述のように、教師あり学習で多ラベル分類器を作成するには、分類対象のデータの特徴量と、特徴量に応じた真の分類結果（ラベル）からなる学習データとを学習器へ入力として与える。単一の分類器で、多様な特徴量と多様なラベルの全ての組み合わせについて正しく学習することは困難であり、単一の分類器では限られた特徴量や、限られたラベルについてのみ正しく分類できることが多い。 As described above, in order to create a multi-label classifier by supervised learning, the features of the data to be classified and the learning data consisting of the true classification results (labels) according to the features are input to the learning device. Give as. It is difficult to correctly learn all combinations of various features and various labels with a single classifier, and a single classifier correctly classifies only limited features and limited labels. There are many things you can do.

そのため、正確な多ラベル分類器を作成するために、複数の分類器を組み合わせによって統合分類器の学習（アンサンブル学習）が一般におこなわれる。分類器の組み合わせ方は多数あるが、分類器の分類結果の多数決や平均値をとる手法が良く知られている。 Therefore, in order to create an accurate multi-label classifier, learning of an integrated classifier (ensemble learning) is generally performed by combining a plurality of classifiers. There are many ways to combine classifiers, but the method of taking the majority of the classification results of the classifiers and taking the average value is well known.

ここで、アンサンブル学習の性質から、統合分類器の学習の際に、高精度化に有用な分類器だけを選別することが必要である。そこで例えば、誤分類をおこなう分類器であっても真の分類結果との一貫性を有する場合は統合分類器の一部として利用することで、統合分類器の分類精度を向上する方法（特許文献1参照）が提案されている。 Here, due to the nature of ensemble learning, it is necessary to select only classifiers that are useful for improving accuracy when learning an integrated classifier. Therefore, for example, if a classifier that misclassifies is consistent with the true classification result, it is used as a part of the integrated classifier to improve the classification accuracy of the integrated classifier (Patent Document). 1) has been proposed.

ただし、こうしたアンサンブル学習では、多様な特徴量と多様なラベルの組み合わせを含む大量の学習データの存在を前提としている。実際には、大量の学習データは存在せず、少量の偏った学習データしか利用できないことが多い。また、大量の学習データを手作業で作成するのは非現実的である。 However, such ensemble learning presupposes the existence of a large amount of training data including various combinations of features and labels. In reality, there is no large amount of training data, and in many cases only a small amount of biased training data can be used. Also, it is unrealistic to manually create a large amount of training data.

学習データが存在しない状況では、統合分類器の分類精度向上に有用な分類器を自動的に選別できない。これは、特徴量とラベルの組み合わせ数が膨大になる多ラベル分類問題では、特に深刻な問題となる。 In the absence of training data, it is not possible to automatically select a classifier that is useful for improving the classification accuracy of the integrated classifier. This is a particularly serious problem in the multi-label classification problem in which the number of combinations of features and labels is enormous.

そのため、少量の学習データから分類器を学習する方法として、ラベルのないデータ（ラベル無しデータ）を利用して学習データを増やす、いわゆる半教師あり学習の概念が存
在する。こうした概念に関連する従来技術として、例えば、学習データを使わずに、ドメイン知識を利用してユーザが分類器を作成するもので、ユーザが容易に作成できる単純な分類器を多数組み合わせて統合分類器を学習（アンサンブル学習）し、統合分類器による分類結果を真のラベルの代用とすることで、学習データの数を補う方法（非特許文献1参
照）が提案されている。 Therefore, as a method of learning a classifier from a small amount of learning data, there is a so-called semi-supervised learning concept in which learning data is increased by using unlabeled data (unlabeled data). As a conventional technique related to such a concept, for example, a user creates a classifier by using domain knowledge without using training data, and integrated classification is performed by combining a large number of simple classifiers that the user can easily create. A method of supplementing the number of training data by learning a vessel (ensemble learning) and using the classification result by an integrated classifier as a substitute for a true label has been proposed (see Non-Patent Document 1).

特開２０１５−１１６８６号公報Japanese Unexamined Patent Publication No. 2015-11686

ＡｌｅｘａｎｄｅｒＪＲａｔｎｅｒ、ＣｈｒｉｓｔｏｐｈｅｒＭＤｅＳａ、ＳｅｎＷｕ、ＤａｎｉｅｌＳｅｌｓａｍ、ａｎｄＣｈｒｉｓｔｏｐｈｅｒＲ´ ｅ、”Ｄａｔａｐｒｏｇｒａｍｍｉｎｇ：Ｃｒｅａｔｉｎｇｌａｒｇｅｔｒａｉｎｉｎｇｓｅｔｓ、ｑｕｉｃｋｌｙ、” ＡｄｖａｎｃｅｓｉｎＮｅｕｒａｌＩｎｆｏｒｍａｔｉｏｎＰｒｏｃｅｓｓｉｎｇＳｙｓｔｅｍｓ２９、ｐｐ．３５６７−３５７５、２０１６．Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Celsam, and Christopher R'e, "Data processing: Creation processing Engineering Processing systems, q 3567-3575, 2016.

上述の従来技術（非特許文献1）の方法で多ラベル分類問題向けの学習データを作成す
る場合、多様な特徴量とラベルとの組み合わせに応じた大量の分類器を、ユーザが手作業で作成しなければならない。 When creating learning data for a multi-label classification problem by the method of the above-mentioned conventional technique (Non-Patent Document 1), a user manually creates a large number of classifiers according to combinations of various features and labels. Must.

そのため、統合分類器の分類結果をユーザが目視確認しながら、大量の分類器の作成および選別を試行錯誤することになる。その結果、分類精度の高い統合分類器の作成が困難になるという課題が生じる。
そこで本発明の目的は、統合分類器における分類器の作成を効率的なものとし、当該統合分類器の分類精度を向上可能とする技術を提供することにある。 Therefore, while the user visually confirms the classification result of the integrated classifier, a large number of classifiers are created and sorted by trial and error. As a result, there arises a problem that it becomes difficult to create an integrated classifier with high classification accuracy.
Therefore, an object of the present invention is to provide a technique for efficiently creating a classifier in an integrated classifier and improving the classification accuracy of the integrated classifier.

上記課題を解決する本発明のデータ分類システムは、複数の分類器各々による所定データセットの分類結果を統合し、当該データセットの特徴量と分類結果たるラベルとの対応関係を規定した学習データを格納する記憶装置と、前記学習データを記憶装置から読み出し、当該学習データにおけるラベル無しのデータセット各々について、当該特徴量と、前記分類器各々による前記分類結果から学習した統合分類器が出力する分類確率とに基づき、１つのデータセットに対して複数ラベルが未分類とされる確率たる同時未分類率と、１つのデータセットに対して複数ラベルが分類される確率たる同時分類確率とを算定する処理、前記同時未分類率と前記同時分類確率との乗算値をラベルごとに集計して推薦スコアを算定する処理、および、分類器を追加作成するべきラベルとして前記推薦スコアの高い順にラベルを特定し、当該ラベルの推薦情報を所定装置に出力する処理、を実行する演算装置と、を含むことを特徴とする。 The data classification system of the present invention that solves the above problems integrates the classification results of a predetermined data set by each of a plurality of classifiers, and obtains training data that defines the correspondence between the feature amount of the data set and the label that is the classification result. The storage device to be stored and the training data are read from the storage device, and for each of the unlabeled data sets in the training data, the feature amount and the classification output by the integrated classifier learned from the classification result by each of the classifiers. Based on the probability, the simultaneous unclassified rate, which is the probability that multiple labels are unclassified for one data set, and the simultaneous classification probability, which is the probability that multiple labels are classified for one data set, are calculated. Processing, processing to calculate the recommendation score by aggregating the multiplication value of the simultaneous unclassified rate and the simultaneous classification probability for each label, and specifying labels in descending order of the recommended score as labels to be additionally created as a classifier. It is characterized by including an arithmetic device for executing a process of outputting the recommendation information of the label to a predetermined device.

また本発明のデータ分類方法は、複数の分類器各々による所定データセットの分類結果を統合し、当該データセットの特徴量と分類結果たるラベルとの対応関係を規定した学習データを格納する記憶装置を備えた情報処理システムが、前記学習データを記憶装置から読み出し、当該学習データにおけるラベル無しのデータセット各々について、当該特徴量と、前記分類器各々による前記分類結果から学習した統合分類器が出力する分類確率とに基づき、１つのデータセットに対して複数ラベルが未分類とされる確率たる同時未分類率と、１つのデータセットに対して複数ラベルが分類される確率たる同時分類確率とを算定
する処理と、前記同時未分類率と前記同時分類確率との乗算値をラベルごとに集計して推薦スコアを算定する処理と、分類器を追加作成するべきラベルとして前記推薦スコアの高い順にラベルを特定し、当該ラベルの推薦情報を所定装置に出力する処理と、を実行することを特徴とする。 Further, the data classification method of the present invention is a storage device that integrates the classification results of a predetermined data set by each of a plurality of classifiers and stores learning data that defines the correspondence between the feature amount of the data set and the label that is the classification result. The information processing system equipped with the above reads the training data from the storage device, and outputs the feature quantity and the integrated classifier learned from the classification result by each of the classifiers for each unlabeled data set in the training data. Based on the classification probability, the simultaneous unclassification rate, which is the probability that multiple labels are unclassified for one data set, and the simultaneous classification probability, which is the probability that multiple labels are classified for one data set, are calculated. The process of calculating, the process of totaling the multiplication value of the simultaneous unclassified rate and the simultaneous classification probability for each label to calculate the recommended score, and the process of additionally creating a classifier, labeled in descending order of the recommended score. Is specified, and the process of outputting the recommendation information of the label to a predetermined device is executed.

また本発明のデータ分類装置は、所定ネットワークを介した他装置との通信処理を行う通信装置と、所定装置に対して前記通信装置によりアクセスし、前記所定装置が備える、複数の分類器各々による所定データセットの分類結果を統合し、当該データセットの特徴量と分類結果たるラベルとの対応関係を規定した学習データ、を取得する処理、当該学習データにおけるラベル無しのデータセット各々について、当該特徴量と、前記分類器各々による前記分類結果から学習した統合分類器が出力する分類確率とに基づき、１つのデータセットに対して複数ラベルが未分類とされる確率たる同時未分類率と、１つのデータセットに対して複数ラベルが分類される確率たる同時分類確率とを算定する処理、前記同時未分類率と前記同時分類確率との乗算値をラベルごとに集計して推薦スコアを算定する処理、および、分類器を追加作成するべきラベルとして前記推薦スコアの高い順にラベルを特定し、当該ラベルの推薦情報を所定装置に出力する処理、を実行する演算装置と、を備えることを特徴とする。 Further, the data classification device of the present invention is based on a communication device that performs communication processing with another device via a predetermined network, and a plurality of classifiers provided in the predetermined device by accessing the predetermined device by the communication device. The process of integrating the classification results of a predetermined data set and acquiring the training data that defines the correspondence between the feature amount of the data set and the label that is the classification result, and the characteristics of each of the unlabeled data sets in the training data. Based on the quantity and the classification probability output by the integrated classifier learned from the classification result by each of the classifiers, the simultaneous unclassification rate, which is the probability that multiple labels are unclassified for one data set, and 1. Processing to calculate the simultaneous classification probability, which is the probability that multiple labels are classified for one data set, and processing to calculate the recommendation score by aggregating the multiplication value of the simultaneous unclassified rate and the simultaneous classification probability for each label. , And, as a label to be additionally created, the classifier is provided with a calculation device for specifying labels in descending order of the recommendation score and outputting the recommendation information of the label to a predetermined device. ..

本発明によれば、統合分類器における分類器の作成を効率的なものとし、当該統合分類器の分類精度を向上させる。 According to the present invention, the creation of a classifier in an integrated classifier is made efficient, and the classification accuracy of the integrated classifier is improved.

本実施形態におけるデータ分類システムを含むネットワーク構成例を示す図である。It is a figure which shows the example of the network configuration including the data classification system in this embodiment. 本実施形態の分類器作成推薦サーバのハードウェア構成例を示す図である。It is a figure which shows the hardware configuration example of the classifier creation recommendation server of this embodiment. 本実施形態のラベル無しデータの構成例を示す図である。It is a figure which shows the structural example of the unlabeled data of this embodiment. 本実施形態の分類器管理情報の構成例を示す図である。It is a figure which shows the structural example of the classifier management information of this embodiment. 本実施形態の分類結果情報の構成例を示す図である。It is a figure which shows the structural example of the classification result information of this embodiment. 本実施形態の学習データ情報の構成例を示す図である。It is a figure which shows the structural example of the learning data information of this embodiment. 本実施形態の未分類率情報の構成例を示す図である。It is a figure which shows the structural example of the unclassified rate information of this embodiment. 本実施形態の同時未分類率情報の構成例を示す図である。It is a figure which shows the structural example of the simultaneous unclassification rate information of this embodiment. 本実施形態の文書群情報の構成例を示す図である。It is a figure which shows the structural example of the document group information of this embodiment. 本実施形態の同時分類確率情報の構成例を示す図である。It is a figure which shows the structural example of the simultaneous classification probability information of this embodiment. 本実施形態の推薦スコア情報の構成例を示す図である。It is a figure which shows the structural example of the recommendation score information of this embodiment. 本実施形態のユーザ情報文書群の構成例を示す図である。It is a figure which shows the structural example of the user information document group of this embodiment. 本実施形態のユーザ情報分類結果の構成例を示す図である。It is a figure which shows the structural example of the user information classification result of this embodiment. 本実施形態における学習データ生成方法のフロー例を示す図である。It is a figure which shows the flow example of the learning data generation method in this embodiment. 本実施形態における分類器作成推薦方法のフロー例を示す図である。It is a figure which shows the flow example of the classifier creation recommendation method in this embodiment. 本実施形態における出力例を示す図である。It is a figure which shows the output example in this embodiment.

−−−ネットワーク構成−−−
図１は本実施形態のデータ分類システム１００を含むネットワーク構成図である。図１に示すデータ分類システム１００は、統合分類器における分類器の作成を効率的なものとし、当該統合分類器の分類精度を向上させる情報処理システムである。 --- Network configuration ---
FIG. 1 is a network configuration diagram including the data classification system 100 of the present embodiment. The data classification system 100 shown in FIG. 1 is an information processing system that makes the creation of a classifier in the integrated classifier efficient and improves the classification accuracy of the integrated classifier.

こうしたデータ分類システム１００は、図１のネットワーク構成で例示するように、ユーザ端末１０１から分類器（例：当該ユーザ端末１０１の操作者が生成したもの）を取得し、この分類器に基づき統合分類器３０３を作成する。このデータ分類システム１００は
、上述の統合分類器３０３によるラベル無しデータ２０１の多ラベル分類を効率的なものとする効果を奏する。ラベル無しデータ２０１を分類する目的として、多ラベル分類器を作成するための学習データ２０３の作成がある。 As illustrated in the network configuration of FIG. 1, such a data classification system 100 acquires a classifier (eg, one generated by the operator of the user terminal 101) from the user terminal 101, and integrates classification based on this classifier. Create a vessel 303. The data classification system 100 has the effect of making the multi-label classification of the unlabeled data 201 by the above-mentioned integrated classifier 303 efficient. The purpose of classifying the unlabeled data 201 is to create training data 203 for creating a multi-label classifier.

図１で示すデータ分類システム１００は、一例として、学習データ生成サーバ１０２と、分類器作成推薦サーバ１０３とにより構成されたものを想定できる。ただし、本実施形態のデータ分類システム１００の最小構成として、分類器作成推薦サーバ１０３のみなど、単体の装置に必要な機能を実装する形態を想定するとしても良い。 As an example, the data classification system 100 shown in FIG. 1 can be assumed to be composed of a learning data generation server 102 and a classifier creation recommendation server 103. However, as the minimum configuration of the data classification system 100 of the present embodiment, it may be assumed that a function required for a single device such as only the classifier creation recommendation server 103 is implemented.

また、本実施形態のデータ分類システム１００は、図１に示すごとく、適宜なネットワーク４０６を介して、ユーザ端末１０１、ラベル無しデータ管理サーバ１０４、および、文書群管理サーバ１０５、など外部装置と通信可能であり、ラベル無しデータ２０１や文書群情報２０６を適宜取得し読み込み可能であるものとする。 Further, as shown in FIG. 1, the data classification system 100 of the present embodiment communicates with an external device such as a user terminal 101, an unlabeled data management server 104, and a document group management server 105 via an appropriate network 406. It is possible, and it is assumed that the unlabeled data 201 and the document group information 206 can be appropriately acquired and read.

こうした本実施形態のデータ分類システム１００を運用する組織としては、例えば、或る工場における生産効率の分析や不良品数の低減を行う事業体を想定できる。 As an organization that operates the data classification system 100 of the present embodiment, for example, an entity that analyzes production efficiency in a certain factory and reduces the number of defective products can be assumed.

この事業体が管理する工場では、当該工場に設置したセンサーや工作機械などから生じる多様なＩｏＴデータのそれぞれに対して、データ作成者が記述した説明文（例：溶接の失敗原因分析データです）がラベル無しデータ２０１として大量に蓄積されているものとする。このラベル無しデータ２０１は、未だ統合分類器３０３による分類処理すなわちラベリングが施されていないデータセットである。 In the factory managed by this business entity, the explanation written by the data creator for each of the various IoT data generated from the sensors and machine tools installed in the factory (example: welding failure cause analysis data). Is stored in large quantities as unlabeled data 201. The unlabeled data 201 is a data set that has not yet been classified by the integrated classifier 303, that is, labeled.

上述のＩｏＴデータの存在を知らないデータ分析者でも、当該ＩｏＴデータを発見し、当該ＩｏＴデータを用いた分析を実行するためには、大量に蓄積されたラベル無しデータ２０１をカテゴリごとに機械学習で自動分類することで、当該ＩｏＴデータをカテゴリによって絞り込み可能とすることが好ましい。 Even a data analyst who does not know the existence of the above-mentioned IoT data can machine-learn a large amount of unlabeled data 201 for each category in order to discover the IoT data and perform analysis using the IoT data. It is preferable that the IoT data can be narrowed down by category by automatically classifying with.

しかしながら、多様なＩｏＴデータに応じて、ラベル無しデータ２０１の特徴量は多様になり、また柔軟な絞りこみを実現するには多様なカテゴリを表現するラベルが必要となる。さらに、ひとつのＩｏＴデータが複数のカテゴリに属する可能性を考慮すると、多ラベル分類器を作成するための学習データが必要である。 However, the features of the unlabeled data 201 vary according to various IoT data, and labels expressing various categories are required to realize flexible narrowing down. Further, considering the possibility that one IoT data belongs to a plurality of categories, training data for creating a multi-label classifier is required.

よって、すでに述べたように、データ分類システム１００の学習データ生成サーバ１０２では、ユーザ端末１０１から分類器の定義を受け取る分類器受取部３０１と、分類器３０２−１〜３０２−Ｎ（Ｎは任意の自然数）を含む分類器実行部３０２と、統合分類器３０３とが動作し、ラベル無しデータ２０１に対する上述の分類器３０２−１〜３０２−Ｎによる分類結果情報２０３を入力に統合分類器３０３を学習し、この統合分類器３０３によるラベル無しデータ２０１の分類結果として学習データ２０４を生成することになる。 Therefore, as already described, in the learning data generation server 102 of the data classification system 100, the classifier receiving unit 301 that receives the definition of the classifier from the user terminal 101 and the classifiers 302-1 to 302-N (N is arbitrary). The classifier execution unit 302 including the natural number) and the integrated classifier 303 operate, and the integrated classifier 303 is input to the classification result information 203 by the above-mentioned classifiers 302-1 to 302-N for the unlabeled data 201. After training, the training data 204 will be generated as the classification result of the unlabeled data 201 by the integrated classifier 303.

この統合分類器３０３と学習データ２０４は、具体的には、データ分類者が、非特許文献１に開示された手法に沿って、学習データ２０４の作成結果を目視確認しながら、分類器を追加作成していく作業を繰り返すことで作成できる。 Specifically, the integrated classifier 303 and the training data 204 are added by the data classifier while visually confirming the creation result of the training data 204 according to the method disclosed in Non-Patent Document 1. It can be created by repeating the work of creating it.

なお、ユーザ端末１０１は、データ分類システム１００を利用するデータ分類者が操作する端末である。こうしたデータ分類者は、ユーザ端末１０１の表示を閲覧、またユーザ端末１０１を操作して、ラベル無しデータ２０１向けの分類器を新規に定義し、学習データ生成サーバ１０２へ送信する。 The user terminal 101 is a terminal operated by a data classifier who uses the data classification system 100. Such a data classifier browses the display of the user terminal 101, operates the user terminal 101, newly defines a classifier for the unlabeled data 201, and transmits the classifier to the learning data generation server 102.

一方、上述のデータ分類者による分類器の追加作成を効率的なものとするために、本実
施形態においては、分類器作成推薦サーバ１０３が分類器を追加作成するべきラベルの推薦情報を、当該データ分類者のユーザ端末１０１に通知する。 On the other hand, in order to make the additional creation of the classifier by the above-mentioned data classifier efficient, in the present embodiment, the recommendation information of the label to which the classifier creation recommendation server 103 should additionally create the classifier is concerned. Notify the user terminal 101 of the data classifier.

本実施形態の分類器作成推薦サーバ１０３は、学習データ２０４を入力に未分類率情報２０５を計算する未分類率計算部３０４と、未分類率情報２０５を入力に同時未分類率情報２０６を計算する同時未分類率計算部３０５と、文書群管理サーバ１０５から文書群２０７を読み込んで同時分類確率情報２０８を計算する同時分類確率計算部３０６と、同時未分類率情報２０６と同時分類確率情報２０８を入力に推薦スコア情報２０９を計算する推薦スコア計算部３０７と、同時分類確率２０８から分類器を生成して学習データ生成サーバ１０２へ送付する分類器生成部３０８、および、上述の推薦スコア情報２０９をユーザ情報分類結果２１１に応じたユーザ端末１０１に宛てて通知する推薦実行部３０９からなる。 The classifier creation recommendation server 103 of the present embodiment calculates the unclassified rate information 205 by inputting the learning data 204 and the unclassified rate information 206 by inputting the unclassified rate information 205. Simultaneous unclassified rate calculation unit 305, simultaneous classification probability calculation unit 306 that reads document group 207 from the document group management server 105 and calculates simultaneous classification probability information 208, and simultaneous unclassified rate information 206 and simultaneous classification probability information 208. The recommendation score calculation unit 307 that calculates the recommendation score information 209 by inputting the above, the classifier generation unit 308 that generates a classifier from the simultaneous classification probability 208 and sends it to the learning data generation server 102, and the above-mentioned recommendation score information 209. Is composed of a recommendation execution unit 309 that notifies the user terminal 101 according to the user information classification result 211.

上述のうち未分類率計算部３０４は、学習データ２０４を受け取ったら、当該学習データ２０４に含まれるデータセット（以下、データ）に対するラベルの分類確率を取得し、このラベル分類確率が所定の閾値内に収まる場合、当該データに対して当該ラベルが未分類とみなす。 Of the above, the unclassified rate calculation unit 304, upon receiving the training data 204, acquires the label classification probability for the data set (hereinafter referred to as data) included in the training data 204, and the label classification probability is within a predetermined threshold. If it fits in, the label is considered unclassified for the data.

また、未分類率計算部３０４は、全ラベルの未分類データ数を計算し、ラベルごとに未分類数を分類対象データ数（学習データに含まれるデータ数）で除算することで、未分類率を計算し、これを未分類率情報２０５に保存することが好ましい。 Further, the unclassified rate calculation unit 304 calculates the number of unclassified data of all labels and divides the unclassified number for each label by the number of classified target data (the number of data included in the training data) to obtain the unclassified rate. Is calculated and stored in the unclassified rate information 205.

未分類率計算部３０４が計算した未分類率が、データ分類者が予め指定した所定の閾値をすべてのラベルについて下回る場合、統合分類器３０３によって十分な数のラベル無しデータ２０１を分類できており、分類器を追加する必要がない。そのため、この時点で分類器作成推薦サーバ１０３は処理を終了してよい。 If the unclassified rate calculated by the unclassified rate calculation unit 304 is below a predetermined threshold specified in advance by the data classifier for all labels, the integrated classifier 303 can classify a sufficient number of unlabeled data 201. , No need to add a classifier. Therefore, at this point, the classifier creation recommendation server 103 may end the process.

一方、上述の未分類率が、いずれかのラベルについて所定の閾値を下回る場合、未分類率計算部３０４は、未分類情報２０５を同時未分類率計算部３０５に送信する。 On the other hand, when the above-mentioned unclassified rate is lower than a predetermined threshold value for any of the labels, the unclassified rate calculation unit 304 transmits the unclassified information 205 to the simultaneous unclassified rate calculation unit 305.

他方、同時未分類率計算部３０５は、上述の未分類率情報２０５を受け取ったら、未分類率情報２０５において、例えば、ｉ番目のラベル（ラベルｉと同義、以下同様）とｊ番目のラベル（ラベルｊと同義、以下同様）の両方が、未分類となるデータの割合として同時未分類率ｕ_ｉｊを計算し、この同時未分類率ｕ_ｉｊを同時未分類率情報２０６に保存することが好ましい。 On the other hand, when the simultaneous unclassified rate calculation unit 305 receives the above-mentioned unclassified rate information 205, in the unclassified rate information 205, for example, the i-th label (synonymous with label i, the same applies hereinafter) and the j-th label (same as the label i). _{It is preferable to calculate the simultaneous unclassified rate u ij} as the ratio of data that is synonymous with label j and the same applies hereinafter, and store this simultaneous unclassified rate u _ij in the simultaneous unclassified rate information 206. ..

また、同時分類確率計算部３０６は、文書群管理サーバ１０５から文書群２０７を読み込み、ｉ番目のラベルとｊ番目のラベルに含まれる単語の共起確率を計算し、当該計算結果を同時分類確率ｐ_ｉｊとして同時分類確率情報２０８に保存することが好ましい。この「単語の共起確率」は、具体的には、既知の手法（例：藤井雄太郎、吉村卓也、伊藤孝行、安藤哲志、”複数単語間の共起情報を用いた有害文書自動分類手法の提案”、第１０回情報科学技術フォーラム、（ＦＩＴ２０１１）講演論文集（２０１１））によって計算できる。 Further, the simultaneous classification probability calculation unit 306 reads the document group 207 from the document group management server 105, calculates the co-occurrence probability of the words included in the i-th label and the j-th label, and calculates the calculation result as the simultaneous classification probability. It is preferable to store it in the simultaneous classification probability information 208 as _pij. Specifically, this "word co-occurrence probability" is a known method (eg, Yutaro Fujii, Takuya Yoshimura, Takayuki Ito, Tetsushi Ando, "Automatic classification method for harmful documents using co-occurrence information between multiple words". Proposal ", 10th Information Science and Technology Forum, (FIT2011) Lecture Papers (2011)).

また、本実施形態では、ラベル無しデータ２０１に一切のラベルが含まれていないことを想定して、同時分類確率ｐ_ｉｊを計算する方法を記載したが、実際にはラベル無しデータ２０１に人手による少量のラベル分類結果が含まれている場合がある。このような場合には、このラベル分類結果から、あるいはこのラベル分類結果を併用して、同じデータが２つのラベルへ同時に分類される確率として同時分類確率を計算してもよい。 _{Further, in the present embodiment, a method of calculating the simultaneous classification probability pij} is described on the assumption that the unlabeled data 201 does not include any label, but the unlabeled data 201 is actually manually calculated. May contain a small amount of label classification results. In such a case, the simultaneous classification probability may be calculated as the probability that the same data is classified into two labels at the same time from the label classification result or in combination with the label classification result.

なお、文書群管理サーバ１０５が保持する文書群２０７は、上述の事業体のデータ分析レポートなどの組織内文書や、インターネットで公開されている文書、また、ラベル無しデータ２０１の文書を含んでよい。 The document group 207 held by the document group management server 105 may include in-organizational documents such as the above-mentioned data analysis report of the business entity, documents published on the Internet, and documents of unlabeled data 201. ..

また、分類器生成部３０８は、他の分類器によってｉ番目のラベルへ分類されたら、同時分類確率情報２０８が示す同時分類確率ｐ_ｉｊの確率でｊ番目のラベルへ分類する分類器を自動作成し、この分類器を学習データ生成サーバ１０２の分類器受取部３０１へ送信する。具体的には、この分類器は、ＩＦ文によって自動作成できる。 Further, the classifier generation unit 308 automatically creates a classifier that classifies to the j-th label with the probability of the _{simultaneous classification probability pij} indicated by the simultaneous classification probability information 208 when the classifier is classified into the i-th label by another classifier. Then, this classifier is transmitted to the classifier receiving unit 301 of the learning data generation server 102. Specifically, this classifier can be automatically created by an IF statement.

また、推薦スコア計算部３０７は、同時未分類率情報２０６から同時未分類率ｕ_ｉｊを読み込み、同時分類確率情報２０８から同時未分類確率ｐ_ｉｊを読み込み、ｉ番目のラベルに対する推薦スコアをΣ_ｊｕ_ｉｊｐ_ｉｊとして計算し、この推薦スコアを推薦スコア情報２０９へ保存することが好ましい。 Further, the recommendation score calculation unit 307 _{reads the simultaneous unclassified rate u ij} _{from the simultaneous unclassified rate information 206, reads the simultaneous unclassified probability p ij} from the simultaneous classification probability information 208, and obtains the recommended score for the i-th label Σ _j. It is preferable to calculate as u _ij p _ij and store this recommendation score in the recommendation score information 209.

また、推薦実行部３０９は、ユーザ端末１０１へ推薦スコア情報２０９を通知して表示させ、当該推薦スコア情報２０９の推薦スコアの値の大きいラベルを分類する分類器の追加作成をデータ分類者へ推薦する。一方、データ分類者は、当該推薦スコアの大きいラベルを正しく分類する分類器から追加作成する。ユーザ端末１０１は、上述のデータ分類者が追加作成した分類器を、学習データ生成サーバ１０２の分類器受取部３０１に配信する。こうして適宜な分類器が追加されることによって、統合分類器３０３として少数の分類器でより多数のデータを分類できるようになる。 Further, the recommendation execution unit 309 notifies the user terminal 101 of the recommendation score information 209 and displays it, and recommends the data classifier to additionally create a classifier for classifying the label having a large recommended score value of the recommendation score information 209. do. On the other hand, the data classifier additionally creates a label having a large recommendation score from a classifier that correctly classifies the label. The user terminal 101 distributes the classifier additionally created by the above-mentioned data classifier to the classifier receiving unit 301 of the learning data generation server 102. By adding an appropriate classifier in this way, it becomes possible to classify a larger number of data with a small number of classifiers as the integrated classifier 303.

また、本実施形態における未分類率計算部３０４は、上述の未分類率の値がすでに計算されていた場合、分類器の追加作成による未分類率の値の変化量をラベルごとに計算し、当該変化量が所定の閾値以下である場合、当該ラベルへの分類器追加作成は有効ではないと判定する。また、未分類率計算部３０４は、この判定に基づき、当該ラベル名を推薦スコア計算部３０７へ送付し、推薦スコア計算時に所定の係数（０．８など）を乗算することで、当該ラベルの推薦スコアを低下させてもよい。 Further, when the above-mentioned unclassified rate value has already been calculated, the unclassified rate calculation unit 304 in the present embodiment calculates the amount of change in the unclassified rate value due to the additional creation of the classifier for each label. If the amount of change is equal to or less than a predetermined threshold, it is determined that the addition of the classifier to the label is not effective. Further, the unclassified rate calculation unit 304 sends the label name to the recommendation score calculation unit 307 based on this determination, and multiplies a predetermined coefficient (0.8, etc.) at the time of calculating the recommendation score to obtain the label. The recommendation score may be lowered.

なお、データ分類者が複数人存在し、ラベル無しデータ２０１の一部としてデータ分類者の説明文（所属やスキルなど）を含むユーザ情報文書群２１０（ラベル無しデータ管理サーバ１０４が保持）が利用可能な場合がある。この場合、こうしたデータ分類者の説明文を統合分類器３０３によって他のラベル無しデータ２０１と同様に分類し、ユーザ情報分類結果２１１を得ることが可能である。 It should be noted that there are a plurality of data classifiers, and the user information document group 210 (held by the unlabeled data management server 104) including the description of the data classifier (affiliation, skill, etc.) is used as a part of the unlabeled data 201. It may be possible. In this case, it is possible to classify the explanatory text of such a data classifier by the integrated classifier 303 in the same manner as the other unlabeled data 201, and obtain the user information classification result 211.

また、推薦実行部３０９は、推薦対象のラベルと同一のラベルへ分類されているデータ分類者がユーザ情報分類結果２１１に含まれていた場合、このデータ分類者のユーザ端末１０１に宛てて当該ラベルに関する分類器の追加を依頼する情報を通知、すなわち推薦情報の出力を行うことによって、当該ラベルに関して知識のある分類者へ分類器の作成を依頼することが可能となる。 Further, when the user information classification result 211 includes a data classifier classified into the same label as the label to be recommended, the recommendation execution unit 309 addresses the label to the user terminal 101 of the data classifier. By notifying the information requesting the addition of the classifier regarding the label, that is, by outputting the recommendation information, it is possible to request the classifier who has knowledge about the label to create the classifier.

なお、ラベル無しデータ２０１が多様な説明文の場合は、処理効率を向上すべく、ラベル無しデータ２０１と文書群２０７を合わせたデータ集合をＫ個のクラスタに分割し、当該クラスタごとに未分類率を計算するとしてもよい。データ集合をクラスタごとに分割する手法としては、具体的には、Ｋ−ｍｅａｎｓ法などを適用すればよい。 When the unlabeled data 201 has various explanatory texts, in order to improve the processing efficiency, the data set including the unlabeled data 201 and the document group 207 is divided into K clusters, and each cluster is not classified. You may calculate the rate. Specifically, the K-means method or the like may be applied as a method for dividing the data set into clusters.

上述のように未分類率をクラスタごとに計算した場合、同時未分類率計算部３０５は、後述する同時未分類率をクラスタごとに計算し、ｋ番目のクラスタについての同時未分類率ｕ_ｉｊｋを同時未分類率情報２０６に保存する。また、同時分類確率計算部３０６は、同時分類確率をクラスタごとに計算し、ｋ番目のクラスタについての同時分類確率ｐ_ｉｊ
_ｋを計算し、同時分類確率情報２０８に保存する。その後、推薦スコア計算部３０７は、全クラスタについて推薦スコアの和をとる形でΣ_ｋΣ_ｊｕ_ｉｊｋｐ_ｉｊｋとして計算する。 When the unclassified rate is calculated for each cluster as described above, the simultaneous unclassified rate calculation unit 305 calculates the simultaneous unclassified rate described later for each cluster, and calculates the simultaneous unclassified rate _uijk for the kth cluster. It is saved in the simultaneous unclassified rate information 206. Further, the simultaneous classification probability calculation unit 306 calculates the simultaneous classification probability for each cluster, and the simultaneous classification probability _{pij for the kth cluster.}
_k is calculated and stored in the simultaneous classification probability information 208. Thereafter, the recommendation score calculation unit 307 calculates as Σ _k Σ _j u _{_ijk} p _ijk in the form of the sum of recommendation score for all clusters.

−−−ハードウェア構成−−−
また、本実施形態のデータ分類システム１００を主として構成する分類器作成推薦サーバ１０３のハードウェア構成を図２に示す。 --- Hardware configuration ---
Further, FIG. 2 shows the hardware configuration of the classifier creation recommendation server 103 that mainly constitutes the data classification system 100 of the present embodiment.

本実施形態の分類器作成推薦サーバ１０３は、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）やハードディスクドライブなど適宜な不揮発性記憶素子で構成される記憶装置４０１、ＲＡＭなど揮発性記憶素子で構成されるメモリ４０４、記憶装置４０１に保持されるプログラム４０２をメモリ４０４に読み出すなどして実行し装置自体の統括制御を行なうとともに各種判定、演算及び制御処理を行なうＣＰＵなどの演算装置４０３、および、ネットワーク４０６と接続して他装置（学習データ生成サーバ１０２、ユーザ端末１０１、文書群管理サーバ１０５など）との通信処理を担う通信装置４０５、を備える。 The classifier creation recommendation server 103 of the present embodiment includes a storage device 401 composed of an appropriate non-volatile storage element such as an SSD (Solid State Drive) or a hard disk drive, a memory 404 composed of a volatile storage element such as a RAM, and storage. The program 402 held in the device 401 is executed by reading it into the memory 404 to perform overall control of the device itself, and is connected to a computing device 403 such as a CPU that performs various determinations, calculations, and control processes, and a network 406. A communication device 405 that handles communication processing with other devices (learning data generation server 102, user terminal 101, document group management server 105, etc.) is provided.

なお、記憶装置４０１内には、本実施形態のデータ分類管理システム１００を構成する分類器作成推薦サーバとして必要な機能を実装する為のプログラム４０２に加えて、未分類率情報２０５、同時未分類率情報２０６、同時分類確率情報２０８、推薦スコア情報２０９、および、ユーザ情報分類結果２１１、が記憶されている。これらの情報の詳細については後述する。 In the storage device 401, in addition to the program 402 for implementing the function required as the classifier creation recommendation server constituting the data classification management system 100 of the present embodiment, the unclassified rate information 205 and the simultaneous unclassified are not classified. The rate information 206, the simultaneous classification probability information 208, the recommendation score information 209, and the user information classification result 211 are stored. Details of this information will be described later.

また、上述の演算装置４０３がプログラム４０２を実行することで、未分類率計算部３０４、同時未分類率計算部３０５、同時分類確率計算部３０６、推薦スコア計算部３０７、分類器生成部３０８、および、推薦実行部３０９が実装される。これら機能部の働きの詳細についても後述する。 Further, when the above-mentioned arithmetic unit 403 executes the program 402, the unclassified rate calculation unit 304, the simultaneous unclassified rate calculation unit 305, the simultaneous classification probability calculation unit 306, the recommended score calculation unit 307, the classifier generation unit 308, And, the recommendation execution unit 309 is implemented. The details of the functions of these functional units will also be described later.

−−−データ構造例−−−
続いて、本実施形態のデータ分類システム１００を構成する、上述の分類器作成推薦サーバ１０３および学習データ生成サーバ１０２らが用いるデータ類について説明する。 --- Data structure example ---
Subsequently, the data used by the above-mentioned classifier creation recommendation server 103 and the learning data generation server 102, which constitute the data classification system 100 of the present embodiment, will be described.

図３は、本実施形態におけるラベル無しデータ２０１の構成例を示す図である。このラベル無しデータ２０１は、ラベル無しデータを一意に識別するための数値、あるいは文字列であるデータＩＤ２０１ａをキーに、ラベル無しデータの特徴量２０１ｂの値を対応付けたレコードの集合体となっている。 FIG. 3 is a diagram showing a configuration example of unlabeled data 201 in the present embodiment. The unlabeled data 201 is a collection of records associated with the value of the feature amount 201b of the unlabeled data using the numerical value for uniquely identifying the unlabeled data or the data ID 201a which is a character string as a key. There is.

このうち特徴量２０１ｂは、分類対象となるデータ、あるいは分類対象となるデータから作成したデータの特徴を示す値であり、文字列、数値など任意の形式を取る。 Of these, the feature amount 201b is a value indicating the characteristics of the data to be classified or the data created from the data to be classified, and has an arbitrary format such as a character string or a numerical value.

続いて図４に、本実施形態における分類器管理情報２０２の構成例を示す。本実施形態の分類器管理情報２０２は、分類器を一意に識別するための数値あるいは文字列である分類器ＩＤ２０２ａをキーに、当該分類器を用いた分類を実行するための方法を示す分類器実行方法２０２ｂ、および、分類器実行方法２０２ｂによって分類される対象のラベルを示す分類対象ラベル２０２ｃ、を対応付けたレコードの集合体となっている。 Subsequently, FIG. 4 shows a configuration example of the classifier management information 202 in the present embodiment. The classifier management information 202 of the present embodiment is a classifier indicating a method for executing classification using the classifier using the classifier ID202a, which is a numerical value or a character string for uniquely identifying the classifier, as a key. It is a collection of records in which the execution method 202b and the classification target label 202c indicating the labels of the targets classified by the classifier execution method 202b are associated with each other.

このうち分類対象ラベル２０２ｃは、分類器の性質に応じて複数の値をとっても良い。
上述の分類器管理情報２０２のレコードは、分類器受取部３０１を通じてユーザ端末１０１から分類者作成の値を取得、あるいは分類器作成部３０８によって自動生成されることによって蓄積されていく。 Of these, the classification target label 202c may take a plurality of values depending on the properties of the classifier.
The records of the above-mentioned classifier management information 202 are accumulated by acquiring the values created by the classifier from the user terminal 101 through the classifier receiving unit 301 or by automatically generating them by the classifier creating unit 308.

続いて図５に、本実施形態の分類結果情報２０３の構成例を示す。本実施形態の分類結果情報２０３は、ラベル無しデータ２０１におけるデータＩＤと同一の（すなわち同じラベル無しデータ２０１に関する分類結果であることを意味する）データＩＤ２０１ａをキーに、当該ラベル無しデータ２０１の特徴量２０１ｂ、Ｎ個（Ｎは自然数）の分類器のそれぞれによって計算された分類確率２０３−１〜２０３−Ｎ、の各値を対応付けたレコードの集合体となっている。 Subsequently, FIG. 5 shows a configuration example of the classification result information 203 of the present embodiment. The classification result information 203 of the present embodiment is characterized by the unlabeled data 201 using the same data ID 201a as the data ID in the unlabeled data 201 (that is, the classification result regarding the same unlabeled data 201) as a key. It is a collection of records in which the values of the classification probabilities 203-1 to 203-N calculated by each of the quantity 201b and N (N is a natural number) classifier are associated with each other.

このうち分類確率２０３−１〜２０３−Ｎは、それぞれの分類器で計算した、分類対象ラベル２０２ｃ（分類器管理情報２０２で保持）にラベル無しデータ２０１の各レコードが分類される確率を示す値である。この値は、分類対象ラベル２０２ｃの値の数に応じて複数の確率値をとってもよい。 Of these, the classification probabilities 203-1 to 203-N are values that indicate the probability that each record of the unlabeled data 201 is classified in the classification target label 202c (held by the classifier management information 202) calculated by each classifier. Is. This value may have a plurality of probability values depending on the number of values of the classification target label 202c.

続いて図６に、本実施形態の学習データ２０４の構成例を示す。本実施形態の学習データ２０４は、データＩＤ２０１ａをキーに、当該ラベル無しデータ２０１の特徴量２０１ｂ、および、統合分類器２０３が計算した各ラベルへのラベル無しデータ２０１の分類確率２０４ｃ、を対応付けたレコードの集合体となっている。このうち分類確率２０４ｃは、各ラベルへの分類確率を要素としたベクトルである。 Subsequently, FIG. 6 shows a configuration example of the learning data 204 of the present embodiment. The learning data 204 of the present embodiment associates the feature amount 201b of the unlabeled data 201 and the classification probability 204c of the unlabeled data 201 for each label calculated by the integrated classifier 203 with the data ID 201a as a key. It is a collection of records. Of these, the classification probability 204c is a vector having the classification probability for each label as an element.

続いて図７に、本実施形態の未分類率情報２０５の構成例を示す。本実施形態の未分類率情報２０５は、ラベル名２０５ａをキーとして、分類対象データ数２０５ｂ、未分類データ数２０５ｃ、および、未分類率２０５ｄ、の各値を対応付けたレコードの集合体となっている。 Subsequently, FIG. 7 shows a configuration example of the unclassified rate information 205 of the present embodiment. The unclassified rate information 205 of the present embodiment is a collection of records in which the values of the number of classified data 205b, the number of unclassified data 205c, and the unclassified rate 205d are associated with each value using the label name 205a as a key. ing.

このうち分類対象データ数２０５ｂは、学習データ２０４に含まれる全データのうち、ラベル名２０５ａのラベルに分類するか判断するべきデータの数を示した数値である。 Of these, the number of data to be classified 205b is a numerical value indicating the number of data to be classified into the label of the label name 205a among all the data included in the learning data 204.

また、未分類データ数２０５ｃは、分類対象データ数２０５ｂのうち、ラベル名２０５ａのラベルに分類するか判断していない（未分類の）データの数を示した数値である。 Further, the number of unclassified data 205c is a numerical value indicating the number of (unclassified) data for which it has not been determined whether or not to classify the data into the label of the label name 205a among the number of data to be classified 205b.

また、未分類データ数２０５ｃと未分類率２０５ｄは、未分類率計算部３０４によって計算される、未分類データ数２０５ｃを分類対象データ数２０５ｂで除算した数値であり、未分類なデータの割合を示した数値である。 Further, the unclassified data number 205c and the unclassified rate 205d are numerical values obtained by dividing the unclassified data number 205c calculated by the unclassified rate calculation unit 304 by the classification target data number 205b, and the ratio of the unclassified data is calculated. It is the numerical value shown.

続いて図８に、本実施形態の同時未分類率情報２０６の構成例を示す。本実施形態の同時未分類率情報２０６は、ラベルｉ２０６ａと、ラベルｊ２０６ｂと、同時未分類率２０６ｃ、の各値を対応付けたレコードの集合体となっている。 Subsequently, FIG. 8 shows a configuration example of the simultaneous unclassified rate information 206 of the present embodiment. The simultaneous unclassified rate information 206 of the present embodiment is a collection of records in which the values of the label i206a, the label j206b, and the simultaneous unclassified rate 206c are associated with each other.

このうち、ラベルｉ２０６ａとラベルｊ２０６ｂは、ともにラベル名を示す文字列であり、同時未分類率情報２０６にはこれら２つのラベルの全組み合わせについてレコードを持っている。 Of these, the label i206a and the label j206b are both character strings indicating the label names, and the simultaneous unclassified rate information 206 has records for all combinations of these two labels.

また、同時未分類率２０６ｃは、ラベルｉ２０６ａおよびラベルｊ２０６ｂの各ラベルについて、どちらのラベルも未分類であるデータの割合を示す数値であり、同時未分類率計算部３０５によって計算される。 Further, the simultaneous unclassified rate 206c is a numerical value indicating the ratio of data in which both labels are unclassified for each label of the label i206a and the label j206b, and is calculated by the simultaneous unclassified rate calculation unit 305.

なお、本実施形態ではすべての２つのラベル（ラベルｉとラベルｊ）の組み合わせについて同時未分類率２０６ｃを計算するとしたが、３つ以上のラベル、例えばラベルｉとラベルｊとラベルｋについてすべての組み合わせをとって同時未分類率２０６ｃを計算しても良い。 In the present embodiment, the simultaneous unclassification rate 206c is calculated for the combination of all two labels (label i and label j), but all of the three or more labels, for example, label i, label j, and label k. The simultaneous unclassified rate 206c may be calculated by taking a combination.

続いて図９に、本実施形態の文書群２０７の構成例を示す。本実施形態の文書群２０７は、文書ＩＤ２０７ａと文書内容２０７ｂの各値を含むレコードの集合体となっている。
このうち文書ＩＤ２０７ａは、当該文書を一意に識別するための数値あるいは文字列である。また、文書内容２０７ｂは、当該文書の内容を示す文字列である。 Subsequently, FIG. 9 shows a configuration example of the document group 207 of the present embodiment. The document group 207 of the present embodiment is a collection of records including the values of the document ID 207a and the document content 207b.
Of these, the document ID 207a is a numerical value or a character string for uniquely identifying the document. Further, the document content 207b is a character string indicating the content of the document.

すでに述べたように、この文書群２０７およびその文書内容２０７ｂは、上述の事業体のデータ分析レポートなどの組織内文書や、インターネットで公開されている文書や、またラベル無しデータの文書を含んでよい。 As already mentioned, this group of documents 207 and its document content 207b includes in-house documents such as the data analysis report of the above-mentioned entity, documents published on the Internet, and documents of unlabeled data. good.

続いて図１０に、本実施形態の同時分類確率２０８の構成例を示す。本実施形態の同時分類確率情報２０８は、ラベルｉ２０６ａおよびラベルｊ２０６ｂをキーに、同時分類確率２０８ｃの値を対応付けたレコードの集合体となっている。 Subsequently, FIG. 10 shows a configuration example of the simultaneous classification probability 208 of the present embodiment. The simultaneous classification probability information 208 of the present embodiment is a collection of records in which the values of the simultaneous classification probability 208c are associated with the labels i206a and j206b as keys.

このうち同時分類確率２０８ｃは、ラベルｉ２０６ａおよびラベルｊ２０６ｂの各ラベルが同時に同じデータに分類される確率を示した数値であり、同時分類確率計算部３０６によって計算される値である。 Of these, the simultaneous classification probability 208c is a numerical value indicating the probability that each label of the label i206a and the label j206b is simultaneously classified into the same data, and is a value calculated by the simultaneous classification probability calculation unit 306.

続いて図１１に、本実施形態の推薦スコア情報２０９の構成例を示す。本実施形態の推薦スコア情報２０９は、ラベル名２０９ａをキーに、推薦スコア２０９ｂおよび推薦順位２０９ｃの各値を対応付けたレコードの集合体となっている。
このち推薦スコア２０９ｂは、ラベル名２０９ａに記載のラベルについて推薦スコア計算部３０７が計算した推薦スコアの値である。 Subsequently, FIG. 11 shows a configuration example of the recommendation score information 209 of the present embodiment. The recommendation score information 209 of the present embodiment is a collection of records in which the values of the recommendation score 209b and the recommendation order 209c are associated with each other using the label name 209a as a key.
The recommendation score 209b is a value of the recommendation score calculated by the recommendation score calculation unit 307 for the label described in the label name 209a.

また、推薦順位２０９ｃは、推薦スコア２０９ｂの値の大きい順に決定した分類器作成を推薦する順位であり、ユーザ端末１０１を通じて推薦順位２０９ｃの値が小さいラベルから順に分類器の作成が分類者へ推薦される。 Further, the recommendation rank 209c is a rank for recommending the creation of a classifier determined in descending order of the value of the recommendation score 209b, and the creation of the classifier is recommended to the classifier in order from the label having the smallest value of the recommendation rank 209c through the user terminal 101. Will be done.

続いて図１２に、本実施形態のユーザ情報文書群２１０の構成例を示す。本実施形態のユーザ情報文書群２１０は、ユーザＩＤ２１０ａをキーに、ユーザ情報文書２１０ｂおよびユーザ連絡先２１０ｃの各値を対応付けたレコードの集合体となっている。
このうちユーザＩＤ２１０ａは、ユーザ情報文書２１０ｂを一意に識別するための数値、あるいは文字列である。
また、ユーザ情報文書２１０ｂは、ユーザの業務上の経験やスキルを表す文書であり、自然言語からなる不定形の文書あるいは整形済みの文字列データである。 Subsequently, FIG. 12 shows a configuration example of the user information document group 210 of the present embodiment. The user information document group 210 of the present embodiment is a collection of records in which the values of the user information document 210b and the user contact 210c are associated with each other using the user ID 210a as a key.
Of these, the user ID 210a is a numerical value or a character string for uniquely identifying the user information document 210b.
Further, the user information document 210b is a document expressing the user's business experience and skill, and is an amorphous document made of natural language or formatted character string data.

また、ユーザ連絡先２１０ｃは、ユーザ情報文書２１０ｂで説明されるユーザの連絡先を表した文字列あるいは数値であり、具体的には当該ユーザの電子メールアドレスや電話番号などで構成される。 Further, the user contact 210c is a character string or a numerical value representing the contact information of the user described in the user information document 210b, and is specifically composed of an e-mail address, a telephone number, or the like of the user.

続いて図１３に、本実施形態のユーザ情報分類結果２１１の構成例を示す。本実施形態のユーザ情報分類結果２１１は、ユーザＩＤ２１０ａをキーに、ユーザ連絡先２１０ｃおよびラベル２１１ｃの各値を対応付けたレコードの集合体となっている。
このうちラベル２１１ｃは、ユーザ情報文書群２１０を入力として、統合分類器３０３で分類した結果として統合分類器３０３から出力されたラベルである。 Subsequently, FIG. 13 shows a configuration example of the user information classification result 211 of the present embodiment. The user information classification result 211 of the present embodiment is a collection of records in which the values of the user contact 210c and the label 211c are associated with each other using the user ID 210a as a key.
Of these, the label 211c is a label output from the integrated classifier 303 as a result of classification by the integrated classifier 303 with the user information document group 210 as an input.

−−−フロー例１−−−
以下、本実施形態におけるデータ分類方法の実際手順について図に基づいて説明する。以下で説明するデータ分類方法に対応する各種動作は、データ分類システム１００を構成する学習データ生成サーバ１０２や分類器作成推薦サーバ１０３がそれぞれメモリ等に読み出して実行するプログラムによって実現される。そして、このプログラムは以下に説明
される各種の動作をおこなうためのコードから構成されている。
図１４は本実施形態における学習データ生成方法のフロー例１を示す図であり、具体的には、学習データ生成サーバ１０２の動作を示すフローチャートである。
このフローにおいて、学習データ生成サーバ１０２は、ユーザ端末１０１あるいは分類器作成推薦サーバ１０３からの分類器追加要求をうけて処理を開始する。 --- Flow example 1 ---
Hereinafter, the actual procedure of the data classification method in the present embodiment will be described with reference to the drawings. Various operations corresponding to the data classification method described below are realized by a program read into a memory or the like by the learning data generation server 102 and the classifier creation recommendation server 103 constituting the data classification system 100, respectively. The program is composed of codes for performing various operations described below.
FIG. 14 is a diagram showing a flow example 1 of the learning data generation method in the present embodiment, and specifically, is a flowchart showing the operation of the learning data generation server 102.
In this flow, the learning data generation server 102 starts processing in response to a classifier addition request from the user terminal 101 or the classifier creation recommendation server 103.

この場合、学習データ生成サーバ１０２の分類器受取部３０１は、ユーザ端末１０１と分類器作成推薦サーバ１０３から追加対象の分類器の情報を受信し、当該分類器の情報を含むレコードを生成し、当該レコードを分類器管理情報２０２に保存する（Ｓ１０１）。 In this case, the classifier receiving unit 301 of the learning data generation server 102 receives the information of the classifier to be added from the user terminal 101 and the classifier creation recommendation server 103, generates a record including the information of the classifier, and generates a record. The record is stored in the classifier management information 202 (S101).

上述のレコードにおける分類器の情報としては、既に図４で説明したように、分類実行方法２０２ｂ、分類対象ラベル２０２ｃ、の各値を含んでいる。また、分類器ＩＤ２０２ａの値は、レコード追加ごとに値をインクリメントして付与したものとなる。 As already described with reference to FIG. 4, the information of the classifier in the above-mentioned record includes the values of the classification execution method 202b and the classification target label 202c. Further, the value of the classifier ID202a is given by incrementing the value each time a record is added.

次に、学習データ生成サーバ１０２の分類器実行部３０２は、上述のＳ１０１で分類器管理情報２０２に新たに保存したレコードから分類器実行方法２０２ｂの値を読取り、当該値に記載の方法にしたがって当該分類器を実行してラベル無しデータ２０１の分類を行い、当該分類の結果を分類結果情報２０３に保存する（Ｓ１０２）。 Next, the classifier execution unit 302 of the learning data generation server 102 reads the value of the classifier execution method 202b from the record newly saved in the classifier management information 202 in the above-mentioned S101, and follows the method described in the value. The classifier is executed to classify the unlabeled data 201, and the result of the classification is stored in the classification result information 203 (S102).

図５で例示した分類結果情報２０３のイメージであれば、２０３−１〜２０３−ｎの各分類器の列に追加対象の分類器（Ｓ１０１で分類器管理情報２０２に新たに追加されたもの）の項目が追加され、当該分類器による分類結果の値（例：データＩＤ“１”に関して、“０．８０”）が、各ラベル無しデータ２０１のレコードに関して追加されることとなる。 In the case of the image of the classification result information 203 exemplified in FIG. 5, the classifier to be added to the column of each classifier of 203-1 to 203-n (the one newly added to the classifier management information 202 in S101). Item is added, and the value of the classification result by the classifier (eg, "0.80" for the data ID "1") is added for each record of the unlabeled data 201.

また、学習データ生成サーバ１０２は、上述の分類結果情報２０３から統合分類器３０３を学習する（Ｓ１０３）。この学習手法自体は、分類結果情報２０３における特徴量２０１ｂを入力とした場合の教師データを分類結果２０３−１〜２０３−ｎの各値として行うもので、既存のものを適宜採用すればよい。 Further, the learning data generation server 102 learns the integrated classifier 303 from the above-mentioned classification result information 203 (S103). In this learning method itself, the teacher data when the feature amount 201b in the classification result information 203 is input is performed as each value of the classification results 203-1 to 203-n, and the existing one may be appropriately adopted.

続いて、学習データ生成サーバ１０２は、上述のＳ１０３で学習した統合分類器３０３でラベル無しデータ２０１を分類し、当該分類の結果を学習データ２０４として保存し（Ｓ１０４）、処理を終了する（Ｓ１０４）。 Subsequently, the learning data generation server 102 classifies the unlabeled data 201 by the integrated classifier 303 learned in the above-mentioned S103, saves the result of the classification as the learning data 204 (S104), and ends the process (S104). ).

なお、このステップＳ１０４における学習データ生成サーバ１０２は、例えば、ラベル無しデータ２０１が含むデータ分類者の説明文（所属やスキルなど）に対し、統合分類器３０３によって他のラベル無しデータ２０１と同様に分類し、ユーザ情報分類結果２１１を得て格納するものとする。 In addition, the learning data generation server 102 in this step S104 uses the integrated classifier 303 to describe the data classifier (affiliation, skill, etc.) included in the unlabeled data 201, in the same manner as the other unlabeled data 201. It is assumed that the user information classification result 211 is obtained and stored.

−−−フロー例２−−−
図１５は本実施形態における分類器作成推薦方法のフロー例を示す図であり、具体的には、分類器追加推薦サーバ１０３の動作を示すフローチャートである。 --- Flow example 2 ---
FIG. 15 is a diagram showing a flow example of the classifier creation recommendation method in the present embodiment, and specifically, is a flowchart showing the operation of the classifier additional recommendation server 103.

続いて、上述の学習データ生成サーバ１０２によって学習データ２０４が更新されたことを契機に、分類器作成推薦サーバ１０３が実行するフローについて説明する。 Subsequently, a flow executed by the classifier creation recommendation server 103 when the learning data 204 is updated by the above-mentioned learning data generation server 102 will be described.

この場合、分類器作成推薦サーバ１０２の未分類率計算部３０４は、学習データ生成サーバ１０２から学習データ２０４を取得し、この学習データ２０４に関して、ラベルごとに未分類率を計算し、この計算の結果を未分類率情報２０５に保存する（Ｓ２０１）。 In this case, the unclassified rate calculation unit 304 of the classifier creation recommendation server 102 acquires the learning data 204 from the learning data generation server 102, calculates the unclassified rate for each label with respect to the learning data 204, and calculates the unclassified rate for this calculation. The result is stored in the unclassified rate information 205 (S201).

この場合の未分類率計算部３０４は、全ラベルそれぞれの未分類データ数すなわち未分類数を計算し、ラベルごとに未分類数を分類対象データ数（学習データに含まれるデータ数）で除算することで、未分類率を計算する。 In this case, the unclassified rate calculation unit 304 calculates the number of unclassified data, that is, the number of unclassified data for each label, and divides the unclassified number for each label by the number of data to be classified (the number of data included in the training data). By doing so, the unclassified rate is calculated.

例えば、「ひび割れ」ラベルの未分類データ数すなわち未分類数は、学習データ２０４の各レコードに関して、その分類確率２０４ｃのベクトル値のうち、“ひび割れ”の値が所定基準値（例：０．６以下）のものを特定し、該当レコードの数、すなわち「ひび割れ」ラベルが未分類となったデータ数を「３０」などカウントする。 For example, in the number of unclassified data, that is, the number of unclassified data of the "crack" label, the value of "crack" is a predetermined reference value (eg, 0.6) among the vector values of the classification probability 204c for each record of the training data 204. The following) is specified, and the number of corresponding records, that is, the number of data for which the "crack" label is unclassified is counted as "30" or the like.

また、「ひび割れ」ラベルの未分類率は、上述の未分類数「３０」を、分類対象データ数（学習データ２０４に含まれる全データ数。例えば、「１００」）で除算することで、未分類率を「０．３」などと計算する。 The unclassified rate of the "crack" label is not determined by dividing the above-mentioned unclassified number "30" by the number of classified data (total number of data included in the training data 204, for example, "100"). Calculate the classification rate as "0.3" or the like.

続いて、学習データ生成サーバ１０２は、Ｓ２０１で計算した未分類率の値が、データ分類者が予め指定した所定の閾値（例：０．２）をいずれかのラベルについて上回るか判定する（Ｓ２０２）。 Subsequently, the learning data generation server 102 determines whether the value of the unclassified rate calculated in S201 exceeds a predetermined threshold value (eg 0.2) specified in advance by the data classifier for any label (S202). ).

上述の判定の結果、未分類率の値が上述の閾値以上でなかった場合（Ｓ２０２：ｎ）、すなわち、Ｓ２０１で計算した未分類率の値が、データ分類者が予め指定した所定の閾値をすべてのラベルについて下回る場合、統合分類器３０３によって十分な数のラベル無しデータ２０１を分類できており、分類器を追加する必要がないと特定し、以後の処理を終了する。 As a result of the above determination, when the value of the unclassified rate is not equal to or higher than the above threshold value (S202: n), that is, the value of the unclassified rate calculated in S201 sets a predetermined threshold value predetermined by the data classifier. If it is below all the labels, the integrated classifier 303 has been able to classify a sufficient number of unlabeled data 201, and it is specified that there is no need to add a classifier, and the subsequent processing is terminated.

一方、上述の判定の結果、未分類率の値が上述の閾値以上であった場合（Ｓ２０２：ｙ）、すなわち上述の未分類率が、いずれかのラベルについて所定の閾値を下回る場合、未分類率計算部３０４は、未分類情報２０５を同時未分類率計算部３０５に送信し、同時未分類率計算部３０５で、同時未分類率ｕ_ｉｊの計算を実行させる（Ｓ２０３）。 On the other hand, as a result of the above determination, when the value of the unclassified rate is equal to or higher than the above-mentioned threshold value (S202: y), that is, when the above-mentioned unclassified rate is lower than the predetermined threshold value for any of the labels, the unclassified rate is unclassified. The rate calculation unit 304 transmits the unclassified information 205 to the simultaneous unclassified rate calculation unit 305, _{and causes the simultaneous unclassified rate calculation unit 305 to calculate the simultaneous unclassified rate uij} (S203).

同時未分類率計算部３０５は、上述の未分類率情報２０５を受け取ったら、未分類率情報２０５において、例えば、ｉ番目のラベル（ラベルｉと同義、以下同様）とｊ番目のラベル（ラベルｊと同義、以下同様）の両方が、未分類となるデータの割合として同時未分類率ｕ_ｉｊを計算し、この同時未分類率ｕ_ｉｊを同時未分類率情報２０６に格納するものとする。 Upon receiving the above-mentioned unclassified rate information 205, the simultaneous unclassified rate calculation unit 305 receives, for example, the i-th label (synonymous with label i, the same applies hereinafter) and the j-th label (label j) in the unclassified rate information 205. both synonymous, hereinafter the same) and is, simultaneously unclassified rate u _ij calculated as a percentage of the data to be unclassified shall store this simultaneous unclassified rate u _ij simultaneously unclassified rate information 206.

例えば、「ひび割れ」ラベルと「不良品」ラベルが、共に未分類となったデータを、学習データ２０４の各レコードの分類確率２０４ｃのベクトル値のうち、“ひび割れ”および“不良品”の各値がいずれも所定基準値（例：０．６以下）のものとして特定し、該当レコードの数、すなわち「ひび割れ」および「不良品」の両ラベルが未分類となったデータ数を「２８」などカウントする。 For example, the data in which both the "crack" label and the "defective product" label are unclassified can be used as the "crack" and "defective product" values among the vector values of the classification probability 204c of each record of the training data 204. Is specified as a predetermined standard value (example: 0.6 or less), and the number of applicable records, that is, the number of data in which both the "crack" and "defective" labels are unclassified is "28", etc. Count.

また、「ひび割れ」および「不良品」の両ラベルが同時に未分類率となって同時未分類率は、上述の未分類数「２８」を、分類対象データ数（学習データ２０４に含まれる全データ数。例えば、「１００」）で除算することで、「０．２８」などと計算する。 Further, both the "cracked" and "defective" labels become unclassified at the same time, and the simultaneous unclassified rate is the above-mentioned unclassified number "28" and the number of classified data (all data included in the training data 204). Number. For example, by dividing by "100"), it is calculated as "0.28".

また、分類器作成推薦サーバ１０３の同時分類確率計算部３０６は、上述の判定の結果（Ｓ２０２：ｙ）を受けて、文書群管理サーバ１０５から文書群２０７を読み込み、ｉ番目のラベル（上述の例の場合、「ひび割れ」）とｊ番目のラベル（上述の例の場合、「不良品」）に含まれる単語（「ひび割れ」と「不良品」）の共起確率を計算し（Ｓ２０４）、当該計算結果を同時分類確率ｐ_ｉｊとして同時分類確率情報２０８に格納する。
なお、上述のＳ２０３およびＳ２０４の各処理は非同期に実行されるものとする。 Further, the simultaneous classification probability calculation unit 306 of the classifier creation recommendation server 103 receives the result of the above determination (S202: y), reads the document group 207 from the document group management server 105, and reads the i-th label (described above). In the case of the example, the co-occurrence probability of the words ("crack" and "defective product") contained in the jth label ("defective product" in the case of the above example) is calculated (S204). The calculation result is stored in the simultaneous classification probability information 208 as the simultaneous classification probability _pij.
It is assumed that each of the above-mentioned processes S203 and S204 is executed asynchronously.

続いて、分類器作成推薦サーバ１０３の推薦スコア計算部３０７は、同時未分類率情報２０６から同時未分類率ｕ_ｉｊを読み込み、同時分類確率情報２０８から同時未分類確率ｐ_ｉｊを読み込み、すべてのラベルｉに対する推薦スコアをΣ_ｊｕ_ｉｊｐ_ｉｊとして計算し（Ｓ２０５）、この推薦スコアを推薦スコア情報２０９へ格納する。 Then, the recommendation score calculation unit 307 of the classifier creation recommendation server 103, reads the simultaneous unclassified rate u _ij from simultaneous unclassified rate information 206, reads the simultaneous unclassified probability p _ij from simultaneous classification probability information 208, all of the It calculates a recommendation score for label i as _{_{_{Σ j u ij p ij (S205}}} ), and stores the recommendation score to recommendation score information 209.

例えば、ラベルｉが「ひび割れ」、ラベルｊが「不良品」の組み合わせに関して、同時未分類率情報２０６のレコードが示す同時未分類率２０６ｃの値「０．３０」と、同時分類確率情報２０８が示す同時分類確率２０８ｃの値「０．８」とを乗算して「０．２４」を得る計算を、ラベルｉが「ひび割れ」である全組み合わせに関して実行し、その実行結果たる乗算値の集計し、推薦スコアを算定する。 For example, regarding the combination of the label i being "cracked" and the label j being "defective product", the value "0.30" of the simultaneous unclassified rate 206c shown by the record of the simultaneous unclassified rate information 206 and the simultaneous classification probability information 208 are A calculation to obtain "0.24" by multiplying the value "0.8" of the simultaneous classification probability 208c shown is executed for all combinations whose label i is "cracked", and the multiplication value which is the execution result is aggregated. , Calculate the recommendation score.

続いて、分類器作成推薦サーバ１０３の分類器生成部３０８は、上述の同時分類確率情報２０８が示す、ラベルｉに関して同時分類確率ｐ_ｉｊの確率でｊ番目のラベルへ分類する分類器を自動作成し（Ｓ２０６）、この分類器を学習データ生成サーバ１０２の分類器受取部３０１へ送信する。 Subsequently, the classifier generation unit 308 of the classifier creation recommendation server 103 automatically creates a classifier that classifies the label i into the jth label with a probability of the _{simultaneous classification probability pij, as shown by the above-mentioned simultaneous classification probability information 208.} (S206), this classifier is transmitted to the classifier receiving unit 301 of the learning data generation server 102.

また、推薦実行部３０９は、推薦スコア情報２０９に記載の推薦対象のラベルと同一のラベルに分類されたユーザがユーザ情報分類結果２１１に含まれていれば、そのユーザの連絡先に宛てて、分類器の追加作成を推薦する推薦情報（図１６の画面１０００）を送信し（Ｓ２０７）、処理を終了する。
なお、上述のＳ２０６およびＳ２０７は非同期に実行されるものとする。 Further, if a user classified into the same label as the label to be recommended described in the recommendation score information 209 is included in the user information classification result 211, the recommendation execution unit 309 addresses the user's contact information. The recommendation information (screen 1000 in FIG. 16) recommending the additional creation of the classifier is transmitted (S207), and the process is terminated.
It is assumed that the above-mentioned S206 and S207 are executed asynchronously.

上述のデータ分類者は、上述の推薦情報をユーザ端末１０１で閲覧し、当該推薦情報が示すラベル（スコアの大きいラベル）を正しく分類する分類器に関する作成作業を行うこととなる。ユーザ端末１０１は、上述のデータ分類者が追加作成した分類器を、学習データ生成サーバ１０２の分類器受取部３０１に配信する。こうして適宜な分類器が追加されることによって、統合分類器３０３として少数の分類器でより多数のデータを分類できるようになる。 The above-mentioned data classifier browses the above-mentioned recommendation information on the user terminal 101, and performs work on a classifier that correctly classifies the label (label with a large score) indicated by the recommendation information. The user terminal 101 distributes the classifier additionally created by the above-mentioned data classifier to the classifier receiving unit 301 of the learning data generation server 102. By adding an appropriate classifier in this way, it becomes possible to classify a larger number of data with a small number of classifiers as the integrated classifier 303.

以上、本発明を実施するための最良の形態などについて具体的に説明したが、本発明はこれに限定されるものではなく、その要旨を逸脱しない範囲で種々変更可能である。
こうした本実施形態によれば、統合分類器における分類器の作成を効率的なものとし、当該統合分類器の分類精度を向上させることとなる。 Although the best mode for carrying out the present invention has been specifically described above, the present invention is not limited to this, and various modifications can be made without departing from the gist thereof.
According to such an embodiment, the creation of a classifier in the integrated classifier is made efficient, and the classification accuracy of the integrated classifier is improved.

本明細書の記載により、少なくとも次のことが明らかにされる。すなわち、本実施形態のデータ分類システムにおいて、前記演算装置は、前記同時未分類率の算定に際し、前記学習データに含まれるラベルの分類確率が所定の閾値内に収まる場合、当該データセットに対して前記ラベルが未分類であると判定し、前記学習データにおいて所定のラベルｉとラベルｊの両方が未分類となるデータの割合として前記同時未分類率ｕ_ｉｊを算定し、前記同時分類確率の算定に際し、所定の文書群に含まれるラベルｉおよびラベルｊの各ラベル名に対応する単語の共起確率として前記同時分類確率ｐ_ｉｊを算定し、前記推薦スコアの算定に際し、ラベルｉに関する前記推薦スコアをΣ_ｊｕ_ｉｊｐ_ｉｊとして算定するものである、としてもよい。 The description herein reveals at least the following: That is, in the data classification system of the present embodiment, when the calculation device calculates the simultaneous unclassification rate, if the classification probability of the label included in the training data falls within a predetermined threshold value, the calculation device is used with respect to the data set. _{It is determined that the label is unclassified, the simultaneous unclassified rate uij} is calculated as the ratio of data in which both the predetermined label i and the label j are unclassified in the training data, and the simultaneous classification probability is calculated. _{At the same time, the simultaneous classification probability pij} is calculated as the co-occurrence probability of the words corresponding to the label names of the label i and the label j included in the predetermined document group, and the recommendation score regarding the label i is calculated. _Is calculated as Σ _j u _ij p ij.

これによれば、同時分類確率および同時分類確率の算定を効率的なものとし、ひいては、統合分類器における分類器の作成をより効率的なものとし、当該統合分類器の分類精度を向上させることとなる。 According to this, the calculation of the simultaneous classification probability and the simultaneous classification probability is made efficient, and by extension, the creation of the classifier in the integrated classifier is made more efficient, and the classification accuracy of the integrated classifier is improved. Will be.

また本実施形態のデータ分類システムにおいて、前記演算装置は、前記推薦情報に基づ
いて所定ユーザが追加作成した分類器を入力装置より取得し、当該追加した分類器を前記複数の分類器に追加して分類器群を生成し、当該分類器群における各分類器の分類結果を統合した新たな学習データと、当該学習データにより学習した統合分類器とを再作成する処理を更に実行し、前記新たな学習データにおいて、ラベル無しのデータセット各々について、当該特徴量と、前記新たな学習データにより学習した統合分類器が出力した分類確率とに基づき、当該新たな学習データに含まれるラベルが未分類となるデータセットの割合として未分類率を計算し、前記未分類率が所定基準を下回るまで、前記推薦情報の出力と前記統合分類器および学習データの再作成と、前記未分類率の計算とを繰り返すものである、としてもよい。 Further, in the data classification system of the present embodiment, the arithmetic unit acquires a classifier additionally created by a predetermined user based on the recommendation information from the input device, and adds the added classifier to the plurality of classifiers. To generate a classifier group, and further execute a process of recreating a new learning data in which the classification results of each classifier in the classifier group are integrated and an integrated classifier learned by the training data. For each unlabeled data set, the labels included in the new training data are unclassified based on the feature quantity and the classification probability output by the integrated classifier trained by the new training data. The unclassified rate is calculated as the ratio of the data set to be, and the recommendation information is output, the integrated classifier and the training data are recreated, and the unclassified rate is calculated until the unclassified rate falls below a predetermined standard. May be repeated.

これによれば、各データセットに対するラベル付与を漏れなく効率的なものとし、ひいては、統合分類器における分類器の作成をより効率的なものとし、当該統合分類器の分類精度を向上させることとなる。 According to this, labeling for each data set is made efficient without omission, and by extension, the creation of the classifier in the integrated classifier is made more efficient, and the classification accuracy of the integrated classifier is improved. Become.

また本実施形態のデータ分類システムにおいて、前記演算装置は、前記分類器の追加に伴う前記新たな学習データおよび前記統合分類器の再作成の前後で、前記未分類率の値が同一あるいは増加した場合、前記推薦情報の対象としたラベルの推薦スコアに対し、ラベル間の推薦スコアの順位を低下させる所定係数を乗算する処理を更に実行するものである、としてもよい。 Further, in the data classification system of the present embodiment, in the arithmetic device, the value of the unclassified rate is the same or increased before and after the new learning data and the re-creation of the integrated classifier accompanying the addition of the classifier. In this case, the process of multiplying the recommendation score of the label targeted by the recommendation information by a predetermined coefficient that lowers the rank of the recommendation score between the labels may be further executed.

これによれば、分類機追加による影響のうち悪影響を適宜に排除することが可能となり、ひいては、統合分類器における分類器の作成をより効率的なものとし、当該統合分類器の分類精度を向上させることとなる。 According to this, it becomes possible to appropriately eliminate the adverse effects of the addition of the classifier, which in turn makes the creation of the classifier in the integrated classifier more efficient and improves the classification accuracy of the integrated classifier. Will be made to.

また本実施形態のデータ分類システムにおいて、前記演算装置は、前記推薦情報の対象として特定した所定のラベルｉについて、前記同時分類確率を参照し、当該ラベルｉとの同時分類確率が所定基準以上のラベルｊを特定し、前記ラベルｉが付与される場合に前記同時分類確率の確率で前記ラベルｊに分類する分類器を自動生成して、前記自動生成した分類器を、前記複数の分類器に追加する処理を更に実行するものである、としてもよい。 Further, in the data classification system of the present embodiment, the arithmetic unit refers to the simultaneous classification probability of the predetermined label i specified as the target of the recommendation information, and the simultaneous classification probability with the label i is equal to or higher than the predetermined reference. The label j is specified, and when the label i is given, a classifier that classifies the label j with the probability of the simultaneous classification probability is automatically generated, and the automatically generated classifier is used as the plurality of classifiers. It may be said that the process to be added is further executed.

これによれば、共起確率が高いキーワードすなわちラベル同士の関係性を踏まえた、漏れの少ない分類器の自動生成が可能となり、ひいては、統合分類器における分類器の作成をより効率的なものとし、当該統合分類器の分類精度を向上させることとなる。 This makes it possible to automatically generate a classifier with few omissions based on keywords with a high co-occurrence probability, that is, the relationship between labels, which in turn makes it more efficient to create a classifier in an integrated classifier. , The classification accuracy of the integrated classifier will be improved.

また本実施形態のデータ分類システムにおいて、前記演算装置は、前記同時分類確率の算定に際し、前記文書群に加え、前記学習データの特徴量に含まれる文と前記学習データに付与された複数のラベルとから前記同時分類確率を計算するものである、としてもよい。 Further, in the data classification system of the present embodiment, when the arithmetic unit calculates the simultaneous classification probability, in addition to the document group, a sentence included in the feature amount of the training data and a plurality of labels attached to the training data. It may be said that the simultaneous classification probability is calculated from the above.

これによれば、予め用意した文書群（例：企業内の技術文書等）が無い場合であっても、既存の学習データに基づいて同時分類確率を算定することが可能となり、ひいては、統合分類器における分類器の作成をより効率的なものとし、当該統合分類器の分類精度を向上させることとなる。 According to this, even if there is no document group prepared in advance (eg, technical documents in the company, etc.), it is possible to calculate the simultaneous classification probability based on the existing learning data, and by extension, the integrated classification. The creation of the classifier in the vessel will be made more efficient, and the classification accuracy of the integrated classifier will be improved.

また本実施形態のデータ分類システムにおいて、前記記憶装置は、前記分類器の追加作成の主体となりうるユーザ各々に関して、所定事象に関与している旨を示す記述、および連絡先の各情報を記述したユーザ情報文書群を更に格納しており、前記演算装置は、前記特定したラベルのラベル名に対応する単語を、前記ユーザ情報文書群に照合し、当該単語を前記記述に含むユーザを特定し、当該ユーザの前記連絡先に宛てて前記推薦情報の出力を行うものである、としてもよい。 Further, in the data classification system of the present embodiment, the storage device describes a description indicating that the user is involved in a predetermined event and information of each contact for each user who can be the main body of the additional creation of the classifier. The user information document group is further stored, and the arithmetic unit collates the word corresponding to the label name of the specified label with the user information document group, identifies the user who includes the word in the description, and identifies the user. The recommendation information may be output to the contact information of the user.

これによれば、類器作成を促すべき好適なユーザ宛てに上述の推薦情報を通知することが可能となり、ひいては、統合分類器における分類器の作成をより効率的なものとし、当該統合分類器の分類精度を向上させることとなる。 According to this, it becomes possible to notify the above-mentioned recommendation information to a suitable user who should be encouraged to create a classifier, which in turn makes the creation of the classifier in the integrated classifier more efficient, and the integrated classifier. It will improve the classification accuracy of.

また本実施形態のデータ分類システムにおいて、前記演算装置は、前記学習データに含まれるデータセットの特徴量および前記文書群の特徴量をクラスタリングし、前記同時未分類率の算定に際し、前記データセットのうち、所定のクラスタｋに属するものについて、所定のラベルｉおよびラベルｊの両方が未分類となる割合として同時未分類率ｕ_ｉｊｋを計算し、前記同時分類確率の算定に際し、前記文書群のうち、所定のクラスタｋに属する文書群に含まれる、ラベルｉおよびラベルｊの各ラベル名に対応する単語の共起確率として同時分類確率ｐ_ｉｊｋを計算し、前記推薦スコアの算定に際し、ラベルｉに関する前記推薦スコアをΣ_ｋΣ_ｊｕ_ｉｊｋｐ_ｉｊｋとして計算するものである、としてもよい。 Further, in the data classification system of the present embodiment, the arithmetic unit clusters the feature amount of the data set included in the learning data and the feature amount of the document group, and when calculating the simultaneous unclassification rate, the data set of the data set is calculated. _{Of these, for those belonging to the predetermined cluster k, the simultaneous unclassification rate uijk} is calculated as the ratio at which both the predetermined label i and the label j are unclassified, and in the calculation of the simultaneous classification probability, among the document group. _{, The simultaneous classification probability pijk} is calculated as the co-occurrence probability of the words corresponding to the label names of the label i and the label j included in the document group belonging to the predetermined cluster k, and the recommendation score is calculated with respect to the label i. wherein in which the recommendation score calculated as _{_{_{_{Σ k Σ j u ijk p ijk}}}} , may be.

これによれば、生産ラインの工程種類などといったクラスタごとに、追加すべきラベルに応じた分類器の推薦を行うことが可能となり、ひいては、統合分類器における分類器の作成をより効率的なものとし、当該統合分類器の分類精度を向上させることとなる。 According to this, it is possible to recommend a classifier according to the label to be added for each cluster such as the process type of the production line, which in turn makes it more efficient to create a classifier in the integrated classifier. The classification accuracy of the integrated classifier will be improved.

１００データ分類システム
１０１ユーザ端末
１０２学習データ生成サーバ（所定装置）
１０３分類器作成推薦サーバ（データ分類装置）
１０４ラベル無しデータ管理サーバ
１０５文書群管理サーバ
２０１ラベル無しデータ
２０２分類器管理情報
２０３分類結果
２０４学習データ
２０５未分類率情報
２０６同時未分類率情報
２０７文書群
２０８同時分類確率情報
２０９推薦スコア情報
２１０ユーザ情報文書群
２１１ユーザ情報分類結果
３０１分類器受取部
３０２分類器実行部
３０３統合分類器
３０４未分類率計算部
３０５同時未分類率計算部
３０６同時分類確率計算部
３０７推薦スコア計算部
３０８分類器生成部
３０９推薦実行部
４０１記憶装置
４０２プログラム
４０３演算装置
４０４メモリ
４０５通信装置 100 Data classification system 101 User terminal 102 Learning data generation server (predetermined device)
103 Classifier creation recommendation server (data classification device)
104 Unlabeled data management server 105 Document group management server 201 Unlabeled data 202 Classifier management information 203 Classification result 204 Learning data 205 Uncategorized rate information 206 Simultaneous unclassified rate information 207 Document group 208 Simultaneous classification probability information 209 Recommended score information 210 User information document group 211 User information classification result 301 Classifier receiving unit 302 Classification device Execution unit 303 Integrated classifier 304 Unclassified rate calculation unit 305 Simultaneous unclassified rate calculation unit 306 Simultaneous classification probability calculation unit 307 Recommended score calculation unit 308 Classification device Generation unit 309 Recommendation execution unit 401 Storage device 402 Program 403 Computing device 404 Memory 405 Communication device

Claims

A storage device that integrates the classification results of a predetermined data set by each of a plurality of classifiers and stores training data that defines the correspondence between the feature amount of the data set and the label that is the classification result.
1 The process of calculating the simultaneous unclassification rate, which is the probability that multiple labels are unclassified for one data set, and the simultaneous classification probability, which is the probability that multiple labels are classified for one data set, said simultaneous unclassification. The process of calculating the recommendation score by totaling the multiplication value of the rate and the simultaneous classification probability for each label, and specifying the labels in descending order of the recommendation score as labels to be additionally created as a classifier, and recommending the label. An arithmetic unit that executes the process of outputting information to a predetermined device, and
A data classification system characterized by including.

The arithmetic unit is
When the classification probability of the label included in the training data falls within a predetermined threshold in the calculation of the simultaneous unclassification rate, it is determined that the label is unclassified for the data set, and the label is determined in the training data. _{The simultaneous unclassified rate u ij} is calculated as the ratio of data in which both the label i and the label j are unclassified.
_{In calculating the simultaneous classification probability, the simultaneous classification probability pij} is calculated as the co-occurrence probability of the words corresponding to the label names of the labels i and j included in the predetermined document group.
In calculating the recommendation score, is to calculate the recommendation score for label i as Σ _j u _{_ij} p _ij,
The data classification system according to claim 1.

The arithmetic unit is
A classifier additionally created by a predetermined user based on the recommendation information is acquired from an input device, the added classifier is added to the plurality of classifiers to generate a classifier group, and each classification in the classifier group is generated. Further execution of the process of recreating the new learning data in which the classification results of the vessels are integrated and the integrated classifier learned by the learning data is further executed.
In the new training data, for each unlabeled data set, the label included in the new training data is based on the feature amount and the classification probability output by the integrated classifier trained by the new training data. The unclassified rate is calculated as the ratio of the unclassified data set, and the recommendation information is output, the integrated classifier and the training data are recreated, and the unclassified rate is calculated until the unclassified rate falls below a predetermined standard. It repeats the calculation,
The data classification system according to claim 1.

The arithmetic unit is
If the value of the unclassified rate is the same or increases before and after the new training data associated with the addition of the classifier and the re-creation of the integrated classifier, the recommendation score of the label targeted for the recommendation information is , Further performs a process of multiplying a predetermined coefficient that lowers the rank of the recommendation score between the labels.
The data classification system according to claim 3, wherein the data classification system is characterized by the above.

The arithmetic unit is
When the label i specified as the target of the recommendation information is referred to the simultaneous classification probability, the label j whose simultaneous classification probability with the label i is equal to or higher than the predetermined standard is specified, and the label i is given. A classifier that classifies to the label j with the probability of the simultaneous classification probability is automatically generated, and a process of adding the automatically generated classifier to the plurality of classifiers is further executed.
The data classification system according to claim 1.

The arithmetic unit is
In calculating the simultaneous classification probability, the simultaneous classification probability is calculated from the sentence included in the feature amount of the learning data and a plurality of labels attached to the learning data in addition to the document group.
The data classification system according to claim 2, wherein the data classification system is characterized by the above.

The storage device is
For each user who can be the main body of the additional creation of the classifier, a description indicating that the user is involved in a predetermined event and a user information document group describing each contact information are further stored.
The arithmetic unit is
The word corresponding to the label name of the specified label is collated with the user information document group, the user who includes the word in the description is specified, and the recommendation information is output to the contact of the user. Is a thing,
The data classification system according to claim 1.

The arithmetic unit is
The feature amount of the data set included in the training data and the feature amount of the document group are clustered.
_{In the calculation of the simultaneous unclassified rate, the simultaneous unclassified rate u ijk} is calculated as the ratio at which both the predetermined label i and the label j are unclassified for the data set belonging to the predetermined cluster k.
_{In calculating the simultaneous classification probability, the simultaneous classification probability pijk} is set as the co-occurrence probability of the words corresponding to the label names of the label i and the label j included in the document group belonging to the predetermined cluster k among the document groups. Calculate and
In calculating the recommendation score, and calculates the recommendation score for label i as _{_{_{_{Σ k Σ j u ijk p ijk}}}} ,
The data classification system according to claim 2, wherein the data classification system is characterized by the above.

An information processing system equipped with a storage device that integrates the classification results of a predetermined data set by each of a plurality of classifiers and stores learning data that defines the correspondence between the feature amount of the data set and the label that is the classification result.
1 Processing to calculate the simultaneous unclassification rate, which is the probability that multiple labels are unclassified for one data set, and the simultaneous classification probability, which is the probability that multiple labels are classified for one data set.
The process of calculating the recommendation score by aggregating the multiplication value of the simultaneous unclassified rate and the simultaneous classification probability for each label.
A process of specifying labels in descending order of the recommendation score as labels to be additionally created as a classifier and outputting the recommendation information of the label to a predetermined device.
A data classification method characterized by performing.

A communication device that performs communication processing with other devices via a predetermined network,
The predetermined device is accessed by the communication device, the classification results of the predetermined data set by each of the plurality of classifiers provided in the predetermined device are integrated, and the correspondence between the feature amount of the data set and the label as the classification result is determined. Based on the process of acquiring the specified training data, the feature amount for each unlabeled data set in the training data, and the classification probability output by the integrated classifier learned from the classification result by each of the classifiers. The process of calculating the simultaneous unclassified rate, which is the probability that multiple labels are unclassified for one data set, and the simultaneous classification probability, which is the probability that multiple labels are classified for one data set, said simultaneous unclassified. The process of calculating the recommendation score by totaling the multiplication value of the classification rate and the simultaneous classification probability for each label, and specifying the labels in descending order of the recommendation score as labels to be additionally created as a classifier, An arithmetic device that executes the process of outputting recommended information to a predetermined device, and
A data classification device characterized by comprising.