JP6401617B2

JP6401617B2 - Data processing apparatus, data processing method, and large-scale data processing program

Info

Publication number: JP6401617B2
Application number: JP2015001514A
Authority: JP
Inventors: 和広斉藤
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2015-01-07
Filing date: 2015-01-07
Publication date: 2018-10-10
Anticipated expiration: 2035-01-07
Also published as: JP2016126646A

Description

本発明は、複数のデータソースの論理的な統合データモデルを提供するデータ仮想化システムにおいて、限られたリソースの環境下で、処理対象データのヒストグラムを用いて大規模データを確実に処理可能なクエリを生成するデータ処理装置、データ処理方法及び大規模データ処理プログラムに関する。 The present invention is a data virtualization system that provides a logical integrated data model of a plurality of data sources, and can reliably process large-scale data using a histogram of data to be processed in a limited resource environment. The present invention relates to a data processing apparatus, a data processing method, and a large-scale data processing program that generate a query.

データ仮想化システム（又はマルチデータベースシステム）は、インタフェースやデータ管理方式が異なる複数のデータソースを仮想的に一つのデータベースシステムに見せるために、各データソースが持つデータを論理的に統合して管理し、ユーザのクエリに対応するデータソースにクエリを投稿する。
代表的なデータ仮想化システムは、例えば特許文献１に示されるように、複数の階層的なデータベースシステムを、データマッピングにより仮想スキーマ（実際の物理テーブルをユーザに提供する論理テーブルに変換する処理を定義したもの）に統合し、クエリ実行時において処理対象となるデータを保持するデータベースシステムにクエリを分配するよう構成されている。各データベースシステムで実行されたクエリの結果は中央に収集され、仮想スキーマに従って一つに統合して結果を出力するシステムとなっている。 A data virtualization system (or multi-database system) manages the data of each data source by logical integration so that multiple data sources with different interfaces and data management methods can be virtually viewed as one database system. And post the query to the data source corresponding to the user's query.
A typical data virtualization system, for example, as shown in Patent Document 1, converts a plurality of hierarchical database systems into a virtual schema (process that converts an actual physical table to a logical table provided to a user by data mapping). The query is distributed to a database system that holds data to be processed at the time of query execution. The results of queries executed in each database system are collected in the center, and are integrated into one according to a virtual schema to output the results.

特許文献１のように、複数のデータソースを跨がるクエリの統合処理は、データ仮想化システム上で実行する必要がある。このとき、一つ以上のデータベースシステムから得られるデータと処理後の結果データサイズが、物理メモリサイズを超えるほど大規模であった場合、実行できずにエラー終了してしまう可能性がある。
このような場合、ＯＳのスワップ機構によって対応可能であると考えられるが、クエリ処理に最適化されておらず、遅延は非常に大きい。また、サイズがスワップ領域を超えてしまった場合にも、同様にエラー終了してしまうという問題がある。 As in Patent Literature 1, it is necessary to execute a query integration process across a plurality of data sources on a data virtualization system. At this time, if the data obtained from one or more database systems and the result data size after the processing are large enough to exceed the physical memory size, there is a possibility that the processing cannot be executed and an error ends.
In such a case, it can be considered that it can be handled by the OS swap mechanism, but it is not optimized for query processing and the delay is very large. Also, when the size exceeds the swap area, there is a problem that the process ends in an error.

データ仮想化システムは、対象の複数のデータソースが持つデータの規模や想定されるクエリの種類から、適切な物理メモリサイズを想定することで構築される。しかし事業環境の変化などから生成されるデータ量やクエリの種類が変化することで、想定した物理メモリサイズを超えるメモリが必要となる場合がある。そのような変化に対応するためにリソースを増設することは、多くの時間を必要とし、事業分析等のスピードが要求される用途で利用するユーザであったとしても、増設が完了するまで必要なクエリを実行することができない。 The data virtualization system is constructed by assuming an appropriate physical memory size from the scale of data possessed by a plurality of target data sources and the type of query assumed. However, as the amount of data generated and the type of query change due to changes in the business environment, memory exceeding the assumed physical memory size may be required. Adding resources to cope with such changes requires a lot of time, and even if the user is used for applications that require speed of business analysis, etc., it is necessary until the addition is completed. The query cannot be executed.

そこで本発明者は、データ仮想化システム上の限られたリソース環境において、大規模なデータに対しても破綻することなく確実にクエリ実行するために、実行可能なデータサイズとなるようにクエリを分割して実行する手法に関する提案を行った（特許文献２）。この手法によれば、処理前後のデータサイズとデータ仮想化システムのメモリサイズを基準にクエリ分割数を計算し、分割数に応じて入力テーブルを均等に分割するようなクエリを生成している。 Therefore, the present inventor, in a limited resource environment on the data virtualization system, in order to execute a query without fail even for large-scale data, the query is executed so as to have an executable data size. The proposal regarding the method to divide and perform was performed (patent document 2). According to this technique, the number of query divisions is calculated based on the data size before and after the process and the memory size of the data virtualization system, and a query that evenly divides the input table according to the number of divisions is generated.

また、非特許文献１には、データベースシステムにおけるクエリの最適化のために、データの属性毎のヒストグラムを利用して、クエリのコストをより正確に推定するための手法として、ヒストグラムの構成手法と、クエリ処理後の出力データのヒストグラムを算出する手法が提案されている。ヒストグラムは、属性毎の特定範囲（バウンド）における値の数（カウント）と、値の種類の数（ドメイン）を、バケットと呼ばれる単位で集計することで、データの全体の統計情報だけでは分からないデータの偏りを表現することが可能となる。 Further, Non-Patent Document 1 discloses a histogram construction method as a method for more accurately estimating the cost of a query by using a histogram for each attribute of the data in order to optimize a query in a database system. A method of calculating a histogram of output data after query processing has been proposed. Histograms do not know only the overall statistical information of data by counting the number of values (count) in a specific range (bound) for each attribute and the number of value types (domain) in units called buckets. Data bias can be expressed.

特開平０７−１４１３９９号公報Japanese Patent Application Laid-Open No. 07-141399 特願２０１４−２００５９６Japanese Patent Application 2014-200596

Proceedings of the 2002 ACM SIGMOD international conference on Management of data, P263-P274Proceedings of the 2002 ACM SIGMOD international conference on Management of data, P263-P274

上述した特許文献２の手法によれば、クエリの分割数を計算し、入力データの特定の属性における範囲をその分割数で均等に分割することで、クエリを分割している。これは入力データが均等に分割されることを前提としているだけでなく、分割して取得する入力データに対する処理後のデータが、入力データと同様に均等に分割されることを前提としている。 According to the method of Patent Document 2 described above, the query is divided by calculating the number of divisions of the query and equally dividing the range of the specific attribute of the input data by the number of divisions. This is based not only on the premise that the input data is equally divided, but also on the premise that the processed data for the input data obtained by being divided is equally divided similarly to the input data.

しかしながら、実際にはデータの分散に応じて偏りが発生し、必ずしも分割対象の属性に対する分割が均等にデーブルを分割するとは限らない。処理後のデータに関しても同様であり、実際の出力データには、処理特性やデータの分散に応じて偏りが発生する。これにより、分割したクエリのうちの一部が、データ仮想化システム上のリソースを超えたデータサイズとなり、クエリを実行できなくなるという現象が生じる可能性がある。 However, in reality, bias occurs according to the distribution of data, and the division for the attribute to be divided does not always divide the table equally. The same applies to the processed data, and the actual output data is biased depending on the processing characteristics and data distribution. As a result, there is a possibility that a part of the divided queries has a data size exceeding the resources on the data virtualization system and the query cannot be executed.

本発明は上記実情に鑑みて提案されたものであり、データ仮想化システムにおいて、限られたリソース環境においても確実にクエリを実行するとともに、クエリ実行時の信頼性の向上を図るデータ処理装置、データ処理方法及び大規模データ処理プログラムを提供することを目的としている。 The present invention has been proposed in view of the above circumstances, and in a data virtualization system, a data processing device that reliably executes a query even in a limited resource environment and improves reliability during query execution, An object is to provide a data processing method and a large-scale data processing program.

上記目的を達成するため本発明のデータ処理装置は、データ仮想化システムにおいて、処理対象のデータにおける属性毎のヒストグラムを基に、ユーザが投稿したクエリの処理毎の出力データのヒストグラムを作成し、入力のバケットに対する出力のバケットの利用メモリサイズを計算することで、出力データの値の偏りを考慮した分割範囲を設定し、精度の高い分割クエリの生成を実現する。 In order to achieve the above object, the data processing apparatus of the present invention creates a histogram of output data for each processing of a query posted by a user based on a histogram for each attribute in data to be processed in a data virtualization system, By calculating the used memory size of the output bucket with respect to the input bucket, a division range is set in consideration of the deviation of the value of the output data, and a highly accurate divided query is generated.

すなわち、請求項１のデータ処理装置は、クエリ処理要求と結果受信を行うクライアントに対して、１つ以上のデータソースを利用してクエリ処理を行うデータ処理装置において、
前記各データソースに存在する複数のデータに基づいてヒストグラムを生成するヒストグラム生成部と、
前記ヒストグラムを格納する統計情報部と、
ユーザが投稿したクエリを実行するためのクエリプランを生成するクエリ評価部と、
前記クエリプランを基に、前記統計情報部に格納されたヒストグラム、又は、自身が以前に生成した中間ヒストグラムを利用して、処理毎の出力データのヒストグラムである中間ヒストグラムを生成する中間ヒストグラム生成部と、
前記統計情報部に格納されたヒストグラムと、前記中間ヒストグラム生成部が生成した中間ヒストグラムとを利用して、前記クエリプランの処理毎にバケット単位で入出力サイズを計算する利用メモリサイズ計算部と、
前記利用メモリサイズ計算部が計算したバケット単位での入出力サイズを基に、クエリの分割範囲を決定する分割範囲生成部と、
を備えてクエリの分割を行うことを特徴としている。 That is, a data processing device according to claim 1 is a data processing device that performs query processing using one or more data sources for a client that performs a query processing request and results reception.
A histogram generator that generates a histogram based on a plurality of data existing in each of the data sources;
A statistical information section for storing the histogram;
A query evaluator that generates a query plan for executing a query posted by the user;
Based on the query plan, an intermediate histogram generation unit that generates an intermediate histogram that is a histogram of output data for each process using a histogram stored in the statistical information unit or an intermediate histogram generated by itself. When,
Utilizing the histogram stored in the statistical information section and the intermediate histogram generated by the intermediate histogram generation section, a use memory size calculation section that calculates the input / output size in bucket units for each processing of the query plan,
Based on the input / output size in bucket units calculated by the used memory size calculation unit, a division range generation unit that determines a query division range;
And dividing the query.

請求項２は、請求項１のデータ処理装置において、
前記各データソースに存在する複数のデータからバケット情報を収集するバケット情報収集部を備え、
前記ヒストグラムは、前記バケット情報からデータの属性別に生成することを特徴としている。 Claim 2 is the data processing device of claim 1,
A bucket information collecting unit for collecting bucket information from a plurality of data existing in each data source;
The histogram is generated for each data attribute from the bucket information.

請求項３は、クエリ処理要求と結果受信を行うクライアントに対して、１つ以上のデータソースを利用してクエリ処理を行うためのデータ処理方法において、
バケット情報からデータの属性別にヒストグラムを生成する手順と、
当該ヒストグラムから前記クエリの処理毎の中間ヒストグラムを作成する手順と，
クエリの分割が必要な場合に前記クエリの分割範囲を指定する分割基準属性を決定する手順と、
前記分割基準属性のヒストグラムにおけるバケットを一つ選択し、当該バケットに対して処理を実行した場合の出力バケットも含めた当該バケットの利用メモリサイズを計算する手順と、
上記処理をリソースの上限となるサイズまで繰り返し、上限を超えた段階で直前に計算したバケットの上限値を、分割範囲として分割範囲リストに登録する手順と、
上記各処理について、全バケット分繰り返すことで、前記分割範囲リストから分割数を決定することを特徴としている。 A third aspect of the present invention relates to a data processing method for performing query processing using one or more data sources for a client that performs query processing request and result reception.
To generate a histogram for each data attribute from bucket information,
Creating an intermediate histogram for each processing of the query from the histogram;
Determining a split criteria attribute that specifies a split range of the query when query splitting is required;
A procedure for selecting one bucket in the histogram of the division criterion attribute and calculating the used memory size of the bucket including the output bucket when processing is performed on the bucket;
Repeat the above process up to the size of the upper limit of the resource, register the bucket upper limit value calculated immediately before the upper limit as a split range in the split range list,
About each said process, the number of division | segmentation is determined from the said division | segmentation range list | wrist by repeating for every bucket.

請求項４は、請求項３のデータ処理方法において、
前記バケットの利用メモリサイズを計算する手順は、
前記クエリの処理対象が単項演算による場合、
前記バケットのバウンドを利用して、対象の入力バケットの処理後のデータが含まれる出力バケットを取得し、
前記出力バケットにおいて、実際に入力バケットの処理後のデータが含まれている割合を前記バウンドから計算し、
前記割合を利用して、前記入力バケットに対する前記出力バケットのサイズを計算し、
これらの処理を入力バケットの値が含まれる全出力バケット分繰り返すことで、入力バケットを基準とした利用メモリサイズを計算することを特徴としている。 Claim 4 is the data processing method of claim 3,
The procedure for calculating the used memory size of the bucket is as follows:
When the processing target of the query is a unary operation,
Use the bound of the bucket to obtain an output bucket that contains the processed data of the target input bucket,
In the output bucket, the ratio of the actual processed data of the input bucket is calculated from the bound,
Using the percentage, calculate the size of the output bucket relative to the input bucket,
By repeating these processes for all the output buckets including the value of the input bucket, the use memory size based on the input bucket is calculated.

請求項５は、請求項３のデータ処理方法において、
前記バケットの利用メモリサイズを計算する手順は、
前記クエリの処理対象が二項演算の直積処理による場合、
処理対象の二つのデータの内、サイズの大きい方のデータの分割基準属性にある一つのバケットを基準として入力サイズを計算して利用メモリサイズの初期値とし、
もう一方の分割基準属性からバケットを取得して入力サイズを計算するとともに、二つの入力バケットから直積処理を行った場合の出力サイズを計算し、
前記利用メモリサイズに対して、前記したもう一方の分割基準属性の入力サイズと前記出力サイズを加算して新たな利用メモリサイズとし、
これらの処理を全バケット分繰り返すことで、入力バケットを基準とした利用メモリサイズを計算することを特徴としている。 Claim 5 is the data processing method of claim 3,
The procedure for calculating the used memory size of the bucket is as follows:
When the processing target of the query is a direct product process of binary operation,
Of the two data to be processed, calculate the input size based on one bucket in the division criterion attribute of the larger data, and use it as the initial value of the memory size used.
While obtaining the bucket from the other split criterion attribute and calculating the input size, calculate the output size when the direct product processing is performed from the two input buckets,
With respect to the used memory size, the input size of the other division criterion attribute and the output size are added to obtain a new used memory size,
By repeating these processes for all buckets, the memory size used is calculated based on the input bucket.

請求項６は、請求項３のデータ処理方法において、
前記バケットの利用メモリサイズを計算する手順は、
前記クエリの処理対象が二項演算の内部結合処理による場合、
入力データにおける基準となる入力バケットのサイズを計算し、前記入力バケットが処理後に出力される値を含む出力バケットを検索し、前記出力バケットにおいて、前記入力バケットの処理後のデータが含まれている割合をバウンドから計算し、前記割合を利用して前記入力バケットに対する出力バケットのサイズの計算について、前記入力バケットの値が含まれる全出力バケット分繰り返すことで、出力結果のサイズを計算する一方、
もう一つの入力データが、基準となる入力バケットと結合される値を含むバケットを検索し、前記出力バケットと同様に当該入力データのサイズを計算し、
基準の入力バケットのサイズと、もう一つの入力データのサイズと、出力結果のサイズを合算することで、基準となる入力データのバケットにおける利用メモリサイズを計算することを特徴としている。 Claim 6 is the data processing method of claim 3,
The procedure for calculating the used memory size of the bucket is as follows:
When the processing target of the query is binary join inner processing,
The size of the reference input bucket in the input data is calculated, the output bucket including the value output after the input bucket is processed is searched, and the processed data of the input bucket is included in the output bucket While calculating the ratio from the bound and calculating the size of the output bucket with respect to the input bucket using the ratio, the size of the output result is calculated by repeating for all output buckets including the value of the input bucket,
Another input data is searched for a bucket including a value combined with a reference input bucket, and the size of the input data is calculated in the same manner as the output bucket.
It is characterized in that the size of memory used in the bucket of the reference input data is calculated by adding the size of the reference input bucket, the size of the other input data, and the size of the output result.

請求項７は、請求項３のデータ処理方法において、
前記バケットの利用メモリサイズを計算する手順は、
前記クエリの処理対象が二項演算の外部結合処理又は統合処理による場合、
請求項６に記載のデータ処理方法におけるバケットの利用メモリサイズを計算する手順によって利用メモリサイズを計算し、
基準となるデータにおいて、分割基準属性が持たない値を、もう一方のデータの分割基準属性が含んでいる場合に、
当該基準となるデータの分割基準属性が持たない値のサイズを計算するために、基準となるデータの分割基準属性の最初のバケットと最後のバケットにおいて、当該バケットのバウンドの範囲外を計算するに際して、
最初のバケットにおいては、下限値より前の値に該当するもう一つの分割基準属性のバケットと出力バケットのサイズを計算して利用メモリサイズに加算するとともに、
最後のバケットにおいては、上限値より先の値に該当するもう一つの分割基準属性のバケットと出力バケットのサイズを計算して利用メモリサイズに加算することを特徴としている。 Claim 7 is the data processing method of claim 3 ,
The procedure for calculating the used memory size of the bucket is as follows:
When the processing target of the query is binary join outer processing or integration processing,
The use memory size calculated by the procedure for calculating the utilization memory size of the bucket in the data processing method according to 請 Motomeko 6,
In the data as a reference, a value dividing the reference attribute does not have, when dividing the reference attributes of the other data is Nde contains,
In order to calculate the size of the value that the division criterion attribute of the reference data does not have , in calculating the outside of the bound range of the bucket in the first bucket and the last bucket of the division criterion attribute of the reference data ,
In the first bucket, calculate the size of another division criterion attribute bucket and output bucket corresponding to the value before the lower limit value and add it to the used memory size,
The last bucket is characterized in that the size of another division criterion attribute bucket and output bucket corresponding to a value before the upper limit value is calculated and added to the used memory size.

請求項８は、請求項３のデータ処理方法において、
前記バケットの利用メモリサイズを計算する手順は、
前記クエリの処理対象が二項演算の全統合処理による場合、
基準となる入力データ及びそれ以外の入力データについて、
一つずつバケットを選択し、両入力バケットの合計サイズを利用メモリサイズとして計算することを特徴としている。 Claim 8 is the data processing method of claim 3,
The procedure for calculating the used memory size of the bucket is as follows:
When the processing target of the query is based on the total integration processing of binary operation,
About the standard input data and other input data,
It is characterized by selecting buckets one by one and calculating the total size of both input buckets as the used memory size.

請求項９の大規模データ処理プログラムは、請求項３から請求項８のいずれか１項に記載の各手順をコンピュータに実行させることを特徴としている。 A large-scale data processing program according to a ninth aspect is characterized by causing a computer to execute each procedure according to any one of the third to eighth aspects.

本発明によれば、データ処理装置が保持するリソース容量で実行可能なデータサイズとなるようにクエリ分割部でクエリを分割し、分割クエリが実行されるので、データ処理装置において、一つのクエリで利用するデータ仮想化装置上のリソース量を制限することが可能となるとともに、分割処理することで、リソース容量を超える大規模データ処理に対応するクエリ処理を可能とすることができる。
その際、処理対象データの属性別に生成されたヒストグラムによってクエリの分割範囲を計算することで、データの偏りや処理特性を考慮することが可能となり、データ処理装置のリソースを考慮し、クエリ分割範囲のデータサイズをより正確に把握してクエリ分割を行うことができる。
その結果、実行できないクエリを減らすことで分割したクエリの実行の確実性を担保し、データ処理装置を備えたデータ仮想化システムの信頼性をより高めることができる。
また、ヒストグラムをデータの属性別に生成することで、前記属性を基準にデータを分割することができる。 According to the present invention, the query is divided by the query dividing unit so that the data size can be executed with the resource capacity held by the data processing device, and the divided query is executed. It is possible to limit the amount of resources on the data virtualization apparatus to be used, and to perform query processing corresponding to large-scale data processing exceeding the resource capacity by performing division processing.
At that time, it is possible to take into account data bias and processing characteristics by calculating the query division range based on the histogram generated for each attribute of the processing target data. Considering the resources of the data processing device, the query division range It is possible to more accurately grasp the data size of and perform query partitioning.
As a result, by reducing the number of unexecutable queries, it is possible to secure the certainty of executing the divided queries, and to further increase the reliability of the data virtualization system including the data processing device.
Further, by generating a histogram for each data attribute, the data can be divided based on the attribute.

本発明のデータ処理装置が含まれるデータ仮想化システムの構成を示すブロック図である。1 is a block diagram showing a configuration of a data virtualization system including a data processing device of the present invention. データ処理装置のクエリ分割部における処理手順を示すフローチャート図である。It is a flowchart figure which shows the process sequence in the query division part of a data processor. 分割基準属性の優先順を示す表である。It is a table | surface which shows the priority order of a division | segmentation reference | standard attribute. 処理種別に対するキー属性を示す表である。It is a table | surface which shows the key attribute with respect to a process type. データ処理装置のクエリ分割部における単項演算の場合の利用メモリサイズの計算手順を示すフローチャート図である。It is a flowchart figure which shows the calculation procedure of the utilization memory size in the case of the unary operation in the query division part of a data processor. データ処理装置のクエリ分割部における二項演算の直積処理による利用メモリサイズの計算手順を示すフローチャート図である。It is a flowchart figure which shows the calculation procedure of the utilization memory size by the direct product process of the binomial operation in the query division part of a data processor. データ処理装置のクエリ分割部における二項演算の内部結合処理による利用メモリサイズの計算手順を示すフローチャート図である。It is a flowchart figure which shows the calculation procedure of the utilization memory size by the internal joint process of a binomial operation in the query division part of a data processor. データ処理装置のクエリ分割部における二項演算の外部結合処理及び統合処理による利用メモリサイズの計算手順を示すフローチャート図である。It is a flowchart figure which shows the calculation procedure of the utilization memory size by the outer joining process of a binary operation in the query division part of a data processor, and an integration process. データ処理装置のクエリ分割部における二項演算の全統合処理による利用メモリサイズの計算手順を示すフローチャート図である。It is a flowchart figure which shows the calculation procedure of the utilization memory size by the total integration process of the binomial operation in the query division part of a data processor.

本発明のデータ処理装置の実施形態の一例について、図１を参照して説明する。
本発明のデータ処理装置は、限られたリソースの範囲内で実行可能なデータサイズとなるようにクエリ処理を分割して実行することで、大規模データに対するクエリ処理を確実に実行するようデータ仮想化システムにおいて構成されるものである。データ処理装置２は、図１に示すように、システムを利用するクライアント１と、システムが利用する１つ以上のデータソース３に対して、ネットワークを介して接続されている。 An example of an embodiment of the data processing apparatus of the present invention will be described with reference to FIG.
The data processing apparatus according to the present invention divides and executes the query processing so that the data size can be executed within a limited resource range, thereby ensuring that the query processing for large-scale data is executed. Configured in a computerized system. As shown in FIG. 1, the data processing apparatus 2 is connected to a client 1 that uses the system and one or more data sources 3 that are used by the system via a network.

クライアント１、データ処理装置２及び各データソース３は、それぞれ、基本プログラムや各種の基本デバイスが記憶されたＲＯＭと、各種のプログラムやデータが記憶されるハードディスクドライブ装置（ＨＤＤ）と、ＣＲ−ＲＯＭやＤＶＤ等の記憶媒体からプログラムやデータを読み出すメディアドライブ装置と、プログラムを実行するＣＰＵと、このＣＰＵにワークエリアを提供するＲＡＭと、外部装置と通信するパラレル／シリアルＩ／Ｆとを主要部分とする一般的な構成を備えたコンピュータ上に構築されている。
例えば、上述した構成を有する各コンピュータにおいて、クエリ処理を実行するための大規模データ処理プログラムがメディアドライブ装置を介してＨＤＤにインストールされることでデータ仮想化装置が構築される。 The client 1, the data processing device 2, and each data source 3 are respectively a ROM that stores basic programs and various basic devices, a hard disk drive (HDD) that stores various programs and data, and a CR-ROM. Main parts are a media drive device that reads programs and data from a storage medium such as DVD and DVD, a CPU that executes the program, a RAM that provides a work area for the CPU, and a parallel / serial I / F that communicates with an external device It is constructed on a computer having a general configuration.
For example, in each computer having the above-described configuration, a data virtualization apparatus is constructed by installing a large-scale data processing program for executing query processing in an HDD via a media drive apparatus.

クライアント１は、データ処理装置２が提供する統合データモデルを利用したクエリを投稿することで、透過的に複数のデータソース３に対するクエリ処理結果を取得する。 The client 1 posts the query using the integrated data model provided by the data processing device 2 to transparently acquire the query processing results for the plurality of data sources 3.

データソース３は、実際にデータを保管するストレージ３１と、データに対する処理を提供するエンジン３２からなり、エンジン３２が持つインタフェース経由でストレージ３１上のデータに対する処理を提供する。 The data source 3 includes a storage 31 that actually stores data and an engine 32 that provides processing for the data. The data source 3 provides processing for data on the storage 31 via an interface of the engine 32.

ヒストグラムを利用したクエリ分割を実現するデータ処理装置２は、統計情報部２０と、バケット情報収集部２１と、ヒストグラム生成部２２と、クエリ評価部２３と、中間ヒストグラム生成部２４と、利用メモリサイズ計算部２５と、分割範囲生成部２６と、クエリ分割部２７とを備えて構成されている。 The data processing device 2 that realizes query division using a histogram includes a statistical information unit 20, a bucket information collection unit 21, a histogram generation unit 22, a query evaluation unit 23, an intermediate histogram generation unit 24, and a memory size used. The calculation unit 25, the division range generation unit 26, and the query division unit 27 are configured.

統計情報部２０は、各データソース３に存在するデータの統計情報として、後述するヒストグラムを構成するために必要なバケットのバウンド、カウント及びドメインを保持する。バケットとは、ヒストグラムを集計するための単位であり、カウントは、属性毎の特定範囲（バウンド）における値の数、ドメインは値の種類の数である。 The statistical information unit 20 holds, as statistical information of data existing in each data source 3, bucket bounds, counts, and domains necessary for constructing a histogram described later. A bucket is a unit for aggregating histograms, a count is the number of values in a specific range (bound) for each attribute, and a domain is the number of types of values.

また、統計情報部２０には、データサイズ、データ属性、データ範囲を含む前記データのサイズ推定に利用可能な統計情報が保持されている。統計情報には、属性ごとに出現するデータの種類の数（ヒストグラム）や、各属性の平均や分散値等の属性の特徴を表示する値が含まれる。また、各データソース３及びデータ処理装置２のリソース容量も記憶されている。
これらの統計情報は、後述するクエリの利用メモリサイズの計算及びクエリ分割に際して必要な情報となる。 The statistical information unit 20 holds statistical information that can be used for estimating the size of the data including the data size, data attribute, and data range. The statistical information includes the number of types of data that appear for each attribute (histogram), and a value that displays the characteristics of the attribute such as the average and variance value of each attribute. In addition, the resource capacity of each data source 3 and data processing device 2 is also stored.
These pieces of statistical information become information necessary for calculation of a query memory size to be used later and query division.

バケット情報収集部２１は、統計情報部２０にヒストグラムとして登録されていないデータがデータソース３にある場合、又は、データソース３に更新されたデータがある場合に、各データソース３からバケットの情報を収集する。 The bucket information collection unit 21 receives information on the bucket from each data source 3 when there is data not registered as a histogram in the statistical information unit 20 in the data source 3 or when there is updated data in the data source 3. To collect.

ヒストグラム生成部２２は、バケット情報収集部２１が収集したバケットの情報から、データの属性別に統計情報としてヒストグラムを生成し、統計情報部２０にデータを格納する。ヒストグラムは、属性毎の特定範囲（バウンド）における値の数（カウント）と、値の種類の数（ドメイン）を、バケットと呼ばれる単位で集計する。このヒストグラムにより、属性の特徴となるデータの偏りを表現することが可能とし、特定の属性を基準にデータを分割することとなる。ヒストグラムの構成手法や算出手法については、上述した非特許文献１の手法が取られる。 The histogram generation unit 22 generates a histogram as statistical information for each data attribute from the bucket information collected by the bucket information collection unit 21 and stores the data in the statistical information unit 20. The histogram counts the number of values (count) in a specific range (bound) for each attribute and the number of types of values (domain) in units called buckets. With this histogram, it is possible to express the bias of the data that is the feature of the attribute, and the data is divided based on a specific attribute. Regarding the histogram construction method and calculation method, the method described in Non-Patent Document 1 is used.

クエリ評価部２３は、ユーザが投稿したクエリを実行するためのクエリプランを生成する。
中間ヒストグラム生成部２４は、クエリ評価部２３が生成したクエリプランを基に、入力となるデータの統計情報上のヒストグラム又は自身が以前に生成した中間ヒストグラム２８を利用して、処理毎の出力データのヒストグラムである中間ヒストグラムを生成する。 The query evaluation unit 23 generates a query plan for executing the query posted by the user.
Based on the query plan generated by the query evaluation unit 23, the intermediate histogram generation unit 24 uses the histogram on the statistical information of the input data or the intermediate histogram 28 previously generated by itself to output output data for each process. An intermediate histogram that is a histogram of is generated.

利用メモリサイズ計算部２５は、クエリプランを基に入力となるデータに対する統計情報部２０のヒストグラムと、中間ヒストグラム生成部２４が生成した中間ヒストグラム２８を利用して、単項演算、内部結合処理（Inner Join）、外部結合（Outer Join）／統合（Union）処理、直積処理（Cross Join）、全統合（Union all）処理の５通りの方法でクエリプランの処理毎にバケット単位で入出力サイズを計算する。 The use memory size calculation unit 25 uses the histogram of the statistical information unit 20 for the input data based on the query plan and the intermediate histogram 28 generated by the intermediate histogram generation unit 24 to perform unary operations and inner join processing (Inner I / O size is calculated in units of buckets for each query plan processing in five ways: Join), Outer Join / Union processing, Cross product processing, Cross all processing, Union all processing. To do.

分割範囲生成部２６は、特定処理における入力データの分割基準となる属性（分割基準属性）を選択し、利用メモリサイズ計算部２５が計算した当該属性のバケット単位での入出力サイズを基に、データソース３及びデータ処理装置２のリソース容量を考慮して当該属性における分割範囲を設定し、分割範囲リスト２９を作成する。 The division range generation unit 26 selects an attribute (division standard attribute) that serves as a division reference of input data in the specific process, and based on the input / output size in units of buckets of the attribute calculated by the use memory size calculation unit 25, In consideration of the resource capacities of the data source 3 and the data processing device 2, a division range in the attribute is set, and a division range list 29 is created.

クエリ分割部２７は、クエリ評価部２３が生成したクエリプランの処理毎にクエリ分割の有無を判断し、分割が必要な場合は分割範囲生成部２６が生成した分割範囲リスト２９を利用してクエリプランに分割範囲と分割対象属性を付与する。
そして、分割範囲と分割対象属性が付与されたクエリプランを基に再構築されたクエリのデータソ−ス３への投稿が実行される。 The query division unit 27 determines the presence or absence of query division for each processing of the query plan generated by the query evaluation unit 23. If division is necessary, the query division unit 27 uses the division range list 29 generated by the division range generation unit 26 to execute the query. A division range and a division target attribute are given to the plan.
And the posting to the data source 3 of the query reconstructed based on the query plan to which the division range and the division target attribute are given is executed.

次に、分割範囲生成部２６における詳細処理について、図２〜図４を参照しながら説明する。
クエリ分割部２７にてクエリの分割が必要と判断された処理において、当該処理の入力データ（入力データセット）を基準としてクエリを分割するために、分割範囲を指定する分割基準属性を決定する（ステップ４０）。
データ同士を結合処理（Join）等する二項演算の場合は、もう一方の入力データに関しても同様に分割基準属性を決定する。 Next, the detailed process in the division | segmentation range production | generation part 26 is demonstrated, referring FIGS.
In the process in which the query division unit 27 determines that the query needs to be divided, a division criterion attribute for designating a division range is determined in order to divide the query based on the input data (input data set) of the process (see FIG. Step 40).
In the case of a binary operation in which data is joined (Join) or the like, the division reference attribute is determined in the same manner for the other input data.

各データにおける分割基準属性は、図３に示す数字の小さい順に優先する属性（分割基準属性）が選択される。すなわち、「１．処理毎のキー属性」、「２．主キーの属性」、「３．索引付き属性」、「４．一意性制約付き属性」、「５．被ＮＵＬL制約付き属性」、「６．その他」、の優先順位で選択される。
なお、属性選択の前提として、出力後にも対象の属性が存在している必要がある。最も優先される属性は処理毎に異なっており、優先順位１の「処理毎のキー属性」が存在している場合には、これが必ず最優先に選択される。優先順位２〜６の優先順は、適宜設定可能とすることができる。 As the division criterion attribute in each data, an attribute (division criterion attribute) that is prioritized in ascending order of the numbers shown in FIG. 3 is selected. That is, “1. Key attribute for each process”, “2. Primary key attribute”, “3. Indexed attribute”, “4. Unique constraint attribute”, “5. NULL constrained attribute”, “ 6. Other ”is selected in the priority order.
As a premise for selecting an attribute, the target attribute must exist even after output. The attribute with the highest priority is different for each process, and when a “key attribute for each process” with priority 1 exists, this is always selected with the highest priority. The priority order of the priority orders 2 to 6 can be set as appropriate.

処理毎のキー属性は、図４に示すように、値を選択する選択処理、値を一つにまとめる集合処理、値を整列する整列処理、値同士を結合させる内部結合処理・外部結合処理において存在する。すなわち、選択処理では、選択の条件となるフィルター属性が優先的に選択される。集合処理では、集計基準となるGroup by指定属性が優先的に選択される。整列処理では並び替え基準となるOrder by指定属性が優先的に選択される。内部結合処理及び外部結合処理では、結合条件となるJoinキー属性が優先的に選択される。
キー属性がない射影処理、直積処理、統合処理、全統合処理は、図３の主キー属性が最優先される。 As shown in FIG. 4, the key attribute for each process includes selection processing for selecting values, set processing for collecting values into one, alignment processing for aligning values, and internal and external combining processing for combining values. Exists. That is, in the selection process, a filter attribute serving as a selection condition is preferentially selected. In the collective processing, the Group by designation attribute that is the aggregation criterion is preferentially selected. In the sorting process, the Order by designation attribute which is the sorting reference is preferentially selected. In the inner join process and the outer join process, the Join key attribute serving as a join condition is preferentially selected.
In the projection process, the direct product process, the integration process, and the total integration process without key attributes, the primary key attribute in FIG. 3 has the highest priority.

次に、この分割基準属性のヒストグラムにおけるバケットを一つ選択し、当該バケットに対して処理を実行した場合の出力バケットも含めて、当該バケットの利用メモリサイズを計算する（ステップ４１〜４８）。
すなわち、先ず、データ処理装置２において、データを記憶できる上限のメモリサイズ（リソースの上限となる物理メモリサイズ）を計算する（ステップ４１）。
次に、利用メモリサイズと合計メモリサイズの初期化をそれぞれ行う（ステップ４２、ステップ４３）。 Next, one bucket in the histogram of the division criterion attribute is selected, and the used memory size of the bucket is calculated including the output bucket when processing is performed on the bucket (steps 41 to 48).
That is, first, the data processing device 2 calculates the upper limit memory size (physical memory size that becomes the upper limit of resources) that can store data (step 41).
Next, the used memory size and the total memory size are respectively initialized (steps 42 and 43).

ヒストグラムの一つのバケットに対する利用メモリサイズを合計メモリサイズに加算する（ステップ４４）。
ヒストグラムにおける残バケットの有無を判断する（ステップ４５）。
残バケットが有れば、当該バケットに関するヒストグラムを利用して利用メモリサイズを計算する（ステップ４６）。 The used memory size for one bucket of the histogram is added to the total memory size (step 44).
It is determined whether or not there is a remaining bucket in the histogram (step 45).
If there is a remaining bucket, the used memory size is calculated using the histogram for the bucket (step 46).

ステップ４４の合計メモリサイズとステップ４６で計算した利用メモリサイズとを合計した値が上限メモリサイズより小さいかを判断し（ステップ４７）、小さい場合は、ステップ４４に戻って、前回の合計メモリサイズに利用メモリサイズを加算する（ステップ４４）。
これをリソースの上限となるサイズまで繰り返し（ステップ４７）、合計メモリサイズが上限メモリサイズを超えた段階で直前に計算したバケットの上限値を分割範囲とし、分割範囲リストに登録する（ステップ４８）。
バケット単位で加算を行う合計メモリサイズの上限を上限メモリサイズ（リソースの上限となるサイズ）としたのは、分割基準属性毎に行われる分割数をなるべく減らすためである。分割数を減らすことで、クエリの分割を効率良く行うことができる。 It is determined whether the sum of the total memory size in step 44 and the used memory size calculated in step 46 is smaller than the upper limit memory size (step 47). If so, the process returns to step 44 to return to the previous total memory size. The used memory size is added to (step 44).
This is repeated until the size becomes the upper limit of the resource (step 47), and the upper limit value of the bucket calculated immediately before the total memory size exceeds the upper limit memory size is set as the division range and registered in the division range list (step 48). .
The reason why the upper limit of the total memory size to be added in units of buckets is set to the upper limit memory size (the size that becomes the upper limit of the resource) is to reduce the number of divisions performed for each division criterion attribute as much as possible. By reducing the number of divisions, the query can be divided efficiently.

以上の処理を全バケット分繰り返すことで（ステップ４５）、分割範囲リストを完成させ、分割範囲リストの登録数に「１」を加えた数を分割数とし（ステップ４９）、分割範囲設定を完了する（ステップ５０）。 By repeating the above processing for all buckets (step 45), the divided range list is completed, and the number obtained by adding “1” to the registered number of the divided range list is set as the divided number (step 49), and the divided range setting is completed. (Step 50).

なお、二項演算の場合、選択した二つの分割基準属性のうち、サイズの大きい方のデータのみを基準に上記の処理を行い、当該データの全バケットを対象として上記の処理を行うことで、分割範囲リストの作成が完了する。
ただし、外部結合処理に関しては、左外部結合（Left Outer Join）の場合は左側のデータを、右外部結合（Right Outer Join）の場合は右側のデータを上記の処理の対象とする。 In the case of binomial operation, by performing the above processing based on only the data having the larger size among the selected two division criterion attributes, and performing the above processing for all buckets of the data, Creation of the split range list is complete.
However, regarding the outer join processing, the left side data is the target of the left outer join (Left Outer Join), and the right data is the target of the right outer join (Right Outer Join).

また、分割範囲リストは、二項演算の場合、二つの入力データそれぞれに必要となることから、二つの分割範囲リストが作成されることとなるが、直積処理と全統合処理を除き、二つともに同一の値が記録される。
直積処理に関しては、基準となるデータ側の分割範囲リストにのみ記録され、もう一方のデータ側には設定されない。
全統合処理は、それぞれのデータで異なる分割範囲リストが作成される。 In addition, since a split range list is required for each of the two input data in the case of binary operation, two split range lists will be created, except for direct product processing and total integration processing. The same value is recorded for both.
Regarding the direct product processing, it is recorded only in the division range list on the reference data side, and is not set on the other data side.
In the total integration process, a different divided range list is created for each data.

次に、分割範囲生成部２６で分割数設定に必要な情報となる利用メモリサイズの計算方法について説明する。
利用メモリサイズ計算部２５では、バケット単位におけるヒストグラムを考慮した入出力の利用メモリサイズを計算するが、（Ａ）単項演算、（Ｂ）直積処理、（Ｃ）内部結合処理、（Ｄ）外部結合処理／統合処理、（Ｅ）全統合処理の５通りの計算方法に分類することができる（図４）。
利用メモリサイズ計算に際しての基本的な方針は、ある一つの分割基準属性における一つのバケットを基準として、当該バケットの処理後に該当する出力範囲と、二項演算の場合は同様に出力範囲を構成するために必要なもう一つの入力データの範囲も併せて算出し、それらのサイズを合算することで、利用するメモリサイズを算出する。 Next, a method of calculating the used memory size, which is information necessary for setting the number of divisions by the division range generation unit 26, will be described.
The used memory size calculation unit 25 calculates the input / output used memory size in consideration of the histogram in bucket units. (A) Unary operation, (B) Direct product processing, (C) Inner join processing, (D) Outer join Processing / integration processing and (E) total integration processing can be classified into five calculation methods (FIG. 4).
The basic policy for calculating the used memory size is to configure the output range corresponding to the processing of the bucket, and the output range in the case of binary operation, based on one bucket in one division criterion attribute. Accordingly, another range of input data necessary for the calculation is also calculated, and the size of the memory to be used is calculated by adding up the sizes thereof.

このとき、例えばハッシュテーブルのような処理上の管理データが必要になる場合には、入力バケットを基準とした対象データ範囲で必要な管理データのサイズも併せて計算する。
なお、以下の処理フローの説明では、ヒストグラムを構成するバケットが属性単位でバウンド順となっていることを前提としている。
また、以下の説明において、バウンドのうち、下限値をlower、上限値をupperと表現する。 At this time, when processing management data such as a hash table is required, the size of the management data required in the target data range based on the input bucket is also calculated.
In the following description of the processing flow, it is assumed that the buckets constituting the histogram are in the bound order in attribute units.
Further, in the following description, of the bounds, the lower limit value is expressed as lower and the upper limit value is expressed as upper.

（Ａ）単項演算
単項演算における利用メモリサイズの計算処理について、図５を参照して説明する。
単項演算では、入力バケットのサイズを計算後、処理後の当該属性のヒストグラムを利用して、処理後の出力サイズを計算する。 (A) Unary Operation Calculation processing of the used memory size in unary operation will be described with reference to FIG.
In the unary operation, after calculating the size of the input bucket, the output size after processing is calculated using the histogram of the attribute after processing.

分割基準属性in1のヒストグラムにおける入力バケットibucを一つ取得し（ステップ６０）、入力バケットibucからデータレコードに基づく入力サイズinsizeを計算する（ステップ６１）。
処理後の出力バケットobucの情報を取得するため、出力サイズoutsizeの初期化を行う（ステップ６２）。
続いて、バケットのバウンドを利用して、対象の入力バケットibucの処理後のデータが含まれる１（i=1）番目の出力バケットobucを取得する（ステップ６３）。
次に、入力バケットibucの下限値lowerと出力バケットobucの上限値upperを比較する（ステップ６４）。 One input bucket ibuc in the histogram of the division criterion attribute in1 is acquired (step 60), and the input size insize based on the data record is calculated from the input bucket ibuc (step 61).
In order to acquire information on the output bucket obuc after processing, the output size outsize is initialized (step 62).
Subsequently, the first (i = 1) -th output bucket obuc including the processed data of the target input bucket ibuc is acquired using the bound of the bucket (step 63).
Next, the lower limit value lower of the input bucket ibuc and the upper limit value upper of the output bucket obuc are compared (step 64).

入力バケットibucの下限値lowerが出力バケットobucの上限値upperより小さい場合、入力バケットの処理結果が出力バケットの出力結果に含まれているので、出力バケットにおいて、実際に入力バケットの処理後のデータが含まれている割合を、バウンドから計算する（ステップ６５）。
この割合を利用して、入力バケットに対する出力バケットのサイズを計算し、出力サイズに加算する（ステップ６６）。
次のバケットについて上述の処理を繰り返すために、iの数を「１」増加する（ステップ６７）。 If the lower limit value lower of the input bucket ibuc is smaller than the upper limit value upper of the output bucket obuc, the processing result of the input bucket is included in the output result of the output bucket. The ratio in which is included is calculated from the bound (step 65).
Using this ratio, the size of the output bucket relative to the input bucket is calculated and added to the output size (step 66).
In order to repeat the above processing for the next bucket, the number of i is increased by “1” (step 67).

入力バケットibucの上限値upperが出力バケットobucの上限値upperより大きい場合、又は、iが処理後属性outの全バケット数bucketsと等しいか大きくなった場合（ステップ６８）、入力サイズと出力サイズとの加算を利用メモリサイズとする（ステップ６９）。
これらの処理を入力バケットの値が含まれる全出力バケット分繰り返すことで、入力バケットを基準とした利用メモリサイズを計算することができる。 When the upper limit upper of the input bucket ibuc is larger than the upper limit upper of the output bucket obuc, or when i is equal to or larger than the total number of buckets buckets of the processed attribute out (step 68), the input size and the output size Is used memory size (step 69).
By repeating these processes for all output buckets including the value of the input bucket, it is possible to calculate the used memory size based on the input bucket.

（Ｂ）直積処理
二項演算のうち、直積処理（Cross Join）における利用メモリサイズの計算処理について、図６を参照して説明する。
二項演算の直積処理では、処理対象の二つのデータの内、サイズの大きい方のデータの分割基準属性にあるバケットを基準として、もう一方のデータ全体と直積処理を行った場合の利用メモリサイズを計算する。 (B) Direct Product Processing The calculation processing of the used memory size in the direct product processing (Cross Join) among the binary operations will be described with reference to FIG.
In binary product direct product processing, the size of the memory used when performing direct product processing with the other data as a reference, using the bucket in the division criteria attribute of the larger data of the two data to be processed Calculate

先ず一方（サイズの大きい方）の分割基準属性in1のヒストグラムにおける入力バケットibuc1を一つ取得し（ステップ７０）、入力バケットibuc1からデータレコードに基づく入力サイズinsize1を計算する（ステップ７１）。
この入力サイズinsize1を利用メモリサイズの初期値として利用する（ステップ７２）。
次に、もう一方の分割基準属性in2からバケットibuc2を取得し（ステップ７３）、入力サイズinsize2を計算する（ステップ７4）。 First, one input bucket ibuc1 in the histogram of one (larger size) division criterion attribute in1 is acquired (step 70), and the input size insize1 based on the data record is calculated from the input bucket ibuc1 (step 71).
This input size insize1 is used as the initial value of the used memory size (step 72).
Next, the bucket ibuc2 is acquired from the other division criterion attribute in2 (step 73), and the input size insize2 is calculated (step 74).

二つの入力バケットibuc1、ibuc2から直積後の出力サイズoutsize（レコード数の乗算）を計算する（ステップ７５）。
ステップ７４で計算した入力サイズinsize2と、ステップ７５で計算した出力サイズoutsizeとを更に加算し利用メモリサイズとする（ステップ７６）。
もう一方の分割基準属性in2に未計算のバケットがあるかを判断し、ある場合には、ステップ７３〜７６の処理を繰り返す（ステップ７７）。
これらの処理を全出力バケット分繰り返すことで、入力バケットibuc1を基準とした利用メモリサイズを計算することができる（ステップ７８）。 The output size outsize (multiplication of the number of records) after direct product is calculated from the two input buckets ibuc1 and ibuc2 (step 75).
The input size insize2 calculated in step 74 and the output size outsize calculated in step 75 are further added to obtain the used memory size (step 76).
It is determined whether or not there is an uncalculated bucket in the other division criterion attribute in2, and if there is, the processes in steps 73 to 76 are repeated (step 77).
By repeating these processes for all output buckets, it is possible to calculate the used memory size based on the input bucket ibuc1 (step 78).

（Ｃ）内部結合処理
二項演算のうち、内部結合処理（Inner Join）における利用メモリサイズの計算処理について、図７を参照して説明する。
ここでは、処理対象の二つのデータの内、サイズの大きい方のデータの分割基準属性in1にあるバケットを基準として、当該バケットと結合する入力データ及びその結果（出力データ）を対象にサイズ計算を行うことで、利用メモリサイズを計算する。 (C) Inner Join Processing Of the binary operations, a calculation process of the used memory size in inner join processing (Inner Join) will be described with reference to FIG.
Here, the size calculation is performed based on the input data combined with the bucket and the result (output data) based on the bucket in the division criterion attribute in1 of the larger data among the two data to be processed. By doing so, the used memory size is calculated.

先ず、基準となる入力バケットのサイズを計算する。
すなわち、分割基準属性in1のヒストグラムにおける入力バケットibucを一つ取得し（ステップ８０）、入力バケットibucからデータレコードに基づく入力サイズinsizeを計算する（ステップ８１）。 First, the size of the reference input bucket is calculated.
That is, one input bucket ibuc in the histogram of the division criterion attribute in1 is acquired (step 80), and the input size insize based on the data record is calculated from the input bucket ibuc (step 81).

次に、入力バケットが処理後に出力される値を含む出力バケットを検索する。
すなわち、処理後の出力バケットobucの情報を取得するため、出力サイズoutsizeの初期化を行う（ステップ８２）。
続いて、バケットのバウンドを利用して、対象の入力バケットibucの処理後のデータが含まれる１（i=1）番目の出力バケットobucを取得する（ステップ８３）。 Next, an output bucket including a value output after the input bucket is processed is searched.
In other words, the output size outsize is initialized in order to obtain information on the processed output bucket obuc (step 82).
Subsequently, the first (i = 1) -th output bucket obuc including the processed data of the target input bucket ibuc is acquired using the bound of the bucket (step 83).

次に、出力バケットにおいて、実際に入力バケットの処理後のデータが含まれている割合をバウンドから計算し、この割合を利用して、入力バケットに対する出力バケットのサイズを計算する。
すなわち、入力バケットibucの下限値lowerと出力バケットobucの上限値upperを比較する（ステップ８４）。
入力バケットibuc1の下限値lowerが出力バケットobucの上限値upperより小さい場合、入力バケットの処理結果が出力バケットの出力結果に含まれているので、出力バケットにおいて、実際に入力バケットの処理後のデータが含まれている割合を、バウンドから計算する（ステップ８５）。
この割合を利用して、入力バケットibuc1に対する出力バケットobucのサイズを計算し、出力サイズoutsizeに加算する（ステップ８６）。
次のバケットについて上述の処理を繰り返すために、iの数を「１」増加する（ステップ８７）。 Next, in the output bucket, the ratio at which the processed data of the input bucket is actually included is calculated from the bound, and the size of the output bucket with respect to the input bucket is calculated using this ratio.
That is, the lower limit value lower of the input bucket ibuc and the upper limit value upper of the output bucket obuc are compared (step 84).
When the lower limit value lower of the input bucket ibuc1 is smaller than the upper limit value upper of the output bucket obuc, the processing result of the input bucket is included in the output result of the output bucket. The ratio in which the is included is calculated from the bound (step 85).
Using this ratio, the size of the output bucket obuc with respect to the input bucket ibuc1 is calculated and added to the output size outsize (step 86).
In order to repeat the above-described processing for the next bucket, the number of i is increased by “1” (step 87).

これらの処理を入力バケットの値が含まれる全出力バケット分繰り返すことで、入力バケットを基準とした出力結果のサイズを計算し、入力バケットibuc1の上限値upperが出力バケットobucの上限値upperより大きい場合、又は、iが処理後属性outの全バケット数bucketsと等しいか大きくなった場合（ステップ８８）、一方の入力データに関する出力結果サイズの計算を終了し、もう一方の入力データに関する出力結果サイズの計算に移る。 By repeating these processes for all output buckets including the input bucket value, the size of the output result based on the input bucket is calculated, and the upper limit upper of the input bucket ibuc1 is larger than the upper limit upper of the output bucket obuc If i is equal to or larger than the total number of buckets buckets of the processed attribute out (step 88), the calculation of the output result size for one input data is terminated, and the output result size for the other input data Move on to the calculation.

もう一つの入力データが、基準となる入力データのバケットと結合される値を含むバケットを検索し、前記した出力バケットと同様に当該入力データのサイズを計算する。 Another input data is searched for a bucket including a value combined with a reference input data bucket, and the size of the input data is calculated in the same manner as the output bucket described above.

先ず、もう一つの入力における入力サイズInsize2の初期化を行う（ステップ８９）。
続いて、分割基準属性in2の１（j=1）番目の入力バケットibuc2を取得する（ステップ９０）。
次に、入力バケットibucの下限値lowerと出力バケットobucの上限値upperを比較する（ステップ９１）。 First, the input size Insize2 in another input is initialized (step 89).
Subsequently, the 1 (j = 1) th input bucket ibuc2 of the division criterion attribute in2 is acquired (step 90).
Next, the lower limit value lower of the input bucket ibuc and the upper limit value upper of the output bucket obuc are compared (step 91).

入力バケットibuc1の下限値lowerが入力バケットibuc2の上限値upperより小さい場合、入力バケットibuc1の処理結果が入力バケットibuc2の処理結果に含まれているので、入力バケットibuc2において、実際に入力バケットibuc1の処理後のデータが含まれている割合を、バウンドから計算する（ステップ９２）。
この割合を利用して、入力バケットibuc1に対する入力バケットibuc2のサイズを計算し、入力サイズibuc2に加算する（ステップ９３）。
次のバケットについて上述の処理を繰り返すために、jの数を「１」増加する（ステップ９４）。
これらの処理を入力バケットibuc1の値が含まれる全入力バケットibuc2分繰り返すことで、入力バケットibuc1を基準とした出力結果サイズを計算することができる。 When the lower limit value lower of the input bucket ibuc1 is smaller than the upper limit upper of the input bucket ibuc2, the processing result of the input bucket ibuc1 is included in the processing result of the input bucket ibuc2. The ratio that the processed data is included is calculated from the bound (step 92).
Using this ratio, the size of the input bucket ibuc2 with respect to the input bucket ibuc1 is calculated and added to the input size ibuc2 (step 93).
In order to repeat the above processing for the next bucket, the number of j is increased by “1” (step 94).
By repeating these processes for all input buckets ibuc2 including the value of the input bucket ibuc1, the output result size based on the input bucket ibuc1 can be calculated.

入力バケットibuc1の上限値upperが入力バケットibuc2の上限値upperより大きい場合、又は、jが分割基準属性in2の全バケット数bucketsと等しいか大きくなった場合（ステップ９５）、以上の処理で計算した基準となる入力バケットibuc1のサイズと、もう一つの入力データibuc2のサイズと、出力結果のサイズを合算することで、基準となる入力データのバケットにおける利用メモリサイズの計算が完了する（ステップ９６）。 When the upper limit upper value of the input bucket ibuc1 is larger than the upper limit value upper of the input bucket ibuc2, or when j is equal to or larger than the total number of buckets buckets of the division criterion attribute in2 (step 95), the above calculation is performed. By calculating the size of the reference input bucket ibuc1, the size of the other input data ibuc2, and the size of the output result, the calculation of the used memory size in the reference input data bucket is completed (step 96). .

（Ｄ）外部結合／統合処理、
二項演算のうち、外部結合処理（Outer Join）と統合処理（Union）における利用メモリサイズの計算処理について、図８を参照して説明する。
これらの処理においては、基本的に上述した内部結合処理（図６）と同じ計算方法で利用メモリサイズを計算する。 (D) Outer join / integration processing,
Of the binary operations, the calculation processing of the used memory size in the outer join process (Outer Join) and the integration process (Union) will be described with reference to FIG.
In these processes, the used memory size is basically calculated by the same calculation method as the above-described inner coupling process (FIG. 6).

外部結合処理（Outer Join）と統合処理（Union）が内部結合処理と異なる点として、基準となるデータにおいて、分割基準属性が持たない値を、もう一方のデータの分割基準属性が含んでいる場合がある。
このような場合、入力データのサイズを計算するために、基準となるデータの分割基準属性の最初のバケットと最後のバケットにおいて、当該バケットのバウンドの範囲外を計算する必要がある。 The difference between the outer join processing (Outer Join) and the integration processing (Union) from the inner join processing is that the split data attribute of the other data contains a value that the split data attribute does not have in the reference data There is.
In such a case, in order to calculate the size of the input data, it is necessary to calculate out of bounds of the bucket in the first bucket and the last bucket of the division criterion attribute of the reference data.

すなわち、分割基準属性in1のヒストグラムにおける入力バケットibuc1を取得し（ステップ１００）、基準となるデータの分割基準属性の最初のバケットと最後のバケットを確認するため、ibuc1のテーブル順であるインデックスindexを確認して（ステップ１０１）、最初のバケット（index＝0）、最後のバケット（index＝in.buckets-1）、それ以外のバケットで場合分けを行う（ステップ１０２）。 That is, the input bucket ibuc1 in the histogram of the division criterion attribute in1 is acquired (step 100), and the index index which is the table order of ibuc1 is determined in order to confirm the first bucket and the last bucket of the division criterion attribute of the reference data. After confirmation (step 101), the first bucket (index = 0), the last bucket (index = in.buckets-1), and the other buckets are classified (step 102).

最初のバケットibuc1（index＝0）においては、下限値lowerより前の値に該当するもう一つの分割基準属性のバケットibuc2及び出力バケットobucのサイズをそれぞれ計算し（ステップ１０３）、内部結合処理によるサイズ計算を行い（ステップ１０４）、利用メモリサイズに加算する（ステップ１０５）。 In the first bucket ibuc1 (index = 0), the sizes of the bucket ibuc2 and the output bucket obuc of another division criterion attribute corresponding to the value before the lower limit value lower are respectively calculated (step 103), and by the internal joining process The size is calculated (step 104) and added to the used memory size (step 105).

最後のバケット（index＝in.buckets-1）においては、内部結合処理によるサイズ計算を行い（ステップ１０６）、上限値upperより先の値に該当するもう一つの分割基準属性のバケットibuc2及び出力バケットobucのサイズをそれぞれ計算し（ステップ１０７）、利用メモリサイズに加算する（ステップ１０８）。 In the last bucket (index = in.buckets-1), the size is calculated by the inner join process (step 106), and the other bucket reference attribute bucket ibuc2 and output bucket corresponding to the value before the upper limit upper Each size of obuc is calculated (step 107) and added to the used memory size (step 108).

それ以外のバケットについては、上述した内部結合処理によるサイズ計算のみで利用メモリサイズを計算する（ステップ１０９）。
そして、ステップ１０５、ステップ１０８、ステップ１０９で計算した利用メモリサイズを合計することで、入力バケットibuc1の利用メモリサイズの計算が完了する。 For the other buckets, the used memory size is calculated only by the size calculation by the above-described inner coupling process (step 109).
Then, by summing up the used memory sizes calculated in Step 105, Step 108, and Step 109, the calculation of the used memory size of the input bucket ibuc1 is completed.

（Ｅ）全統合処理
二項演算のうち、全統合処理（Union all）における利用メモリサイズの計算処理について、図９を参照して説明する。
全統合処理では、基準となる入力データと、それ以外の入力データでのバケット選択方法に違いはなく、両データともに一つずつバケットを選択し、両入力バケットの合計サイズを利用メモリサイズとして計算する。
すなわち、分割基準属性in1のヒストグラムにおける入力バケットibuc1を一つ取得する（ステップ２００）。
未計算の分割基準属性in2の入力バケットibuc2を選択する（ステップ２０１）。
入力バケットibuc1と入力バケットibuc2の合計サイズinsizeを計算する（ステップ２０２）。
この合計サイズinsizeを利用メモリサイズとする（ステップ２０３）。 (E) All Integration Processing Of the binomial operations, the calculation processing of the used memory size in the all integration processing (Union all) will be described with reference to FIG.
In the all integration process, there is no difference in the bucket selection method between the reference input data and the other input data. Select one bucket for each data and calculate the total size of both input buckets as the used memory size. To do.
That is, one input bucket ibuc1 in the histogram of the division criterion attribute in1 is acquired (step 200).
An input bucket ibuc2 having an uncalculated division criterion attribute in2 is selected (step 201).
The total size insize of the input bucket ibuc1 and the input bucket ibuc2 is calculated (step 202).
This total size insize is used as the use memory size (step 203).

上述したデータ仮想化システムにおけるデータ処理装置によれば、バケット情報収集部２１を介して統計情報部２０で記録された処理対象のデータにおける属性毎のヒストグラムを基に、ユーザが投稿したクエリの処理毎の出力データのヒストグラムをヒストグラム生成部２２で作成し、入力のバケットに対する出力のバケットの利用メモリサイズを利用メモリサイズ計算部２５が計算することで、出力データの値の偏りを考慮した分割範囲を分割範囲生成部２６で設定し、クエリ分割部２７で精度の高い分割クエリの生成を実現することができる。
その結果、一つのクエリで利用するデータ処理装置２上のリソース量を制限することが可能となるので、リソース容量を超える大規模データ処理に対応することができる。 According to the data processing device in the data virtualization system described above, the processing of the query posted by the user based on the histogram for each attribute in the processing target data recorded in the statistical information unit 20 via the bucket information collection unit 21 A histogram of the output data for each output is created by the histogram generator 22 and the used memory size calculator 25 calculates the used memory size of the output bucket relative to the input bucket, so that the divided range in consideration of the deviation of the value of the output data Can be set by the division range generation unit 26, and the query division unit 27 can realize generation of a divided query with high accuracy.
As a result, it is possible to limit the amount of resources on the data processing device 2 used in one query, so that it is possible to cope with large-scale data processing exceeding the resource capacity.

１…クライアント、２…データ処理装置、３…データソース、２０…統計情報部、２１…バケット情報収集部、２２…ヒストグラム生成部、２３…クエリ評価部、２４…中間ヒストグラム生成部、２５…利用メモリサイズ計算部、２６…分割範囲生成部、２７…クエリ分割部、２８…中間ヒストグラム、２９…分割範囲リスト、３１…ストレージ、３２…エンジン。 DESCRIPTION OF SYMBOLS 1 ... Client, 2 ... Data processing apparatus, 3 ... Data source, 20 ... Statistical information part, 21 ... Bucket information collection part, 22 ... Histogram generation part, 23 ... Query evaluation part, 24 ... Intermediate histogram generation part, 25 ... Use Memory size calculation unit, 26 ... division range generation unit, 27 ... query division unit, 28 ... intermediate histogram, 29 ... division range list, 31 ... storage, 32 ... engine.

Claims

In a data processing apparatus that performs query processing using one or more data sources for a client that performs query processing request and result reception,
A histogram generator that generates a histogram based on a plurality of data existing in each of the data sources;
A statistical information section for storing the histogram;
A query evaluator that generates a query plan for executing a query posted by the user;
Based on the query plan, an intermediate histogram generation unit that generates an intermediate histogram that is a histogram of output data for each process using a histogram stored in the statistical information unit or an intermediate histogram generated by itself. When,
Utilizing the histogram stored in the statistical information section and the intermediate histogram generated by the intermediate histogram generation section, a use memory size calculation section that calculates the input / output size in bucket units for each processing of the query plan,
Based on the input / output size in bucket units calculated by the used memory size calculation unit, a division range generation unit that determines a query division range;
A data processing apparatus characterized by comprising:

A bucket information collecting unit for collecting bucket information from a plurality of data existing in each data source;
The data processing apparatus according to claim 1, wherein the histogram is generated for each data attribute from the bucket information.

In a data processing method for performing query processing using one or more data sources for a client that performs query processing request and result reception,
To generate a histogram for each data attribute from bucket information,
Creating an intermediate histogram for each processing of the query from the histogram;
Determining a split criteria attribute that specifies a split range of the query when query splitting is required;
A procedure for selecting one bucket in the histogram of the division criterion attribute and calculating the used memory size of the bucket including the output bucket when processing is performed on the bucket;
Repeat the above process up to the size of the upper limit of the resource, register the bucket upper limit value calculated immediately before the upper limit as a split range in the split range list,
A data processing method, wherein the number of divisions is determined from the division range list by repeating the above processes for all buckets.

The procedure for calculating the used memory size of the bucket is as follows:
When the processing target of the query is a unary operation,
Use the bound of the bucket to obtain an output bucket that contains the processed data of the target input bucket,
In the output bucket, the ratio of the actual processed data of the input bucket is calculated from the bound,
Using the percentage, calculate the size of the output bucket relative to the input bucket,
The data processing method according to claim 3, wherein the use memory size is calculated based on the input bucket by repeating these processes for all output buckets including the value of the input bucket.

The procedure for calculating the used memory size of the bucket is as follows:
When the processing target of the query is a direct product process of binary operation,
Of the two data to be processed, calculate the input size based on one bucket in the division criterion attribute of the larger data, and use it as the initial value of the memory size used.
While obtaining the bucket from the other split criterion attribute and calculating the input size, calculate the output size when the direct product processing is performed from the two input buckets,
With respect to the used memory size, the input size of the other division criterion attribute and the output size are added to obtain a new used memory size,
The data processing method according to claim 3, wherein the used memory size is calculated based on the input bucket by repeating these processes for all buckets.

The procedure for calculating the used memory size of the bucket is as follows:
When the processing target of the query is binary join inner processing,
The size of the reference input bucket in the input data is calculated, the output bucket including the value output after the input bucket is processed is searched, and the processed data of the input bucket is included in the output bucket While calculating the ratio from the bound and calculating the size of the output bucket with respect to the input bucket using the ratio, the size of the output result is calculated by repeating for all output buckets including the value of the input bucket,
Another input data is searched for a bucket including a value combined with a reference input bucket, and the size of the input data is calculated in the same manner as the output bucket.
The data processing method according to claim 3, wherein the memory size used in the bucket of the reference input data is calculated by adding the size of the reference input bucket, the size of the other input data, and the size of the output result. .

The procedure for calculating the used memory size of the bucket is as follows:
When the processing target of the query is binary join outer processing or integration processing,
The use memory size calculated by the procedure for calculating the utilization memory size of the bucket in the data processing method according to 請 Motomeko 6,
In the data as a reference, a value dividing the reference attribute does not have, when dividing the reference attributes of the other data is Nde contains,
In order to calculate the size of the value that the division criterion attribute of the reference data does not have , in calculating the outside of the bound range of the bucket in the first bucket and the last bucket of the division criterion attribute of the reference data ,
In the first bucket, calculate the size of another division criterion attribute bucket and output bucket corresponding to the value before the lower limit value and add it to the used memory size,
4. The data processing method according to claim 3, wherein in the last bucket, the size of another division criterion attribute bucket and output bucket corresponding to a value before the upper limit value is calculated and added to the used memory size.

The procedure for calculating the used memory size of the bucket is as follows:
When the processing target of the query is based on the total integration processing of binary operation,
About the standard input data and other input data,
The data processing method according to claim 3, wherein buckets are selected one by one, and the total size of both input buckets is calculated as the used memory size.

A large-scale data processing program for causing a computer to execute each procedure according to any one of claims 3 to 8.