JP4452234B2

JP4452234B2 - Data stream processing method, data stream processing program, storage medium, and data stream processing apparatus

Info

Publication number: JP4452234B2
Application number: JP2005339584A
Authority: JP
Inventors: 保志櫻井
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2005-11-25
Filing date: 2005-11-25
Publication date: 2010-04-21
Anticipated expiration: 2025-11-25
Also published as: JP2007150484A

Description

本発明は、ストリームマイニング技術におけるデータストリーム処理方法、データストリーム処理プログラム、記憶媒体、および、データストリーム処理装置に関する。 The present invention relates to a data stream processing method, a data stream processing program, a storage medium, and a data stream processing apparatus in stream mining technology.

データストリームとはネットワークから高速に流れてくる大量のデータのことである。ストリームマイニングとは、時系列として表現されるデータストリームから役に立つ情報を素早く見つけ出す技術である。これは、単にデータベースに蓄えられた大規模データを分析するのではなく、増え続けるデータの流れをリアルタイムに分析し、監視するための技術である。 A data stream is a large amount of data flowing from a network at high speed. Stream mining is a technique for quickly finding useful information from a data stream expressed as a time series. This is not a technique for simply analyzing large-scale data stored in a database, but a technique for analyzing and monitoring an increasing flow of data in real time.

例えば、ストリームマイニングの分野では、複数のストリームから相関のあるストリームのペアを検出する技術がある。非特許文献１の手法は、離散フーリエ変換を用いることによって、複数のストリームから相関のあるストリームのペアを高速に検出している。しかし、多くの相関のあるストリームには大抵いくらかの遅延があり、非特許文献１の手法では相関のあるストリームでも遅延があるもの（以下、遅延相関とする）は、検出できなかった。 For example, in the field of stream mining, there is a technique for detecting a pair of correlated streams from a plurality of streams. The method of Non-Patent Document 1 detects a pair of correlated streams from a plurality of streams at high speed by using discrete Fourier transform. However, many correlated streams usually have some delay, and the method of Non-Patent Document 1 cannot detect a correlated stream that has a delay (hereinafter referred to as delay correlation).

遅延相関があるストリームのペアを検出するためには、相互相関関数を計算する必要がある。相互相関関数は統計学の分野でよく使われる関数である。なお、長さｎの２つの時系列Ｘ＝（ｘ_１，ｘ_２，…，ｘ_ｎ）と、Ｙ＝（ｙ_１，ｙ_２，…，ｙ_ｎ）が与えられているとき、以下に示す（式１）が相互相関関数である。

In order to detect a pair of streams with delayed correlation, it is necessary to calculate a cross-correlation function. The cross correlation function is a function often used in the field of statistics. In addition, when two time series of length n X = (x ₁ , x ₂ ,..., X _n ) and Y = (y ₁ , y ₂ ,..., Y _n ) are given, (Equation 1) is a cross-correlation function.

図７は、二つの時系列データの相互相関関数の一例を示す図である。図７（ａ）および図７（ｂ）の二つの時系列は温度センサデータである。３０秒毎に計測した温度を示している。二つの温度センサはよく似た傾向を示しているが、若干のずれがある。具体的には、温度センサＡの方が少々遅れている。図７（ｃ）は相互相関関数である。縦軸は相関値、つまり二つの時系列にどれほど相関があるのかを示している。横軸は遅延の大きさ（遅延時間）である。つまり、相互相関関数は片方の時系列をずらしたときに両者にどれほどの相関値があるのかを示したものである。 FIG. 7 is a diagram illustrating an example of a cross-correlation function of two time series data. Two time series in FIG. 7A and FIG. 7B are temperature sensor data. The temperature measured every 30 seconds is shown. The two temperature sensors show a similar tendency, but there is a slight deviation. Specifically, the temperature sensor A is slightly delayed. FIG. 7C shows a cross-correlation function. The vertical axis shows the correlation value, that is, how much the two time series are correlated. The horizontal axis is the magnitude of the delay (delay time). In other words, the cross-correlation function indicates how much correlation value exists between the two time series when they are shifted.

図７では、二つの温度センサに遅延がないものとして相関を調べると、あまり相関がないという結果になるが、温度センサＢを少し遅らせて両者の相関を調べると高い相関値が得られる。そして、遅延が１３００［秒］のところで相関値のピークがある。この場合は「温度センサＡは１３００［秒］の遅延で温度センサＢと相関している」と言える。 In FIG. 7, when the correlation is examined on the assumption that there is no delay between the two temperature sensors, the result is that there is not much correlation. There is a correlation value peak at a delay of 1300 [seconds]. In this case, it can be said that “temperature sensor A correlates with temperature sensor B with a delay of 1300 [seconds]”.

遅延相関があるストリームのペアを検出する手法において、効率化が求められている。効率化の要因として、例えば、増え続ける大規模なデータを分析するため、また利用者に情報をリアルタイムに提供するため、高速化と省メモリ化を図る必要があることが挙げられる。 Efficiency is demanded in a technique for detecting a pair of streams having a delayed correlation. As a factor of efficiency, for example, it is necessary to increase the speed and save memory in order to analyze a large amount of data that is increasing and to provide information to users in real time.

効率化を達成するために有効となるのは、データストリームの要約情報（以下、要約情報と表記する）の計算方法と要約情報を用いた問合せ処理方法である。これは、オリジナルのデータストリームの長さは非常に長くなるため、データストリームをコンパクトに表現した要約情報が、高速化と省メモリ化には有効となるためである。要約情報を保持することにより、オリジナルのデータストリームを破棄しながら、遅延相関があるストリームのペアを求めるための問合せ処理を実行することができる。 What is effective for achieving efficiency is a data stream summary information (hereinafter referred to as summary information) calculation method and a query processing method using summary information. This is because the length of the original data stream becomes very long, and summary information that expresses the data stream in a compact manner is effective for speeding up and saving memory. By holding the summary information, it is possible to execute a query process for obtaining a pair of streams having a delayed correlation while discarding the original data stream.

データストリームの要約情報をもとに遅延相関のあるストリームのペアを効率的に計算する方法として、非特許文献２の方法が提案されている。遅延相関を求めるためにデータストリームの統計値（平均、分散、内積）を要約情報として持っており、そしてデータを受信する毎に、それらの要約情報を更新する。 As a method for efficiently calculating a pair of streams having delay correlation based on summary information of a data stream, the method of Non-Patent Document 2 has been proposed. In order to obtain the delay correlation, statistical values (average, variance, inner product) of the data stream are held as summary information, and each time the data is received, the summary information is updated.

遅延相関のあるストリームのペアを効率的に計算する方法として、非特許文献２の方法が提案されている。非特許文献２の方法は、利用者もしくはアプリケーションからの要求があると、要約情報から遅延相関を計算し、相関のあるストリームのペアとその遅延の値を出力する。遅延相関を計算するための時間とメモリ量は、データストリームの長さに線形に比例して大きくなる。そこで、高速化と省メモリ化を達成するために、平滑化とサンプリングを用いて相互相関関数を近似している。 As a method for efficiently calculating a pair of streams having a delay correlation, the method of Non-Patent Document 2 has been proposed. In the method of Non-Patent Document 2, when there is a request from a user or an application, a delay correlation is calculated from summary information, and a pair of correlated streams and a value of the delay are output. The time and amount of memory for calculating the delay correlation increases linearly in proportion to the length of the data stream. Therefore, in order to achieve speedup and memory saving, the cross-correlation function is approximated using smoothing and sampling.

非特許文献２の方法では、相互相関関数の近似によって遅延相関の計算時間を低減化させている。しかし、非特許文献２の方法では、ストリームの数が増えたとき、要約情報の更新コストはストリームの数の２乗に比例するという問題がある。
Yunyue Zhu,Dennis Shasha“StatStream:Statistical Monitoring of Thousands of Data Streams in RealTime.”In Proceedings of VLDB,pp.358-369,August2002. Yasushi Sakurai,SpirosPapadimitriou、Christos Faloutsos“BRAID:Stream Mining through Group Lag Correlations”In Proceedings of ACM SIGMOD,pp.599-610,June 2005. In the method of Non-Patent Document 2, the delay correlation calculation time is reduced by approximation of the cross-correlation function. However, the method of Non-Patent Document 2 has a problem that when the number of streams increases, the update cost of summary information is proportional to the square of the number of streams.
Yunyue Zhu, Dennis Shasha “StatStream: Statistical Monitoring of Thousands of Data Streams in RealTime.” In Proceedings of VLDB, pp.358-369, August2002. Yasushi Sakurai, Spiros Papadimitriou, Christos Faloutsos “BRAID: Stream Mining through Group Lag Correlations” In Proceedings of ACM SIGMOD, pp.599-610, June 2005.

しかし、従来の手法は、遅延相関のあるストリームのペアを効率的に計算できていない。ここで、効率的ではないとは、計算における計算量およびメモリ使用量を過剰に浪費することを意味する。 However, the conventional method cannot efficiently calculate a pair of streams having a delay correlation. Here, being inefficient means that the calculation amount and the memory usage amount in the calculation are excessively wasted.

遅延相関の計算のためのメモリ使用量は以下の通りである。データストリームの数をｍとする。このとき、（式１）をそのまま計算する手法（以下、ナイーブな手法とする）を用いた場合、Ｏ(ｍ^２)のメモリを必要とする。なお、関数Ｏは、計算量のオーダーを示す。何故ならナイーブな手法では、全てのデータストリームの和と分散、さらに全てのデータストリームのペアの内積を計算し、要約情報としてメモリに保存するためである。つまり、各データストリームのペアに対して、内積を保存し、その内積を用いて相関値を計算するため、Ｏ(ｍ^２)のメモリを必要とする。 The memory usage for calculating the delay correlation is as follows. Let m be the number of data streams. At this time, when a method for calculating (Equation 1) as it is (hereinafter referred to as a naive method) is used, a memory of O (m ² ) is required. The function O indicates the order of calculation amount. This is because the naive method calculates the sum and variance of all the data streams and further calculates the inner product of all the data stream pairs and stores them in the memory as summary information. That is, an inner product is stored for each pair of data streams, and a correlation value is calculated using the inner product, so that an O (m ² ) memory is required.

非特許文献１の方法では、そもそも遅延相関のあるストリームのペアを計算できない。また、非特許文献２の方法では、要約情報の更新に関する計算量はＯ（ｍ^２）になり、効率化が不十分である。 In the method of Non-Patent Document 1, a pair of streams having a delay correlation cannot be calculated in the first place. In the method of Non-Patent Document 2, the amount of calculation related to the update of summary information is O (m ² ), and the efficiency is insufficient.

そこで、本発明は、前記した問題を解決し、遅延相関のあるストリームのペアを効率的に計算することを主な目的とする。 Therefore, the main object of the present invention is to solve the above-described problem and to efficiently calculate a pair of streams having a delay correlation.

前記課題を解決するために、本発明は、時系列で測定された複数のデータストリーム間の遅延相関を求めるデータストリーム処理方法であって、コンピュータが、入力された前記データストリームのランダム射影、時系列の和、および、時系列の２乗和に基づく要約情報を作成して記憶手段に格納し、前記データストリームの測定値の更新に基づき前記要約情報を更新する要約情報計算手順と、前記記憶手段に格納された前記要約情報を読み取り、その前記要約情報をもとに前記データストリームの相互相関関数を計算し、前記相互相関関数から前記データストリーム間の遅延相関を求めて出力する遅延相関計算手順と、を実行することを特徴とする。 In order to solve the above-mentioned problem, the present invention provides a data stream processing method for obtaining a delayed correlation between a plurality of data streams measured in time series, wherein a computer performs random projection, time Summary information calculation procedure for creating summary information based on sum of series and sum of squares of time series, storing the summary information in storage means, and updating the summary information based on update of the measurement value of the data stream, and the storage A delay correlation calculation that reads the summary information stored in the means, calculates a cross-correlation function of the data stream based on the summary information, and obtains and outputs a delay correlation between the data streams from the cross-correlation function And a procedure is executed.

これにより、要約情報を更新する際の計算時間とメモリ使用量を、従来技術よりも低減化することができる。 As a result, the calculation time and memory usage when updating the summary information can be reduced as compared with the prior art.

本発明は、前記要約情報計算手順が、指数関数により計算された時間間隔で更新された前記データストリームをもとに、前記要約情報を更新することを特徴とする。 The present invention is characterized in that the summary information calculation procedure updates the summary information based on the data stream updated at time intervals calculated by an exponential function.

これにより、データストリームのサンプル数が減ることで、要約情報を更新する際の計算時間とメモリ使用量を、従来技術よりも低減化することができる。 As a result, the number of samples of the data stream is reduced, so that the calculation time and the memory usage when updating the summary information can be reduced as compared with the prior art.

本発明は、コンピュータが、あらかじめ相関があると指定された前記データストリームのペアが、前記遅延相関計算手順により未検出のときには、故障として検出することを特徴とする。 The present invention is characterized in that when a pair of the data streams designated as having a correlation in advance is not detected by the delay correlation calculation procedure, it is detected as a failure.

これにより、データストリームの測定装置の故障を迅速に検出することができる。 This makes it possible to quickly detect a failure in the data stream measurement device.

本発明は、複数のコンピュータが、担当する前記データストリームに関する前記要約情報を、他のコンピュータとは並列に計算することを特徴とする。 The present invention is characterized in that a plurality of computers calculate the summary information about the data stream in charge in parallel with other computers.

これにより、データストリームの計算が並列化されることで、要約情報を更新する際の計算時間を、従来技術よりも低減化することができる。 Thereby, since the calculation of the data stream is parallelized, the calculation time for updating the summary information can be reduced as compared with the prior art.

本発明は、前記データストリーム処理方法を、コンピュータに実行させるためのデータストリーム処理プログラムである。 The present invention is a data stream processing program for causing a computer to execute the data stream processing method.

本発明は、前記データストリーム処理プログラムを格納した、コンピュータが読み取り可能な記憶媒体である。 The present invention is a computer-readable storage medium storing the data stream processing program.

本発明は、時系列で測定された複数のデータストリーム間の遅延相関を求めるデータストリーム処理装置であって、入力された前記データストリームのランダム射影、時系列の和、および、時系列の２乗和に基づく要約情報を作成して記憶手段に格納し、前記データストリームの測定値の更新に基づき前記要約情報を更新する要約情報計算部と、前記記憶手段に格納された前記要約情報を読み取り、その前記要約情報をもとに前記データストリームの相互相関関数を計算し、前記相互相関関数から前記データストリーム間の遅延相関を求めて出力する遅延相関計算部と、を有することを特徴とする。 The present invention is a data stream processing apparatus for obtaining a delay correlation between a plurality of data streams measured in time series, the random projection of the input data stream, the sum of time series, and the square of time series Summary information based on the sum is created and stored in the storage means, the summary information calculation unit that updates the summary information based on the update of the measurement value of the data stream, and the summary information stored in the storage means is read, A delay correlation calculating unit that calculates a cross-correlation function of the data stream based on the summary information, and obtains and outputs a delay correlation between the data streams from the cross-correlation function;

本発明は、前記要約情報計算部が、指数関数により計算された時間間隔で更新された前記データストリームをもとに、前記要約情報を更新することを特徴とする。 The present invention is characterized in that the summary information calculation unit updates the summary information based on the data stream updated at time intervals calculated by an exponential function.

本発明は、あらかじめ相関があると指定された前記データストリームのペアが、前記遅延相関計算部により未検出のときには、故障として検出することを特徴とする。 The present invention is characterized in that when the pair of data streams designated as having a correlation in advance is not detected by the delayed correlation calculation unit, it is detected as a failure.

本発明は、複数の前記データストリーム処理装置が、担当する前記データストリームに関する前記要約情報を、他の前記データストリーム処理装置とは並列に計算することを特徴とする。 The present invention is characterized in that a plurality of the data stream processing devices calculate the summary information related to the data stream in charge in parallel with the other data stream processing devices.

本発明によれば、要約情報の更新コストをＯ（ｍ）に引き下げることにより、データストリームの遅延相関を高精度に検出しつつ、高速かつ省メモリに処理をすることができる。これは、本発明の手法では、要約情報として各データストリームの平均値、分散値、ランダム射影を保存し、相関値はそれらから計算するために、Ｏ(ｍ)のメモリしか必要としないためである。 According to the present invention, by reducing the update cost of summary information to O (m), it is possible to perform high-speed and memory-saving processing while detecting the delay correlation of the data stream with high accuracy. This is because the method of the present invention stores the average value, variance value, and random projection of each data stream as summary information, and the correlation value requires only O (m) memory to calculate from them. is there.

以下に、本発明が適用されるデータストリーム処理システムの一実施形態について、図面を参照して詳細に説明する。まず、本実施形態のデータストリーム処理システムの構成について、図１を参照して説明する。 Hereinafter, an embodiment of a data stream processing system to which the present invention is applied will be described in detail with reference to the drawings. First, the configuration of the data stream processing system of this embodiment will be described with reference to FIG.

図１は、データストリーム処理システムを示す構成図である。データストリーム処理システムは、データストリームを計測するセンサ２、および、データストリームを処理するデータストリーム処理装置１を含めて構成される。なお、データストリーム処理装置１は、演算処理を行う際に用いられる記憶手段としてのメモリと、前記演算処理を行う演算処理装置とを少なくとも備えるコンピュータとして構成される。なお、メモリは、ＲＡＭ（Random Access Memory）などにより構成され、データストリームの要約情報などを格納する。演算処理は、ＣＰＵ（Central Processing Unit）によって構成される演算処理装置が、メモリ上のプログラムを実行することで、実現される。データストリーム処理装置１は、要約情報計算部１０、および、遅延相関計算部２０を有する。 FIG. 1 is a configuration diagram showing a data stream processing system. The data stream processing system includes a sensor 2 that measures a data stream and a data stream processing device 1 that processes the data stream. The data stream processing device 1 is configured as a computer including at least a memory serving as storage means used when performing arithmetic processing and an arithmetic processing device that performs the arithmetic processing. The memory is composed of a RAM (Random Access Memory) or the like, and stores summary information of the data stream. Arithmetic processing is realized by an arithmetic processing unit configured by a CPU (Central Processing Unit) executing a program on a memory. The data stream processing device 1 includes a summary information calculation unit 10 and a delay correlation calculation unit 20.

要約情報計算部１０は、データストリームを受信するたびに、各データストリームの要約情報を計算し、オリジナルのデータストリームを破棄する。なお、本実施形態における要約情報とは、ランダム射影、時系列の和、および、時系列の２乗和をもとに算出される。 Each time the summary information calculation unit 10 receives a data stream, it calculates the summary information of each data stream and discards the original data stream. Note that the summary information in the present embodiment is calculated based on a random projection, a time-series sum, and a time-series sum of squares.

データストリームＸ＝（ｘ_１，ｘ_２，…，ｘ_ｎ）が与えられているとき、Ｘの時系列の和Ｓｘ、２乗和Ｓｘｘは、それぞれ以下の（式２）のように計算される。要約情報計算部１０は、計算したＳｘおよびＳｘｘを遅延相関計算部２０に通知する。

When the data stream X = (x ₁ , x ₂ ,..., X _n ) is given, the time series sum Sx and the square sum Sxx of X are respectively calculated as (Equation 2) below. . The summary information calculation unit 10 notifies the delay correlation calculation unit 20 of the calculated Sx and Sxx.

また、要約情報計算部１０は、各データストリームのペアに対して、一つの相関値のみならず様々な遅延がある場合の相関値も計算する。したがって、要約情報計算部１０は、
データストリームＸについては以下の（式３）のように、データストリームＹについては以下の（式４）のように、それぞれ複数個の和と分散を計算する。ここで、ｌは、遅延の大きさである。

The summary information calculation unit 10 calculates not only one correlation value but also correlation values when various delays are present for each data stream pair. Therefore, the summary information calculation unit 10
A plurality of sums and variances are calculated for the data stream X as shown in (Formula 3) below, and for the data stream Y as shown in (Formula 4) below. Here, l is the magnitude of the delay.

なお、要約情報のランダム射影とは、コンピュータ科学において使われている次元縮退のための一手法であり、時系列から空間ベクトルへの射影（マッピング）にはランダム関数が用いられる。ランダム射影は、例えば、文献（Dimitris Achlioptas“Database-friendly random projections.”In Proceedings of PODS,pp.274-281,May 2001.）に記載されている。Ｘのランダム射影のベクトルをＰ_ｘ=（ｐ_１，ｐ_２，…，ｐ_ｄ）、Ｐｘの次元数をｄ、ランダム関数をＵ＝（ｕ_ｔｉ）とすると、Ｐｘにおけるｉ次元の値は、以下の（式５）となる。

The random projection of summary information is a technique for dimensional reduction used in computer science, and a random function is used for projection (mapping) from a time series to a space vector. Random projection is described, for example, in the literature (Dimitris Achlioptas “Database-friendly random projections.” In Proceedings of PODS, pp. 274-281, May 2001.). If the vector of the random projection of X is P _x = (p ₁ , p ₂ ,..., P _d ), the number of dimensions of Px is d, and the random function is U = (u _ti ), the i-dimensional value in Px is The following (Formula 5) is obtained.

ここで、要約情報計算部１０は、遅延ｌを考慮した場合、和と分散と同様に、複数のランダム射影Ｐｘ（ｌ＋１，ｎ）を計算する。これはデータストリームＹについても同様であり、要約情報計算部１０は、Ｐｙ（１，ｎ−ｌ）を計算する。 Here, the summary information calculation unit 10 calculates a plurality of random projections Px (l + 1, n) in the same manner as the sum and variance when the delay l is considered. The same applies to the data stream Y, and the summary information calculation unit 10 calculates Py (1, n−1).

遅延相関計算部２０は、アプリケーションもしくは利用者からの問合せ要求を受け付けると、要約情報計算部１０が計算したデータストリームの要約情報を用いて、遅延相関の計算を行う。要約情報計算部１０は、１回の問合せ要求を受け付けると、データストリーム処理装置１に入力されたデータストリームのペアを総当たりで調査し、遅延相関を有するペアおよびその遅延値をリストとして出力する。 When receiving a query request from an application or a user, the delay correlation calculation unit 20 calculates delay correlation using the summary information of the data stream calculated by the summary information calculation unit 10. When the summary information calculation unit 10 receives a single inquiry request, the summary information calculation unit 10 checks the data stream pairs input to the data stream processing device 1 in a brute force manner and outputs a pair having a delay correlation and its delay value as a list. .

遅延相関計算部２０は、要約情報を用いて、データストリームのペアの相互相関関数を計算する。まず、ランダム射影と時系列の２乗和を用いて内積を計算する（式６）。そして、遅延相関計算部２０は、計算した内積を用いて相互相関関数Ｒ（ｌ）を計算する（式７）。

The delayed correlation calculation unit 20 calculates the cross-correlation function of the data stream pair using the summary information. First, an inner product is calculated using a random projection and a square sum of time series (Formula 6). Then, the delayed correlation calculation unit 20 calculates a cross-correlation function R (l) using the calculated inner product (Formula 7).

図２は、データストリーム処理の概要を示すフローチャートである。 FIG. 2 is a flowchart showing an outline of data stream processing.

まず、データストリーム処理装置１は、データストリームの更新があるか否かを判定する（Ｓ１０１）。データストリームの更新は、要約情報計算部１０に過去に入力されていないデータストリームの新たなデータが、新たに要約情報計算部１０に入力されたことを示す。データストリームの更新がないときには（Ｓ１０１，Ｎｏ）、処理をＳ１０３に進める。データストリームの更新があるときには（Ｓ１０１，Ｙｅｓ）、サブルーチン処理「要約情報の更新処理」を呼び出して（Ｓ１０２）、処理をＳ１０３に進める。 First, the data stream processing apparatus 1 determines whether there is a data stream update (S101). The update of the data stream indicates that new data of the data stream that has not been input to the summary information calculation unit 10 in the past is newly input to the summary information calculation unit 10. If there is no update of the data stream (S101, No), the process proceeds to S103. When there is an update of the data stream (S101, Yes), the subroutine process “summary information update process” is called (S102), and the process proceeds to S103.

次に、データストリーム処理装置１は、問い合わせ要求があるか否かを判定する（Ｓ１０３）。問い合わせ要求がないときには（Ｓ１０３，Ｎｏ）、処理をＳ１０１に戻す。以下、問い合わせ要求があるとき（Ｓ１０３，Ｙｅｓ）の処理を説明する。 Next, the data stream processing apparatus 1 determines whether or not there is an inquiry request (S103). If there is no inquiry request (S103, No), the process returns to S101. Hereinafter, the processing when there is an inquiry request (S103, Yes) will be described.

データストリーム処理装置１は、サブルーチン処理「遅延情報の計算処理」を呼び出し（Ｓ１０４）、その結果である遅延相関を有するペアおよびその遅延値をリストとして出力する。 The data stream processing apparatus 1 calls a subroutine process “delay information calculation process” (S104), and outputs a pair having a delay correlation as a result and a delay value thereof as a list.

さらに、データストリーム処理装置１は、Ｓ１０４で算出した結果を入力として、遅延情報の応用処理を実行してもよい（Ｓ１０５）。遅延情報の応用処理は、例えば、センサネットワークにおける故障検出である。具体的には、データストリーム処理装置１は、連動（すなわち相関）するべきセンサ２のペアをあらかじめ登録しておき、Ｓ１０４で算出した遅延相関のあるセンサ２のペアとして検出されなければ、故障の可能性を示唆することができる。 Further, the data stream processing apparatus 1 may execute the delay information application process using the result calculated in S104 as an input (S105). The application process of delay information is, for example, failure detection in a sensor network. Specifically, the data stream processing apparatus 1 registers a pair of sensors 2 to be linked (that is, correlated) in advance, and if the pair of sensors 2 having a delayed correlation calculated in S104 is not detected, The possibility can be suggested.

遅延情報の応用処理（Ｓ１０５）の他の一例として、センサ２が測定するネットワークトラヒックの異常検出が挙げられる。ネットワークトラヒックは、例えば、毎週同じ曜日には、ほぼ同じ測定値が期待されるとする。そして、同じセンサ２において先週測定した測定値と、今週測定した測定値との相関を調べ、相関がないときには、ネットワークトラヒックの異常検出を通知する。 Another example of the delay information application process (S105) is an abnormality detection of network traffic measured by the sensor 2. For network traffic, for example, it is assumed that almost the same measurement value is expected on the same day of the week. Then, the correlation between the measured value measured last week by the same sensor 2 and the measured value measured this week is checked, and when there is no correlation, the detection of abnormality in network traffic is notified.

図３は、「要約情報の更新処理」（Ｓ１０２）を示すフローチャートである。 FIG. 3 is a flowchart showing the “summary information update process” (S102).

まず、要約情報計算部１０は、ループ変数ｋを１からｍまで１つずつ加算するループ（Ｓ２０１〜Ｓ２０９）を実行する。要約情報計算部１０は、変数Ｘにｋ番目のデータストリームを代入する（Ｓ２０２）。そして、要約情報計算部１０は、時刻ｔにおいて、ｘ_ｔを受信する（Ｓ２０３）。要約情報計算部１０は、ＳｘにＳｘ＋ｘ_ｔを、および、ＳｘｘにＳｘｘ＋ｘ_ｔ ^２を、それぞれ代入する（Ｓ２０４）。なお、Ｓ２０４における各パラメータは、（式２，３，６，７）に記載されている。 First, the summary information calculation unit 10 executes a loop (S201 to S209) in which the loop variable k is added one by one from 1 to m. The summary information calculation unit 10 substitutes the kth data stream for the variable X (S202). The summary information calculation section 10, at time t, to receive _{x t} (S203). Summary information calculation unit 10, the Sx + _{x t} to Sx, and the Sxx _{+ x} ^{t 2} in Sxx, the values are (S204). Each parameter in S204 is described in (Equations 2, 3, 6, and 7).

そして、要約情報計算部１０は、ループ変数ｉを１からｄまで１つずつ加算するループ（Ｓ２０５〜Ｓ２０８）を実行する。要約情報計算部１０は、ランダム関数からｕ_ｔｉを生成し（Ｓ２０６）、ｐ_ｉにｐ_ｉ＋ｘ_ｔｕ_ｔｉを代入する（Ｓ２０７）。なお、Ｓ２０７における各パラメータは、（式５）に記載されている。以上、図３の処理について、説明した。 Then, the summary information calculation unit 10 executes a loop (S205 to S208) in which the loop variable i is added one by one from 1 to d. Summary information calculation unit 10 generates a _{u ti} from a random function (S206), substitutes _p i ₊ x _{t u ti} to _{p i} (S207). Each parameter in S207 is described in (Formula 5). The processing in FIG. 3 has been described above.

以上説明した要約情報計算部１０は、Ｏ（ｍ）の計算コストとメモリ使用量で済むので、計算コストとメモリ使用量の低減化が可能となる。なお、非特許文献２の方式では、ｍ個のデータストリームを扱う場合、非特許文献２が全てのデータストリームのペアの内積を計算しなければならないため、要約情報の更新にＯ（ｍ^２）の計算コストとメモリ使用量が必要であり、非効率であった。 Since the summary information calculation unit 10 described above requires only the calculation cost of O (m) and the memory usage, the calculation cost and the memory usage can be reduced. In the method of Non-Patent Document 2, when m data streams are handled, since Non-Patent Document 2 must calculate the inner product of all data stream pairs, O (m ² ) Computational cost and memory usage were required and inefficient.

ここで、要約情報計算部１０の計算コストがＯ（ｍ）であることを説明する。図３において、ループは、２つ存在する。１つめのループ（Ｓ２０１〜Ｓ２０９）は、ループ変数ｋの上限値がｍであるため、計算コストは、Ｏ（ｍ）である。そして、２つめのループ（Ｓ２０５〜Ｓ２０８）は、ループ変数ｉの上限値がｄである。ここで、ｄは、ランダム射影のベクトルＶｘの次元数である。この次元数ｄは、１つのデータストリームの属性であるので、要約情報計算部１０に入力された他のデータストリームの個数ｍには依存しないパラメータである。よって、１からｄまでのループは、計算量のオーダーに影響しないので、図３のトータルの計算コストは、Ｏ（ｍ）である。 Here, it will be described that the calculation cost of the summary information calculation unit 10 is O (m). In FIG. 3, there are two loops. In the first loop (S201 to S209), since the upper limit value of the loop variable k is m, the calculation cost is O (m). In the second loop (S205 to S208), the upper limit value of the loop variable i is d. Here, d is the number of dimensions of the random projection vector Vx. Since the dimension number d is an attribute of one data stream, it is a parameter that does not depend on the number m of other data streams input to the summary information calculation unit 10. Therefore, since the loop from 1 to d does not affect the order of calculation amount, the total calculation cost in FIG. 3 is O (m).

図４は、「遅延情報の計算処理」（Ｓ１０４）を示すフローチャートである。 FIG. 4 is a flowchart showing the “delay information calculation process” (S104).

まず、遅延相関計算部２０は、ループ変数ｋを１からｍまで１つずつ加算するループ（Ｓ３０１〜Ｓ３０８）を実行する。遅延相関計算部２０は、変数Ｘにｋ番目のデータストリームを代入する（Ｓ３０２）。そして、遅延相関計算部２０は、ループ変数ｊをｋからｍまで１つずつ加算するループ（Ｓ３０３〜Ｓ３０７）を実行する。 First, the delay correlation calculation unit 20 executes a loop (S301 to S308) in which the loop variable k is added one by one from 1 to m. The delayed correlation calculation unit 20 substitutes the kth data stream for the variable X (S302). Then, the delay correlation calculation unit 20 executes a loop (S303 to S307) in which the loop variable j is added one by one from k to m.

遅延相関計算部２０は、変数Ｙにｊ番目のデータストリームを代入する（Ｓ３０４）。遅延相関計算部２０は、Ｘの要約情報（Ｓｘ，Ｓｘｘ，Ｐｘ）とＹの要約情報（Ｓｙ，Ｓｙｙ，Ｐｙ）から相互相関関数を計算する（Ｓ３０５）。遅延相関計算部２０は、Ｓ３０５で計算した相互相関関数から遅延情報を検出する（Ｓ３０６）。 The delayed correlation calculation unit 20 substitutes the jth data stream for the variable Y (S304). The delayed correlation calculation unit 20 calculates a cross-correlation function from the X summary information (Sx, Sxx, Px) and the Y summary information (Sy, Syy, Py) (S305). The delay correlation calculation unit 20 detects delay information from the cross-correlation function calculated in S305 (S306).

なお、遅延情報の検出（Ｓ３０６）は、具体的には以下の通りである。時系列ＸとＹが与えられたとき、以下の条件を満たすとき、ＸとＹはｌの遅延相関を持つと定義される。よって、以下の条件を満たすか否かによって、遅延情報の検出が可能か否かを判定する。
（１）相関値の絶対値｜Ｒ（ｌ）｜をスコアとすると、ｘ_ｔとｙ_ｔ−ｌのスコアが閾値γを超える極大値である。
（２）複数の極大値がある場合は、最初の極大値を指す。 The delay information detection (S306) is specifically as follows. Given time series X and Y, X and Y are defined to have a delay correlation of l when the following conditions are met: Therefore, whether or not the delay information can be detected is determined depending on whether or not the following condition is satisfied.
(1) the absolute value of the correlation value | R (l) | When the score is a local maximum score of _{x t} and _{y t-l} exceeds the threshold value gamma.
(2) When there are a plurality of maximum values, the first maximum value is indicated.

ここで、図４において、遅延相関計算部２０の計算コストは、Ｏ（ｍ^２）である。しかし、図４の処理を起動させる契機が、問い合わせ要求であるので、図３の処理を起動させる契機（データストリームの更新）よりも極めて頻度が少ない。よって、図４の処理の計算コストは、図３の処理に比べて大きく影響しない。 Here, in FIG. 4, the calculation cost of the delay correlation calculation unit 20 is O (m ² ). However, since the trigger for starting the processing in FIG. 4 is an inquiry request, the frequency is extremely less than the trigger for starting the processing in FIG. 3 (data stream update). Therefore, the calculation cost of the process of FIG. 4 does not have a significant influence compared to the process of FIG.

以上説明したデータストリーム処理装置１は、インクリメンタルに要約情報を更新することを特徴とする。インクリメンタルとは、繰り返し動作の状態において、あるデータ項目に対してある一定の規則で量または値を加算することを意味する。要約情報計算部１０は、要約情報にランダム射影を用いているため、インクリメンタルに要約情報を更新することができる。それにより、オリジナルのデータストリームを破棄してもよいので、メモリ使用量を節約できる。そして、遅延相関計算部２０では、オリジナルのデータストリームを使わず、インクリメンタルに更新されている各要約情報のみを用いて、相関のあるデータストリームのペアおよびその遅延値を見つける。 The data stream processing apparatus 1 described above is characterized in that the summary information is updated incrementally. Incremental means that an amount or value is added to a certain data item according to a certain rule in the state of repeated operation. Since the summary information calculation unit 10 uses random projection for the summary information, the summary information can be updated incrementally. Thereby, since the original data stream may be discarded, the memory usage can be saved. Then, the delayed correlation calculation unit 20 finds a correlated data stream pair and its delay value using only each summary information that is incrementally updated without using the original data stream.

なお、データストリーム処理装置１は、１台の計算機により構成されてもよいし、複数の計算機により構成されてもよい。例えば、センサ２ごとにデータストリームは生成されるので、センサ２ごとにデータストリーム処理装置１を設けて、並列に図３および図４の処理を実行させてもよい。そのときには、図３のループ（Ｓ２０１〜Ｓ２０９）および図４のループ（Ｓ３０１〜Ｓ３０８）は、センサ２を担当するデータストリーム処理装置１に分配される。また、センサ２とデータストリーム処理装置１とを、同一の筐体に収容してもよい。 The data stream processing device 1 may be configured by a single computer or a plurality of computers. For example, since a data stream is generated for each sensor 2, a data stream processing device 1 may be provided for each sensor 2, and the processes of FIGS. 3 and 4 may be executed in parallel. At that time, the loop (S201 to S209) in FIG. 3 and the loop (S301 to S308) in FIG. 4 are distributed to the data stream processing apparatus 1 in charge of the sensor 2. Further, the sensor 2 and the data stream processing device 1 may be accommodated in the same casing.

以上説明したデータストリーム処理装置１を、以下で評価する。評価は、出力結果の精度と計算時間という２つの尺度でそれぞれ行っている。以下に示すように、本実施形態のデータストリーム処理装置１は、出力結果の精度を落とすことなく、計算時間の短縮化を実現している。 The data stream processing apparatus 1 described above will be evaluated below. The evaluation is performed on two scales, that is, the accuracy of the output result and the calculation time. As will be described below, the data stream processing apparatus 1 of the present embodiment realizes a reduction in calculation time without degrading the accuracy of the output result.

図５は、出力結果の精度を示すための実験結果を示す。図５（ａ）の光センサＣ、および、図５（ｂ）の光センサＤは、それぞれ照明のセンサデータである。図５（ｃ）の相互相関関数では、点線は相互相関関数をそのまま計算した結果であり、「ナイーブな手法」と表記している。図５（ｃ）の実線は本実施形態によって近似的に計算した結果である。これら２本の線は、ほぼ同じ結果をなぞっているので、本実施形態は、非常に高精度に近似していることが分かる。なお、非特許文献２の手法で求まる解は、本実施形態と同じように近似解である。よって、本実施形態の結果は、非特許文献２の結果と比較する代わりに、ナイーブな手法と比較することにより、充分な精度を実証できた。 FIG. 5 shows experimental results for indicating the accuracy of the output results. The optical sensor C in FIG. 5A and the optical sensor D in FIG. 5B are sensor data of illumination, respectively. In the cross-correlation function of FIG. 5C, the dotted line is the result of calculating the cross-correlation function as it is, and is expressed as “naive method”. The solid line in FIG. 5C is the result of the approximate calculation according to this embodiment. Since these two lines trace almost the same result, it can be seen that the present embodiment approximates with very high accuracy. Note that the solution obtained by the method of Non-Patent Document 2 is an approximate solution as in the present embodiment. Therefore, sufficient accuracy can be verified by comparing the result of the present embodiment with a naive technique instead of comparing with the result of Non-Patent Document 2.

なお、図５（ｃ）の実線上の「＊」マークは、計算値を示す。そして、図５（ｃ）における実線は、計算値をなめらかにつなぐことにより生成される。ここで、図５（ｃ）では、計算値が横軸の左にいくほど密になり、右にいくほど粗になる。これは、データストリームのサンプリング間隔が、一定ではなく、指数関数に基づいているためである。例えば、時刻ｔ＝１，２，４，８，１６，…においてサンプリングする。これにより、計算値の精度を大きく落とすことなく、計算値の個数を少なくすることができるので、計算量を適切に削減することができる。 The “*” mark on the solid line in FIG. 5C indicates the calculated value. The solid line in FIG. 5C is generated by smoothly connecting the calculated values. Here, in FIG. 5C, the calculated value becomes denser as it goes to the left of the horizontal axis and becomes coarser as it goes to the right. This is because the sampling interval of the data stream is not constant and is based on an exponential function. For example, sampling is performed at time t = 1, 2, 4, 8, 16,. As a result, the number of calculated values can be reduced without greatly reducing the accuracy of the calculated values, so that the amount of calculation can be appropriately reduced.

図６は、ナイーブな手法と本実施形態の手法の計算時間を比較したものである。横軸で示すデータストリームの数の増加に対して、縦軸の計算時間に着目する。ナイーブな手法は、計算時間が指数関数的に増加してしまうのに対し、本実施形態の手法は、データストリームの数の増加に影響されず、ほぼ一定の計算時間で済む。これにより、本実施形態の手法が、大幅な計算コストの低減化を達成していることが分かる。 FIG. 6 compares the calculation time of the naive method and the method of the present embodiment. Pay attention to the calculation time on the vertical axis with respect to the increase in the number of data streams shown on the horizontal axis. The naive method increases the calculation time exponentially, whereas the method of the present embodiment is not affected by the increase in the number of data streams, and requires a substantially constant calculation time. Thereby, it can be seen that the method of the present embodiment achieves a significant reduction in calculation cost.

本発明の一実施形態に関するデータストリーム処理装置を示す構成図である。It is a block diagram which shows the data stream processing apparatus regarding one Embodiment of this invention. 本発明の一実施形態に関するデータストリーム処理の概要を示すフローチャートである。It is a flowchart which shows the outline | summary of the data stream process regarding one Embodiment of this invention. 本発明の一実施形態に関する要約情報の計算手順を示すフローチャートである。It is a flowchart which shows the calculation procedure of the summary information regarding one Embodiment of this invention. 本発明の一実施形態に関する遅延相関の計算手順を示すフローチャートである。It is a flowchart which shows the calculation procedure of the delay correlation regarding one Embodiment of this invention. 本発明の一実施形態に関する遅延相関の精度の実験結果を示すグラフである。It is a graph which shows the experimental result of the precision of the delay correlation regarding one Embodiment of this invention. 本発明の一実施形態に関する遅延相関を計算するための計算時間の実験結果を示すグラフである。It is a graph which shows the experimental result of the calculation time for calculating the delay correlation regarding one Embodiment of this invention. 本発明の一実施形態に関する従来の技術の中で、遅延相関の例を示すグラフである。It is a graph which shows the example of a delay correlation in the prior art regarding one Embodiment of this invention.

Explanation of symbols

１データストリーム処理装置
２センサ
１０要約情報計算部
２０遅延相関計算部 DESCRIPTION OF SYMBOLS 1 Data stream processing apparatus 2 Sensor 10 Summary information calculation part 20 Delay correlation calculation part

Claims

A data stream processing method for obtaining a delayed correlation between a plurality of data streams measured in time series,
Computer
Summarization information based on random projection of the input data stream, sum of time series, and sum of squares of time series is created and stored in storage means, and the summary information is based on the update of the measurement value of the data stream The summary information calculation procedure to update
A delay that reads the summary information stored in the storage means, calculates a cross-correlation function of the data stream based on the summary information, and obtains and outputs a delayed correlation between the data streams from the cross-correlation function Correlation calculation procedure;
The data stream processing method characterized by performing.

2. The data stream processing method according to claim 1, wherein the summary information calculation procedure updates the summary information based on the data stream updated at a time interval calculated by an exponential function.

3. The data stream according to claim 1, wherein when the pair of data streams designated as correlated in advance is not detected by the delayed correlation calculation procedure, a computer detects the pair as a failure. Processing method.

The data stream processing method according to any one of claims 1 to 3, wherein a plurality of computers calculate the summary information related to the data stream in charge in parallel with other computers.

A data stream processing program for causing a computer to execute the data stream processing method according to any one of claims 1 to 4.

A computer-readable storage medium storing the data stream processing program according to claim 5.

A data stream processing apparatus for obtaining a delay correlation between a plurality of data streams measured in time series,
Summarization information based on random projection of the input data stream, sum of time series, and sum of squares of time series is created and stored in storage means, and the summary information is based on the update of the measurement value of the data stream A summary information calculator for updating
A delay that reads the summary information stored in the storage means, calculates a cross-correlation function of the data stream based on the summary information, and obtains and outputs a delayed correlation between the data streams from the cross-correlation function A correlation calculator;
A data stream processing apparatus comprising:

8. The data stream processing apparatus according to claim 7, wherein the summary information calculation unit updates the summary information based on the data stream updated at time intervals calculated by an exponential function.

9. The data stream processing apparatus according to claim 7, wherein when the pair of data streams designated as having a correlation in advance is not detected by the delayed correlation calculation unit, it is detected as a failure.

10. The data stream processing apparatus according to claim 7, wherein a plurality of the data stream processing devices calculate the summary information regarding the data stream in charge in parallel with the other data stream processing devices. 2. A data stream processing device according to 1.