JP7823745B2

JP7823745B2 - Data generation device, data generation method, and program

Info

Publication number: JP7823745B2
Application number: JP2024530160A
Authority: JP
Inventors: テキリ; 晴久野末; 憲男山本
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2022-06-29
Filing date: 2022-06-29
Publication date: 2026-03-04
Anticipated expiration: 2042-06-29
Also published as: JPWO2024004083A1; WO2024004083A1

Description

この発明の一態様は、例えば障害等の異常を検知する機械学習モデルの学習データを生成するデータ生成装置、データ生成方法、およびプログラムに関する。 One aspect of the present invention relates to a data generation device, a data generation method, and a program for generating training data for a machine learning model that detects abnormalities such as failures.

システムは、最終的には本番環境（target production environment）への適用を目指して開発される。とはいえ製品リリースに至るまでには検証環境、あるいはステージング環境などの、本番環境を模した環境での検証作業が欠かせない。ＭＬＯｐｓ（Machine Learning Operations）と絡めて議論されることが増えてきたＡＩ（Artificial Intelligence）開発においても、事情は同じである。 Systems are developed with the ultimate goal of being applied to a target production environment. However, before the product can be released, verification work in an environment that mimics the production environment, such as a testing or staging environment, is essential. The situation is the same in AI (Artificial Intelligence) development, which is increasingly being discussed in conjunction with MLOps (Machine Learning Operations).

近年、ネットワークを運用するのに、ネットワーク運用に関する学習データで機械学習させたＡＩモデルを利用することが試みられている。ＡＩモデルをトレーニングするには、ネットワークの正常運用状態のデータ（正常時データ：normal state data）と、障害が発生したときのデータ（異常時データ：failure data）との双方を未学習モデルに大量に与えることが必要である。しかし本番環境で障害を再現することには大きなリスクが伴うので、これらのデータを大量に収集することは難しい。そこで、本番環境のミラー環境（mirror environment）を構築し、機械学習に必要な正常時データと異常時データとの双方をミラー環境の側で生成するという技術が提案された（非特許文献１を参照）。In recent years, attempts have been made to operate networks using AI models trained on learning data related to network operation. Training an AI model requires providing large amounts of both data on the network's normal operating state (normal state data) and data on when a failure occurs (failure data) to the untrained model. However, reproducing a failure in a production environment involves significant risks, making it difficult to collect such data in large quantities. Therefore, a technology has been proposed in which a mirror environment of the production environment is constructed and both the normal and abnormal data required for machine learning are generated in the mirror environment (see non-patent document 1).

D. Li, K. Akashi, H. Nozue and K. Tayama, "A Mirror Environment to Produce Artificial Intelligence Training Data," in IEEE Access, vol. 10, pp. 24578-24586, 2022, doi: 10.1109/ACCESS.2022.3154825.D. Li, K. Akashi, H. Nozue and K. Tayama, "A Mirror Environment to Produce Artificial Intelligence Training Data," in IEEE Access, vol. 10, pp. 24578-24586, 2022, doi: 10.1109/ACCESS.2022.3154825. L. Han, F. Gao, Z. Li and O. A. Dobre, "Low Complexity Automatic Modulation Classification Based on Order-Statistics," in IEEE Transactions on Wireless Communications, vol. 16, no. 1, pp. 400-411, Jan. 2017, doi: 10.1109/TWC.2016.2623716.L. Han, F. Gao, Z. Li and O. A. Dobre, "Low Complexity Automatic Modulation Classification Based on Order-Statistics," in IEEE Transactions on Wireless Communications, vol. 16, no. 1, pp. 400-411, Jan. 2017, doi: 10.1109/TWC.2016.2623716.

ターゲット環境を模したミラー環境から収集したログデータと、ターゲット環境のログデータは、テキストベースのデータであることが多い。それぞれの環境で取得したログデータを照合して同じ意味を持つアラーム同士をクラスタリングしたり、書き換えたりして、ミラー環境で作成した学習済みイベントからターゲット環境向けのイベントを生成することができる。対応方法などを含む属情報もそのまま利用することができる。 Log data collected from a mirror environment that mimics the target environment and log data from the target environment are often text-based. By comparing the log data obtained in each environment and clustering or rewriting alarms with the same meaning, events for the target environment can be generated from the learned events created in the mirror environment. Attribute information, including response methods, can also be used as is.

しかし、ターゲット環境から収集されるネットワークログの種類や数は膨大であり、単語数の増加と共に、クラスタリング処理や学習時間も増える。さらに、運用期間が経過するにつれ、ターゲットネットワークから収集されたログの量は一方的に増加する。結局のところ、ターゲット環境とミラー環境との間でのクラスタリングの処理時間が累積的に遅くなってしまう。これを放置するとログ解析に非常に長い時間がかかり、解決することが望まれている。 However, the types and number of network logs collected from the target environment are enormous, and as the number of words increases, the clustering processing and learning time also increases. Furthermore, as the operation period progresses, the amount of logs collected from the target network increases unilaterally. Ultimately, the clustering processing time between the target environment and the mirror environment cumulatively slows down. If this is left unaddressed, log analysis will take an extremely long time, and a solution is desired.

この発明は上記事情に着目してなされたもので、ログ解析にかかる時間を短縮することのできる技術を提供しようとするものである。 This invention was made with the above circumstances in mind and aims to provide technology that can shorten the time required for log analysis.

この発明の一態様によれば、データ生成装置は、ターゲット環境のネットワークから収集したテキストベースのログデータを記憶する記憶部と、プロセッサとを具備する。プロセッサは、ユニーク抽出処理部と、単語リスト生成部と、単語ベクトル生成部と、クラスタリング処理部と、データ生成部とを備える。ユニーク抽出処理部は、ログデータからユニークログを抽出して抽出後ログデータを生成する。単語リスト生成部は、形態素解析により抽出後ログデータを複数の単語に分割して単語リストを生成する。単語ベクトル生成部は、単語リストをベクトル化して複数次元のベクトル空間における複数の単語ベクトルを生成する。クラスタリング処理部は、ベクトル空間における複数の単語ベクトル間の距離指標に基づいて複数の単語ベクトルをクラスタリングする。データ生成部は、クラスタリングの結果に基づいて、ターゲット環境のミラー環境で作成された学習済みイベントからターゲット環境向けのイベントを生成する。 According to one aspect of the present invention, a data generation device includes a memory unit that stores text-based log data collected from a network of a target environment, and a processor. The processor includes a unique extraction processing unit, a word list generation unit, a word vector generation unit, a clustering processing unit, and a data generation unit. The unique extraction processing unit extracts unique logs from the log data to generate extracted log data. The word list generation unit divides the extracted log data into multiple words using morphological analysis to generate a word list. The word vector generation unit vectorizes the word list to generate multiple word vectors in a multi-dimensional vector space. The clustering processing unit clusters the multiple word vectors based on a distance index between the multiple word vectors in the vector space. The data generation unit generates events for the target environment from learned events created in a mirror environment of the target environment based on the clustering results.

この発明の一態様によれば、ログ解析にかかる時間を短縮することのできる技術を提供できる。 One aspect of this invention provides technology that can reduce the time required for log analysis.

図１は、この発明の一実施形態に係るデータ生成装置が適用されるシステムの一例を示す図である。FIG. 1 is a diagram showing an example of a system to which a data generating device according to an embodiment of the present invention is applied. 図２は、図１に示されるデータ生成装置３の一例を示すブロック図である。FIG. 2 is a block diagram showing an example of the data generating device 3 shown in FIG. 図３は、図２に示されるデータ生成装置３の一例を示す機能ブロック図である。FIG. 3 is a functional block diagram showing an example of the data generating device 3 shown in FIG. 図４は、実施形態に係わるクラスタリングの処理手順の一例を示すフローチャートである。FIG. 4 is a flowchart showing an example of a clustering processing procedure according to the embodiment. 図５は、クレンジング処理における処理手順の一例を示すフローチャートである。FIG. 5 is a flowchart showing an example of a processing procedure in the cleansing process. 図６は、クラスタ判定処理における処理手順の一例を示すフローチャートである。FIG. 6 is a flowchart illustrating an example of a processing procedure in the cluster determination process. 図７は、［ファイル読み込みと前処理（図６のステップＳ７２）］における処理手順の一例を示すフローチャートである。FIG. 7 is a flowchart showing an example of the processing procedure for "File Reading and Preprocessing (Step S72 in FIG. 6)." 図８は、［クラスタリング判定処理（図６のステップＳ７４）］における処理手順の一例を示すフローチャートである。FIG. 8 is a flowchart showing an example of the processing procedure of the clustering determination process (step S74 in FIG. 6). 図９は、クラスタリングの結果をクレンジングの有／無に対応させて示す図である。FIG. 9 is a diagram showing the results of clustering in relation to whether cleansing has been performed. 図１０は、ユニーク抽出処理のみの場合の処理時間の一例を示すグラフである。FIG. 10 is a graph showing an example of processing time in the case of only the unique extraction process. 図１１は、ユニーク抽出処理とクラスタリング処理を含むクラスタリングに要する処理時間の一例を示すグラフである。FIG. 11 is a graph showing an example of the processing time required for clustering including the unique extraction process and the clustering process. 図１２は、クレンジング無しで実施されたクラスタリングの結果の他の例を示す図である。FIG. 12 is a diagram showing another example of the result of clustering performed without cleansing. 図１３は、既存の階層的クラスタリングの処理手順の一例を示すフローチャートである。FIG. 13 is a flowchart showing an example of a procedure for an existing hierarchical clustering process.

以下、図面を参照してこの発明に係わる実施形態を説明する。 The following describes an embodiment of the present invention with reference to the drawings.

［一実施形態］
＜構成＞
図１は、この発明の一実施形態に係るデータ生成装置が適用されるシステムの一例を示す図である。図１に示されるシステムは、例えばネットワークシステム等の対象システム１を、ＡＩモデル２を利用して運用するシステムである。このシステムは、さらに、データ生成装置３を備える。データ生成装置３は、ＡＩモデル２を学習させるための学習データを生成する。 [One embodiment]
<Configuration>
Fig. 1 is a diagram showing an example of a system to which a data generation device according to an embodiment of the present invention is applied. The system shown in Fig. 1 is a system that operates a target system 1, such as a network system, by using an AI model 2. The system further includes a data generation device 3. The data generation device 3 generates training data for training the AI model 2.

図２は、図１に示されるデータ生成装置３の一例を示すブロック図である。データ生成装置３は、プロセッサ１０、メモリ２０、ストレージ３０、および入出力インタフェース（Ｉ／Ｆ）４０と、これらを相互に接続するバス４５とを備えるコンピュータである。入出力Ｉ／Ｆ４０は、データ生成装置３と対象システム１、およびＡＩモデル２との間に通信リンクを設定し、各種のデータを授受する。プロセッサ１０は、ＣＰＵ（Central Processing Unit）やＭＰＵ（Micro Processing Unit）などの演算デバイスであり、ストレージ３０からメモリ２０にロードされたプログラムに従って、実施形態の処理機能を実現する。 Figure 2 is a block diagram showing an example of the data generation device 3 shown in Figure 1. The data generation device 3 is a computer comprising a processor 10, memory 20, storage 30, an input/output interface (I/F) 40, and a bus 45 interconnecting these components. The input/output I/F 40 establishes communication links between the data generation device 3 and the target system 1 and AI model 2, and transmits and receives various types of data. The processor 10 is a computing device such as a CPU (Central Processing Unit) or MPU (Micro Processing Unit), and realizes the processing functions of the embodiment in accordance with a program loaded from the storage 30 to the memory 20.

図３は、図２に示されるデータ生成装置３の一例を示す機能ブロック図である。図３において、メモリ２０は、ＲＯＭ（Read Only Memory）やＲＡＭ（Random Access Memory）等の半導体メモリである。 Figure 3 is a functional block diagram showing an example of the data generation device 3 shown in Figure 2. In Figure 3, the memory 20 is a semiconductor memory such as a ROM (Read Only Memory) or a RAM (Random Access Memory).

ストレージ３０は、ＨＤＤ（Hard Disk Drive）、またはＳＳＤ（Solid State Drive）等の不揮発性メモリであり、ＯＳ（Operating System）等の基本ソフトウェアに加えて、実施形態に係る処理を実現するためのプログラム３０ｇを記憶する。すなわちプログラム３０ｇは、データ生成装置３にインストールされることが可能である。 Storage 30 is a non-volatile memory such as a hard disk drive (HDD) or solid state drive (SSD), and stores basic software such as an operating system (OS), as well as a program 30g for implementing the processing according to the embodiment. In other words, program 30g can be installed in the data generation device 3.

また、ストレージ３０は、ログデータ３０ａ、抽出後ログデータ３０ｂ、単語リスト３０ｃ、再生成単語リスト３０ｄ、単語ベクトル３０ｅ、クラスタリング結果３０ｆを記憶する。このうちログデータ３０ａは、ターゲット環境のネットワークから収集したテキストベースのログデータである。 The storage 30 also stores log data 30a, extracted log data 30b, word list 30c, regenerated word list 30d, word vector 30e, and clustering results 30f. Of these, the log data 30a is text-based log data collected from the network of the target environment.

プロセッサ５０は、この発明の一実施形態に係る処理機能として、ユニーク抽出処理部５０ａ、単語リスト生成部５０ｂ、クレンジング処理部５０ｃ、単語ベクトル生成部５０ｄ、クラスタリング処理部５０ｅ、および、データ生成部５０ｆを備える。 The processor 50 has, as processing functions according to one embodiment of the present invention, a unique extraction processing unit 50a, a word list generation unit 50b, a cleansing processing unit 50c, a word vector generation unit 50d, a clustering processing unit 50e, and a data generation unit 50f.

ユニーク抽出処理部５０ａ、単語リスト生成部５０ｂ、クレンジング処理部５０ｃ、単語ベクトル生成部５０ｄ、クラスタリング処理部５０ｅ、および、データ生成部５０ｆは、メモリ２０にロードされたプログラムをプロセッサ１０が実行することで実現される。つまりプログラム３０ｇは、プロセッサ５０を、ユニーク抽出処理部５０ａとして機能させる命令と、単語リスト生成部５０ｂとして機能させる命令と、クレンジング処理部５０ｃとして機能させる命令と、単語ベクトル生成部５０ｄとして機能させる命令と、クラスタリング処理部５０ｅとして機能させる命令と、データ生成部５０ｆとして機能させる命令とを含む。The unique extraction processing unit 50a, word list generation unit 50b, cleansing processing unit 50c, word vector generation unit 50d, clustering processing unit 50e, and data generation unit 50f are realized by processor 10 executing a program loaded into memory 20. In other words, program 30g includes instructions that cause processor 50 to function as unique extraction processing unit 50a, as word list generation unit 50b, as cleansing processing unit 50c, as word vector generation unit 50d, as clustering processing unit 50e, and as data generation unit 50f.

ユニーク抽出処理部５０ａは、ログデータからユニークログを抽出して抽出後ログデータを生成する。抽出後ログデータはストレージ３０に送られ、抽出後ログデータ３０ｂとして記憶される。 The unique extraction processing unit 50a extracts unique logs from the log data and generates extracted log data. The extracted log data is sent to the storage 30 and stored as extracted log data 30b.

単語リスト生成部５０ｂは、例えば形態素解析により、抽出後ログデータ３０ｂを複数の単語に分割して単語リストを生成する。単語リストはストレージ３０に送られ、単語リスト３０ｃとして記憶される。The word list generation unit 50b generates a word list by dividing the extracted log data 30b into multiple words, for example by morphological analysis. The word list is sent to the storage 30 and stored as a word list 30c.

クレンジング処理部５０ｃは、クレンジング処理により、単語リスト３０ｃから不要な単語を除去して、再生成単語リストを生成する。再生成単語リストはストレージ３０に送られ、再生成単語リスト３０ｄとして記憶される。 The cleansing processing unit 50c performs a cleansing process to remove unnecessary words from the word list 30c and generate a regenerated word list. The regenerated word list is sent to the storage 30 and stored as the regenerated word list 30d.

単語ベクトル生成部５０ｄは、再生成単語リスト３０ｄをベクトル化して、複数次元のベクトル空間における複数の単語ベクトルを生成する。単語ベクトルはストレージ３０に送られ、単語ベクトル３０ｅとして記憶される。 The word vector generation unit 50d vectorizes the regenerated word list 30d to generate multiple word vectors in a multi-dimensional vector space. The word vectors are sent to the storage 30 and stored as word vectors 30e.

クラスタリング処理部は、ベクトル空間における複数の単語ベクトル間の距離指標に基づいて、複数の単語ベクトルをクラスタリングする。クラスタリングの結果はストレージ３０に送られ、クラスタリング結果３０ｆとして記憶される。The clustering processing unit clusters multiple word vectors based on the distance index between the multiple word vectors in vector space. The clustering results are sent to storage 30 and stored as clustering results 30f.

データ生成部５０ｆは、クラスタリング結果３０ｆに基づいて、ターゲット環境のミラー環境で作成された学習済みイベントからターゲット環境向けのイベントを生成する。 The data generation unit 50f generates events for the target environment from learned events created in the mirror environment of the target environment based on the clustering results 30f.

＜作用＞
次に、上記構成における作用を説明する。
図４は、実施形態に係わるクラスタリングの処理手順の一例を示すフローチャートである。図４において、プロセッサ５０は、ログデータ３０ａをメモリ２０に読み込み（ステップＳ１）データの前処理を行う（ステップＳ２）。ステップＳ２での前処理は、テキストデータを利用するための整形・抽出などの一般的な処理であってよい。 <Effect>
Next, the operation of the above configuration will be described.
4 is a flowchart showing an example of a clustering process according to an embodiment. In FIG. 4, the processor 50 reads the log data 30a into the memory 20 (step S1) and preprocesses the data (step S2). The preprocessing in step S2 may be general processing such as shaping and extraction for utilizing text data.

次に、プロセッサ５０は、例えば行単位で重複する行を削除するなどの方法で、ユニーク抽出処理を行い（ステップＳ３）、抽出後ログデータ３０ｂを生成する。次に、プロセッサ５０は、抽出後ログデータ３０ｂに対して、形態素解析により単語分割処理を施す（ステップＳ４）。これにより単語リスト３０ｃが生成される。Next, the processor 50 performs a unique extraction process, for example by deleting duplicate lines on a line-by-line basis (step S3), to generate post-extraction log data 30b. Next, the processor 50 performs a word segmentation process on the post-extraction log data 30b using morphological analysis (step S4). This generates a word list 30c.

次に、プロセッサ５０は、クレンジング処理を行って（ステップＳ５）、分割された単語のリスト（単語リスト３０ｃ）から、クラスタリングに対して意味を持たない単語を削除する。例えば、ネットワーク機器を識別するための数字番号、または単純なカウント用の通し番号や数字番号などが、削除すべき単語となる。プロセッサ５０は、このような、ネットワークに固有の特徴を反映する単語を単語リスト３０ｃから除去して、再生成単語リスト３０ｄとする。Next, the processor 50 performs a cleansing process (step S5) to remove words that have no meaning for clustering from the divided word list (word list 30c). For example, words to be removed include numeric numbers for identifying network devices, or serial numbers or numeric numbers for simple counting. The processor 50 removes such words that reflect characteristics unique to the network from the word list 30c to generate a regenerated word list 30d.

さらに、プロセッサ５０は、ベクトル化処理により、再生成単語リスト３０ｄから単語ベクトル３０ｅを生成する（ステップＳ６）。最後に、プロセッサ５０は、単語ベクトル３０ｅから最短距離法、最長距離法、あるいは群平均法などの手法で、ベクトル空間におけるベクトル同士の距離指標を計算する。 Furthermore, the processor 50 generates word vectors 30e from the regenerated word list 30d through vectorization processing (step S6). Finally, the processor 50 calculates distance indices between vectors in vector space from the word vectors 30e using a method such as the shortest distance method, the longest distance method, or the group average method.

クラスタリングは、データ間の類似度（似ているかどうか）に基づいてデータをグループ分けする手法である。クラスタリングの手法は、階層的手法と、分割最適化手法とに大別される。 Clustering is a method of grouping data based on the similarity (how similar) between the data. Clustering methods can be broadly divided into hierarchical methods and partition optimization methods.

ネットワークログのクラスタリングを行う際、クラスタ間の距離関数に基づき、距離の近い二つのクラスタを逐次的に併合されるまで繰り返すことで階層構造を獲得する手法がある。一方、階層を作らずにデータをグルーピングしていく、非階層的手法もある。 When clustering network logs, one method obtains a hierarchical structure by iteratively merging two clusters that are close to each other based on a distance function between clusters. On the other hand, there are also non-hierarchical methods that group data without creating a hierarchy.

階層的クラスタリングにおける距離指標は、最短距離法、最長距離法、あるいは群平均法等で計算することができる。ユークリッド距離やコサイン類似度などでこれらの指標を求め、その値を既定の閾値で峻別してクラスタリングを行うことができる。 Distance indices in hierarchical clustering can be calculated using the shortest distance method, the longest distance method, or the group average method. These indices can be calculated using Euclidean distance or cosine similarity, and clustering can be performed by discriminating these values using a predetermined threshold.

例えば、ｎ次元ベクトル（ｘ，ｙ）間のユークリッド距離は、式（１）で定義される。 For example, the Euclidean distance between n-dimensional vectors (x, y) is defined by equation (1).

ユークリッド距離は、クラスタリングにおけるデータ間の距離指標として最も一般的である。 Euclidean distance is the most common distance measure between data in clustering.

図５は、クレンジング処理（図４のステップＳ５）における処理手順の一例を示すフローチャートである。図５において、プロセッサ５０は、単語リスト３０ｃを生成し（ステップＳ５１）、次に、単語リストに対してデータクレンジング処理を施す（ステップＳ５２）。さらに、プロセッサ５０は、残った単語から単語リストを再生成して、再生成単語リスト３０ｄとする（ステップＳ５３）。 Figure 5 is a flowchart showing an example of the processing steps in the cleansing process (step S5 in Figure 4). In Figure 5, the processor 50 generates a word list 30c (step S51) and then performs a data cleansing process on the word list (step S52). Furthermore, the processor 50 regenerates the word list from the remaining words to create a regenerated word list 30d (step S53).

図６は、クラスタ判定処理（図４のステップＳ７）における処理手順の一例を示すフローチャートである。図６において、プロセッサ５０は、クラスタ履歴を読み込み（ステップＳ７１）、ファイル読み込みと前処理を行う（ステップＳ７２）。次に、プロセッサ５０は、クラスタの中心座標を再計算し（ステップＳ７３）、クラスタリング判定処理を行う（ステップＳ７４）。そして、プロセッサ５０は、クラスタリング判定処理により得られたクラスタ履歴を出力する（ステップＳ７５）。 Figure 6 is a flowchart showing an example of the processing procedure for the cluster determination process (step S7 in Figure 4). In Figure 6, the processor 50 reads the cluster history (step S71), and performs file reading and preprocessing (step S72). Next, the processor 50 recalculates the center coordinates of the cluster (step S73) and performs clustering determination processing (step S74). The processor 50 then outputs the cluster history obtained by the clustering determination processing (step S75).

図７は、［ファイル読み込みと前処理（図６のステップＳ７２）］における処理手順の一例を示すフローチャートである。図７において、プロセッサ５０は、入力ファイルを読み込み（ステップＳ２１）、前処理を行う（ステップＳ２２）。次に、プロセッサ５０は、図６の手順で得られたクラスタ履歴と、入力ファイルとを合わせて単語リストを作成し（ステップＳ２３）、単語リストからＯｎｅ－ｈｏｔデータを作成する（ステップＳ２４）。 Figure 7 is a flowchart showing an example of the processing procedure for [File Reading and Preprocessing (Step S72 in Figure 6)]. In Figure 7, the processor 50 reads the input file (Step S21) and performs preprocessing (Step S22). Next, the processor 50 combines the cluster history obtained in the procedure in Figure 6 with the input file to create a word list (Step S23), and creates one-hot data from the word list (Step S24).

図８は、［クラスタリング判定処理（図６のステップＳ７４）］における処理手順の一例を示すフローチャートである。図８において、プロセッサ５０は、Ｏｎｅ－ｈｏｔデータを１行ずつ読み込み（ステップＳ４１）、データの有無を確認する（ステップＳ４２）。データが無くなれば（Ｎｏ）、処理は終了する。 Figure 8 is a flowchart showing an example of the processing procedure for the clustering determination process (step S74 in Figure 6). In Figure 8, the processor 50 reads the one-hot data line by line (step S41) and checks whether or not there is data (step S42). If there is no data (No), the process ends.

データを読み出せれば（ステップＳ４２でＹｅｓ）、プロセッサ５０は、既存クラスタとのユークリッド距離を計算し（ステップＳ４３）、しきい値内クラスタの有無を判定する（ステップＳ４４）。ユークリッド距離が閾値以下であれば（ステップＳ４４でＹｅｓ）、プロセッサ５０は、当該単語ベクトルを最近傍クラスタに追加（ステップＳ４５）する。ユークリッド距離が閾値より大きければ（ステップＳ４４でＮｏ）、プロセッサ５０は、当該単語ベクトルを新規クラスタに登録（ステップＳ４６）する。ステップＳ４５、またはステップＳ４６を経たのち、プロセッサはクラスタの中心座標を更新して（ステップＳ４７）、処理手順はステップＳ４１に戻る。If the data can be read (Yes in step S42), the processor 50 calculates the Euclidean distance from the existing cluster (step S43) and determines whether there is a cluster within the threshold (step S44). If the Euclidean distance is less than the threshold (Yes in step S44), the processor 50 adds the word vector to the nearest cluster (step S45). If the Euclidean distance is greater than the threshold (No in step S44), the processor 50 registers the word vector in a new cluster (step S46). After step S45 or step S46, the processor updates the center coordinates of the cluster (step S47), and the processing procedure returns to step S41.

図９は、クラスタリングの結果をクレンジングの有／無に対応させて示す図である。図９においては、サンプルデータとして例えば７つのログデータファイルを統合し、８５万行のログデータを用意した。その中からメッセージに対応するデータを抽出し、クラスタリングの対象にした。同じスペックのＰＣサーバで処理した場合、クレンジング無しのケースでは１３時間１５分もの時間がかかった。これに対しクレンジング有りのケースでは、たかだか３５分で処理が完了していることがわかる。このケースでは、クレンジング無しのケースに比べて単語数が４５％削減され、実行時間が９６％短縮されていることがわかる。 Figure 9 shows the clustering results with and without cleansing. In Figure 9, for example, seven log data files were integrated as sample data, providing 850,000 lines of log data. Data corresponding to messages was extracted from this and used for clustering. When processed on a PC server with the same specifications, the case without cleansing took 13 hours and 15 minutes. In contrast, the case with cleansing completed processing in just 35 minutes. In this case, the number of words was reduced by 45% compared to the case without cleansing, and the execution time was reduced by 96%.

図１０は、ユニーク抽出処理のみの場合の処理時間の一例を示すグラフである。横軸の単位は秒である。クラスタとの距離計算の結果は、おおよそ右肩上がりとなる。 Figure 10 is a graph showing an example of processing time when only unique extraction processing is performed. The horizontal axis is in seconds. The results of the distance calculation to the cluster roughly increase to the right.

図１１は、ユニーク抽出処理とクラスタリング処理を含むクラスタリングに要する処理時間の一例を示すグラフである。横軸の単位は秒である。クラスタとの距離計算の結果の更新は、指数関数的に増加している。これは、クラスタ間の距離計算結果の更新処理がボトルネックになっていることを示す。例えば、約１万行の処理に６万秒（１６時間）かかっていることがグラフに示されている。 Figure 11 is a graph showing an example of the processing time required for clustering, including unique extraction processing and clustering processing. The horizontal axis is in seconds. The update of the results of distance calculations between clusters increases exponentially. This indicates that the update process of the results of distance calculations between clusters is a bottleneck. For example, the graph shows that it took 60,000 seconds (16 hours) to process approximately 10,000 rows.

この処理では、全クラスタと新規レコード（行）とのユークリッド距離を計算し、結果をリストに追加しているため、クラスタ数が増えるほどリストが肥大化している。まして、ユニークログを抽出する前のサンプルデータは約８５万行あるために、処理に数日を要することとなる。 This process calculates the Euclidean distance between all clusters and new records (rows) and adds the results to a list, so the list becomes larger as the number of clusters increases. Furthermore, since the sample data before extracting unique logs contains approximately 850,000 rows, processing takes several days.

図１２は、クレンジング無しで実施されたクラスタリングの結果の他の例を示す図である。図９の結果と比べてＣＰＵのスペックをやや低いものとした。それだけで、３５時間を超える実行時間がかかっていることが分かる。 Figure 12 shows another example of the results of clustering performed without cleansing. The CPU specifications were slightly lower than those in Figure 9. It can be seen that the execution time was over 35 hours.

図１３は、既存の階層的クラスタリングの処理手順の一例を示すフローチャートである。既存の技術では、図４のフローチャートと比べて、ユニーク抽出処理（ステップＳ３）も、クレンジング処理（ステップＳ５）も実施されない。このため、管理対象となるネットワークから収集するログデータが増えるほど、ログをクラスタリングするための処理時間が単調に長くなっていくこととなる。 Figure 13 is a flowchart showing an example of an existing hierarchical clustering processing procedure. Unlike the flowchart in Figure 4, existing technology does not perform the unique extraction process (step S3) or the cleansing process (step S5). Therefore, the more log data collected from the network to be managed, the longer the processing time required to cluster the logs becomes.

＜効果＞
以上述べたように実施形態では、クラスタリング処理に影響を与えないデータを、ログデータから事前に削除しておくことで、クラスタリング処理時間を削減する。ここで、削除対象のデータとしては、ログ内の重複行や、機器・情報を識別するためのＩＤ等があげられる。 <Effects>
As described above, in the embodiment, data that does not affect the clustering process is deleted from the log data in advance to reduce the clustering process time. Here, examples of data to be deleted include duplicate lines in the log and IDs for identifying devices and information.

つまり、実施形態では、ターゲットネットワークから収集されたログデータに対してクラスタリングや学習等の処理を行う前に、後の処理（クラスタ判定処理や学習処理）に対して、不要なデータを与えないようにする。つまり、メッセージ文クラス分け判断結果に影響を及ぼす単語や計算時間を占める重複データのユニーク抽出や、データクレンジングを行う。これにより、単なる機器を識別するための番号や、設備の数、経過時間と共にカウントして増えていく数字・番号などといった、重複データを大幅に減らし、処理対象を減少させることができる。処理対象が減少することから、全体の学習時間も大幅に削減することができる。 In other words, in this embodiment, before processing such as clustering and learning is performed on log data collected from the target network, unnecessary data is not provided to subsequent processing (cluster determination processing and learning processing). In other words, unique extraction of words that affect the message classification judgment results and duplicate data that takes up calculation time, and data cleansing are performed. This significantly reduces duplicate data such as numbers that simply identify equipment, the number of pieces of equipment, and numbers that increase over time, thereby reducing the number of data to be processed. Because the number of data to be processed is reduced, the overall learning time can also be significantly reduced.

収集されたログデータ量は、運用時間とともに増えていく。データの量の増加とともに、そのあとの処理時間や学習時間も増えていく。収集されたログデータは、大量の数字・番号や重複文章等のような、後段の処理では意味を持たないデータを含む。事前処理によりこれらのデータを除去し、処理対象となるデータ量を削減して、処理時間を大幅に減少し、高速化ができる。 The amount of collected log data increases over time. As the amount of data increases, subsequent processing and learning times also increase. The collected log data contains data that is meaningless in subsequent processing, such as large amounts of numbers and duplicated text. Pre-processing removes this data, reducing the amount of data to be processed and significantly shortening and speeding up processing times.

特に、大量のテキストデータに対して逐次的な処理を行う処理は、対象データの量が多ければ処理時間も多くなる。重複するデータや必要ではない特徴データを削除することにより、必要な特徴を持つデータだけを処理対象とすることができる。これにより、処理時間を大幅に削減でき、高速化を促すことができる。 In particular, when sequentially processing large amounts of text data, the processing time increases as the amount of target data increases. By removing duplicate data and unnecessary feature data, only data with the necessary features can be processed. This can significantly reduce processing time and promote faster processing.

ネットワークログのクラスタリングは、ログを形態素解析で単語ごとにわけ、単語リストを作る。単語リストをｏｎｅ－ｈｏｔ等の手法でベクトル化し、ベクトルごとにユークリッド距離を計算してクラスタに分離する、というアプローチが考えられる。距離の計算にかかる処理時間は、形態素解析の結果となる単語数が多くなるにつれて増加する。 To cluster network logs, the logs are divided into words using morphological analysis to create a word list. One possible approach is to convert the word list into vectors using a method such as one-hot, and then calculate the Euclidean distance for each vector to separate them into clusters. The processing time required to calculate the distance increases as the number of words resulting from the morphological analysis increases.

ネットワークログは、設定された仕様に基づいて、各機器により自動的に生成される。このため、重複するログが多く存在する。このため、重複するものも含めてすべてのログをクラスタリング対象にすることは、計算機リソースや処理時間を浪費する。 Network logs are automatically generated by each device based on configured specifications. As a result, there are many duplicate logs. Therefore, clustering all logs, including duplicates, would waste computer resources and processing time.

そこで、実施形態ではユニークログを抽出し、データクレンジングを施すことで後段に与えるデータ量を可能な限り削減し、高速化を促す。 Therefore, in this embodiment, unique logs are extracted and data cleansing is performed to reduce the amount of data provided to subsequent stages as much as possible, thereby promoting speedup.

ネットワークから収集されたログには、ネットワークを構成する機器を識別するための番号や、ＯＳから吐き出されるＳｙｓｌｏｇの記録番号など、単純に区別するための数字番号が数多く存在する。これらの数字番号をそのままクラスタリングの対象としてクラスタを生成しても、メッセージ文をクラスタリングするには無用であるし、かなりの処理時間がかかる。実施形態で説明したように、単語リストを前処理で単語クレンジングしてから、クラスタリングを行うことにより、クラスタリング対象のデータ量を大幅に削減でき、処理時間も高速化できる。 Logs collected from a network contain numerous numerical numbers used simply for differentiation, such as numbers to identify devices that make up the network and Syslog record numbers output by the OS. Generating clusters using these numerical numbers as clustering targets would be useless for clustering message texts and would require a considerable amount of processing time. As explained in the embodiments, by preprocessing the word list to cleanse words before clustering, the amount of data to be clustered can be significantly reduced and processing time can be accelerated.

これらのことから実施形態によれば、ログ解析にかかる時間を短縮することができるようになり、ひいては、ネットワーク運用に係わるＡＩモデルを学習させるために必要な時間を短縮し、トータルでの処理時間を高速化することも可能になる。さらには、ネットワーク管理業務において、管理対象となるネットワークのアラーム情報を収集し、学習済みのアラーム情報と比較・分析することで、管理対象ネットワークの状況を解析する作業の効率を高めることが可能になる。 As a result, according to the embodiment, it is possible to reduce the time required for log analysis, which in turn reduces the time required to train AI models related to network operation, thereby speeding up total processing time. Furthermore, in network management work, by collecting alarm information from the network to be managed and comparing and analyzing it with learned alarm information, it is possible to improve the efficiency of work to analyze the status of the managed network.

なお、この発明は上記の実施形態に限定されるものではない。例えば図１において、対象システム１は、例えばサーバ装置や通信機器、オフィス機器、医療機器、車載機器、各種住宅設備機器または家電機器などでもよい。また、ＡＩモデル２およびデータ生成装置は、対象システム１に含まれてもよいし、通信回線を介して相互に接続されてもよい。データ生成装置はＡＩモデル２に含まれてもよいし、通信回線を介して相互に接続されてもよい。 Note that this invention is not limited to the above-described embodiments. For example, in FIG. 1, target system 1 may be, for example, a server device, communications equipment, office equipment, medical equipment, in-vehicle equipment, various types of residential equipment, or home appliances. Furthermore, AI model 2 and data generation device may be included in target system 1, or may be connected to each other via a communications line. The data generation device may be included in AI model 2, or may be connected to each other via a communications line.

またＡＩモデルとしてはＤＮＮ（Deep Neural Network）、ＣＮＮ（Convolutional Neural Network）、ＲＮＮ（Recurrent Neural Network）など、あらゆるタイプのモデルを適用して良い。 In addition, any type of AI model can be applied, such as DNN (Deep Neural Network), CNN (Convolutional Neural Network), or RNN (Recurrent Neural Network).

さらには、例えば、図４のフローチャートにおいてクレンジング処理（ステップＳ５）を省略しても、ユニーク抽出処理（ステップＳ３）が実行されれば、クラスタリングにかかるデータ数を削減して処理時間を短縮することはできるので、目的は達成される。この構成では、単語ベクトル生成部５０ｄは、単語リスト生成部５０ｂにより生成された単語リスト３０ｃをベクトル化して、複数の単語ベクトル３０ｅを生成する。 Furthermore, even if the cleansing process (step S5) in the flowchart of Figure 4 is omitted, the objective can be achieved by executing the unique extraction process (step S3), which reduces the amount of data required for clustering and shortens the processing time. In this configuration, the word vector generation unit 50d vectorizes the word list 30c generated by the word list generation unit 50b to generate multiple word vectors 30e.

要するにこの発明は、上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合せにより種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる実施形態に亘る構成要素を適宜組み合せてもよい。 In short, this invention is not limited to the above-described embodiments, and can be embodied by modifying the components in the implementation stage without departing from the spirit of the invention. Furthermore, various inventions can be created by appropriately combining multiple components disclosed in the above-described embodiments. For example, some components may be deleted from all of the components shown in the embodiments. Furthermore, components from different embodiments may be appropriately combined.

１…対象システム
２…ＡＩモデル
３…データ生成装置
１０…プロセッサ
２０…メモリ
３０…ストレージ
３０ａ…ログデータ
３０ｂ…抽出後ログデータ
３０ｃ…単語リスト
３０ｄ…再生成単語リスト
３０ｅ…単語ベクトル
３０ｆ…クラスタリング結果
３０ｇ…プログラム
４０…入出力インタフェース
４５…バス
５０…プロセッサ
５０ａ…ユニーク抽出処理部
５０ｂ…単語リスト生成部
５０ｃ…クレンジング処理部
５０ｄ…単語ベクトル生成部
５０ｅ…クラスタリング処理部
５０ｆ…データ生成部。

1...Target system 2...AI model 3...Data generation device 10...Processor 20...Memory 30...Storage 30a...Log data 30b...Extracted log data 30c...Word list 30d...Regenerated word list 30e...Word vector 30f...Clustering result 30g...Program 40...Input/output interface 45...Bus 50...Processor 50a...Unique extraction processing unit 50b...Word list generation unit 50c...Cleansing processing unit 50d...Word vector generation unit 50e...Clustering processing unit 50f...Data generation unit.

Claims

a storage unit for storing text-based log data collected from the network of the target environment;
a processor;
The processor:
a unique extraction processing unit that extracts a unique log from the log data by deleting at least duplicated lines in the log data and IDs for identifying devices and information, and generates extracted log data;
a word list generation unit that divides the extracted log data into a plurality of words by morphological analysis to generate a word list;
a word vector generation unit that vectorizes the word list to generate a plurality of word vectors in a multi-dimensional vector space;
a clustering processing unit that clusters the plurality of word vectors based on a distance index between the plurality of word vectors in the vector space;
a data generation unit that generates events for the target environment from learned events created in a mirror environment of the target environment based on the results of the clustering.

The processor:
a cleansing processing unit that removes unnecessary words from the word list by a cleansing process to generate a regenerated word list;
The data generating device according to claim 1 , wherein the word vector generating unit generates the word vectors by vectorizing the regenerated word list.

The data generation device described in claim 2, wherein the cleansing processing unit generates the regenerated word list by removing words that reflect characteristics unique to the network from the word list.

The data generation device of claim 1, wherein the distance metric is Euclidean distance in the vector space.

1. A data generation method by a computer including a memory unit that stores text-based log data collected from a network of a target environment, and a processor, the method comprising:
a step in which the processor extracts, from the log data , a unique log in which at least duplicate lines in the log data and IDs for identifying devices and information have been deleted, and generates extracted log data;
a step of the processor dividing the extracted log data into a plurality of words by morphological analysis to generate a word list;
the processor vectorizing the word list to generate a plurality of word vectors in a multi-dimensional vector space;
the processor clustering the plurality of word vectors based on a distance metric between the plurality of word vectors in the vector space;
and generating events for the target environment from learned events created in a mirror environment of the target environment based on the results of the clustering by the processor.

A program installable on a computer having a processor and a storage unit that stores text-based log data collected from a network of a target environment, the program comprising:
The processor,
an instruction to function as a unique extraction processing unit that extracts a unique log from the log data by deleting at least duplicate lines in the log data and IDs for identifying devices and information, and generates extracted log data;
an instruction to function as a word list generation unit that divides the extracted log data into a plurality of words by morphological analysis and generates a word list;
an instruction to function as a word vector generator that vectorizes the word list and generates a plurality of word vectors in a multi-dimensional vector space;
an instruction to function as a clustering processor that clusters the plurality of word vectors based on a distance index between the plurality of word vectors in the vector space;
and instructions to cause the program to function as a data generation unit that generates events for the target environment from learned events created in a mirror environment of the target environment based on the results of the clustering.