JP7800680B2

JP7800680B2 - Correction device, correction method, and correction program

Info

Publication number: JP7800680B2
Application number: JP2024526099A
Authority: JP
Inventors: 悠太岩城; 楊鐘本
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2022-06-07
Filing date: 2022-06-07
Publication date: 2026-01-16
Anticipated expiration: 2042-06-07
Also published as: JPWO2023238259A1; WO2023238259A1

Description

本発明は、修正装置、修正方法及び修正プログラムに関する。 The present invention relates to a correction device, a correction method, and a correction program.

実世界において正規表現は正規表現エンジンとして実装され、様々な場面で利用されている。例えば、正規表現エンジンは、emailアドレスを入力する画面を持つウェブアプリケーションにおいて、ユーザが入力した文字列がemailアドレスかどうかを確認するために用いられている。また、例えば、正規表現エンジンは、外部から送られてきたデータのサニタイズや要素の抽出、汎用的なプログラミング言語の標準ライブラリ等に採用されている。 In the real world, regular expressions are implemented as regular expression engines and used in a variety of situations. For example, regular expression engines are used in web applications that have a screen for entering email addresses to verify whether the string entered by the user is an email address. Regular expression engines are also used, for example, to sanitize data sent from external sources, extract elements, and in standard libraries of general-purpose programming languages.

ここで、多くの正規表現エンジンに採用されているバックトラッキング法に基づいた解析アルゴリズムは、解析対象のデータと正規表現の組み合わせによっては処理に膨大な時間を要するという欠点がある。そのような欠点を悪用したサイバー攻撃としてRegular Expression Denial of Service（ReDoS）が知られている（参考文献："Regular expression Denial of Service - ReDoS", https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS）。 However, the analysis algorithm based on backtracking, which is used in many regular expression engines, has the disadvantage that it can take an enormous amount of time to process depending on the combination of the data to be analyzed and the regular expression. A cyber attack that exploits this drawback is known as Regular Expression Denial of Service (ReDoS) (Reference: "Regular expression Denial of Service - ReDoS", https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS).

なお、マッチさせる文字列の長さに対して、正規表現エンジン上で線形時間で動作するような正規表現を脆弱でない正規表現と呼ぶものとする。逆に、マッチさせる文字列の長さに対して、正規表現エンジン上で例えば指数関数時間で動作するような正規表現を脆弱な正規表現と呼ぶものとする。 A regular expression that runs in linear time on a regular expression engine relative to the length of the string to be matched is called a non-fragile regular expression. Conversely, a regular expression that runs in exponential time on a regular expression engine relative to the length of the string to be matched is called a vulnerable regular expression.

従来、ReDoSの脅威を取り除くための技術として、正規表現の受理する言語の誤りを修正するRFixer（例えば、非特許文献１を参照）が知られている。また、純粋な正規表現を決定性有限オートマトン（Deterministic Finite Automaton）に一度変換して戻すことで脆弱でない正規表現を得る方法（例えば、非特許文献２を参照）が知られている。 Conventionally, one known technology for eliminating the threat of ReDoS is RFixer, which corrects errors in the language accepted by regular expressions (see, for example, Non-Patent Document 1). Another known method is to obtain non-vulnerable regular expressions by first converting them into deterministic finite automata and then converting them back (see, for example, Non-Patent Document 2).

Rong Pan, Qinheping Hu, Gaowei Xu, and Loris D’Antoni. 2019. Automatic Repair of Regular Expressions. Proc. ACM Program. Lang. 3, OOPSLA, Article 139 (Oct. 2019), 29 pages.Rong Pan, Qinheping Hu, Gaowei Xu, and Loris D’Antoni. 2019. Automatic Repair of Regular Expressions. Proc. ACM Program. Lang. 3, OOPSLA, Article 139 (Oct. 2019), 29 pages. Brink van der Merwe, Nicolaas Weideman, and Martin Berglund. 2017. Turning Evil Regexes Harmless. In Proceedings of the South African Institute of Computer Scientists and Information Technologists (SAICSIT’17). Association for Computing Machinery, New York, NY, USA, Article 38, 10 pages.Brink van der Merwe, Nicolaas Weideman, and Martin Berglund. 2017. Turning Evil Regexes Harmless. In Proceedings of the South African Institute of Computer Scientists and Information Technologists (SAICSIT’17). Association for Computing Machinery, New York, NY, USA, Article 38, 10 pages.

しかしながら、従来の技術には、修正後の正規表現の可読性が悪い場合があるという問題がある。 However, conventional techniques have the problem that the readability of the modified regular expressions can be poor.

例えば、脆弱性が修正された正規表現の候補は複数存在する場合がある。修正後の正規表現の候補は、正規表現としての機能は共通していても、人間から見た可読性が異なる場合があり、常に可読性が高い正規表現が得られるとは限らない。 For example, there may be multiple candidate regular expressions for which vulnerabilities have been fixed. Even if the candidate regular expressions after correction share the same functionality as regular expressions, they may differ in terms of human readability, and it is not always possible to obtain a highly readable regular expression.

上述した課題を解決し、目的を達成するために、修正装置は、第１の正規表現における範囲文字の少なくとも一部をプレースホルダに置換したテンプレートのそれぞれについて、前記テンプレートのそれぞれの正規表現としての処理コスト又は文字列としての性質の少なくともいずれかに応じた優先度を算出する算出部と、前記優先度が高い順に、前記テンプレートのそれぞれを、前記第１の正規表現の脆弱性が解消された第２の正規表現に変換する変換部と、を有することを特徴とする。 In order to solve the above-mentioned problems and achieve the objectives, the correction device is characterized by having a calculation unit that calculates a priority for each template in which at least a portion of the range characters in a first regular expression have been replaced with placeholders, based on at least one of the processing cost of each template as a regular expression or its properties as a string, and a conversion unit that converts each of the templates in descending order of priority into a second regular expression in which the vulnerability of the first regular expression has been eliminated.

本発明によれば、修正後の正規表現の可読性を改善することができる。 The present invention can improve the readability of modified regular expressions.

図１は、第１の実施形態に係る修正装置の構成例を示す図である。FIG. 1 is a diagram illustrating an example of the configuration of a correction device according to the first embodiment. 図２は、正規表現の構文の例を示す図である。FIG. 2 is a diagram showing an example of the syntax of a regular expression. 図３は、コスト情報の例を示す図である。FIG. 3 is a diagram illustrating an example of cost information. 図４は、Positive ExamplesとNegative Examplesの例を示す図である。FIG. 4 is a diagram showing examples of Positive Examples and Negative Examples. 図５は、文字列の集合の生成方法を説明する図である。FIG. 5 is a diagram for explaining a method for generating a set of character strings. 図６は、正規表現の合成方法を説明する図である。FIG. 6 is a diagram for explaining a method for synthesizing regular expressions. 図７は、コストに基づくスコアの算出方法を説明する図である。FIG. 7 is a diagram illustrating a method for calculating a score based on costs. 図８は、文字数に基づくスコアの算出方法を説明する図である。FIG. 8 is a diagram illustrating a method for calculating a score based on the number of characters. 図９は、コスト及び文字数に基づくスコアの算出方法を説明する図である。FIG. 9 is a diagram illustrating a method for calculating a score based on the cost and the number of characters. 図１０は、文字列としてのかい離度合いに基づくスコアの算出方法を説明する図である。FIG. 10 is a diagram illustrating a method for calculating a score based on the degree of deviation as a character string. 図１１は、第１の実施形態に係る修正装置の処理の流れを示すフローチャートである。FIG. 11 is a flowchart showing the flow of processing by the repair device according to the first embodiment. 図１２は、正規表現の修正処理の流れを示すフローチャートである。FIG. 12 is a flowchart showing the flow of the regular expression correction process. 図１３は、テンプレートの格納処理の流れを示すフローチャートである。FIG. 13 is a flowchart showing the flow of template storage processing. 図１４は、修正プログラムを実行するコンピュータの一例を示す図である。FIG. 14 illustrates an example of a computer that executes a correction program.

以下に、本願に係る修正装置、修正方法及び修正プログラムの実施形態を図面に基づいて詳細に説明する。なお、本発明は、以下に説明する実施形態により限定されるものではない。 The following describes in detail embodiments of the correction device, correction method, and correction program according to the present application, with reference to the accompanying drawings. Note that the present invention is not limited to the embodiments described below.

［第１の実施形態の構成］
まず、図１を用いて、第１の実施形態に係る修正装置の構成について説明する。図１は、第１の実施形態に係る修正装置の構成の一例を示す図である。図１に示すように、修正装置１０は、修正前の正規表現の入力を受け付け、入力された正規表現の修正を行い、修正後の正規表現を出力する。 [Configuration of the first embodiment]
First, the configuration of a correction device according to the first embodiment will be described with reference to Fig. 1. Fig. 1 is a diagram showing an example of the configuration of a correction device according to the first embodiment. As shown in Fig. 1, a correction device 10 receives an input of a regular expression before correction, corrects the input regular expression, and outputs the corrected regular expression.

ここで、本実施形態における正規表現は、実世界の拡張を施した正規表現であって、バッカスナウア記法（ＢＮＦ：Backus-Naur form）で定義された構文に従うものとする。図２は、正規表現の構文の例を示す図である。図２の正規表現ｒは、本実施形態における正規表現の一例である。なお、以降の説明において、正規表現中の「＼」は適宜バックスラッシュに置き換えられてもよい。 The regular expressions in this embodiment are real-world extended regular expressions that conform to the syntax defined in Backus-Naur form (BNF). Figure 2 is a diagram showing an example of regular expression syntax. Regular expression r in Figure 2 is an example of a regular expression in this embodiment. Note that in the following explanation, "\" in a regular expression may be replaced with a backslash as appropriate.

図２の「C」は文字の集合であり、「x」は文字列、「i」は自然数である。図２の構文は既存の正規表現エンジンで利用されているものである（参考文献:https://perldoc.perl.org/perlre.html）。 In Figure 2, "C" is a set of characters, "x" is a string, and "i" is a natural number. The syntax in Figure 2 is used in existing regular expression engines (Reference: https://perldoc.perl.org/perlre.html).

また、「.」は任意の１文字を表す記号である。つまり、「.」は図２の範囲文字「[C]」と糖衣構文である。また、範囲文字「[C]」にマッチしない文字の集合は、「[^C]」と書ける。また、空集合は「[]」と表記され、任意の文字にマッチしないことを意味する。 Also, "." is a symbol that represents any single character. In other words, "." is syntactic sugar for the range character "[C]" in Figure 2. Also, a set of characters that does not match the range character "[C]" can be written as "[^C]". Also, the empty set is written as "[]", which means that it does not match any characters.

図１に戻り、修正装置１０の各部について説明する。図１に示すように、修正装置１０は、インタフェース部１１、記憶部１２及び制御部１３を有する。Returning to Figure 1, we will explain each part of the correction device 10. As shown in Figure 1, the correction device 10 has an interface unit 11, a memory unit 12, and a control unit 13.

インタフェース部１１は、データの入出力及びデータの通信を行うためのインタフェースである。例えば、インタフェース部１１は、キーボード及びマウス等の入力装置からデータの入力を受け付ける。また、例えば、インタフェース部１１は、ディスプレイ及びスピーカ等の出力装置にデータを出力する。 The interface unit 11 is an interface for inputting, outputting, and communicating data. For example, the interface unit 11 accepts data input from input devices such as a keyboard and a mouse. Also, for example, the interface unit 11 outputs data to output devices such as a display and a speaker.

また、インタフェース部１１は、ネットワークを介して通信を行うための装置（例えばＮＩＣ（Network Interface Card））であってもよい。 The interface unit 11 may also be a device for communicating over a network (e.g., a NIC (Network Interface Card)).

記憶部１２は、ＨＤＤ（Hard Disk Drive）、ＳＳＤ（Solid State Drive）、光ディスク等の記憶装置である。なお、記憶部１２は、ＲＡＭ（Random Access Memory）、フラッシュメモリ、ＮＶＳＲＡＭ（Non Volatile Static Random Access Memory）等のデータを書き換え可能な半導体メモリであってもよい。記憶部１２は、修正装置１０で実行されるＯＳ（Operating System）や各種プログラムを記憶する。 The memory unit 12 is a storage device such as an HDD (Hard Disk Drive), SSD (Solid State Drive), or optical disk. The memory unit 12 may also be a data-rewritable semiconductor memory such as RAM (Random Access Memory), flash memory, or NVSRAM (Non-Volatile Static Random Access Memory). The memory unit 12 stores the OS (Operating System) and various programs executed by the correction device 10.

記憶部１２は、置換候補構文情報１２１を記憶する。置換候補構文情報１２１は、正規表現又はテンプレートの、範囲文字又はホールと置換される正規表現の構文の集合である。 The memory unit 12 stores replacement candidate syntax information 121. The replacement candidate syntax information 121 is a set of regular expression syntax to be replaced with range characters or holes in a regular expression or template.

例えば、置換候補構文情報１２１は、「□□, □|□, □*, (□), ＼i, (?=□), (?!□), (?<=□), (?<!□)」である。ただし、「□」はホールである。ホール及びテンプレートについては後述する。 For example, replacement candidate syntax information 121 is "□□, □|□, □*, (□), ＼i, (?=□), (?!□), (?<=□), (?<!□)". Here, "□" is a hole. Holes and templates will be explained later.

コスト情報１２２は、正規表現の演算子ごとの処理にかかるコストを定めた情報である。また、コスト情報１２２は、ホールのコストを含む。コストは後述する優先度の算出において使用される。図３は、コスト情報の例を示す図である。また、コストの詳細については後述する。 Cost information 122 is information that defines the cost of processing each regular expression operator. Cost information 122 also includes the cost of holes. The costs are used in calculating priorities, which will be described later. Figure 3 shows an example of cost information. Details of costs will be described later.

制御部１３は、修正装置１０全体を制御する。制御部１３は、例えば、ＣＰＵ（Central Processing Unit）、ＭＰＵ（Micro Processing Unit）、ＧＰＵ（Graphics Processing Unit）等の電子回路や、ＡＳＩＣ（Application Specific Integrated Circuit）、ＦＰＧＡ（Field Programmable Gate Array）等の集積回路である。 The control unit 13 controls the entire correction device 10. The control unit 13 is, for example, an electronic circuit such as a CPU (Central Processing Unit), MPU (Micro Processing Unit), or GPU (Graphics Processing Unit), or an integrated circuit such as an ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array).

また、制御部１３は、各種の処理手順を規定したプログラムや制御データを格納するための内部メモリを有し、内部メモリを用いて各処理を実行する。また、制御部１３は、各種のプログラムが動作することにより各種の処理部として機能する。例えば、制御部１３は、生成部１３１及び合成部１３２を有する。 The control unit 13 also has an internal memory for storing programs and control data that define various processing procedures, and executes each process using the internal memory. The control unit 13 also functions as various processing units as various programs run. For example, the control unit 13 has a generation unit 131 and a synthesis unit 132.

生成部１３１は、修正前の正規表現によって受理される文字列の集合であるPositive Examples、及び、修正前の正規表現によって拒否される文字列の集合であるNegative Examplesを生成する。 The generation unit 131 generates Positive Examples, which are a set of strings that are accepted by the regular expression before modification, and Negative Examples, which are a set of strings that are rejected by the regular expression before modification.

なお、Positive Examplesは第１の集合の一例である。また、Negative Examplesは第２の集合の一例である。また、修正前の正規表現は第１の正規表現の一例である。 Note that the Positive Examples are an example of the first set, and the Negative Examples are an example of the second set. The regular expression before correction is an example of the first regular expression.

図４は、Positive ExamplesとNegative Examplesの例を示す図である。ここでは、修正前の正規表現が「.*.*=.*」であるものとする。このとき、Positive Examplesに含まれる「=」、「abcd==」、「==abcd」及び「ab=c」は正規表現「.*.*=.*」にマッチする（受理される）。一方、Negative Examplesに含まれる「abc」は正規表現「.*.*=.*」にマッチしない（拒否される）。 Figure 4 shows examples of Positive Examples and Negative Examples. Here, we assume that the regular expression before correction is ".*.*=.*". In this case, "=", "abcd==", "==abcd", and "ab=c" included in the Positive Examples match the regular expression ".*.*=.*" (are accepted). On the other hand, "abc" included in the Negative Examples does not match the regular expression ".*.*=.*" (is rejected).

生成部１３１は、特定の長さ以下の文字を組み合わせた文字列を全列挙し、各文字列が正規表現に受理されるならPositive Examplesに分類し、拒否されるならNegative Examplesに分類することができる。なお、生成部１３１は、非特許文献１に記載の方法を用いてPositive ExamplesとNegative Examplesを生成してもよい。The generation unit 131 can enumerate all strings that combine characters up to a certain length, and classify each string as a Positive Example if it is accepted by the regular expression, or as a Negative Example if it is rejected. The generation unit 131 may also generate Positive Examples and Negative Examples using the method described in Non-Patent Document 1.

ここで、文字列を素直に全列挙すると、爆発的に例が生成されてしまう。これを回避するために、生成部１３１は、修正前の正規表現の中に現れる文字のみからPositive Examplesの文字列及びNegative Examplesの文字列を生成してもよい。 If all strings were simply listed, an explosive number of examples would be generated. To avoid this, the generation unit 131 may generate Positive Examples and Negative Examples from only the characters that appear in the regular expression before modification.

例えば、正規表現が「ab[c-d]*」である場合、生成部１３１は、「a」及び「b」と、「[c,d]」からランダムに選択した１文字と、を組み合わせて候補の文字列を生成する。 For example, if the regular expression is "ab[c-d]*", the generation unit 131 generates a candidate string by combining "a" and "b" with one character randomly selected from "[c,d]".

図５は、文字列の集合の生成方法を説明する図である。図５の例では、修正前の正規表現は「.*.*@example[.]com」である。この場合、生成部１３１は、正規表現「.*.*@example[.]com」によって受理される文字列「@example.com」、「a@example.com」、「gc@example.com」をPositive Examplesに分類する。一方、生成部１３１は、正規表現「.*.*@example[.]com」によって拒否される文字列「example.com」、「@.com」、「@examplecom」、「@example.」等をNegative Examplesに分類する。 Figure 5 is a diagram explaining how to generate a set of strings. In the example of Figure 5, the regular expression before correction is ".*.*@example[.]com". In this case, the generation unit 131 classifies the strings "@example.com", "a@example.com", and "gc@example.com", which are accepted by the regular expression ".*.*@example[.]com", into Positive Examples. On the other hand, the generation unit 131 classifies the strings "example.com", "@.com", "@examplecom", "@example.", etc., which are rejected by the regular expression ".*.*@example[.]com", into Negative Examples.

合成部１３２は、修正前の正規表現における範囲文字を所定の構文に置換した正規表現であって、Positive Examplesの文字列を受理し、かつNegative Examplesの文字列を拒否するような正規表現である修正後の正規表現を合成する。なお、修正後の正規表現は第２の正規表現の一例である。 The synthesis unit 132 synthesizes a modified regular expression, which is a regular expression in which the range characters in the pre-modification regular expression are replaced with a predetermined syntax, and which accepts character strings in the Positive Examples and rejects character strings in the Negative Examples. Note that the modified regular expression is an example of a second regular expression.

合成部１３２による処理は大きく、テンプレートを作成するステップと、テンプレートへの割り当てを行うステップと、に分けられる。 The processing by the synthesis unit 132 can be broadly divided into a step of creating a template and a step of assigning to the template.

テンプレートを作成するステップでは、合成部１３２は、正規表現における範囲文字をプレースホルダを使って置換することによりテンプレートを作成する。 In the template creation step, the synthesis unit 132 creates the template by replacing range characters in the regular expression with placeholders.

テンプレートへの割り当てを行うステップでは、合成部１３２は、プレースホルダへ所定の構文を割り当てて、脆弱でない正規表現を合成する。以降、プレースホルダをホールと呼び、「□」と表記する。 In the template assignment step, the synthesis unit 132 assigns a predetermined syntax to the placeholder to synthesize a non-vulnerable regular expression. Hereafter, the placeholder will be referred to as a hole and represented by a "□".

合成部１３２は、優先度付きキューを保持しつつ処理を行う。キューに格納されるテンプレートには優先度が付与される。 The synthesis unit 132 performs processing while maintaining a prioritized queue. Priorities are assigned to templates stored in the queue.

合成部１３２は、キューから要素を取り出す際には、格納されているテンプレートのうち優先度が最も高いものを優先して取り出す。処理の開始時点では、合成部１３２は、修正前の正規表現をテンプレートとしてキューに格納する。なお、キューに格納された修正前の正規表現の優先度は必然的に最高になる。 When the synthesis unit 132 retrieves an element from the queue, it prioritizes retrieving the stored template with the highest priority. At the start of processing, the synthesis unit 132 stores the unmodified regular expression in the queue as a template. Note that the unmodified regular expression stored in the queue will inevitably have the highest priority.

まず、合成部１３２によって実行されるテンプレートを作成するステップについて説明する。合成部１３２は、キューから取り出したテンプレートが範囲文字を含む場合、当該テンプレートに含まれる当該範囲文字をホールに置換する。なお、範囲文字は、例えば「[C]」又は「.」のように表される。一方、合成部１３２は、キューから取り出したテンプレートがホールを含む場合、当該ホールのうちいずれか１つを、所定の構文に置換してもよい。 First, we will explain the steps of creating a template executed by the synthesis unit 132. If the template retrieved from the queue contains range characters, the synthesis unit 132 replaces the range characters contained in the template with holes. Note that range characters are represented, for example, as "[C]" or ".". On the other hand, if the template retrieved from the queue contains holes, the synthesis unit 132 may replace one of the holes with a specified syntax.

例えば、合成部１３２は、テンプレートとしてキューに格納されている修正前の正規表現「.*.*=.*」の範囲文字を置換したテンプレート「□*.*=.*」、「.*□*=.*」「.*.*=□*」を作成し、キューに格納する。なお、一度取り出されたテンプレートは破棄されるものとする。 For example, the synthesis unit 132 creates templates "□*.*=.*", ".*□*=.*", and ".*.*=□*" by replacing the range characters of the unmodified regular expression ".*.*=.*" stored in the queue as a template, and stores them in the queue. Note that once a template has been extracted, it is discarded.

このように、合成部１３２は、修正前の正規表現における範囲文字の少なくとも一部をホールに置換し、当該置換したホールをさらに所定の構文に置換したテンプレートを基に修正後の正規表現を合成する。 In this way, the synthesis unit 132 replaces at least some of the range characters in the pre-modification regular expression with holes, and synthesizes the modified regular expression based on a template in which the replaced holes are further replaced with a specified syntax.

さらに、合成部１３２は、ホールを、置換候補構文情報１２１に含まれる「□□」、「□|□」、「□*」、「(□)」、「＼i」、「(?=□)」、「(?!□)」、「(?<=□)」、「(?<!□)」といった構文に置換することができる。この場合、合成部１３２は、テンプレートに含まれるホールを、ホールを含む所定の構文である「□□」、「□|□」、「□*」、「(□)」、「＼i」、「(?=□)」、「(?!□)」、「(?<=□)」、「(?<!□)」のいずれかに置換したテンプレート（ただし、□はホール）を基に修正後の正規表現を合成する。 Furthermore, the synthesis unit 132 can replace holes with syntax such as "□□", "□|□", "□*", "(□)", "＼i", "(?=□)", "(?!□)", "(?<=□)", or "(?<!□)" contained in the replacement candidate syntax information 121. In this case, the synthesis unit 132 synthesizes a modified regular expression based on a template in which the hole contained in the template has been replaced with one of the predetermined syntaxes containing holes, "□□", "□|□", "□*", "(□)", "＼i", "(?=□)", "(?!□)", "(?<=□)", or "(?<!□)" (where □ represents a hole).

続いて、合成部１３２によって実行されるテンプレートへの割り当てを行うステップについて説明する。ここでは、合成部１３２がテンプレートを作成するステップを繰り返し、例えばテンプレート「□*□*=.*」を作成しキューに格納したものとする。例えば、合成部１３２は、テンプレート「□*.*=.*」の左辺の範囲文字「.」をホールに置換することでテンプレート「□*□*=.*」を得る。 Next, we will explain the steps of assigning to a template executed by the synthesis unit 132. Here, we assume that the synthesis unit 132 repeats the steps of creating a template, and creates, for example, the template "□*□*=.*" and stores it in the queue. For example, the synthesis unit 132 obtains the template "□*□*=.*" by replacing the range character "." on the left side of the template "□*.*=.*" with a hole.

合成部１３２は、テンプレートが含むホールに対する、条件を満たす範囲文字の割り当てを探索する。例えば、合成部１３２は、Satisfiability Modulo Theories（ＳＭＴ) solver（例えば、Z3 solver）等を用いて探索を行う。The synthesis unit 132 searches for a range character assignment that satisfies the conditions for the holes contained in the template. For example, the synthesis unit 132 performs the search using a Satisfiability Modulo Theories (SMT) solver (e.g., the Z3 solver).

合成部１３２は、テンプレートが「□*□*=.*」であり、Positive Examples及びNegative Examplesが図４の通りであれば、「[]*[^=]*=.*」という割り当てを探索により得ることができる。合成部１３２は、空集合である「[]」を取り除き、正規表現「[^=]*=.*」を得る。 If the template is "□*□*=.*" and the Positive Examples and Negative Examples are as shown in Figure 4, the synthesis unit 132 can obtain the assignment "[]*[^=]*=.*" by searching. The synthesis unit 132 removes the empty set "[]" and obtains the regular expression "[^=]*=.*".

ここでは、合成部１３２は変換部として機能する。すなわち、合成部１３２は、優先度が高い順に、テンプレートのそれぞれを、修正前の正規表現の脆弱性が解消された修正後の正規表現に変換する。例えば、合成部１３２は、テンプレート「□*□*=.*」を、正規表現「[^=]*=.*」に変換する。 Here, the synthesis unit 132 functions as a conversion unit. That is, the synthesis unit 132 converts each template, in descending order of priority, into a modified regular expression in which the vulnerability of the pre-modification regular expression has been eliminated. For example, the synthesis unit 132 converts the template "□*□*=.*" into the regular expression "[^=]*=.*".

正規表現「[^=]*=.*」は、図４のPositive Examplesを受理し、Negative Examplesを拒否する。また、正規表現「[^=]*=.*」は、同じ文字にマッチする箇所を高々１つしか含まないため、脆弱でない性質を持っているということができる。 The regular expression "[^=]*=.*" accepts the Positive Examples in Figure 4 and rejects the Negative Examples. Furthermore, the regular expression "[^=]*=.*" contains at most one match for the same character, so it can be said to be invulnerable.

これより、合成部１３２は、テンプレートにおけるプレースホルダを所定の構文に置換した正規表現であって、Positive Examplesを受理し、かつNegative Examplesを拒否するような正規表現である修正後の正規表現に、テンプレートを変換するということができる。 As a result, the synthesis unit 132 can convert the template into a modified regular expression, which is a regular expression in which the placeholders in the template are replaced with a specified syntax, and which accepts Positive Examples and rejects Negative Examples.

本実施形態では、前述の通り、マッチさせる文字列の長さに対して、正規表現エンジン上で線形時間で動作するような正規表現を脆弱でない正規表現と呼ぶ。逆に、マッチさせる文字列の長さに対して、正規表現エンジン上で例えば指数関数時間で動作するような正規表現を脆弱な正規表現と呼ぶ。 In this embodiment, as mentioned above, a regular expression that runs in linear time on a regular expression engine relative to the length of the string to be matched is called a non-fragile regular expression. Conversely, a regular expression that runs in, for example, exponential time on a regular expression engine relative to the length of the string to be matched is called a fragile regular expression.

合成部１３２による脆弱でない正規表現の合成は、KochとScherzingerらにより考案されたstrongly one-unambiguous（参考文献：Christoph Koch and Stefanie Scherzinger. 2007. Attribute Grammars for Scalable Query Processing on XML Streams. The VLDB Journal 16, 3 (July 2007), 317-342.）という性質を実世界の拡張にも合わせて改良した性質を用いたものである。 The synthesis of robust regular expressions by the synthesis unit 132 utilizes the strongly one-unambiguous property devised by Koch and Scherzinger et al. (Reference: Christoph Koch and Stefanie Scherzinger. 2007. Attribute Grammars for Scalable Query Processing on XML Streams. The VLDB Journal 16, 3 (July 2007), 317-342.) which has been improved to accommodate real-world extensions.

Strongly one-unambiguousとは、正規表現エンジンが次に処理する演算は現在解析中の文字が何か定まれば一意に定まるという性質である。 Strongly one-unambiguous means that the next operation that the regular expression engine processes is uniquely determined once the character currently being parsed is determined.

同様に、修正前の正規表現が「.*.*@example[.]com」である場合、図６に示すように、合成部１３２は、脆弱でない正規表現「[^@]*@example[.]com」を得ることができる。 Similarly, if the regular expression before modification is ".*.*@example[.]com", the synthesis unit 132 can obtain the non-vulnerable regular expression "[^@]*@example[.]com", as shown in Figure 6.

［優先度の算出］
合成部１３２は、正規表現又はテンプレートをキューに格納する際に、優先度を算出する。このとき、合成部１３２は、算出部として機能する。 [Priority calculation]
The synthesis unit 132 calculates the priority when storing a regular expression or a template in a queue. At this time, the synthesis unit 132 functions as a calculation unit.

合成部１３２は、第１の正規表現における範囲文字の少なくとも一部をプレースホルダに置換したテンプレートのそれぞれについて、テンプレートのそれぞれの正規表現としての処理コスト又は文字列としての性質の少なくともいずれかに応じた優先度を算出する。 For each template in which at least some of the range characters in the first regular expression have been replaced with placeholders, the synthesis unit 132 calculates a priority based on at least one of the processing cost of each template as a regular expression or its properties as a string.

コスト情報１２２は、正規表現としての処理コストの一例である。また、文字列としての性質は、例えば文字数（サイズ、長さ）、特定の文字列とのかい離度又は類似度等である。 Cost information 122 is an example of the processing cost for a regular expression. Furthermore, the properties of the string include, for example, the number of characters (size, length), the degree of deviation or similarity from a specific string, etc.

以降、スコアの算出方法の例を説明する。合成部１３２は、ここで説明する算出方法をそれぞれ単独で用いてもよいし、各算出方法を組み合わせて用いてもよい。また、スコアが小さいほど優先度は高くなるものとする。合成部１３２は、スコアの逆数又はスコアの符号を反転させた値を優先度としてもよい。 The following describes examples of score calculation methods. The synthesis unit 132 may use each of the calculation methods described here individually, or may use a combination of the calculation methods. The smaller the score, the higher the priority. The synthesis unit 132 may use the inverse of the score or the value obtained by inverting the sign of the score as the priority.

図７は、コストに基づくスコアの算出方法を説明する図である。図７に示すように、合成部１３２は、あらかじめ定められたテンプレートに含まれる正規表現の演算子ごとのコストに応じて優先度（スコア）を算出することができる。 Figure 7 is a diagram explaining a method for calculating a score based on cost. As shown in Figure 7, the synthesis unit 132 can calculate a priority (score) according to the cost of each regular expression operator included in a predetermined template.

演算子ごとのコストは図３に示す通りである。例えば、コストは、演算子及びホールを扱う際の複雑さに応じて大きくなる。図３の例では、固定文字のコストは０であり、任意の文字を示す演算子（例えば、範囲文字）のコストは１である。なお、「a」、「1」、「/」等の固定文字は、それぞれ「[a]」、「[1]」、「[/]」等の略記であり、範囲文字の一種であるが、複雑性は極めて低いためコストは０と定められる。The cost for each operator is as shown in Figure 3. For example, the cost increases depending on the complexity of handling the operator and holes. In the example in Figure 3, the cost of a fixed character is 0, and the cost of an operator that indicates any character (for example, a range character) is 1. Note that fixed characters such as "a", "1", and "/" are abbreviations for "[a]", "[1]", and "[/]", respectively, and are a type of range character, but their complexity is extremely low, so their cost is set to 0.

さらに、文字及び文字列の選択又は繰り返しを行う演算子のコストは３である。また、指定の文字列又はパターンを取得、参照、又は検索する演算子（キャプチャ、後方参照、各種先読み及び各種後読み等）のコストは７である。 Furthermore, operators that select or repeat characters and strings have a cost of 3. Operators that obtain, reference, or search for specified strings or patterns (capture, backreference, various lookaheads, various lookbehinds, etc.) have a cost of 7.

また、ホールは、「□□, □|□, □*, (□), ＼i, (?=□), (?!□), (?<=□), (?<!□)」のいずれかに置換される、または探索の結果範囲文字が割り当てられるものである。このため、ホールのコストは、置換後に追加される演算子の最大コストである７と、範囲文字のコストである１と、それらのいずれかが選択されることから選択演算子のコストである３と、の和である１１と定められる。 A hole is one that is replaced with one of "□□, □|□, □*, (□), ＼i, (?=□), (?!□), (?<=□), (?<!□)" or assigned a range character as a search result. Therefore, the cost of a hole is set to 11, which is the sum of 7, the maximum cost of operators added after replacement, 1, the cost of a range character, and 3, the cost of a selection operator since one of them is selected.

なお、コストの定め方は、ここで説明したものに限られず、任意の方法で定められたものであってよい。 Note that the method for determining costs is not limited to that described here and may be determined in any manner.

図７に戻り、テンプレート「(□|□)*.*=.*」には、コスト１１のホールが２つ、コスト７のキャプチャが１つ、コスト３の選択が１つ、コスト３の繰り返しが３つ、コスト１の範囲文字が２つ含まれる。このため、合成部１３２は、コストを合計し、テンプレート「(□|□)*.*=.*」のスコアを４３と算出する。 Returning to Figure 7, the template "(□|□)*.*=.*" contains two holes with a cost of 11, one capture with a cost of 7, one choice with a cost of 3, three repeats with a cost of 3, and two range characters with a cost of 1. Therefore, the synthesis unit 132 adds up the costs and calculates the score of the template "(□|□)*.*=.*" as 43.

また、テンプレート「□*□*=.*」には、コスト１１のホールが２つ、コスト３の繰り返しが３つ、コスト１の範囲文字が１つ含まれる。このため、合成部１３２は、コストを合計し、テンプレート「□*□*=.*」のスコアを３２と算出する。 In addition, the template "□*□*=.*" contains two holes with a cost of 11, three repetitions with a cost of 3, and one range character with a cost of 1. Therefore, the synthesis unit 132 adds up the costs and calculates the score of the template "□*□*=.*" as 32.

また、テンプレート「(?=□)*@hoge＼.com」には、コスト１１のホールが１つ、コスト７の肯定先読みが１つ、コスト３の繰り返しが１つ含まれる。このため、合成部１３２は、コストを合計し、テンプレート「(?=□)*@hoge＼.com」のスコアを２１と算出する。 Furthermore, the template "(?=□)*@hoge＼.com" contains one hole with a cost of 11, one positive lookahead with a cost of 7, and one repetition with a cost of 3. Therefore, the synthesis unit 132 adds up the costs and calculates the score of the template "(?=□)*@hoge＼.com" as 21.

図８は、文字数に基づくスコアの算出方法を説明する図である。図８に示すように、合成部１３２は、テンプレートの文字数に応じて優先度を算出する。 Figure 8 is a diagram explaining how to calculate a score based on the number of characters. As shown in Figure 8, the synthesis unit 132 calculates the priority according to the number of characters in the template.

図８に示すように、テンプレート「(□|□)*.*=.*」の文字数は１１であるため、合成部１３２は、テンプレート「(□|□)*.*=.*」のスコアを１１と算出する。 As shown in Figure 8, the number of characters in the template "(□|□)*.*=.*" is 11, so the synthesis unit 132 calculates the score of the template "(□|□)*.*=.*" as 11.

また、テンプレート「□*□*=.*」の文字数は７であるため、合成部１３２は、テンプレート「□*□*=.*」のスコアを７と算出する。 Furthermore, since the number of characters in the template "□*□*=.*" is 7, the synthesis unit 132 calculates the score of the template "□*□*=.*" as 7.

また、テンプレート「(?=□)*@hoge＼.com」の文字数は１６であるため、合成部１３２は、テンプレート「(?=□)*@hoge＼.com」のスコアを１６と算出する。 Furthermore, since the number of characters in the template "(?=□)*@hoge＼.com" is 16, the synthesis unit 132 calculates the score of the template "(?=□)*@hoge＼.com" to be 16.

図９は、コスト及び文字数に基づくスコアの算出方法を説明する図である。図９に示すように、合成部１３２は、図７で説明したコストに基づくスコアと、図８で説明した文字数に基づくスコアとを掛ける（乗じる）ことによって最終的なスコアを計算してもよい。 Figure 9 is a diagram illustrating a method for calculating a score based on cost and number of characters. As shown in Figure 9, the synthesis unit 132 may calculate a final score by multiplying the score based on cost described in Figure 7 by the score based on the number of characters described in Figure 8.

図９に示すように、テンプレート「(□|□)*.*=.*」のコストに基づくスコアは４３であり、文字数に基づくスコアは１１であるため、合成部１３２は、各スコアを掛けて、最終的なスコアを４７３と算出する。 As shown in Figure 9, the cost-based score of the template "(□|□)*.*=.*" is 43, and the score based on the number of characters is 11, so the synthesis unit 132 multiplies each score to calculate a final score of 473.

また、テンプレート「□*□*=.*」のコストに基づくスコアは３２であり、文字数に基づくスコアは７であるため、合成部１３２は、各スコアを掛けて、最終的なスコアを２２４と算出する。 Furthermore, since the cost-based score of the template "□*□*=.*" is 32 and the score based on the number of characters is 7, the synthesis unit 132 multiplies each score and calculates a final score of 224.

また、テンプレート「(?=□)*@hoge＼.com」のコストに基づくスコアは２１であり、文字数に基づくスコアは１６であるため、合成部１３２は、各スコアを掛けて、最終的なスコアを３３６と算出する。 Furthermore, since the cost-based score of the template "(?=□)*@hoge＼.com" is 21 and the score based on the number of characters is 16, the synthesis unit 132 multiplies each score and calculates a final score of 336.

図１０は、文字列としてのかい離度合いに基づくスコアの算出方法を説明する図である。ここでは、合成部１３２は、テンプレートと修正前の正規表現との間のレーベンシュタイン距離をスコアとして算出する。レーベンシュタイン距離は、文字列間のかい離度合いの一例である。なお、修正前の正規表現は「.*.*＝.*」であるものとする。 Figure 10 is a diagram explaining a method for calculating a score based on the degree of deviation as a string. Here, the synthesis unit 132 calculates the Levenshtein distance between the template and the regular expression before correction as the score. The Levenshtein distance is an example of the degree of deviation between strings. Note that the regular expression before correction is assumed to be ".*.* = .*".

図１０に示すように、テンプレート「(□|□)*.*=.*」と正規表現「.*.*＝.*」との間のレーベンシュタイン距離は５であるため、合成部１３２は、スコアを５と算出する。 As shown in Figure 10, the Levenshtein distance between the template "(□|□)*.*=.*" and the regular expression ".*.*＝.*" is 5, so the synthesis unit 132 calculates the score as 5.

なお、この場合、「(□|□)*.*=.*」→「□|□)*.*＝.*」→「|□)*.*＝.*」→「□)*.*＝.*」→「)*.*＝.*」→「.*.*＝.*」の５ステップで変形可能であるため、レーベンシュタイン距離は５となる。 In this case, the transformation can be done in five steps: "(□|□)*.*=.*" → "□|□)*.*＝.*" → "|□)*.*＝.*" → "□)*.*＝.*" → ")*.*＝.*" → ".*.*＝.*", so the Levenshtein distance is 5.

また、テンプレート「□*□*=.*」と正規表現「.*.*＝.*」との間のレーベンシュタイン距離は２であるため、合成部１３２は、スコアを２と算出する。 Furthermore, since the Levenshtein distance between the template "□*□*=.*" and the regular expression ".*.*＝.*" is 2, the synthesis unit 132 calculates the score as 2.

ここで説明した方法によれば、可読性が良いほど優先度が高くなる（スコアが小さくなる）。例えば、コストが大きい演算子（先読み、キャプチャ等）ほど、可読性を悪化させることが考えられる。また、例えば、文字数が多いほど、可読性が悪化することが考えられる。また、例えば、修正後の正規表現が修正前の正規表現からかい離しているほど、人間にとっては修正後の正規表現の意味を読み取り辛くなり、可読性は悪化すると考えられる。 According to the method described here, the better the readability, the higher the priority (the lower the score). For example, the more costly an operator (lookahead, capture, etc.) is, the worse the readability is likely to be. Also, for example, the more characters there are, the worse the readability is likely to be. Also, for example, the more the corrected regular expression deviates from the regular expression before correction, the more difficult it will be for humans to decipher the meaning of the corrected regular expression, and the worse the readability is likely to be.

［第１の実施形態の処理］
図１１は、第１の実施形態に係る修正装置の処理の流れを示すフローチャートである。まず、修正装置１０は、正規表現の入力を受け付ける（ステップＳ１０）。 [Processing of the First Embodiment]
11 is a flowchart showing the flow of processing by the editing device according to the first embodiment. First, the editing device 10 receives input of a regular expression (step S10).

次に、修正装置１０は、入力された正規表現によって受理される文字列の集合（Positive Examples）を生成する（ステップＳ２０）。また、修正装置１０は、入力された正規表現によって拒否される文字列の集合（Negative Examples）を生成する（ステップＳ３０）。Next, the correction device 10 generates a set of strings (Positive Examples) that are accepted by the input regular expression (step S20). The correction device 10 also generates a set of strings (Negative Examples) that are rejected by the input regular expression (step S30).

例えば、修正装置１０は、入力された修正前の正規表現から拡張オートマトンを作成し、当該拡張オートマトンのパスを全てカバーするように文字列の集合を生成することができる。 For example, the modification device 10 can create an extended automaton from the input pre-modification regular expression and generate a set of strings that cover all paths of the extended automaton.

続いて、修正装置１０は、入力された正規表現、受理される文字列及び拒否される文字列を基に正規表現を生成（合成）する（ステップＳ４０）。そして、修正装置１０は、生成した正規表現を出力する（ステップＳ５０）。Next, the modification device 10 generates (synthesizes) a regular expression based on the input regular expression, the accepted string, and the rejected string (step S40). The modification device 10 then outputs the generated regular expression (step S50).

図１２は、正規表現の合成処理の流れを示すフローチャートである。図１２の処理は、図１１のステップＳ４０に相当する。まず、修正装置１０は、入力された正規表現を、テンプレートとしてキューに格納する（ステップＳ４０１）。 Figure 12 is a flowchart showing the flow of the regular expression synthesis process. The process in Figure 12 corresponds to step S40 in Figure 11. First, the correction device 10 stores the input regular expression in a queue as a template (step S401).

次に、修正装置１０は、最も優先度が高いテンプレートをキューから取得する（ステップＳ４０２）。 Next, the correction device 10 retrieves the template with the highest priority from the queue (step S402).

続いて、修正装置１０は、修正装置１０は、受理される文字列を受理し、拒否される文字列を拒否し、かつ脆弱性に関する条件を満たすような、ホールへの範囲文字の割り当てを探索する（ステップＳ４０３）。 Next, the modification device 10 searches for an assignment of range characters to the hole that accepts accepted strings, rejects rejected strings, and satisfies the vulnerability conditions (step S403).

修正装置１０は、探索結果の割り当てが存在するか否かを判定する（ステップＳ４０４）。探索結果の割り当てが存在しない場合（ステップＳ４０４、Ｎｏ）、修正装置１０は、範囲文字をホールに置換するか、又はホールを所定のパターンに置換する（ステップＳ４０５）。所定のパターンは、例えば「□□」、「□|□」、「□*」、「(□)」、「＼i」、「(?=□)」、「(?!□)」、「(?<=□)」、「(?<!□)」といった構文である。なお、ステップＳ４０１でキューに格納された、入力された正規表現がステップＳ４０３での探索の対象となった場合、ステップＳ４０４では割り当てが存在しないもの（Ｎｏ）とみなされる。 The modification device 10 determines whether an assignment exists for the search results (step S404). If an assignment does not exist for the search results (step S404, No), the modification device 10 replaces the range character with a hole or replaces the hole with a predetermined pattern (step S405). The predetermined pattern is, for example, a syntax such as "□□", "□|□", "□*", "(□)", "\i", "(?=□)", "(?!□)", "(?<=□)", or "(?<!□)". Note that if the input regular expression stored in the queue in step S401 becomes the target of the search in step S403, it is considered that no assignment exists (No) in step S404.

そして、修正装置１０は、ステップＳ４０５で処理済みのテンプレートをキューに格納する（ステップＳ４０６）。ここでの処理済みのテンプレートは、範囲文字がホールに置換されたテンプレート、又はホールが所定のパターンに置換されたテンプレートである。 Then, the correction device 10 stores the template processed in step S405 in a queue (step S406). The processed template here is a template in which the range character has been replaced with a hole, or a template in which the hole has been replaced with a specified pattern.

一方、探索結果の割り当てが存在する場合（ステップＳ４０４、Ｙｅｓ）、修正装置１０は、探索結果の割り当てを基に脆弱でない正規表現を合成する（ステップＳ４０７）。 On the other hand, if a search result assignment exists (step S404, Yes), the modification device 10 synthesizes a non-vulnerable regular expression based on the search result assignment (step S407).

図１３は、テンプレートの格納処理の流れを示すフローチャートである。図１３の処理は、図１２のステップＳ４０６に相当する。図１３に示すように、まず、修正装置１０は、テンプレートのスコアを計算する（ステップＳ４０６１）。修正装置１０は、コスト、文字数、テンプレートと元の正規表現とのかい離度合い等に応じてスコアを計算することができる。 Figure 13 is a flowchart showing the flow of the template storage process. The process in Figure 13 corresponds to step S406 in Figure 12. As shown in Figure 13, first, the correction device 10 calculates the score of the template (step S4061). The correction device 10 can calculate the score based on the cost, number of characters, the degree of deviation between the template and the original regular expression, etc.

次に、修正装置１０は、スコアを基に優先度を判定する（ステップＳ４０６２）。例えば、修正装置１０は、スコアが小さいほど優先度が高いと判定する。Next, the editing device 10 determines the priority based on the score (step S4062). For example, the editing device 10 determines that the smaller the score, the higher the priority.

そして、修正装置１０は、優先度とともにテンプレートをキューに格納する（ステップＳ４０６３）。キュー内のテンプレートは、優先度が高い順に取り出される。 Then, the correction device 10 stores the template in a queue along with its priority (step S4063). The templates in the queue are retrieved in order of highest priority.

［第１の実施形態の効果］
これまで説明してきたように、合成部１３２は、第１の正規表現における範囲文字の少なくとも一部をプレースホルダに置換したテンプレートのそれぞれについて、テンプレートのそれぞれの正規表現としての処理コスト又は文字列としての性質の少なくともいずれかに応じた優先度を算出する。合成部１３２は、優先度が高い順に、テンプレートのそれぞれを、第１の正規表現の脆弱性が解消された第２の正規表現に変換する。これにより、修正後の正規表現の可読性を改善することが可能になる。 [Effects of the First Embodiment]
As described above, for each template in which at least some of the range characters in the first regular expression have been replaced with placeholders, the synthesis unit 132 calculates a priority based on at least one of the processing cost of the template as a regular expression and the properties of the template as a character string. The synthesis unit 132 converts each template in descending order of priority into a second regular expression in which the vulnerability of the first regular expression has been resolved. This makes it possible to improve the readability of the corrected regular expression.

また、生成部１３１は、第１の正規表現によって受理される文字列の集合である第１の集合、及び、第１の正規表現によって拒否される文字列の集合である第２の集合を生成する。合成部１３２は、テンプレートにおけるプレースホルダを所定の構文に置換した正規表現であって、第１の集合の文字列を受理し、かつ第２の集合の文字列を拒否するような正規表現である第２の正規表現に、テンプレートを変換する。 The generation unit 131 also generates a first set, which is a set of strings accepted by the first regular expression, and a second set, which is a set of strings rejected by the first regular expression. The synthesis unit 132 converts the template into a second regular expression, which is a regular expression in which placeholders in the template are replaced with a predetermined syntax and which accepts strings in the first set and rejects strings in the second set.

合成部１３２は、あらかじめ定められたテンプレートに含まれる正規表現の演算子ごとのコストに応じて優先度を算出する。これにより、修正後の正規表現でコストが大きい演算子（先読み、キャプチャ等）が使用されることをなるべく避け、可読性を改善することができる。The synthesis unit 132 calculates the priority according to the cost of each operator in the regular expression included in the predetermined template. This avoids the use of high-cost operators (lookahead, capture, etc.) in the modified regular expression as much as possible, thereby improving readability.

また、合成部１３２は、テンプレートの文字数に応じて優先度を算出する。これにより、修正後の正規表現の文字数をなるべく少なくすることで、可読性を改善することができる。 The synthesis unit 132 also calculates the priority based on the number of characters in the template. This allows the number of characters in the corrected regular expression to be reduced as much as possible, thereby improving readability.

合成部１３２は、テンプレートと第１の正規表現との間の文字列としてのかい離度合いに応じて優先度を算出する。これにより、修正後の正規表現をなるべく修正前の正規表現と近いものとすることで、可読性を改善することができる。また、優先度が高いほど後の処理で最終的な出力として採用されやすくなるため、修正後の正規表現の可読性を改善することができる。 The synthesis unit 132 calculates a priority based on the degree of deviation between the template and the first regular expression as character strings. This allows the corrected regular expression to be as close as possible to the original regular expression, thereby improving readability. Furthermore, the higher the priority, the more likely it is to be adopted as the final output in subsequent processing, thereby improving the readability of the corrected regular expression.

［システム構成等］
また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示のように構成されていることを要しない。すなわち、各装置の分散及び統合の具体的形態は図示のものに限られず、その全部又は一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的又は物理的に分散又は統合して構成することができる。さらに、各装置にて行われる各処理機能は、その全部又は任意の一部が、ＣＰＵ（Central Processing Unit）及び当該ＣＰＵにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。なお、プログラムは、ＣＰＵだけでなく、ＧＰＵ等の他のプロセッサによって実行されてもよい。 [System configuration, etc.]
Furthermore, the components of each device shown in the figure are functional concepts and do not necessarily have to be physically configured as shown. In other words, the specific form of distribution and integration of each device is not limited to that shown, and all or part of the devices can be functionally or physically distributed or integrated in any unit depending on various loads, usage conditions, etc. Furthermore, all or any part of the processing functions performed by each device can be realized by a CPU (Central Processing Unit) and a program analyzed and executed by the CPU, or can be realized as hardware using wired logic. Note that the program may be executed not only by the CPU but also by other processors such as a GPU.

また、本実施形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部又は一部を手動的に行うこともでき、あるいは、手動的に行われるものとして説明した処理の全部又は一部を公知の方法で自動的に行うこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 Furthermore, among the processes described in this embodiment, all or part of the processes described as being performed automatically can be performed manually, or all or part of the processes described as being performed manually can be performed automatically using known methods. In addition, the information, including the processing procedures, control procedures, specific names, various data, and parameters shown in the above documents and drawings, can be changed as desired unless otherwise specified.

［プログラム］
一実施形態として、修正装置１０は、パッケージソフトウェアやオンラインソフトウェアとして上記の修正処理を実行する修正プログラムを所望のコンピュータにインストールさせることによって実装できる。例えば、上記の修正プログラムを情報処理装置に実行させることにより、情報処理装置を修正装置１０として機能させることができる。ここで言う情報処理装置には、デスクトップ型又はノート型のパーソナルコンピュータが含まれる。また、その他にも、情報処理装置にはスマートフォン、携帯電話機やＰＨＳ（Personal Handyphone System）等の移動体通信端末、さらには、ＰＤＡ（Personal Digital Assistant）等のスレート端末等がその範疇に含まれる。 [program]
In one embodiment, the correction device 10 can be implemented by installing a correction program that executes the above-described correction process as package software or online software on a desired computer. For example, by executing the above-described correction program on an information processing device, the information processing device can function as the correction device 10. The information processing device referred to here includes desktop and notebook personal computers. Other information processing devices also include mobile communication terminals such as smartphones, mobile phones, and PHS (Personal Handyphone Systems), as well as slate terminals such as PDAs (Personal Digital Assistants).

また、修正装置１０は、ユーザが使用する端末装置をクライアントとし、当該クライアントに上記の修正処理に関するサービスを提供する修正サーバ装置として実装することもできる。例えば、修正サーバ装置は、修正前の正規表現を入力とし、修正後の正規表現を出力とする修正サービスを提供するサーバ装置として実装される。この場合、修正サーバ装置は、Ｗｅｂサーバとして実装することとしてもよいし、アウトソーシングによって上記の修正処理に関するサービスを提供するクラウドとして実装することとしてもかまわない。 The correction device 10 can also be implemented as a correction server device that uses a terminal device used by a user as a client and provides services related to the above correction processing to the client. For example, the correction server device is implemented as a server device that provides a correction service that takes an uncorrected regular expression as input and outputs a corrected regular expression. In this case, the correction server device may be implemented as a web server, or as a cloud that provides services related to the above correction processing through outsourcing.

図１４は、修正プログラムを実行するコンピュータの一例を示す図である。コンピュータ１０００は、例えば、メモリ１０１０、ＣＰＵ１０２０を有する。また、コンピュータ１０００は、ハードディスクドライブインタフェース１０３０、ディスクドライブインタフェース１０４０、シリアルポートインタフェース１０５０、ビデオアダプタ１０６０、ネットワークインタフェース１０７０を有する。これらの各部は、バス１０８０によって接続される。 Figure 14 is a diagram showing an example of a computer that executes a patch program. The computer 1000 has, for example, memory 1010 and a CPU 1020. The computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these components is connected by a bus 1080.

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１及びＲＡＭ（Random Access Memory）１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０９０に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１１００に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ１１００に挿入される。シリアルポートインタフェース１０５０は、例えばマウス１１１０、キーボード１１２０に接続される。ビデオアダプタ１０６０は、例えばディスプレイ１１３０に接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012. The ROM 1011 stores a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to a hard disk drive 1090. The disk drive interface 1040 is connected to a disk drive 1100. A removable storage medium such as a magnetic disk or optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to a mouse 1110 and a keyboard 1120, for example. The video adapter 1060 is connected to a display 1130, for example.

ハードディスクドライブ１０９０は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３、プログラムデータ１０９４を記憶する。すなわち、修正装置１０の各処理を規定するプログラムは、コンピュータにより実行可能なコードが記述されたプログラムモジュール１０９３として実装される。プログラムモジュール１０９３は、例えばハードディスクドライブ１０９０に記憶される。例えば、修正装置１０における機能構成と同様の処理を実行するためのプログラムモジュール１０９３が、ハードディスクドライブ１０９０に記憶される。なお、ハードディスクドライブ１０９０は、ＳＳＤ（Solid State Drive）により代替されてもよい。 The hard disk drive 1090 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, the programs that define each process of the modification device 10 are implemented as program modules 1093 in which computer-executable code is written. The program modules 1093 are stored, for example, on the hard disk drive 1090. For example, a program module 1093 for executing processes similar to the functional configuration of the modification device 10 is stored on the hard disk drive 1090. The hard disk drive 1090 may be replaced by an SSD (Solid State Drive).

また、上述した実施形態の処理で用いられる設定データは、プログラムデータ１０９４として、例えばメモリ１０１０やハードディスクドライブ１０９０に記憶される。そして、ＣＰＵ１０２０は、メモリ１０１０やハードディスクドライブ１０９０に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して、上述した実施形態の処理を実行する。 In addition, the setting data used in the processing of the above-described embodiment is stored as program data 1094, for example, in memory 1010 or hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 or program data 1094 stored in memory 1010 or hard disk drive 1090 into RAM 1012 as needed, and executes the processing of the above-described embodiment.

なお、プログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０９０に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ１１００等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、プログラムモジュール１０９３及びプログラムデータ１０９４は、ネットワーク（ＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）等）を介して接続された他のコンピュータに記憶されてもよい。そして、プログラムモジュール１０９３及びプログラムデータ１０９４は、他のコンピュータから、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 The program module 1093 and program data 1094 may not necessarily be stored on the hard disk drive 1090, but may also be stored on a removable storage medium, for example, and read by the CPU 1020 via the disk drive 1100, etc. Alternatively, the program module 1093 and program data 1094 may be stored on another computer connected via a network (such as a LAN (Local Area Network), WAN (Wide Area Network)). The program module 1093 and program data 1094 may then be read by the CPU 1020 from the other computer via the network interface 1070.

１０修正装置
１１インタフェース部
１２記憶部
１３制御部
１２１置換候補構文情報
１２２コスト情報
１３１生成部
１３２合成部 REFERENCE SIGNS LIST 10 Correction device 11 Interface unit 12 Storage unit 13 Control unit 121 Replacement candidate syntax information 122 Cost information 131 Generation unit 132 Synthesis unit

Claims

a calculation unit that calculates a priority for each template in which at least a part of range characters in a first regular expression is replaced with a placeholder, in accordance with a cost of each operator of the regular expression included in the template that is determined in advance ;
a conversion unit that converts each of the templates in descending order of priority into a second regular expression that eliminates the vulnerability of the first regular expression;
A repair device comprising:

a generator that generates a first set of strings that are accepted by a first regular expression and a second set of strings that are rejected by the first regular expression;
2. The correction device according to claim 1, wherein the conversion unit converts the template into the second regular expression, which is a regular expression in which the placeholder in the template is replaced with a predetermined syntax, and which accepts character strings in the first set and rejects character strings in the second set.

The correction device described in claim 1 or 2, characterized in that the calculation unit calculates the priority based on the number of characters in the template.

The correction device described in claim 1 or 2, characterized in that the calculation unit calculates the priority based on the degree of deviation between the template and the first regular expression as character strings.

A repair method performed by a repair device, comprising:
a calculation step of calculating a priority for each template in which at least a part of range characters in a first regular expression is replaced with a placeholder, in accordance with a predetermined cost for each operator of the regular expression included in the template ;
a conversion step of converting each of the templates in descending order of priority into a second regular expression in which the vulnerability of the first regular expression is eliminated;
A correction method comprising:

a calculation step of calculating a priority for each template obtained by replacing at least a part of range characters in a first regular expression with a placeholder, in accordance with a predetermined cost for each operator of the regular expression included in the template ;
a conversion step of converting each of the templates in descending order of priority into a second regular expression in which the vulnerability of the first regular expression is eliminated;
A fix program characterized by causing a computer to execute the following.