Last updated: Thu, 31 May 2007

CXX. 正規表現関数（Perl 互換）

導入

この正規表現関数で使用するパターンの構文は、Perl と類似しています。正規表現は、スラッシュ (/) などのデリミタで囲う必要があります。デリミタとしては、英数字およびバックスラッシュ(\) 以外のすべての文字を使用可能です。デリミタ文字を正規表現本体において使用する必要がある場合は、バックスラッシュでエスケープします。PHP 4.0.4 以降、 Perl形式の (), {}, [], <> も使用可能です。パターンの詳細については、パターン構文を参照してください。

様々な修飾子を終端デリミタの後に付け、マッチングに変化を与えることができます。パターン修飾子を参照ください。

PHP は、POSIX 拡張正規表現関数において、 POSIX 拡張構文を用いた正規表現もサポートしています。

注意: この拡張モジュールでは、コンパイルした正規表現のためにスレッド単位のグローバルキャッシュ (最大 4096) を管理しています。

警告

PCRE には、いくつかの制限があります。詳細は、» http://www.pcre.org/pcre.txt を参照してください。

要件

外部ライブラリを必要としません。

インストール手順

PHP 4.2.0 以降、本関数はデフォルトで有効となっています。 --without-pcre-regex で PCRE 関数を無効にすることができます。付属のライブラリを使用しない場合、 --with-pcre-regex=DIR を使用して PCRE のインクルードおよびライブラリファイルがある場所 DIR を指定してください。以前のバージョンでは、本関数を使用するためには --with-pcre-regex[=DIR] を指定して PHP を configure およびコンパイルする必要があります。

Windows 版の PHP にはこの拡張モジュールのサポートが組み込まれています。これらの関数を使用するために拡張モジュールを追加でロードする必要はありません。

実行時設定

php.ini の設定により動作が変化します。

表 241. PCRE 設定オプション

名前	デフォルト	変更の可否	変更履歴
pcre.backtrack_limit	100000	PHP_INI_ALL	PHP 5.2.0 以降で使用可能
pcre.recursion_limit	100000	PHP_INI_ALL	PHP 5.2.0 以降で使用可能

PHP_INI_* 定数の詳細および定義については付録 I. php.ini ディレクティブ を参照してください。

以下に設定ディレクティブに関する簡単な説明を示します。

pcre.backtrack_limit integer: PCRE のバックトラック処理の制限値です。
pcre.recursion_limit integer: PCRE の再帰処理の制限値です。この値を大きくすると、使用可能なプロセススタックを使い切ってしまい、 (OS のスタックサイズの制限値に達して) PHP をクラッシュさせてしまうことに注意しましょう。

リソース型

リソース型は定義されていません。

定義済み定数

以下の定数が定義されています。この関数の拡張モジュールが PHP 組み込みでコンパイルされているか、実行時に動的にロードされている場合のみ使用可能です。

表 242. PREG 定数

定数	説明
PREG_PATTERN_ORDER	$matches[0] はパターン全体にマッチした文字列の配列、 $matches[1] は第 1 のキャプチャ用サブパターンにマッチした文字列の配列、といったように結果の順序を指定します。このフラグは、preg_match_all() でのみ使用されます。
PREG_SET_ORDER	$matches[0] は 1 回目のマッチングでキャプチャした値の配列、 $matches[1] は 2 回目のマッチングでキャプチャした値の配列、といったように結果の順序を指定します。このフラグは、preg_match_all() でのみ使用されます。
PREG_OFFSET_CAPTURE	`PREG_SPLIT_OFFSET_CAPTURE` の説明を参照してください。このフラグは、PHP 4.3.0 以降で利用可能です。
PREG_SPLIT_NO_EMPTY	このフラグは、preg_split() が、空文字列でないものだけを返すようにします。
PREG_SPLIT_DELIM_CAPTURE	このフラグは、preg_split() が文字列分割用のパターン中のカッコによるサブパターンでキャプチャされた値も同時に返すようにします。このフラグは、PHP 4.0.5 以降で利用可能です。
PREG_SPLIT_OFFSET_CAPTURE	このフラグを設定した場合、各マッチに対応する文字列のオフセットも返されます。これにより、返り値は配列となり、配列の要素 0 はマッチした文字列、要素 1 は対象文字列中におけるマッチした文字列のオフセット値となることに注意してください。このフラグは、`PHP` 4.3.0 以降で利用可能で、 preg_split() のみで使用されます。
PREG_NO_ERROR	エラーが存在しなかった場合に preg_last_error() から返されます。 PHP 5.2.0 以降で使用可能です。
PREG_INTERNAL_ERROR	PCRE 内部エラーが発生した場合に preg_last_error() から返されます。 PHP 5.2.0 以降で使用可能です。
PREG_BACKTRACK_LIMIT_ERROR	backtrack limit に達した場合に preg_last_error() から返されます。 PHP 5.2.0 以降で使用可能です。
PREG_RECURSION_LIMIT_ERROR	recursion limit に達した場合に preg_last_error() から返されます。 PHP 5.2.0 以降で使用可能です。
PREG_BAD_UTF8_ERROR	壊れている UTF8 データによって直近のエラーが発生した場合に preg_last_error() から返されます (UTF-8 モードで正規表現を実行した場合のみ)。 PHP 5.2.0 以降で使用可能です。

例

例 1666. 有効なパターンの例

/<\/\w+>/
|(\d{3})-\d+|Sm
/^(?i)php[34]/
{^\s+(\s+)?$}

例 1667. 無効なパターンの例

/href='(.*)' - 終端デリミタが抜けている
/\w+\s*\w+/J - 未知の修飾子 'J'
1-\d3-\d3-\d4| - 始端デリミタが抜けている

パターン修飾子 — 正規表現パターンに使用可能な修飾子
パターン構文 — PCRE 正規表現の説明
preg_grep — パターンにマッチする配列の要素を返す
preg_last_error — 直近の PCRE 正規表現処理のエラーコードを返す
preg_match_all — 繰返し正規表現検索を行う
preg_match — 正規表現によるマッチングを行う
preg_quote — 正規表現文字をクオートする
preg_replace_callback — 正規表現検索を行い、コールバック関数を使用して置換を行う
preg_replace — 正規表現検索および置換を行う
preg_split — 正規表現で文字列を分割する

パターン修飾子

" width="11" height="7"/>

pcntl_wtermsig

Last updated: Thu, 31 May 2007

add a note User Contributed Notes
正規表現関数（Perl 互換）

tabac (at) uu (dot) dk
27-Jul-2007 10:19


Hello bermi



<?php

  if(preg_match("/((a+)?)+/", "a")){

    echo "Matched";

  }

?>



Segfault is always bad, but realize what you are asking here:



"Is there one or more occurrences of zero or one sequences of one or more 'a' ?"



Considering the backtracking algorithm used, the RE engine must consider if an infinite sequence of sub matches of which all but one has a length of zero.



This is a bug, but it is in line with the famous "ls -l /usr/../*/../*/../*/../*/../*" bug

misc at e2007 dot cynergi dot com
05-May-2007 08:16


PCRE faster than POSIX RE? Not always.



In a recent search-engine project here at Cynergi, I had a simple loop with a few cute ereg_replace() functions that took 3min to process data. I changed that 10-line loop into a 100-line hand-written code for replacement and the loop now took 10s to process the same data! This opened my eye to what can *IN SOME CASES* be very slow regular expressions.



Lately I decided to look into Perl-compatible regular expressions (PCRE). Most pages claim PCRE are faster than POSIX, but a few claim otherwise. I decided on bechmarks of my own.



My first few tests confirmed PCRE to be faster, but... the results were slightly different than others were getting, so I decided to benchmark every case of RE usage I had on a 8000-line secure (and fast) Webmail project here at Cynergi to check it out.



The results? Inconclusive! Sometimes PCRE *are* faster (sometimes by a factor greater than 100x faster!), but some other times POSIX RE are faster (by a factor of 2x).



I still have to find a rule on when are one or the other faster. It's not only about search data size, amount of data matched, or "RE compilation time" which would show when you repeated the function often: one would *always* be faster than the other. But I didn't find a pattern here. But truth be said, I also didn't take the time to look into the source code and analyse the problem.



I can give you some examples, though. The POSIX RE



([0-9]{4})/([0-9]{2})/([0-9]{2})[^0-9]+

([0-9]{2}):([0-9]{2}):([0-9]{2})



is 30% faster in POSIX than when converted to PCRE (even if you use \d and \D and non-greedy matching). On the other hand, a similarly PCRE complex pattern



/[0-9]{1,2}[ \t]+[a-zA-Z]{3}[ \t]+[0-9]{4}[ \t]+[0-9]{1,2}:[0-9]{1,2}(:[0-9]{1,2})?[ \t]+[+-][0-9]{4}/



is 2.5x faster in PCRE than in POSIX RE. Simple replacement patterns like



ereg_replace( "[^a-zA-Z0-9-]+", "", $m );



are 2x faster in POSIX RE than PCRE. And then we get confused again because a POSIX RE pattern like



(^|\n|\r)begin-base64[ \t]+[0-7]{3,4}[ \t]+......



is 2x faster as POSIX RE, but the case-insensitive PCRE



/^Received[ \t]*:[ \t]*by[ \t]+([^ \t]+)[ \t]/i



is 30x faster than its POSIX RE version!



When it comes to case sensitivity, PCRE has so far seemed to be the best option. But I found some really strange behaviour from ereg/eregi. On a very simple POSIX RE



(^|\r|\n)mime-version[ \t]*:



I found eregi() taking 3.60s (just a number in a test benchmark), while the corresponding PCRE took 0.16s! But if I used ereg() (case-sensitive) the POSIX RE time went down to 0.08s! So I investigated further. I tried to make the POSIX RE case-insensitive itself. I got as far as this:



(^|\r|\n)[mM][iI][mM][eE]-vers[iI][oO][nN][ \t]*:



This version also took 0.08s. But if I try to apply the same rule to any of the 'v', 'e', 'r' or 's' letters that are not changed, the time is back to the 3.60s mark, and not gradually, but immediatelly so! The test data didn't have any "vers" in it, other "mime" words in it or any "ion" that might be confusing the POSIX parser, so I'm at a loss.



Bottom line: always benchmark your PCRE / POSIX RE to find the fastest!



Tests were performed with PHP 5.1.2 under Windows, from the command line.



Pedro Freire

cynergi.com

nickspring at mail dot ru
14-Oct-2006 11:47


Regular Expressions Tutorial on russian language is accessible on http://www.pcre.ru

lgandras at hotmail dot com
20-Feb-2006 07:19


I read this part, but i couldn't undertand a single word beacause before i must know Basic regular expression. Somebody put a link for PERL that is almost like PHP but here is one totally dedicated to PHP:



http://weblogtoolscollection.com/regex/regex.php

Gokul
06-Feb-2006 05:59


I came accross this nice tutorial for regural expression in perl

http://perldoc.perl.org/perlretut.html

richardh at phpguru dot org
23-Sep-2005 03:50


There's a printable PDF PCRE cheat sheet available here:



http://www.phpguru.org/article.php?ne_id=67



Has the common metacharacters, quantifiers, pattern modifiers, character classes and assertions with short explanations.

hfuecks at nospam dot org
04-Jul-2005 06:21


Good PCRE tutorial at http://www.tote-taste.de/X-Project/regex/ - well explained but still in depth

Ned Baldessin
24-Oct-2004 10:08


If you want to perform regular expressions on Unicode strings, the PCRE functions will NOT be of any help. You need to use the Multibyte extension : mb_ereg(), mb_eregi(), pb_ereg_replace() and so on. When doing so, be carefull to set the default text encoding to the same encoding used by the text you are searching and replacing in. You can do that with the mb_regex_encoding() function. You will probably also want to set the default encoding for the other mb_* string functions with mb_internal_encoding().



So when dealing with, say, french text, I start with these :

<?php

mb_internal_encoding('UTF-8');

mb_regex_encoding('UTF-8');

setlocale(LC_ALL, 'fr-fr');

?>

steve at stevedix dot de
20-Jul-2004 09:17


Something to bear in mind is that regex is actually a declarative programming language like prolog : your regex is a set of rules which the regex interpreter tries to match against a string.   During this matching, the interpreter will assume certain things, and continue assuming them until it comes up against a failure to match, which then causes it to backtrack.  Regex assumes "greedy matching" unless explicitly told not to, which can cause a lot of backtracking.  A general rule of thumb is that the more backtracking, the slower the matching process.



It is therefore vital, if you are trying to optimise your program to run quickly (and if you can't do without regex), to optimise your regexes to match quickly.



I recommend the use of a tool such as "The Regex Coach" to debug your regex strings.



http://weitz.de/files/regex-coach.exe (Windows installer) http://weitz.de/files/regex-coach.tgz (Linux tar archive)

Biju
21-Sep-2003 01:00


Regular Expressions Tutorial from non PHP sites


   http://www.amk.ca/python/howto/regex/


   http://sitescooper.org/tao_regexps.html


   http://www.english.uga.edu/humcomp/perl/regex2a.html


   http://www.english.uga.edu/humcomp/perl/regexps.html


   http://www.english.uga.edu/humcomp/perl/regular_expressions.HTML


   http://www.english.uga.edu/humcomp/perl/


   http://java.sun.com/docs/books/tutorial/extra/regex/


   http://gnosis.cx/publish/programming/regular_expressions.html


   http://www.zvon.org/other/PerlTutorial/Books/Book1/


   http://it.metr.ou.edu/regex/


   http://www.regular-expressions.info/

hrz at geodata dot soton dot ac dot uk
07-Mar-2002 04:33


If you're venturing into new regular expression territory with a lack of useful examples then it would pay to get familiar with this page:



http://www.pcre.org/man.txt

add a note

パターン修飾子

" width="11" height="7"/>

pcntl_wtermsig

Last updated: Thu, 31 May 2007

CXX. 正規表現関数（Perl 互換）

導入

要件

インストール手順

実行時設定

リソース型

定義済み定数

例

目次