If you want to test for FALSE use === instead.
$result = preg_match("/badtest/J",$string);
if($result === FALSE) {
// bad query
error_log("Whoops!");
} else {
echo("Matched " . $result . " times");
}
preg_match
(PHP 4, PHP 5)
preg_match — 正規表現によるマッチングを行う
説明
pattern で指定した正規表現により subject を検索します。
パラメータ
- pattern
-
検索するパターンを表す文字列。
- subject
-
入力文字列。
- matches
-
matches を指定した場合、検索結果が代入されます。 $matches[0] にはパターン全体にマッチしたテキストが代入され、 $matches[1] には 1 番目ののキャプチャ用サブパターンにマッチした 文字列が代入され、といったようになります。
- flags
-
flags には以下のフラグを指定できます。
- PREG_OFFSET_CAPTURE
- このフラグを設定した場合、各マッチに対応する文字列のオフセットも返されます。 これにより、返り値は配列となり、配列の要素 0 はマッチした文字列、 要素 1は対象文字列中におけるマッチした文字列のオフセット値 となることに注意してください。
- offset
-
通常、検索は対象文字列の先頭から開始されます。 オプションのパラメータ offset を使用して 検索の開始位置を (バイト単位で) 指定することも可能です。
注意: offset を用いるのと、 substr($subject, $offset) を preg_match()の対象文字列として指定するのとは 等価ではありません。 これは、pattern には、 ^, $ や (?<=x) のような言明を含めることができるためです。 以下を比べてみてください。
<?php
$subject = "abcdef";
$pattern = '/^def/';
preg_match($pattern, $subject, $matches, PREG_OFFSET_CAPTURE, 3);
print_r($matches);
?>上の例の出力は以下となります。
Array ( )
一方、この例を見てください。
<?php
$subject = "abcdef";
$pattern = '/^def/';
preg_match($pattern, substr($subject,3), $matches, PREG_OFFSET_CAPTURE);
print_r($matches);
?>出力は以下のようになります。
Array ( [0] => Array ( [0] => def [1] => 0 ) )
返り値
preg_match() は、pattern がマッチした回数を返します。つまり、0 回(マッチせず)または 1 回となります。 これは、最初にマッチした時点でpreg_match() は検索を止めるためです。逆にpreg_match_all()は、 subject の終わりまで検索を続けます。 preg_match() は、エラーが発生した場合にFALSEを返します。
変更履歴
| バージョン | 説明 |
|---|---|
| 4.3.3 | パラメータ offset が追加されました。 |
| 4.3.0 | フラグ PREG_OFFSET_CAPTURE が追加されました。 |
| 4.3.0 | パラメータ flags が追加されました。 |
例
Example#1 文字列 "php" を探す
<?php
// パターンのデリミタの後の "i" は、大小文字を区別しない検索を示す
if (preg_match("/php/i", "PHP is the web scripting language of choice.")) {
echo "A match was found.";
} else {
echo "A match was not found.";
}
?>
Example#2 単語 "web" を探す
<?php
/* パターン内の \b は単語の境界を示す。このため、独立した単語の
* "web"にのみマッチし、"webbing" や "cobweb" のような単語の一部にはマッチしない */
if (preg_match("/\bweb\b/i", "PHP is the web scripting language of choice.")) {
echo "A match was found.";
} else {
echo "A match was not found.";
}
if (preg_match("/\bweb\b/i", "PHP is the website scripting language of choice.")) {
echo "A match was found.";
} else {
echo "A match was not found.";
}
?>
Example#3 URL からドメイン名を得る
<?php
// get host name from URL
preg_match('@^(?:http://)?([^/]+)@i',
"http://www.php.net/index.html", $matches);
$host = $matches[1];
// get last two segments of host name
preg_match('/[^.]+\.[^.]+$/', $host, $matches);
echo "domain name is: {$matches[0]}\n";
?>
上の例の出力は以下となります。
domain name is: php.net
注意
preg_match
07-Mar-2008 01:21
05-Mar-2008 07:56
To the comment below about the vallidation of phone numbers.
PEAR offers some briljant classes for phonenumber vallidation.
Check out http://pear.php.net/packages.php?catpid=50&catname=Validate
Regards
Thijs
22-Feb-2008 06:49
<?php
// After not being able to find a comprehensive phone number expression,
// I came up with my own which handles many ways to format a number
//
// Accepted
// 490-5473 (559) 585-1635
// (231)-826-4402 2072444529 315-789-7555 x52
// 708.333.0003 1-559-584-9639 308-882-7111 ext 7234
//
// Rejected
// 765-0600 489-4151 60-415-5389 315-789-7555, x52
function validatePhoneNumber( $sPhoneNum ) {
return preg_match('/^(1(\/|-|\s|.|))?(\(?\d{3}\)?)?(\/|-|\s|.|)?\d{3}'
. '(\/|-|\s|.|)?\d{4}(\/|-|\s|.|)?((x|ext)(.*)\d+)?$/i', $sPhoneNum );
}
?>
13-Feb-2008 04:35
In regard to Adlez below:
Your function 'Entities' returns an uninitialized variable. Rather a waste of time, don't you think? Perhaps you should check your code before submitting it and save everyone time ....
05-Feb-2008 09:01
> This fixes BK's bugs when checking an email address:
>
> "/^([a-z0-9._-](\+[a-z0-9])*)+@[a-z0-9.-]+\.[a-z]{2,6}$/i"
misplaced that * I guess
>
> Features:
> -Accepts + addressing, and must have characters after the +
> -No special characters such as: ! # $ % & ' * / = ? ^ ` { | } who needs 'em!
> -No spaces.
>
> Caveats still remaining:
> -You should use trim() on your email address first.
Trimming... easy fix, add [:space:] or \s to the front and end of the expression
> -Allows multiple ...... dots
Needs a restructure of the expression - having a different character class before and after dots
> -Allows dots in the .wrong places.
> -Allows domain names with dashes - in the wrong places.
I would come up with the following (mind that I am just getting started with regexes as well):
"/^\s*[a-z][a-z0-9]*(\.[a-z0-9][a-z0-9-]*)*(\+[a-z0-9]*)?
@[a-z0-9][a-z0-9-]*(\.[a-z0-9][a-z0-9-]*)*\.[a-z]{2,6}\s*$/i"
(added linebreak due to technical reasons problems with php.net)
The only problem that still applies, is the dashes in the wrong places (though the only wrong place is right before a dot). Fixing this is tricky, I tried using lookbehind, but that did not work. You do have to keep in mind that there may only be one letter between two dots.
22-Jan-2008 02:03
This fixes BK's bugs when checking an email address:
"/^([a-z0-9._-](\+[a-z0-9])*)+@[a-z0-9.-]+\.[a-z]{2,6}$/i"
Features:
-Accepts + addressing, and must have characters after the +
-No special characters such as: ! # $ % & ' * / = ? ^ ` { | } who needs 'em!
-No spaces.
Caveats still remaining:
-You should use trim() on your email address first.
-Allows multiple ...... dots
-Allows dots in the .wrong places.
-Allows domain names with dashes - in the wrong places.
17-Jan-2008 05:01
Another pointer regarding the example from BK below.... Email addresses can actually contain a great deal more non-alphanumeric characters than this regexp implies. The characters are listed here in http://www.faqs.org/rfcs/rfc2822.html section 3.2.4.
For validating email addresses prior to an actual emailed challenge, I have been using the following regexp with eregi for years to match the left-hand side of an email address:
^[a-z0-9][a-z0-9!#$%&'*+/=?^_`{|}-]+$
Of course, the right-hand side is this:
^([a-z0-9][a-z0-9-]+\.)+[a-z][a-z]+$
Anything more restrictive violates standards.
Watch your quotes.
15-Jan-2008 10:57
Hi BK, your expression contains two errors:
- any address with a space in the middle of the first part will also be accepted
- addresses with a plus sign in the first part will not be accepted (though they are valid)
12-Jan-2008 11:35
Hey BK, why not just:
$email = trim($_POST["email"]);
You should make a habit out of doing this on anything submitted by a user anyway.
28-Dec-2007 05:01
A quick example of using named recursion and negative lookaheads for finding the outermost div. You can use this same idea for any type of nested tags.
<?php
$sample =
"lead in text to capture <div>
outside div text
<div>
inner div text
<div>
deep nested text
</div>
</div>
bottom of outside div text
</div> end of text to capture";
preg_match(
'#^(?P<a>.*?)(?P<b>.?<div((.(?!<div))|(?P>b))*?.</div>)(?P<c>.*?)$#s',
$sample, $matches);
echo "<pre>";
var_dump($matches);
//$matches['a'] == "lead in text to capture"
//$matches['b'] == the outermost <div> and child contents (with a leading space)
//$matches['c'] == " end of text to capture"
?>
28-Dec-2007 03:19
One note on the regular expressions provided that claim to validate e-mail addresses: They're incomplete. To quote the note submission page, just a couple paragraphs above the box where you type in your note:
"(And if you're posting an example of validating email addresses, please don't bother. Your example is almost certainly wrong for some small subset of cases. See this information from O'Reilly Mastering Regular Expressions book [http://examples.oreilly.com/regex/readme.html] for the gory details.)"
That said, the expressions as provided aren't COMPLETELY irrelevant -- they WILL validate MOST e-mail addresses, and you won't really be blocking any significant portion of the population by using them. Just be aware of the limitations.
23-Dec-2007 05:20
I found Frebby's post below from 28-Oct-2007 to be a rock solid way to validate a user's e-mail address, and it even accepts subdomains such as .co.uk. However, it fails if the user or the browser adds a space before the text entry (this sometimes happens when clicking into a form field, particularly in IE 6).
There may be other ways to address that problem, but here's a simple fix that works well within basic form processing scripts:
<?php
$email = $_POST["email"];
$errorurl = "your error page URL here" ;
if (!preg_match("/^[\ a-z0-9._-]+@[a-z0-9.-]+\.[a-z]{2,6}$/i", $email)) {
header( "Location: $errorurl" );
exit ;
}
?>
The difference? Just add a backwards slash and a space before the a-z portion of the first segment:
[\ a-z0-9._-]+@
That's it! Enjoy
18-Dec-2007 08:52
To test if a regular expression is syntactically correct:
<?
function preg_test($regex)
{
if (sprintf("%s",@preg_match($regex,'')) == '')
{
$error = error_get_last();
throw new Exception(substr($error['message'],70));
}
else
return true;
}
?>
usage:
<?
if (preg_test('/.*/i'))
print "correct!";
// Returns "correct!"
?>
<?
if (preg_test('/.**/i'))
print "correct!";
// Throws exception with message 'Compilation failed: nothing to repeat at offset 2'
?>
07-Dec-2007 04:08
Not quite sure why no one has posted this before (unless I missed it somewhere) but "Example 1531. Getting the domain name out of a URL" clearly doesn't work for domains such as .co.uk.
Here is a simple improvement that does a much better job at extracting a domain from a URL (though not perfect). It assumes the following: country code TLD's have two letters (eg .uk, .jp, .au), and their subdomains have two or three letters, (eg .gov.uk, .co.uk). These are parsed in three parts. Anything else is parsed in two parts.
Hope it helps!
Noel
<?php
function extract_domain($url){
preg_match('@^(?:http://)?([^/]+)@i', $url, $matches);
$host = $matches[1];
// get last three segments of host name if country code TLD with sub domain, eg .co.uk
preg_match('/[^.]+\.[^.]{2,3}\.[^.]{2}$/', $host, $matches);
if (empty($matches)) {
// get last two segments of host name if generic TLD
preg_match('/[^.]+\.[^.]+$/', $host, $matches);
}
return $matches[0];
}
?>
02-Dec-2007 06:03
Here's a nice workaround to check if your regex is valid.
Sometimes PHP may throw an error like:
Warning: preg_match() [function.preg-match]: Unknown modifier '$' in foo.php on line 2
You can't really tell if the 'false' value is actually a value returned because the rule isn't value, or because you regex rule doesn't really match your string.
To find out what is the deal with your regex rule (maybe you're building it on fly, etc), you can find out if the "false" returned result is really coming because the string doesn't match, or because a warning was issued.
For the last part, I like to use try... catch expressions, so it's highly recommended, let's have a look:
<?php
function testMyRule($rule, $string) {
// don't forget to enable warning reporting if disabled
try {
/*
catch the preg match output warning, inside buffer
*/
ob_start(); // start bufffer
$result = serialize(preg_match($rule, $string));
$pwarnings = ob_get_contents(); // get results, including the warning if any
ob_end_clean(); // clean output
if (strpos($pwarnings, 'Warning')) { // is warning?
throw new Exception($pwarnings);
}
return unserialize($result); //
} catch(Exception $e) {
echo $e->getMessage();
die();
}
} // end of function
?>
Now, you have to make sure error reporting will allow warnings, and next, we'll serialize the result of your preg_match function applied against your string.
If this issues a warning, we catch inside the buffer, and later see if the buffer contains the warning.
We're serializing/unserializing the result of our pregmatch, because if we wouldn't serialize it, it would come back as a string, instead of boolean.
Enjoy!
Vladimir Ghetau
19-Nov-2007 07:19
If you try to find the offset when searching in UTF-8 string (containing multibyte characters, like cyrillic characters) with preg_match, using the PREG_OFFSET_CAPTURE flag, you may have different result from what you expected.
First of all you must compiled PHP with Multibyte Support (mbstring). Then you must configure to use Multibyte Support functions (mb_*) or turn on some php Runtime Configurations (php.ini, apache vhost conf file, .htaccess or somewhere else):
php_value default_charset UTF-8
php_value mbstring.func_overload 7
php_value mbstring.internal_encoding UTF-8
php_value mbstring.detect_order UTF-8
When using preg_match with PREG_OFFSET_CAPTURE flag and UTF-8 string the function will count bytes and NOT characters, so 2 bytes but NOT 1 character for some multibyte character. That's way the offset will be more than what you expected.
My simple solution is using mb_strpos:
...
preg_match($pattern, $found_text, $matches, PREG_OFFSET_CAPTURE);
// This will convert $matches[0][1] multibyte byte length to multibyte character length (UTF-8)
$matches[0][1] = mb_strpos($found_text, $matches[0][0]);
...
P.S. The $pattern variable must use "/u" switch for Unicode!!!
-------------------------------------------------
PHP Version 5.2.4
Multibyte regex (oniguruma) version 4.4.4
-------------------------------------------------
28-Oct-2007 06:12
If you wonder how to check for correct e-mail and such (you can use it for usernames and anything you want, but this is for e-mail) you can use this little code to validate the users e-mail:
We'll assume that they have been processing a form, entering their e-mail as "email" and now PHP will take care of the rest:
$emailcheck = $_POST["email"];
if(!preg_match("/^[a-z0-9\å\ä\ö._-]+@
[a-z0-9\å\ä\ö.-]+\.[a-z]{2,6}$/i", $emailcheck))
$errors[] = "- Your e-mail is missing or is not valid.";
(note that the preg_match had to be cut or I couldn't post it since it was too long so I cut it after @ so just put them together again.)
If we split the parts it would look like this:
[a-z0-9._-]+@
This is the name of the email, such as greatguy3 (then @domain.com) so this allows dot, underscore and - aswell as alphabetical letters and decimals.
[a-z0-9.-]+\.
This is the domain part, note that there must be a dot after domain name, so it's harder to fake an email. Same here though, A-Z, 0-9, dot and - (if your domain has - in it, such as nintendo-wii.com)
[a-z]{2,6}$/i
This is the last part of your email, the .com/.net/.info, whichever you use. The numbers between {} is how many letters are limited (in this case min 2 and max 6) and it would allow "us" up to "org.uk" and "museum" and only A-Z letters are used for obvious reasons. The "i" there is there so you can use both uppercase and lowercase characters. (A-Z & a-z)
So a valid email address with this code would be "coolguy3@cooldomain.com" and a nonvalid one would be "zûmgz^;*@hot_mail.bananá"
This is only the email part, so this is not the fullcode. Paste this in your form process to use it with the rest of your code!
Hope this helps.
11-Oct-2007 04:39
Here is a sample code to check for alphabetic characters only with an exception to space, hyphen and single quotes using preg_match().
$alpha = "some very funny string'9-2'";
/* check for alphabets and hyphens, quotes and space in the string but no numbers */
if(preg_match("/^[a-zA-Z\-\'\ ]+$/u", $alpha)){
return 1;
}else
return 0;
one can just add a '\' followed by the character he wish to allow for use [\@].
i hope it would be helpful to some one their.
-Erandra
26-Sep-2007 07:22
Quick function to filter input.
Filters any javascript, html, sql injections, and RFI.
<?php
function entities($text){
$text = "";
for ( $i = 0; $i <= strlen($text) - 1; $i += 1) {
$text .= "&#" .ord($text{$i});
}
return $eresult;
}
function filter($text){
if (preg_match("#(on(.*?)\=|script|xmlns|expression|
javascript|\>|\<|http)#si","$text",$ntext)){
$re = entities($ntext[1]);
$text = str_replace($ntext[0],$re,$text);
}
$text = mysql_real_escape_string($text);
return $text;
}
foreach ($_POST as $x => $y){
$_POST[$x] = filter($y);
}
foreach ($_GET as $x => $y){
$_GET[$x] = filter($y);
}
foreach ($_COOKIE as $x => $y){
$_COOKIE[$x] = filter($y);
}
?>
26-Aug-2007 03:17
regex for validating emails, from Perl's RFC2822 package:
http://en.wikipedia.org/wiki/Talk:E-mail_address
01-Aug-2007 10:06
>>what about .mil, .golf,.tv etc etc
ICANN Does not list .golf TLD
A complete List of Top Level Domains from ICANN here:
http://data.iana.org/TLD/tlds-alpha-by-domain.txt
I also found this article about verifying Email-Adresses:
http://www.regular-expressions.info/email.html
26-Jul-2007 08:47
Maybe it will sound obvious, but I've encountered this a few times...
If you are using preg_match() to validate user input, remember about including ^ and $ to your regex or take input from $matches[0] after successfully matching a pattern ie.
preg_match('/[0-9]+/', '123 UNION SELECT ... --') will return TRUE, but when you it in a SQL statement, injected code will be probably executed(if you don't escape user argument). Note that $matches[0] == '123', so it can be used as a valid input.
25-Jul-2007 05:44
Match and replace for arrays. Useful for parsing entire $_POST
Only array_preg_match examples:
<?php
function array_preg_match(array $patterns, array $subjects, &$errors = array()) {
$errors = array();
foreach ($patterns as $k => $v) preg_match($v, $subjects[$k]) or $errors[$k] = TRUE;
return count($errors) == 0 ? TRUE : FALSE;
}
function array_preg_replace(array $patterns, array $replacements, array $subject) {
$r = array();
foreach ($patterns as $k => $v) $r[$k] = preg_replace($v, $replacements[$k], $subject[$k]);
return $r+$subject;
}
$arr1 = array('name' => 'Alexandre', 'phone' => '44559999');
$arr2 = array('name' => '', 'phone' => '44559999c');
array_preg_match(array(
'name' => '#.+#', //Not empty
'phone' => '#^$|(\d[^\D])+#' // Only digits, optional
), $arr1, $match_errors);
print_r($match_errors); // Empty, it is ok.
array_preg_match(array(
'name' => '#.+#', //Not empty
'phone' => '#^$|(\d[^\D])+#' // Only digits, optional
), $arr2, $match_errors);
print_r($match_errors); // Two indexes, name and phone, both not ok.
?>
23-Jul-2007 07:22
Ne'er try to verify email address by using some random regex you just invented sitting on the toilet seat. It will not work properly. The proper regex for email validation is something along the lines of
"([-!#$%&'*+/=?_`{|}~a-z0-9^]
+(\.[-!#$%&'*+/=?_`{|}~a-z0-9
^]+)*|"([\x0b\x0c\x21\x01-\x08\
x0e-\x1f\x23-\x5b\x5d-\x7f]|\\[\x
0b\x0c\x01-\x09\x0e-\x7f])*")@((
[a-z0-9]([-a-z0-9]*[a-z0-9])?\.)+[
a-z0-9]([-a-z0-9]*[a-z0-9]){1,1}|
\[((25[0-5]|2[0-4][0-9]|[01]?[0-9]
[0-9]?)\.){3,3}(25[0-5]|2[0-4][0-9
]|[01]?[0-9][0-9]?|[-a-z0-9]*[a-z0
-9]:([\x0b\x0c\x01-\x08\x0e-\x1f\
x21-\x5a\x53-\x7f]|\\[\x0b\x0c\x0
1-\x09\x0e-\x7f])+)\])".
However, you shouldn't even try that regex. If you do not understand what that regexp does, then please do not try to write one yourself. If you need a _truly_ _valid_ e-mail address, no regexp is going to help you - just send a verification message to the user-supplied address with a link or code the user can paste to verify the address. IF you still WISH - against my recommendation - to use some validating regexp then *please* just make it warn loudly that the address may be invalid; do not write code that throws a fatal error outright. I am quite fed up with sites that do not accept my .name e-mail address, or some other valid, working forms for that matter.
11-Jul-2007 05:33
I just started using PHP and this section doesn't clarify whether or not you must use "/" as your regular expression delimiters.
I want to clarify that you can use almost any character as your delimiter. The delimiter is automatically the first character of your regular expression string. This makes it a bit easier if you are looking for things that might contain a forward slash. For example::
preg_match('#</b>#', $string);
Instead of:
preg_match('/<\/b>/', $string);
Or:
preg_match('@/my/dir/name/@', $string);
Instead of:
preg_match('/\/my\/dir\/name\//', $string);
This can greatly boost readability. Not quite as flexible as in Perl (You can't use control characters or \n which can really come in handy when you aren't quite sure what characters might be in your regular expression), but switching to another delimiter can make your code a bit easier to read.
06-Dec-2006 11:19
This is a function to convert byte offsets into (UTF-8) character offsets (this is reagardless of whether you use /u modifier:
<?php
function mb_preg_match($ps_pattern, $ps_subject, &$pa_matches, $pn_flags = NULL, $pn_offset = 0, $ps_encoding = NULL) {
// WARNING! - All this function does is to correct offsets, nothing else:
//
if (is_null($ps_encoding))
$ps_encoding = mb_internal_encoding();
$pn_offset = strlen(mb_substr($ps_subject, 0, $pn_offset, $ps_encoding));
$ret = preg_match($ps_pattern, $ps_subject, $pa_matches, $pn_flags, $pn_offset);
if ($ret && ($pn_flags & PREG_OFFSET_CAPTURE))
foreach($pa_matches as &$ha_subpattern)
$ha_subpattern[1] = mb_strlen(substr($ps_subject, 0, $ha_subpattern[1]), $ps_encoding);
return $ret;
}
?>
18-Aug-2006 04:27
Concerning the German umlauts (and other language-specific chars as accented letters etc.): If you use unicode (utf-8), you can match them easily with the unicode character property \pL (match any unicode letter) and the "u" modifier, so e.g.
<?php preg_match("/[\w\pL]/u",$var); ?>
would really match all "words" in $var - whether they contain umlauts or not. Took me a while to figure this out, so maybe this comment will safe the day for someone else :-)
30-Jan-2006 03:17
This is the only function in which the assertion \\G can be used in a regular expression. \\G matches only if the current position in 'subject' is the same as specified by the index 'offset'. It is comparable to the ^ assertion, but whereas ^ matches at position 0, \\G matches at position 'offset'.
26-Jan-2006 08:18
Intending to use preg_match to check whether an email address is in a valid format? The following page contains some very useful information about possible formats of email addresses, some of which may surprise you: http://en.wikipedia.org/wiki/E-mail_address
28-Dec-2005 01:27
Here's a format for matching US phone numbers in the following formats:
###-###-####
(###) ###-####
##########
It restricts the area codes to >= 200 and exchanges to >= 100, since values below these are invalid.
<?php
$pattern = "/(\([2-9]\d{2}\)\s?|[2-9]\d{2}-|[2-9]\d{2})"
. "[1-9]\d{2}"
. "-?\d{4}/";
?>
27-Oct-2005 02:37
Test for valid US phone number, and get it back formatted at the same time:
function getUSPhone($var) {
$US_PHONE_PREG ="/^(?:\+?1[\-\s]?)?(\(\d{3}\)|\d{3})[\-\s\.]?"; //area code
$US_PHONE_PREG.="(\d{3})[\-\.]?(\d{4})"; // seven digits
$US_PHONE_PREG.="(?:\s?x|\s|\s?ext(?:\.|\s)?)?(\d*)?$/"; // any extension
if (!preg_match($US_PHONE_PREG,$var,$match)) {
return false;
} else {
$tmp = "+1 ";
if (substr($match[1],0,1) == "(") {
$tmp.=$match[1];
} else {
$tmp.="(".$match[1].")";
}
$tmp.=" ".$match[2]."-".$match[3];
if ($match[4] <> '') $tmp.=" x".$match[4];
return $tmp;
}
}
usage:
$phone = $_REQUEST["phone"];
if (!($phone = getUSPhone($phone))) {
//error gracefully :)
}
05-Jul-2005 01:03
Do not forget PCRE has many compatible features with Perl.
One that is often neglected is the ability to return the matches as an associative array (Perl's hash).
For example, here's a code snippet that will parse a subset of the XML Schema 'duration' datatype:
<?php
$duration_tag = 'PT2M37.5S'; // 2 minutes and 37.5 seconds
// drop the milliseconds part
preg_match(
'#^PT(?:(?P<minutes>\d+)M)?(?P<seconds>\d+)(?:\.\d+)?S$#',
$duration_tag,
$matches);
print_r($matches);
?>
Here is the corresponding output:
Array
(
[0] => PT2M37.5S
[minutes] => 2
[1] => 2
[seconds] => 37
[2] => 37
)
12-Feb-2005 03:03
Pointing to the post of "internet at sourcelibre dot com": Instead of using PerlRegExp for e.g. german "Umlaute" like
<?php
$bolMatch = preg_match("/^[a-zA-Z������]+$/", $strData);
?>
use the setlocal command and the POSIX format like
<?php
setlocale (LC_ALL, 'de_DE');
$bolMatch = preg_match("/^[[:alpha:]]+$/", $strData);
?>
This works for any country related special character set.
Remember since the "Umlaute"-Domains have been released it's almost mandatory to change your RegExp to give those a chance to feed your forms which use "Umlaute"-Domains (e-mail and internet address).
Live can be so easy reading the manual ;-)
13-Jan-2005 10:11
Note that the PREG_OFFSET_CAPTURE flag, as far as I've tested, returns the offset in bytes not characters, which may not be what you're expecting if you're using the /u pattern modifier to make the regex UTF-8 aware (i.e. multibyte characters will result in a greater offset than you expect)
03-Feb-2004 11:30
<?php // some may find this usefull... :)
$iptables = file ('/proc/net/ip_conntrack');
$services = file ('/etc/services');
$GREP = '!([a-z]+) ' .// [1] protocol
'\\s*([^ ]+) ' .// [2] protocl in decimal
'([^ ]+) ' .// [3] time-to-live
'?([A-Z_]|[^ ]+)?'.// [4] state
' src=(.*?) ' .// [5] source address
'dst=(.*?) ' .// [6] destination address
'sport=(\\d{1,5}) '.// [7] source port
'dport=(\\d{1,5}) '.// [8] destination port
'src=(.*?) ' .// [9] reversed source
'dst=(.*?) ' .//[10] reversed destination
'sport=(\\d{1,5}) './/[11] reversed source port
'dport=(\\d{1,5}) './/[12] reversed destination port
'\\[([^]]+)\\] ' .//[13] status
'use=([0-9]+)!'; //[14] use
$ports = array();
foreach($services as $s) {
if (preg_match ("/^([a-zA-Z-]+)\\s*([0-9]{1,5})\\//",$s,$x)) {
$ports[ $x[2] ] = $x[1];
} }
for($i=0;$i <= count($iptables);$i++) {
if ( preg_match ($GREP, $iptables[$i], $x) ) {
// translate known ports... . .
$x[7] =(array_key_exists($x[7],$ports))?$ports[$x[7]]:$x[7];
$x[8] =(array_key_exists($x[8],$ports))?$ports[$x[8]]:$x[8];
print_r($x);
} // on a nice sortable-table... bon appetite!
}
?>
18-Jan-2004 04:31
As I did not find any working IPv6 Regexp, I just created one. Here is it:
$pattern1 = '([A-Fa-f0-9]{1,4}:){7}[A-Fa-f0-9]{1,4}';
$pattern2 = '[A-Fa-f0-9]{1,4}::([A-Fa-f0-9]{1,4}:){0,5}[A-Fa-f0-9]{1,4}';
$pattern3 = '([A-Fa-f0-9]{1,4}:){2}:([A-Fa-f0-9]{1,4}:){0,4}[A-Fa-f0-9]{1,4}';
$pattern4 = '([A-Fa-f0-9]{1,4}:){3}:([A-Fa-f0-9]{1,4}:){0,3}[A-Fa-f0-9]{1,4}';
$pattern5 = '([A-Fa-f0-9]{1,4}:){4}:([A-Fa-f0-9]{1,4}:){0,2}[A-Fa-f0-9]{1,4}';
$pattern6 = '([A-Fa-f0-9]{1,4}:){5}:([A-Fa-f0-9]{1,4}:){0,1}[A-Fa-f0-9]{1,4}';
$pattern7 = '([A-Fa-f0-9]{1,4}:){6}:[A-Fa-f0-9]{1,4}';
patterns 1 to 7 represent different cases. $full is the complete pattern which should work for all correct IPv6 addresses.
$full = "/^($pattern1)$|^($pattern2)$|^($pattern3)$
|^($pattern4)$|^($pattern5)$|^($pattern6)$|^($pattern7)$/";
24-Nov-2003 06:23
A web server log record can be parsed as follows:
$line_in = '209.6.145.47 - - [22/Nov/2003:19:02:30 -0500] "GET /dir/doc.htm HTTP/1.0" 200 6776 "http://search.yahoo.com/search?p=key+words=UTF-8" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)"';
if (preg_match('!^([^ ]+) ([^ ]+) ([^ ]+) \[([^\]]+)\] "([^ ]+) ([^ ]+) ([^/]+)/([^"]+)" ([^ ]+) ([^ ]+) ([^ ]+) (.+)!',
$line_in,
$elements))
{
print_r($elements);
}
Array
(
[0] => 209.6.145.47 - - [22/Nov/2003:19:02:30 -0500] "GET /dir/doc.htm HTTP/1.0" 200 6776 "http://search.yahoo.com/search?p=key+words=UTF-8" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)"
[1] => 209.6.145.47
[2] => -
[3] => -
[4] => 22/Nov/2003:19:02:30 -0500
[5] => GET
[6] => /dir/doc.htm
[7] => HTTP
[8] => 1.0
[9] => 200
[10] => 6776
[11] => "http://search.yahoo.com/search?p=key+words=UTF-8"
[12] => "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)"
)
Notes:
1) For the referer field ($elements[11]), I intentially capture the double quotes (") and don't use them as delimiters, because sometimes double-quotes do appear in a referer URL. Double quotes can appear as %22 or \". Both have to be handled correctly. So, I strip off the double quotes in a second step.
2) The URLs should be further parsed, using parse_url, which is quicker and more reliable then preg_match.
3) I assume the requested protocol (HTTP/1.1) always has a slash character in the middle, which might not always be the case, but I'll take the risk.
4) The agent field ($elments[12]) is the most unstructured field, so I make no assumptions about it's format. If the record is truncated, the agent field will not be delimited properly with a quote at the end. So, both cases must be handled.
5) A hyphen (- or "-") means a field has no value. It is necessary to convert these to appropriate value (such as empty string, null, or 0).
6) Finally, there should be appropriate code to handle malformed web log enteries, which are common, due to junk data. I never assume I've seen all cases.
01-Apr-2003 10:56
I you want to match all scandinavian characters (����������) in addition to those matched by \w, you might want to use this regexp:
/^[\w\xe6\xc6\xf8\xd8\xe5\xc5\xf6\xd6\xe4\xc4]+$/
Remember that \w respects the current locale used in PCRE's character tables.