preg_quote

" width="11" height="7"/>

preg_match_all

Last updated: Thu, 31 May 2007

preg_match

(PHP 4, PHP 5)

preg_match — 正規表現によるマッチングを行う

説明

int preg_match ( string $pattern, string $subject [, array &$matches [, int $flags [, int $offset]]] )

pattern で指定した正規表現により subject を検索します。

パラメータ

pattern

検索するパターンを表す文字列。

subject

入力文字列。

matches

matches を指定した場合、検索結果が代入されます。 $matches[0] にはパターン全体にマッチしたテキストが代入され、 $matches[1] には 1 番目ののキャプチャ用サブパターンにマッチした文字列が代入され、といったようになります。

flags

flags には以下のフラグを指定できます。

PREG_OFFSET_CAPTURE: このフラグを設定した場合、各マッチに対応する文字列のオフセットも返されます。これにより、返り値は配列となり、配列の要素 0 はマッチした文字列、要素 1は対象文字列中におけるマッチした文字列のオフセット値となることに注意してください。

offset

通常、検索は対象文字列の先頭から開始されます。オプションのパラメータ offset を使用して検索の開始位置を指定することも可能です。

注意: offset を用いるのと、 substr($subject, $offset) を preg_match()の対象文字列として指定するのとは等価ではありません。これは、pattern には、 ^, $ や (?<=x) のような言明を含めることができるためです。以下を比べてみてください。
<?php $subject = "abcdef"; $pattern = '/^def/'; preg_match($pattern, $subject, $matches, PREG_OFFSET_CAPTURE, 3); print_r($matches); ?>
上の例の出力は以下となります。
Array
(
)

         
一方、この例を見てください。

<?php $subject = "abcdef"; $pattern = '/^def/'; preg_match($pattern, substr($subject,3), $matches, PREG_OFFSET_CAPTURE); print_r($matches); ?>
出力は以下のようになります。
Array
(
    [0] => Array
        (
            [0] => def
            [1] => 0
        )

)

         

返り値

preg_match() は、pattern がマッチした回数を返します。つまり、0 回（マッチせず）または 1 回となります。これは、最初にマッチした時点でpreg_match() は検索を止めるためです。逆にpreg_match_all()は、 subject の終わりまで検索を続けます。 preg_match() は、エラーが発生した場合にFALSEを返します。

変更履歴

バージョン	説明
4.3.3	パラメータ `offset` が追加されました。
4.3.0	フラグ `PREG_OFFSET_CAPTURE` が追加されました。
4.3.0	パラメータ `flags` が追加されました。

例

例 1671. 文字列 "php" を探す


<?php

// パターンのデリミタの後の "i" は、大小文字を区別しない検索を示す

if (preg_match("/php/i", "PHP is the web scripting language of choice.")) {

    echo "A match was found.";

} else {

    echo "A match was not found.";

}

?>

例 1672. 単語 "web" を探す


<?php

/* パターン内の \b は単語の境界を示す。このため、独立した単語の

 *  "web"にのみマッチし、"webbing" や "cobweb" のような単語の一部にはマッチしない */

if (preg_match("/\bweb\b/i", "PHP is the web scripting language of choice.")) {

    echo "A match was found.";

} else {

    echo "A match was not found.";

}



if (preg_match("/\bweb\b/i", "PHP is the website scripting language of choice.")) {

    echo "A match was found.";

} else {

    echo "A match was not found.";

}

?>

例 1673. URL からドメイン名を得る


<?php

// get host name from URL

preg_match('@^(?:http://)?([^/]+)@i',

    "http://www.php.net/index.html", $matches);

$host = $matches[1];



// get last two segments of host name

preg_match('/[^.]+\.[^.]+$/', $host, $matches);

echo "domain name is: {$matches[0]}\n";

?>

上の例の出力は以下となります。


domain name is: php.net

注意

ティップ

ある文字列が他の文字列内に含まれているかどうかを調べるためだけに preg_match() を使うのは避けた方が良いでしょう。 strpos() か strstr() 関数を使う方が速くなります。

参考

preg_match_all()
preg_replace()
preg_split()

preg_quote

" width="11" height="7"/>

preg_match_all

Last updated: Thu, 31 May 2007

add a note User Contributed Notes
preg_match

creature
01-Aug-2007 10:06


>>what about .mil, .golf,.tv etc etc



ICANN Does not list .golf TLD



A complete List of Top Level Domains from ICANN here:

http://data.iana.org/TLD/tlds-alpha-by-domain.txt



I also found this article about verifying Email-Adresses:

http://www.regular-expressions.info/email.html

razortongue
26-Jul-2007 08:47


Maybe it will sound obvious, but I've encountered this a few times...



If you are using preg_match() to validate user input, remember about including ^ and $ to your regex or take input from $matches[0] after successfully matching a pattern ie.

preg_match('/[0-9]+/', '123 UNION SELECT ... --') will return TRUE, but when you it in a SQL statement, injected code will be probably executed(if you don't escape user argument). Note that $matches[0] == '123', so it can be used as a valid input.

alexandre at NO-DAMN-SPAM-BOTS-gaigalas dot net
25-Jul-2007 05:44


Match and replace for arrays. Useful for parsing entire $_POST



Only array_preg_match examples:



<?php



function array_preg_match(array $patterns, array $subjects, &$errors = array()) {

    $errors = array();

    foreach ($patterns as $k => $v) preg_match($v, $subjects[$k]) or $errors[$k] = TRUE;    

    return count($errors) == 0 ? TRUE : FALSE;

}



function array_preg_replace(array $patterns, array $replacements, array $subject) {

    $r = array();

    foreach ($patterns as $k => $v) $r[$k] = preg_replace($v, $replacements[$k], $subject[$k]);    

    return $r+$subject;

}



$arr1 = array('name' => 'Alexandre', 'phone' => '44559999');



$arr2 = array('name' => '', 'phone' => '44559999c');



        array_preg_match(array(

            'name' => '#.+#', //Not empty

            'phone' => '#^$|(\d[^\D])+#' // Only digits, optional

        ), $arr1, $match_errors);

        print_r($match_errors); // Empty, it is ok.



        array_preg_match(array(

            'name' => '#.+#', //Not empty

            'phone' => '#^$|(\d[^\D])+#' // Only digits, optional

        ), $arr2, $match_errors);

        print_r($match_errors); // Two indexes, name and phone, both not ok.



?>

Antti Haapala
23-Jul-2007 07:22


Ne'er try to verify email address by using some random regex you just invented sitting on the toilet seat. It will not work properly. The proper regex for email validation is something along the lines of 



"([-!#$%&'*+/=?_`{|}~a-z0-9^]

+(\.[-!#$%&'*+/=?_`{|}~a-z0-9

^]+)*|"([\x0b\x0c\x21\x01-\x08\

x0e-\x1f\x23-\x5b\x5d-\x7f]|\\[\x

0b\x0c\x01-\x09\x0e-\x7f])*")@((

[a-z0-9]([-a-z0-9]*[a-z0-9])?\.)+[

a-z0-9]([-a-z0-9]*[a-z0-9]){1,1}|

\[((25[0-5]|2[0-4][0-9]|[01]?[0-9]

[0-9]?)\.){3,3}(25[0-5]|2[0-4][0-9

]|[01]?[0-9][0-9]?|[-a-z0-9]*[a-z0

-9]:([\x0b\x0c\x01-\x08\x0e-\x1f\

x21-\x5a\x53-\x7f]|\\[\x0b\x0c\x0

1-\x09\x0e-\x7f])+)\])". 



However, you shouldn't even try that regex. If you do not understand what that regexp does, then please do not try to write one yourself. If you need a _truly_ _valid_ e-mail address, no regexp is going to help you - just send a verification message to the user-supplied address with a link or code the user can paste to verify the address. IF you still WISH - against my recommendation - to use some validating regexp then *please* just make it warn loudly that the address may be invalid; do not write code that throws a fatal error outright. I am quite fed up with sites that do not accept my .name e-mail address, or some other valid, working forms for that matter.

iamdecal at gmail dot com
17-Jul-2007 08:09


>>..'com|org|net|gov|biz|info|name|aero|biz|info|jobs|'

.'museum)



what about .mil, .golf,.tv etc etc



the point of the code should be that you don't have to continuously go and update it

David W.
11-Jul-2007 05:33


I just started using PHP and this section doesn't clarify whether or not you must use "/" as your regular expression delimiters.



I want to clarify that you can use almost any character as your delimiter. The delimiter is automatically the first character of your regular expression string. This makes it a bit easier if you are looking for things that might contain a forward slash. For example::



preg_match('#</b>#', $string);



Instead of:



preg_match('/<\/b>/', $string);



Or:



preg_match('@/my/dir/name/@', $string);



Instead of:



preg_match('/\/my\/dir\/name\//', $string);



This can greatly boost readability. Not quite as flexible as in Perl (You can't use control characters or \n which can really come in handy when you aren't quite sure what characters might be in your regular expression), but switching to another delimiter can make your code a bit easier to read.

florian dot beisel at gmail dot com
29-Jun-2007 10:36


function checkEmailAddress($mail)



$regex = '/\A(?:[a-z0-9!#$%&\'*+\/=?^_`{|}~-]+'

.'(?:\.[a-z0-9!#$%&\'*+\/=?^_`{|}~-]+)*@'

.'(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[a-z]{2}|'

.'com|org|net|gov|biz|info|name|aero|biz|info|jobs|'

.'museum)\b)\Z/i';



if (preg_match($regex, $mail)) {

    return true;

} else {

    return false;

}

Elier Delgado
12-Jun-2007 12:55


I'm not happy with any pattern of email address that I have seen.



The fallowing address are wrong:



email1..@myserver.com

email1.-@myserver.com

email1._@myserver.com

email1@2sub.myserver.com

email1@sub.sub.2sub.myserver.com



So, this is my pattern:

$pat =

"/^[a-z]+[a-z0-9]*[\.|\-|_]?[a-z0-9]+

@([a-z]+[a-z0-9]*[\.|\-]?[a-z]+[a-z0-9]*[a-z0-9]+){1,4}

\.[a-z]{2,4}$/";



Best Regards, Elier



http://www.faqs.org/rfcs/rfc1035.html

RFC 1035 - Domain names - implementation and specification

chuckie
06-Dec-2006 11:19


This is a function to convert byte offsets into (UTF-8) character offsets (this is reagardless of whether you use /u modifier:



<?php



function mb_preg_match($ps_pattern, $ps_subject, &$pa_matches, $pn_flags = NULL, $pn_offset = 0, $ps_encoding = NULL) {

  // WARNING! - All this function does is to correct offsets, nothing else:

  //

  if (is_null($ps_encoding))

    $ps_encoding = mb_internal_encoding();



  $pn_offset = strlen(mb_substr($ps_subject, 0, $pn_offset, $ps_encoding));

  $ret = preg_match($ps_pattern, $ps_subject, $pa_matches, $pn_flags, $pn_offset);



  if ($ret && ($pn_flags & PREG_OFFSET_CAPTURE))

    foreach($pa_matches as &$ha_subpattern)

      $ha_subpattern[1] = mb_strlen(substr($ps_subject, 0, $ha_subpattern[1]), $ps_encoding);



  return $ret;

  }



?>

preg regexp
21-Sep-2006 06:37


If you want to have perl equivalent regexp match:

$`, $& and $'

before the match, the match itself, after the match



Here's one way to do it:



echo preg_match("/(.*?)(and)(.*)/", "this and that",$matches);

print_r($matches);



$` = ${1};

$& = ${2};

$' = ${3};



Notice (.*) else the end won't match.



Note that if you only need $&, simply use ${0}.



Here's another way, which is a bit simpler to remember:



echo preg_match("/^(.*?)(and)(.*?)$/", "this and that",$matches);

print_r($matches);

Izzy
18-Aug-2006 04:27


Concerning the German umlauts (and other language-specific chars as accented letters etc.): If you use unicode (utf-8), you can match them easily with the unicode character property \pL (match any unicode letter) and the "u" modifier, so e.g.



<?php preg_match("/[\w\pL]/u",$var); ?>



would really match all "words" in $var - whether they contain umlauts or not. Took me a while to figure this out, so maybe this comment will safe the day for someone else :-)

steve at webcommons dot biz
06-Jul-2006 03:48


This function (for PHP 4.3.0+) uses preg_match to return the regex position (like strpos, but using a regex pattern instead):



  function preg_pos($sPattern, $sSubject, &$FoundString, $iOffset = 0) {

      $FoundString = NULL;

      

      if (preg_match($sPattern, $sSubject, $aMatches, PREG_OFFSET_CAPTURE, $iOffset) > 0) {

        $FoundString = $aMatches[0][0];

        return $aMatches[0][1];

      }

      else {

        return FALSE;

      }

  }



It also returns the actual string found using the pattern, via $FoundString.

roberta at lexi dot net
14-Feb-2006 03:25


How to verify a Canadian postal code!



if (!preg_match("/^[a-z]\d[a-z] ?\d[a-z]\d$/i" , $postalcode)) 

{

     echo "Your postal code has an incorrect format."

}

patrick at procurios dot nl
30-Jan-2006 03:17


This is the only function in which the assertion \\G can be used in a regular expression. \\G matches only if the current position in 'subject' is the same as specified by the index 'offset'. It is comparable to the ^ assertion, but whereas ^ matches at position 0, \\G matches at position 'offset'.

bloopy.org
26-Jan-2006 08:18


Intending to use preg_match to check whether an email address is in a valid format? The following page contains some very useful information about possible formats of email addresses, some of which may surprise you: http://en.wikipedia.org/wiki/E-mail_address

john at recaffeinated d0t c0m
28-Dec-2005 01:27


Here's a format for matching US phone numbers in the following formats:



###-###-####

(###) ###-####

##########



It restricts the area codes to >= 200 and exchanges to >= 100, since values below these are invalid.



<?php

$pattern = "/(\([2-9]\d{2}\)\s?|[2-9]\d{2}-|[2-9]\d{2})" 

         . "[1-9]\d{2}"

         . "-?\d{4}/";

?>

phpnet_spam at erif dot org
27-Oct-2005 02:37


Test for valid US phone number, and get it back formatted at the same time:



  function getUSPhone($var) {

    $US_PHONE_PREG ="/^(?:\+?1[\-\s]?)?(\(\d{3}\)|\d{3})[\-\s\.]?"; //area code

    $US_PHONE_PREG.="(\d{3})[\-\.]?(\d{4})"; // seven digits

    $US_PHONE_PREG.="(?:\s?x|\s|\s?ext(?:\.|\s)?)?(\d*)?$/"; // any extension

    if (!preg_match($US_PHONE_PREG,$var,$match)) {

      return false;

    } else {

      $tmp = "+1 ";

      if (substr($match[1],0,1) == "(") {

        $tmp.=$match[1];

      } else {

        $tmp.="(".$match[1].")";

      }

      $tmp.=" ".$match[2]."-".$match[3];

      if ($match[4] <> '') $tmp.=" x".$match[4];

      return $tmp;

    }

  }



usage:



  $phone = $_REQUEST["phone"];

  if (!($phone = getUSPhone($phone))) {

    //error gracefully :)

  }

tlex at NOSPAM dot psyko dot ro
23-Sep-2005 03:34


To check a Romanian landline phone number, and to return "Bucharest", "Proper" or "Unknown", I've used this function:



<?

function verify_destination($destination) {

    $dst_length=strlen($destination);

    if ($dst_length=="10"){

        if(preg_match("/^021[2-7]{1}[0-9]{6}$/",$destination)) {

            $destination_match="Bucharest";

        } elseif (preg_match("/^02[3-6]{1}[0-9]{1}[1-7]{1}[0-9]{5}$/",$destination)) {

            $destination_match = "Proper";

        } else {

            $destination_match = "Unknown";

        }

    }

    return ($destination_match);

}

?>

hippiejohn1020 --- attT --- yahoo.com
27-Jul-2005 01:38


Watch out when using c-style comments around a preg_match or preg_* for that matter. In certain situations (like example below) the result will not be as expected. This one is of course easy to catch but worth noting.



/* 

    we will comment out this section



    if (preg_match ("/anything.*/", $var)) {

        code here;

    }

*/



This is (I believe) because comments are interpreted first when parsing the code (and they should be). So in the preg_match the asterisk (*) and the ending delimiter (/) are interpreted as the end of the comment and the rest of your (supposedly commented) code is intrepreted as php.

Rasqual
05-Jul-2005 01:03


Do not forget PCRE has many compatible features with Perl.

One that is often neglected is the ability to return the matches as an associative array (Perl's hash).



For example, here's a code snippet that will parse a subset of the XML Schema 'duration' datatype:



<?php

$duration_tag = 'PT2M37.5S';  // 2 minutes and 37.5 seconds



// drop the milliseconds part

preg_match(

  '#^PT(?:(?P<minutes>\d+)M)?(?P<seconds>\d+)(?:\.\d+)?S$#',

  $duration_tag,

  $matches);



print_r($matches);

?>



Here is the corresponding output:

Array

(

    [0] => PT2M37.5S

    [minutes] => 2

    [1] => 2

    [seconds] => 37

    [2] => 37

)

carsten at senseofview dot de
14-Mar-2005 08:57


The ExtractString function does not have a real error, but some disfunction. What if is called like this:



ExtractString($row, 'action="', '"');



It would find 'action="' correctly, but perhaps not the first " after the $start-string. If $row consists of



<form method="post" action="script.php">



strpos($str_lower, $end) would return the first " in the method-attribute. So I made some modifications and it seems to work fine.



function ExtractString($str, $start, $end)

{

    $str_low = strtolower($str);

    $pos_start = strpos($str_low, $start);

    $pos_end = strpos($str_low, $end, ($pos_start + strlen($start)));

    if ( ($pos_start !== false) && ($pos_end !== false) )

    {

        $pos1 = $pos_start + strlen($start);

        $pos2 = $pos_end - $pos1;

        return substr($str, $pos1, $pos2);

    }

}

info at reiner-keller dot de
12-Feb-2005 03:03


Pointing to the post of "internet at sourcelibre dot com": Instead of using PerlRegExp for e.g. german "Umlaute" like



<?php



$bolMatch = preg_match("/^[a-zA-Z������]+$/", $strData);



?>



use the setlocal command and the POSIX format like



<?php



setlocale (LC_ALL, 'de_DE');

$bolMatch = preg_match("/^[[:alpha:]]+$/", $strData);



?>



This works for any country related special character set.



Remember since the "Umlaute"-Domains have been released it's almost mandatory to change your RegExp to give those a chance to feed your forms which use "Umlaute"-Domains (e-mail and internet address).



Live can be so easy reading the manual ;-)

hfuecks at phppatterns dot com
13-Jan-2005 10:11


Note that the PREG_OFFSET_CAPTURE flag, as far as I've tested, returns the offset in bytes not characters, which may not be what you're expecting if you're using the /u pattern modifier to make the regex UTF-8 aware (i.e. multibyte characters will result in a greater offset than you expect)

29-Dec-2004 05:44


This is a constant that helps in getting a valid phone number that does not need to be in a particular format. The following is a constant that matches the following US Phone formats:



Phone number can be in many variations of the following:

(Xxx) Xxx-Xxxx

(Xxx) Xxx Xxxx

Xxx Xxx Xxxx

Xxx-Xxx-Xxxx

XxxXxxXxxx

Xxx.Xxx.Xxxx



define( "REGEXP_PHONE", "/^(\(|){1}[2-9][0-9]{2}(\)|){1}([\.- ]|)[2-9][0-9]{2}([\.- ]|)[0-9]{4}$/" );

ebiven
06-Jul-2004 05:53


To regex a North American phone number you can assume NxxNxxXXXX, where N = 2 through 9 and x = 0 through 9.  North American numbers can not start with a 0 or a 1 in either the Area Code or the Office Code.  So, adpated from the other phone number regex here you would get:



/^[2-9][0-9]{2}[-][2-9][0-9]{2}[-][0-9]{4}$/

05-May-2004 11:23


A very simple Phone number validation function.

Returns the Phone number if the number is in the xxx-xxx-xxxx format. x being 0-9.

Returns false if missing digits or improper characters are included.



<?

function VALIDATE_USPHONE($phonenumber)

{

if ( (preg_match("/^[0-9]{3,3}[-]{1,1}[0-9]{3,3}[-]{1,1}

      [0-9]{4,4}$/", $phonenumber) ) == TRUE ) {

   return $phonenumber;

 } else {

   return false;

   }

}



?>

mark at portinc dot net
03-Feb-2004 11:30


<?php // some may find this usefull... :)



$iptables = file ('/proc/net/ip_conntrack'); 

$services = file ('/etc/services');

$GREP = '!([a-z]+) '     .// [1] protocol 

        '\\s*([^ ]+) '     .// [2] protocl in decimal

        '([^ ]+) '        .// [3] time-to-live 

        '?([A-Z_]|[^ ]+)?'.// [4] state 

        ' src=(.*?) '     .// [5] source address 

        'dst=(.*?) '      .// [6] destination address

        'sport=(\\d{1,5}) '.// [7] source port 

        'dport=(\\d{1,5}) '.// [8] destination port 

        'src=(.*?) '      .// [9] reversed source

        'dst=(.*?) '      .//[10] reversed destination

        'sport=(\\d{1,5}) './/[11] reversed source port

        'dport=(\\d{1,5}) './/[12] reversed destination port

        '\\[([^]]+)\\] '    .//[13] status

        'use=([0-9]+)!';   //[14] use



$ports = array();

foreach($services as $s) { 

  if (preg_match ("/^([a-zA-Z-]+)\\s*([0-9]{1,5})\\//",$s,$x)) {

     $ports[ $x[2] ] = $x[1];

} }

for($i=0;$i <= count($iptables);$i++) { 

  if ( preg_match ($GREP, $iptables[$i], $x) ) {

     // translate known ports... . . 

     $x[7] =(array_key_exists($x[7],$ports))?$ports[$x[7]]:$x[7]; 

     $x[8] =(array_key_exists($x[8],$ports))?$ports[$x[8]]:$x[8]; 

     print_r($x);

  }  // on a nice sortable-table... bon appetite!

}

?>

nico at kamensek dot de
18-Jan-2004 04:31


As I did not find any working IPv6 Regexp, I just created one. Here is it:



$pattern1 = '([A-Fa-f0-9]{1,4}:){7}[A-Fa-f0-9]{1,4}';

$pattern2 = '[A-Fa-f0-9]{1,4}::([A-Fa-f0-9]{1,4}:){0,5}[A-Fa-f0-9]{1,4}';

$pattern3 = '([A-Fa-f0-9]{1,4}:){2}:([A-Fa-f0-9]{1,4}:){0,4}[A-Fa-f0-9]{1,4}';

$pattern4 = '([A-Fa-f0-9]{1,4}:){3}:([A-Fa-f0-9]{1,4}:){0,3}[A-Fa-f0-9]{1,4}';

$pattern5 = '([A-Fa-f0-9]{1,4}:){4}:([A-Fa-f0-9]{1,4}:){0,2}[A-Fa-f0-9]{1,4}';

$pattern6 = '([A-Fa-f0-9]{1,4}:){5}:([A-Fa-f0-9]{1,4}:){0,1}[A-Fa-f0-9]{1,4}';

$pattern7 = '([A-Fa-f0-9]{1,4}:){6}:[A-Fa-f0-9]{1,4}';



patterns 1 to 7 represent different cases. $full is the complete pattern which should work for all correct IPv6 addresses.



$full = "/^($pattern1)$|^($pattern2)$|^($pattern3)$

|^($pattern4)$|^($pattern5)$|^($pattern6)$|^($pattern7)$/";

thivierr at telus dot net
24-Nov-2003 06:23


A web server log record can be parsed as follows:



$line_in = '209.6.145.47 - - [22/Nov/2003:19:02:30 -0500] "GET /dir/doc.htm HTTP/1.0" 200 6776 "http://search.yahoo.com/search?p=key+words=UTF-8" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)"';



if (preg_match('!^([^ ]+) ([^ ]+) ([^ ]+) \[([^\]]+)\] "([^ ]+) ([^ ]+) ([^/]+)/([^"]+)" ([^ ]+) ([^ ]+) ([^ ]+) (.+)!',

  $line_in,

  $elements))

{

  print_r($elements);

}



Array

(

    [0] => 209.6.145.47 - - [22/Nov/2003:19:02:30 -0500] "GET /dir/doc.htm HTTP/1.0" 200 6776 "http://search.yahoo.com/search?p=key+words=UTF-8" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)"

    [1] => 209.6.145.47

    [2] => -

    [3] => -

    [4] => 22/Nov/2003:19:02:30 -0500

    [5] => GET

    [6] => /dir/doc.htm

    [7] => HTTP

    [8] => 1.0

    [9] => 200

    [10] => 6776

    [11] => "http://search.yahoo.com/search?p=key+words=UTF-8"

    [12] => "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)"

)



Notes:  

1) For the referer field ($elements[11]), I intentially capture the double quotes (") and don't use them as delimiters, because sometimes double-quotes do appear in a referer URL.  Double quotes can appear as %22 or \".  Both have to be handled correctly.  So, I strip off the double quotes in a second step.

2) The URLs should be further parsed, using parse_url, which is quicker and more reliable then preg_match.

3) I assume the requested protocol (HTTP/1.1) always has a slash character in the middle, which might not always be the case, but I'll take the risk.

4) The agent field ($elments[12]) is the most unstructured field, so I make no assumptions about it's format.  If the record is truncated, the agent field will not be delimited properly with a quote at the end.  So, both cases must be handled.

5) A hyphen  (- or "-") means a field has no value.  It is necessary to convert these to appropriate value (such as empty string, null, or 0).

6) Finally, there should be appropriate code to handle malformed web log enteries, which are common, due to junk data.  I never assume I've seen all cases.

nospam at 1111-internet dot com
12-Nov-2003 05:29


Backreferences (ala preg_replace) work within the search string if you use the backslash syntax. Consider:



<?php

if (preg_match("/([0-9])(.*?)(\\1)/", "01231234", $match))

{

    print_r($match);

}

?>



Result: Array ( [0] => 1231 [1] => 1 [2] => 23 [3] => 1 )



This is alluded to in the description of preg_match_all, but worth reiterating here.

bjorn at kulturkonsult dot no
01-Apr-2003 10:56


I you want to match all scandinavian characters (����������) in addition to those matched by \w, you might want to use this regexp:



/^[\w\xe6\xc6\xf8\xd8\xe5\xc5\xf6\xd6\xe4\xc4]+$/



Remember that \w respects the current locale used in PCRE's character tables.

add a note

preg_quote

" width="11" height="7"/>

preg_match_all

Last updated: Thu, 31 May 2007