Ask questions about WordPress courses

PHP working with regular expressions

What are regular expressions

Basically, how regular expressions work can be described in the following way:, regular expression is a method of pattern matching or matching patterns in a string. PHP most commonly uses PCRE or "Perl Compatible Regular Expressions". Today we will leave the simple string search methods behind and work with a stronger tool, which many people use but do not know how it works. Here we will try to decode the meaningless hieroglyphs as everyone thinks they are and do it with examples. The biggest mistake people make when learning regular expressions is trying to understand everything in one sitting..

Let's start learning

Create an index.php file on your test server and place the code in it:

Let's slightly modify the code and add the preg_match function().

After running the script, we will get 1, which means that part of the text was found in the string $string (PHP 1 and TRUE equal answers).

We have given an example of how to find a string, but there is a way to do it faster through standard php functions StrPos () and strstr ().

Specifying the beginning of a string in a regular expression

To indicate the beginning of a line in an expression, add the ^ sign, let's talk less and try in practice to change our code to the following form:

After executing the code, you will see the inscription “This line starts with abc”, since our string really starts with the letters “abc”. Symbol (^) gives us a search only at the beginning of the string, but not all over. This construct is case-sensitive by default..

Search for a substring at the beginning, case insensitive

Previously, we used the if combination(preg_match(“/^ABC/”, $string)), but this combination would produce an erroneous result, because it is case sensitive. Consider PHP code that will not be case-sensitive when searching..

The design just changed a little and another modifier was added i preg_match(“/^ABC/i“, $string) – case insensitive (case insensitive). After our amendments, the script will perfectly find the substring.

How to find a string by its ending

In many ways, searching for a string by its ending is similar to the previous example.. All that is needed is to add z to the end of the search pattern.

Since our line ends with 89 and the search pattern matches the end of the string, then the result will be “End of line is 89”.

A meta symbol

We have already used special characters such as (^) and ($) these characters, along with others are called meta characters. Here is the list meta characters which are also used in regular expressions:

. (Full stop)
^ (Carat) - start of line
* (Asterix) - means any number of characters in a string, preceding the "asterisk", including zero characters.
+ (Plus) - indicates that, that the previous character or expression occurs 1 or more times. plays the same role, which is the asterisk symbol (*), except for the case of zero occurrences.
? (Question Mark) - means, that the previous character or regular expression occurs 0 or 1 once. Mainly used to search for single characters.
{ (Opening curly brace)
} (Closing curly brace) – {a,b} is the number of occurrences of the preceding character or subpattern from a before b. If not specified, counts, that there is no upper limit. For example, * - same, what {0,}. ? - too, what {0,1}. {5,7}5,6 or 7 repetitions.
[ (Opening brace)
] (Closing brace) – designed to specify a subset of characters. Square brackets, inside a regular expression, count as one character, which can take the values, listed inside these brackets.
\ (Backslash) - used to escape special characters, it means, that escaped characters should be interpreted literally, i.e. not as meta characters, but as simple symbols.
| (Pipe) – acts as a logical operator “OR” in regular expressions and is used to specify a set of alternatives ‘re(a|e)d’.
( (Opening parens)
) (Closing parens) – designed to highlight groups of regular expressions. They are useful when used with the “|” and when extracting a substring using the command expr.

We will look at each of the meta characters with examples during this tutorial., but important, so that you know, what they are. If you want to find a string containing one of these characters, for example: “1 + 1“, then you need, so that the program considers them regular characters, and not the goal with symbols, To do this, add a backslash and escape the character:

In this example \ escaped the plus and the expression used it as a regular character, otherwise the expression would evaluate to false.

Consider what other meta characters can do

We have already seen the caret symbol ^ and the dollar $? let's look at others, starting with square brackets []. Square brackets designed to search for characters [abcdef] or range of characters [a-f]. Let's look at an example of a regular expression:

The expression will return true if the string contains the words big, bog, bug, bag but not beg.

You can also use this combination [abcdef$], in this case the sign $ will be just a dollar but not a meta symbol. Almost all meta characters are irrelevant, except for some cases.

Let's try the following script:

The output of the script will be 0 -> a script outputs characters up to character b.

Let's try to modify our script a bit and use the function preg_match_all().

As you can see from the output of the above script, it prints all characters of a string, that do not match pattern “B”
acefghijklmnopqrstuvwxyz0123456789.
Let's take it one step further to filter all numbers from a string.:

This script returns the string:
abcefghijklmnopqrstuvwxyz

Based on the above code, we can see that the ^ sign in the above examples means negation (All but the listed characters).

stay with us, even more interesting

Let's try using meta character escaping, to use them in search. It will be easiest to understand based on the result, For example:

The result of the script will be:

[]

This is because we have indicated that we want to take all the characters that match []. In order for the expression to work correctly, we used slashes, if you want the slash to be treated as a simple character, you will need to include two slashes \\, for example, for such an expression c:\dirfile.php.

Consider working with the dot operator ( . ) with a simple example:

As a result, we will get 1, since our string contains the word sex, this expression will also match SOX words, SUX and SX, but will not find Stix.

Let's try to count the number of words in a string using a regular expression with a dot.

The code above will return this:
sex
at
noon
taxes
4

First we output the line, and the n operator exposed hyphens. Below we see the number 4, is the number of words found by the function preg_match_all ().

Let's work with the meta symbol next. ( * ). This operator will match any number of any characters, which may be up to the operator, and may not exist. Let's look at the example below:

As a result, we will get 1 since we found 1 expression that matches expression. Also true is pp (no symbols) и phhhp (with multiple characters).

If we need to exclude an empty result like pp, then you can use the meta symbol ( + ). Let's look at an example:

Symbol usage ( + ) works like ( * ), but plus doesn't take into account empty value.

Our next meta character is a question mark. ( ? ), it means that the previous character can be, it may not be. An example would be writing a phone number, both expressions will be true (1234-5678) and (12345678).

The same result will be when using the following code:

Next we have braces or {} metacharacter. Specifies the number of occurrences of the previous expression or range. Curly braces must be escaped with a slash "[0-9]\{5\}”.
Expression "[0-9]\{5\}” – matches a substring of five decimal digits (characters from the range from 0 before 9, inclusive).

Next, we will use an expression in which after the text “PHP” must be completed exactly 3 numbers.

The result of the regular expression will be true (1). It can be seen from the regular expression that it must begin with PHP text and be completed with three digits from 0 before 9.

Special Sequences

Backslash is also used for special sequences. Let's see what the sequences are?

  • \d - expresses any numeric characters like an expression [0-9]
  • \D - matches any numeric characters like [^0-9]
  • \s - matches any character of the form [ \tnrfv]
  • \S - matches any character of the form [^ tnrfv]
  • \w - matches any alphanumeric characters and underscore like [a-zA-Z0-9_]
  • \W - matches any alphanumeric characters and underscore like [^a-zA-Z0-9_]

These sequences can be used to shorten your regular expressions. The following example shows how you can clear a string of extra characters..

Such an expression will be useful if you need to clean up the user's login, from extra and invalid characters.

Also, when cleaning a string, it is often necessary to make sure, that the string does not start with numbers, this can be done with the following example:

This example will show that a digit builds first in a string 2.

Let's use a dot to determine if a string contains at least one character..

Point ( . ) means any character, at least one, except for the newline character (\n).

Let's try to use the s sequence to get the number of words in a string separated by n.

Expression result:

sex
at
noon
taxes
4

Let's summarize our knowledge

Let's start combining our expressions into a more complex form. The following expression indicates that the string must contain one of the words This or That or There.

Another interesting example of a script for determining the beginning of a word.

Let's expand on the above code., so that we can see which characters the word begins with and display these characters and the word itself on the screen:

If you did everything right, then the result of the expression will be:

0->Hello
1->He

$matches[0] contains the full text of the expression template - Hello.
$matches[1] contains the first part of the expression template.

Modifiers and assertions

As we saw earlier in this tutorial, we were able to create a regular expression, which was case sensitive using /i . This is a modifier and is one of the many used in regular expressions to make changes to pattern matching behavior.. Here is a list of regex modifiers and assertions, used in PHP.

Modifiers

i - case insensitive.
U - inverts greed. this means, that the pattern matches as many characters as possible, falling under this pattern.
s - if used, then the symbol . also matches a newline n. Otherwise it doesn't match..
m - multiline (Multiple lines)
x - causes all unescaped whitespace characters to be ignored, if they are not listed in the character class. Comfortable, when you want to use enters and spaces to make it easier to read in a regular expression.
e - If this modifier is used, preg_replace() after performing standard substitutions in the replaced string, it interprets it as PHP code and uses the result to replace the searched string. Single and double quotes, backslashes (\) NULL characters will be escaped with backslashes in substituted backreferences. (Only works with preg_replace).
S - in case, if you plan to reuse the template, it makes sense to spend a little more time on its analysis, to reduce the execution time. When, if this modifier is used, additional analysis of the template is carried out. Currently, this only makes sense for "unanchored" templates., not starting with any specific character. More on this later.

Statements

b – Word Boundry (word boundary)
A word boundary is created between two "b" modifiers.
This is a special “supporting type of modifiers, which allow you to specify an EXACT match.
The text must only match the exact pattern enclosed in "b"
For example, pattern "cat" does not match "catalog".
B – Not a word boundary (is not a word boundary)
This modifier is related to the previous one., but B does not condition the word boundary, on the contrary, it denies the word boundary. This modifier is useful, when you need to find something inside the text, which is inside the word, but not at the very beginning or end of the phrase.
A – PCRE_ANCHORED
If this modifier is used, pattern matching will be achieved only if, if it is "anchored", i.e. matches the beginning of the string, in which the search is made. The same effect can be achieved with a suitable nested template construction, which is the only way to implement this behavior in Perl.
Z - indicating the end of the line.
end of data or position before last newline (regardless of multiline mode).
z - indicating the end of the line.
end of data (regardless of multiline mode).
G is the first matching position in the string.

Let's work with modifiers and assertions with examples

As you can see from the list above, there are many ways, to change the behavior of regular expressions, let's try to work with modifiers and assertions one by one using simple examples.

Modifier (i)

If you have read the previous parts of this tutorial, there will be no surprise, what this construction considered similar to “ABC” and with abc, because we used the case-insensitive modifier (i) .

Modifier (s)

Let's continue the study and consider the modifier (s). If this modifier is used, then the symbol ( . ) also matches a newline n. Otherwise it doesn't match.. First we will try an example without the modifier (s).

As you can see, this example returned the answer ( 0 ), for the result to be positive. ( 1 ), and the symbol ( . ) n, must be added to the expression modifier ( s )? rebuild our example.

The above code will display the number 1, since the string matching the expression pattern was found.

Modifier (m)

When adding a modifier to a string, interesting magic will happen. Regular expression will treat one line as multiple, if it contains a hyphen n. To make it easier to understand the effect of the modifier, look at an example.

In this example, we use the meta symbol ( ^ ) looking for the word noon at the beginning of the line. Our line starts with the word sex, which means that in the usual case we would not find the desired word. Since in our example all words are separated by n and it costs modifier ( m ), then each word will be treated as the beginning of a line. So that the search does not take into account the case of letters, we also added modifier ( i ). If you look closely at the example above, you will see that you can use several modifiers nearby.

Modifier ( x ) makes our expression longer but it allows us to split the regular expression over multiple lines and gives us the ability to comment on each action in the expression, comments in regular expressions make them clearer. There is no point in describing further., just see how this regular expression will work, it works like the previous one but has comments and modifiers ( imx ).

Our next modifier is ( S ), with it, you can parse the string before matching with the template. An expression can make it easier to execute a pattern in cases of multiple occurrences.

Consider an example of multiple occurrence (coincidences):

Modifier ( S ) rarely used, but suddenly you meet him, you will know what he is for, or know where you can read about this modifier.

Next we will work with word boundaries b, this modifier allows us to clearly define where the word starts and ends. A common mistake for programmers to use this modifier to search for an occurrence. Finding an occurrence with b will return false. Let's look at an example:

The search word lab was not found in the word available due to the use of b in the pattern. For b modifier in search, words cat and catalog will be different words.

Consider another example of finding an expression:

Search successful!

Modifier B

This modifier (\B) negates a word boundary. The modifier will be useful in cases where you need to find something inside the text, according to the pattern that is inside the word, but not at the very beginning or end of the phrase.

Example with a word that starts with a given occurrence

By example, you should understand why the word “the” was not found, all because we indicated with the help of the regular expression “/Btheb/” that “the” is the end of the word, but not a whole word.

Modifier U used to invert greed.

This modifier inverts the greed of quantifiers, thus they are not greedy by default. But get greedy, if they are followed by a character ?. It can also be installed with (?U) setting a modifier inside a template or by adding a question mark after the quantifier (for example, .*?).

An example of using greedy and lazy expressions from wikipedia

Expression (<.*>) matches the string, containing multiple HTML markup tags, entirely.
<p><b>wp-admin.com.ua</b> — website development lessons and <i>cms wordpress</i> </p>
To highlight individual tags, you can apply a lazy version of this expression: (<.*?>) It does not match the entire line shown above, and individual tags (highlighted in color):
<p><b>wp-admin.com.ua</b> — website development lessons and <i>cms wordpress</i></p>

Using preg_replace

I think it will be easier for you to try to enter the presented code and look at the result of the function.

Many developers may reproach that working with a function str_replace() goes much faster, but we have just given a simple example, it will be more interesting.

Consider a more complex replacement example using the preg_replace function().

Having worked with such simple code, we can see how bloated the template engines and control systems are., but it's that simple..

https://www.php.su/lessons/?lesson_17

https://www.skillz.ru/dev/php/article-Regulyarnye_vyrazheniya_dlya_chaynikov.html

https://www.phpro.org/tutorials/Introduction-to-PHP-Regex.html

https://www.compileonline.com/execute_php_online.php

Nikolaenko Maxim

Director of web studies ProGrafika. I am developing, website design and promotion. Always glad to new blog readers and good clients.


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Templates for WordPress
The best hosting in Ukraine
Stable hosting for Drupal