Advanced Regular Expressions: No Witchcraft
Admittedly, they look strange at first and I’ve been avoiding them myself for a very long time. But for my new plugin “Divi – PageSpeed Booster” I finally had to face them. Regular Expressions have nothing to do with anything you know. But once you understand the principle, you will realize how ingenious the idea behind it is and how many great possibilities it offers you at once. If you make the effort to understand it.
I’m not going into the basics here in the post. These are already well explained elsewhere. For example, there is the course by Jeffrey Way and this page here is recommended to test it: regexr.com. Here in the article it is more about advanced possibilities of use. And here I use PHP to explain the context. But in principle, the regexes themselves should not differ too much from each other in the different languages.
On the site regex.com you will find all the important descriptions about the meaning of the different characters and also how to use them in context. Take a look at it at your leisure if you haven’t dealt with it yet and then let’s get started.
Get image, iframe, source and audio tags in something like a foreach loop
return preg_replace_callback ( '/(?<media><(?<tag>img|iframe|source|audio)(?![^>]*(?:divilazy|nolazy))[^>]*>(?:\s*<\s*\/\s*iframe\s*>)?)/', function ( $match ) { $this->tag = $match['tag']; $this->media = $this->hostToCdn( $match['media'] ); switch ( $this->tag ) : case 'img': $return = $this->image(); break; case 'source': if ( DALL()->isIn( 'type="video', $this->media ) ) $return = $this->video(); if ( DALL()->isIn( 'type="audio', $this->media ) ) $return = $this->audio(); break; case 'audio': $return = $this->audio(); break; case 'iframe': $return = $this->iframe(); break; endswitch; return apply_filters( DALL()->prefix() . '_return_media', $return ); }, $this->output );
As you can see, we use the function “preg_replace_callback” to get the different instances from the source code by using the first parameter with the regex. Then you can call a callback with the second parameter to edit the single instances. In my case I use a switch to perform different operations depending on the tag.
But what exactly does the regex do? With “?<media>” we name the whole match and can control it via “$match[‘media’]“. With “?<tag>” we can get the corresponding element name via “$match[‘tag’]” to use it for further actions in the callback. The elements in this case are “img“, or “iframe“, or “source“, or “audio“. With a negative lookahead “(?![^>]*(?:divilazy|nolazy))” we exclude 2 classes that can be assigned to the attribute “class“. With “[^>]” we allow all characters except the closing angle bracket “>“. So we can get the whole tag with the opening and closing angle brackets. With “(?:\s*<\s*\/\s*iframe\s*>)?)” we optionally get the closing iframe tag, taking into account that there can also be spaces. There you should pay attention at the place, because there are indeed extensions that execute such with. However, this is fortunately not the rule, but we must be prepared. In any case, you should always hold back on “.*” whenever possible. Because with this you fetch every following character and this can lead to unwanted results.
Get Youtube and Vimeo ID’s from the tags
'/https([^\"\'])*youtu[^\"\']*\/(?<id>\w{11})[^\"\']*?([\'\"])/' '/https([^\"\'])*vimeo[^\"\']*\/(?<id>\d{7,12})(?:[^\"\']*?([\'\"]))/'
Youtube ID’s are strings with 11 characters. We get these with “(?<id>\w{11})” and can use them further with “$match[‘id’]“.
With Vimeo ID’s it is a bit more difficult to find out how long they actually are. They are always at least 7 numbers long, sometimes 8. In any case, they are an exclusive series of numbers and so we get them with “(?<id>\d{7,12})“. We can also access them with “$match[‘id’]“.
Get background images
'/(?<start><(?<tag>\w{1,12})\s(?![^>]*(divilazy|nolazy|url\(\)))[^>]*(?<!lazy)style=[^>]*)((background-image:|background:)[^>]*(?<value>url\([^\)]*\))[\s;]?)(?<end>[^>]*>)/'
These are a bit more complicated because we don’t know which tag they are assigned to, because in principle you can assign them to any tag using the “style” attribute. So after the opening angle bracket “<” we use the rule “<(?<tag>\w{1,12})\s” with a maximum of 12 characters and a following whitespace to get the tag. We can then retrieve this with “$match[‘tag’]“.
Also in this case we exclude 2 classes with a negative lookahead “(?![^>]*(divilazy|nolazy|url\(\))” and an empty url assignment “url()“. Then follows a negative lookbehind “(?<!lazy)” to allow only “style=”…” and to exclude “lazystyle=”…“.
Then we search with “(background-image:|background:)[^>]*(?<value>url” for a background attribute that continues with url and get it in case of a positive match with “$match[‘value’]“.
Get noscript tags
/** * Remove noscript elements to prepare html for lazyLoad * * @since 1.0 */ public function clearHtml( $html ) { return preg_replace_callback ( '/(?<match>(?:<\s?noscript\s?>)(?:.|\n)*?(?:\/\s?noscript\s?>))/', function ( $match ) { $this->counter++; $replace = "%%noscript{$this->counter}%%"; $this->noscript[$replace] = $match['match']; return $replace; }, $html ); } // end clearHtml
Here we now have a crucial exception. Any content can be present in a noscript tag. Therefore we look with “(?:.|\n)*” for all characters including linebreak. Responsible for this is “.|\n“. And this up to the closing noscript tag “?(?:\/\s?noscript\s?>)“. The whole tag is stored in “$match[‘match’]“.
Here in my case I store the matches in an array “$this->noscript” and replace the noscript tags with “%%noscript{$this->counter}%%“. This way I can process the output with further operations and bring the placeholders back later with a foreach loop and “str_peplace” very easily. In this case in a buffer of “ob_start” which closes automatically.
Do something with a video source
/** * Set video background attributes to corresponding elements * * @since 1.0 */ public function setVideoBgAttributes( array $data, string $output ) { foreach ( $data as $key => $url ) : $suffix = DALL()->isIn( 'mp4', $url ) ? 'mp4' : 'webm'; $find = "/(<[^>]*source[^>]*)src=(['\"\s])" . str_replace( '/', '\/', $url ) . "([^>]*>)/"; $repl = "$1class=\"divilazy bg\" src=\"" . $this->hostToCdn( DALL()->videos() ) . DALL()->dummy() . ".{$suffix}\" data-lazyvideo=$2{$this->hostToCdn( $url )}$3"; $output = preg_replace( $find, $repl, $output ); endforeach; return $output; } // end setVideoBgAttributes
In this case the video url was known and it was a matter of converting the source tag as needed for the lazy load plugin. Rather rarely the case, but perhaps helpful in principle.
With the regex the source tag is searched, which contains the video link “$url“. With “(<[^>]*source[^>]*)” we get everything before the attribute “src” and can put it back later with “$1“. With “src=([‘\”\s])” we get the kind of opening quotes and put it back through the 2nd angle bracket with “$2“. With “([^>]*>)/” we get everything that follows after the matched url and reinsert it with “$3“.
In between we can do all the operations we need to output our source tag.
Final words for the extended use of Regular Expressions
Even if they scare you a bit at first, I haven’t heard of anyone avoiding them at first, Regular Expressions are a very valuable and powerful tool for getting a handle on really difficult tasks. From a performance standpoint, you should always try to work around them. So is there a safe way to solve something without Regular Expressions? Then use it always and without exception. Otherwise, they are the way out when all other tools fail.
Have you suggestions for improvements to this article? Just use the comment area below. Do you want support for implementation or do you need help elsewhere? You can book us. For this, simply use our contact form to get in touch with us.
Divi is a registered trademark of Elegant Themes, Inc. This website is not affiliated with nor endorsed by Elegant Themes.
Get the best out of your web!
[…] In this case it simply makes sense to work with Regular Expressions. For further information I have written a blog here: Extended Regular Expressions […]