Tag Archives: regex

Writing Regular Expressions for .htaccess and IIS 7 URL Rewrites

Regular ExpressionsWhen I was in the throws of transferring this site from Joomla to WordPress, one issue that I had to contend with was the URL changes. The solution is the 301 redirect. That flows link juice from the old URL to the new one and if possible to implement is far more useful for users.

This would be a huge pain to implement manually for every URL and thus had to be systematic. The easiest way to accomplish this is through the use of regular expressions in what is known as a .htaccess file on Linux/Apache or Microsoft’s IIS 7 equivalent functionality known as the URL Rewrite module.

To use this guide, you’ll need to know the basics of setting up rewrite rules. Turns out that the regex not that hard though. In fact, a simple chart can describe all the behaviors that regular expressions follow.

Values:

All the values you will find in this section will be matching against a single character.

. Period matches any single character including numbers, letters and symbols.
ex.:
A, a, 1, %, etc.

[ ] Square brackets will match against a range of characters.
ex.:
[a-z] matches all lowercase letters
[0-9] matches all numbers
[abC] matches a, b or C
[a-f0-3z] matches a to f, 0 to 3 and z

[^] Matches against anything not in brackets. Basically a negation.
ex.: [^a-c0] matches any character that is not a to c or 0.

Anchors

All the values you will find in this section will be matching against a single character.

^ Means the string must start here
ex.:
cats will match “I like cats”
^cats will not match “I like cats”
^cats will match “cats like me”

$ Means the string must end here.
ex.:
cats$ will match “I like cats”
cats$ will not match “cats like me”
^cats$ will match only “cats”

Quantifiers

All the values you will find in this section will be matching against a many characters. You should always use the most specific one possible to reduce overhead and false positives.

* matches zero or more of the previous character.
ex.: lo*l will match “ll” “lol” “lool” “loool” and so on.

+ matches one or more of the previous expression
ex.: lo+l will match “lol” “lool” loool” and so on.

? matches zero or one of the previous character
ex.: pi?e will match “pine” and “pie”

Others

| means OR.
ex.: “gray|grey” will match “gray or “grey”

() will group an expression together and allow it to be reused later (more on this later)
ex.:
The example above could be simplified to “gr(e|a)y”.

Escape Character

Since regex use special characters as defined above, we need a way to tell the computer to read the character literally. This is done with a backslash .
ex.: “.htm” must be escaped as “.htm” to treat the period literally.

Example in Practice

So lets put all this together and put it to good use. We’ll use the case of this site as an example. The first thing to do is observe the structure of what you have and see which strings can be easily grouped together in addition to which parts of the URL you want to keep and which you don’t. Take this for instance:

Old URL
www.cmoullas.net/reviews/34-mobile/46-latitudereview

New URL
www.cmoullas.net/latitudereview

We can see a few things here. First, the end of the URL will be staying the same “latitudereview”. That means we want to isolate the end of the URL. We also want to be as specific as possible to avoid false positives and avoid overhead. Finally, we know that we only need to match after the first slash and don’t need to worry about the www.cmoullas.net/. So consider the following solution:

[a-z]+/ matches “reviews/”
[0-9]+- matches “34-”
[a-z]+/ matches “mobile/”
[0-9]+- matches “46-”
(.+) matches “latitudereview” and saves it for later referencing

Putting that all together:

[a-z]+/[0-9]+-[a-z]+/[0-9]+-(.+)

This will great, but we can still improve on it. How you might ask? Well remember in this case our only goal is to get at the string “latitudereview”. As a result, we actually only need to match “/34-mobile/46-latitudereview” as that will be sufficiently unique to not create any false positives. Thus we can get away with just:

[0-9]+-[a-z]+/[0-9]+-(.+)

A couple more notes on this and you can call yourself an expert! Notice the specificity here. We matched as follows:

Any numbers and a dash
Any letters and a forward slash
Any numbers and a dash
Any string and saved it

We can do this because we know that the URL will always follow that pattern. At the end, we had to match any string because the title could be far more complex with any letters, numbers or even symbols, such as latitude-review-2, for example. The rest will always follow the same specific pattern.

And that’s it! If you followed all this, you’ll be following all the best practice rules with regex and will be rewriting like an expert! Look forward to another article in the new future explaining how to use back-references in both mod_rewrite .htaccess files as well as in IIS!