|
|
The right tool for the right job
-
So Eric Wise posted a small little snippet IsEmailAddress that stars a long winded regular expression to determine if an input string is in fact an Email Address. So IsEmailAddress(“try.sending.this.email@____________________________________________.zzzzz”) would return True for the suspect email.
I don’t want this to come across as flaming Eric but I do feel it’s important to make a few points (and they’re not directly pointed at Eric, though the first one certainly is relevant, the rest are just more general rants):
1st, if you really want to use the email address for the purpose of sending email there are far better ways to do this. I’ve blogged about this before but it’s worth repeating: Email Validation for every new email address and parsing your bounce logs is a sure fire way to satisfy the requirement that the email address works and is at least accessible to the registering user (whether they own it or not is a discussion for some other day and some other blog). If you’re not into validation, which is probably the best method, you can try to poll the SMTP server for the domain to see if the account exists, which might tell you it’s a valid email but not necessarily one that will get to the registering user.
2nd, Regular Expressions are a tool, just like anything else they are only appropriate for certain situations. This isn’t to say you can’t go out of your way to use regex to solve problems at considerable expense to yourself, just for the sake of using regex. Religious zealots abound, look at the junior programmer in the next cube that just learned about the Facade pattern and is trying to use it on every project he ever encounters, right or wrong. Knowing the tools aren’t enough, knowing when it’s appropriate to use which tool is far more important.
3rd, if you’re really interested in what a valid email address is go read RFC 822 browse down to page 7 and start looking through the lexical symbols. If you think it’s an easy thing to do in a regex just check out the number of varying results from regex lib (and no, I’m not endorsing regex lib, I think it’s a fairly bad idea to insert a bit of code that you have no idea what it’s saying and just hoping that it works, always.) and you’ll see pages of tries. If you want to see what the real regex looks like go here: http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html and it still doesn’t tell you if the email exists, only that it meets the standard.
|
-
In a previous post I complained about the regex replace capabilities inside VS.NET 2003. That’s not to say I don’t use them, I’m certainly going to use PowerGREP for any multiple file replacements, I do especially with NUnitAsp.
Typical to us, starting some GUI testing, we’ll define a bunch of Testers and then “new them up” so we can get on with writing our tests. One handy trick we use is copying our definition list and then regex replace the copy into the “new” version.
The whole process looks something like this:

I’m about midway through my replacement at this point but it’s a whole lot easier to do this (easier == less keystrokes) than it is to rekey or try to manually cut around some of those breaks to do the work yourself.
This does get to some of my annoyances of VS.NET 2k3’s implementation of Regular Expressions. I have to use an expression like this:
protected:b+{[^:b]+}:b+{[^:b]+}:b+=:b+null;
Instead of something like (called out visually different to make a point, black bordered code is what you would use in the Find/Replace with regex dialog in VS.NET):
protected\s+([^\s]+)\s+([^\s]+)\s+=\s+null;
Major differences are groups are called out with { } instead of ( ) and all my shortcuts (like \s for whitespace) are different (:b being the equivalent for whitespace, well :Wh is there as well, thank god!). For those that are interested the replacement looks like this:
\2 = new \1("\2", CurrentWebForm);
Not that Regular Expressions are a giant mystery, and I’m happier to have people learn the concepts than syntax, it is nice to be able to learn one syntax and use it than being bothered with mapping one implementation to another.
|
-
I mentioned in my last blog post that Visual Studio .NET search and replace using regular expressions weren’t all that great for me, I’ve hosed studio a number of times and we tend to use regex replaces quite often in the IDE, especially when coding up an NUnitASP TextFixture, but we’ll save that for another day.
So I went searching for a tool and found one, PowerGREP. Damn is it fast. It found and replaced 586 matches in 54 files in microseconds. This tool is amazing and uses Perl compatible regular expressions instead of whatever oddball implementation the VS.NET 2003 team decided to use.
Eye candy to follow as always. It’s $99 to buy but it’s worth every cent.

Now playing: KMFDM - Sturm & Drang Tour 2002 - D.I.Y.
|
-
Last night I have a talk on Regular Expressions for PAFOX a Visual FoxPro usergroup. Apparently there was a three part article in FoxTalk2.0 by Lauren Clarke and Randy Pearson about text manipulation that described an implementation/fusion of Regular Expressions with native VFP string/test functions. Tim Yeaney wanted to fill in some blanks for the group so he got in touch with Geoff Snowman, our Developer Champion, who put him in touch with me.
The slide deck for the presentation is over on the Thycotic site but I'll warn you, there's not a whole lot of meat. That's not to say it was a quick talk, it went for an hour and a half. The problem with discussing regular expressions (regex) is that you have to strike a balance. No one in my audience came from a pure computer science background, had any experience in language construction, compiler design, lexers and though I didn't ask any *nix tools like sed, awk, grep or PERL (no, it's not just a *nix tool, I know). So the theory / background leaned a lot more towards "it's a pattern describing a set of strings" instead of the formal language theory.
We quickly set aside theory and went into practical usage, starting with some typical contrived examples you might see in data validation. I was using The Regulator during the presentation, a choice I regret and I'll explain why in a moment, to demonstrate matching. Another thing you won't find in my slide deck is a list of characters and symbols [], [^], \d, \W, etc. Not because I don't think that's useful, because I wanted to focus on the intent of regex and not necessarily the details. Sure I'm discussins a specific implementation (System.Text.RegularExpressions) for use with VFP but I'm also talking about a tool that has many different implementations (look at VS.NET search and replace with regex for example, or for that matter just about any text editor has their own version of regex). It's important to understand the concepts, you can get the details from a reference.
I showed a few examples where it could be used in search, how string replacement worked and we ended by working through log parsing. The log parsing example was interesting because it tied a lot of pieces together. If you ever try something like this, make your log short 10, 15, 20 lines is probably plenty, I started with a 24K file and quickly pared it down.
One last note about my slide deck, I'm probably the first guy to talk about regex in .NET and not mention regexlib. That's not an oversight. I think regex is a fairly useful tool but like every tool there are appropriate times to use it and not use it. I've discussed this before over on my regex blog before. Yes it is possible to define a single regex to do a lot of the things I don't agree with, doesn't mean they're wrong, but I'm not going to encourage those things.
Regrets, thoughts, etc:
- Using The Regulator. I'm not busting on Roy or his tool. I think it is a great tool but it doesn't have a place--at least for me--during a presentation for several reasons:
- I can't change font size, if you can, let me know how, I couldn't find it.
- I want to turn off "intellisense", it just gets in my way. This was especially true when not building the complete expression from left to right. I liked to go back and refactor the expression, then I'd be caught in the intellisense bit and have to remember to escape out of it before trying to move the cursor somewhere.
- I want to turn off scope matching brackets, parens, braces for pretty much the same reason as intellisense. I think they're a great idea for a beginner but I'd really like to turn that off.
- Maybe install PowerGrep or something so the search demonstration could be a little more flexible.
- Not having made a minimal quick reference dealing just with basic ECMA, no lookaround stuff and no code just something I could pop back and forth to when I'm describing a new pattern.
There weren't a whole lot of questions. I'm not sure this was good or bad. I think neither, just that like any other tool unless when you learn about it you have a specific need where it coud be applied to make your life better you don't know what questions to ask yet. I could have spent some time quizing the audience, this would probably work a lot better if there was a whiteboard around (or they just pass the keyboard around or something -- not wireless, so I'll have to remember that if I want to try it).
|
-
I had this come up in a project I am working on for a customer, and just answered a message on http://groups.yahoo.com/group/dotnetregex/ about time validation (nope, sorry, answered it somewhere else, suggested they use above list).
“The easiest input to validate is input that can't be wrong”.
Seems like a pretty silly thing to say, self evident to say the least.
For time it's a little more intensive, there are 1440 minutes in a given day, so we really don't want to have a drop down list with 1440 items (plus one as a default). A digital clock isn't that bad:
H: 1-12 (or 0-23)
MTens: 0-6
MOnes: 0-9
AM/PM (or nothing if 0-23)
Taking a lot of ambiguity out of a user interface leads to less things to validate on the back end.
Every once in a while the obvious needs to be stated, and I'll do it if no one else will :)
|
-
Lets assume you want to store a valid date. I'll further assume that you'll want to do Date things on that date (either in SQL or in code or somewhere, but you probably want to use it).
Religious tool fanatics will say “regex is the only way to go”. It might be, but I think there is a much nicer way to do this. Use the language's (of your choice) built in tools, or write a custom method to do this for you.
I'd do something like this:
public bool IsDate(string date)
{
DateTime dt;
bool isDate = true; //assume tbDate.Text is a date
try
{
dt = DateTime.Parse(date);
}
catch (FormatException efx)
{
//failed, so not a date.
isDate = false;
}
catch (ArgumentOutOfRangeException aoex)
{
// failed, so not a date.
isDate = false;
}
return isDate;
}
Use that function in a custom validator, or put it in a utility class and use it all the time. It works for a wide variety of formats, as shown by this impromptu test:
string[] dates = new string [] {
"4/4/2003","4/4/03","May 4, 2003", "4 May 2003", "2003-04-04",
"4/41/2003","14/4/03","Masy 4, 2003", "4 Maya 2003", "20023-04-04"
};
foreach (string date in dates)
{
bool isDate = IsDate(date);
litResults.Text += date + " => ";
if (isDate)
litResults.Text += " Valid Date \n";
else
litResults.Text += " Invalid Date \n";
}
Generating the following output:
4/4/2003 => Valid Date 4/4/03 => Valid Date May 4, 2003 => Valid Date 4 May 2003 => Valid Date 2003-04-04 => Valid Date 4/41/2003 => Invalid Date 14/4/03 => Invalid Date Masy 4, 2003 => Invalid Date 4 Maya 2003 => Invalid Date 20023-04-04 => Invalid Date
Just trying to check within a single format is difficult enough with regex, trying to accept all those formats would give anyone a headache. You also win by doing this in code if your language of choice supports localization.
No, every post won't be about avoiding regex, I just have to get a few things off my chest. Remember my grain of salt. It's all about the right tool for the right job.
|
-
Email validation seems to be one of the most popular uses for regex. I imagine a large part of that, at least in the ASP.NET community, is due to Validation Controls. Sure, you want to validate user input. Hell, you probably even want to do it client side to save your server a little extra load, save the client some time in a post back, and make yourself feel good by watching the little, red, * pop up next to the email field.
I've got to ask you, why are you going to bother validating the format of an email address if you're never going to use it? Sure, this.is.not.an.account@myserver.com is a valid email address, but what good is it, if it doesn't exist?
Oh, you want to use the email address. Well, by the simple logic above, format validation is not sufficient. If you're collecting email to use (and why else would you collect the information, just to waste space on your precious server?) then you need the email address authenticated--every time it changes--and you need to cull through your mail server logs looking for bounced emails. [Parsing through logs, now that is a great use for regex, more on that later]
Email authentication? Sure, you send off an email to the address the user provided with a URL like (http://mysite.com/email/87asdfjasd0fas7df-0235235asdf and if the hash matches what you expect--stored in the database--you flip the switch on the bit column or put a date in the “emailAuthenticated” column and you're golden). URL Email Authentication would be a very good use for an http module, you expect no input from the user, minimal output from the server (Thanks, you're authenticated), so why bother with the overhead of a webform? So that's an ASP.NET solution, but it's just as easy to deal with $PATH_INFO in PHP or the Query String in any web language for that matter.
It's all about the job. If you want an email to use, you better make sure you can use it. Validation just lets you know it's in the right format, Authentication lets you know it's a real email address and hopefully the guy on the other end was expecting it, or doesn't mind it getting sent to him. If you're not going to use the email, why waste the space and collect it, and above all why waste the time to validate it? (Remember the women at your feet?)
|
-
I'll try to avoid the obligatory “first post” scene and instead describe--what I hope will be--the overall theme for my blogging here; “the right tool for the right job”.
I've seen this happen in other places, but most often I see it in software development. You read about “the next best thing” and suddenly get all religious about how sure you are that if you could only use “the next best thing” your life would be:
- 100 times more rewarding,
- 100 times easier, and
- your hardest decision would be which of the women thrown at your feet would you choose first.
Unfortunately that never happens (as I look back at the countless virgins in my wake). What is more likely to happen is that you spend hours trying to force a square peg into a round hole. If you're lucky, you're only frustrated, burnt out and minus a few hours of your life when it's all said and done. If you're not lucky you're all those things, not done, and very close to throwing “the next best thing“ out the window because it didn't help at all.
Don't get me wrong, regular expressions (regex) are a great tool to add to your toolbox. I just don't think it should be your only tool. So as you read my posts let that be your grain of salt. When I say (and I will) that coming up with a 7 line regex to validate dates is a bad idea, I'm not saying regex's are a bad idea, but that they're the wrong tool for that particular job.
|
|
|
|