Perl::Critic::Policy::RegularExpressions::ProhibitComplexRegexes - Split long regexps into smaller qr//
chunks.
This Policy is part of the core Perl::Critic distribution.
Big regexps are hard to read, perhaps even the hardest part of Perl.
A good practice to write digestible chunks of regexp and put them
together. This policy flags any regexp that is longer than N
characters, where N
is a configurable value that defaults to 60.
If the regexp uses the x
flag, then the length is computed after
parsing out any comments or whitespace.
As an example, look at the regexp used to match email addresses in Email::Valid::Loose (tweaked lightly to wrap for POD)
(?x-ism:(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\] \000-\037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015 "]*)*")(?:(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[ \]\000-\037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n \015"]*)*")|\.)*\@(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@, ;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\] )(?:\.(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000 -\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\]))*)
which is constructed from the following code:
my $esc = '\\\\'; my $period = '\.'; my $space = '\040'; my $open_br = '\['; my $close_br = '\]'; my $nonASCII = '\x80-\xff'; my $ctrl = '\000-\037'; my $cr_list = '\n\015'; my $qtext = qq/[^$esc$nonASCII$cr_list\"]/; # " my $dtext = qq/[^$esc$nonASCII$cr_list$open_br$close_br]/; my $quoted_pair = qq<$esc>.qq<[^$nonASCII]>; my $atom_char = qq/[^($space)<>\@,;:\".$esc$open_br$close_br$ctrl$nonASCII]/;# " my $atom = qq<$atom_char+(?!$atom_char)>; my $quoted_str = qq<\"$qtext*(?:$quoted_pair$qtext*)*\">; # " my $word = qq<(?:$atom|$quoted_str)>; my $domain_ref = $atom; my $domain_lit = qq<$open_br(?:$dtext|$quoted_pair)*$close_br>; my $sub_domain = qq<(?:$domain_ref|$domain_lit)>; my $domain = qq<$sub_domain(?:$period$sub_domain)*>; my $local_part = qq<$word(?:$word|$period)*>; # This part is modified $Addr_spec_re = qr<$local_part\@$domain>;
If you read the code from bottom to top, it is quite readable. And,
you can even see the one violation of RFC822 that Tatsuhiko Miyagawa
deliberately put into Email::Valid::Loose to allow periods. Look for
the |\.
in the upper regexp to see that same deviation.
One could certainly argue that the top regexp could be re-written more
legibly with m//x
and comments. But the bottom version is
self-documenting and, for example, doesn't repeat \x80-\xff
18
times. Furthermore, it's much easier to compare the second version
against the source BNF grammar in RFC 822 to judge whether the
implementation is sound even before running tests.
This policy allows regexps up to N
characters long, where N
defaults to 60. You can override this to set it to a different number
with the max_characters
setting. To do this, put entries in a
.perlcriticrc file like this:
[RegularExpressions::ProhibitComplexRegexes] max_characters = 40
Initial development of this policy was supported by a grant from the Perl Foundation.
Chris Dolan <cdolan@cpan.org>
Copyright (c) 2007-2008 Chris Dolan. Many rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. The full text of this license can be found in the LICENSE file included with this module