D issues are now tracked on GitHub. This Bugzilla instance remains as a read-only archive.
Issue 7551 - Regex parsing bug for right bracket in character class
Summary: Regex parsing bug for right bracket in character class
Status: RESOLVED FIXED
Alias: None
Product: D
Classification: Unclassified
Component: phobos (show other issues)
Version: D2
Hardware: All All
: P2 enhancement
Assignee: No Owner
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-02-20 03:06 UTC by Magnus Lie Hetland
Modified: 2016-04-07 00:21 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description Magnus Lie Hetland 2012-02-20 03:06:36 UTC
It seems that a bug has appeared for charsets in the std.regex. In previous versions, a right bracket could be included in a character set by placing it first, as is the case in many other languages/libraries. In the current version (I'm using the canned DMD 2.058 for OS X), that doesn't work:

import std.regex;
void main() {
    auto r = regex("[]]");
}

This gives the following exception:

std.regex.RegexException@/usr/share/dmd/src/phobos/std/regex.d(1951): wrong CodepointSet
Pattern with error: `[]` <--HERE-- `]`

This should probably be permitted, as a "least surprise" practice, and to preserve compatibility with older versions. (It doesn't seem to be explicitly documented in the standard library docs, though. Then again, as far as I can see, no other mechanism for including right brackets in charsets is documented either.)
Comment 1 Dmitry Olshansky 2012-02-24 11:28:25 UTC
It perfectly fine to use escapes for special characters:

import std.regex;
void main() {
    auto r = regex("[\]]");
}

The reason for killing first bracket doesn't count rule (if ever knew it existed)
is that new regex allows doing things like 
[[abc0-9]--[bcd||1-9]] 
i.e. set operations 
the above should get you [bc0], it's more useful with \p{xxx} things.
Basically braces do matter more now. 
But this many other languages... (or better libraries) - which ones? Unless there is strong precident I'm not doing another special case.
Comment 2 Magnus Lie Hetland 2012-02-27 00:44:59 UTC
It did exist in the previous version -- my code broke with the new regexp engine, but worked before :-)

If this is a conscious choice, then that's totally fine by me. Special cases aren't the right way to go when the general mechanism works. I had some trouble getting this to work (did something like what you wrote here, which won't work -- but double-escaping does, of course), so I ended up with using the or-operator, which was kind of hackish ;-)

So, yeah, I guess I "retract" my bug report :->

As for other languages: Yeah, I think this is pretty common. E.g., Python (http://docs.python.org/library/re.html) and in Perl and Perl-compatible regexps, as used in all kinds of places, such as PHP, Apache, Safari, … (http://www.php.net/manual/en/regexp.reference.character-classes.php).

So I think the "place member end brackets as first character" is the "industry standard" behavior.

But as a compromise: Perhaps a useful error message pointing out the escape thing could be added? Or it could be explicitly pointed out in a note in the documentation (to avoid special-casing the error code)?

I think some kind of "least surprise" handling for people coming from basically anywhere else might be useful ;-)
Comment 3 Magnus Lie Hetland 2012-02-27 00:51:18 UTC
This whole thing goes for start brackets, too, I guess. As far as I can see, they, too, must be escaped when used inside character classes, now. This follows from the definition in the docs, for sure, but wasn't entirely obvious to me -- especially given that it worked before. (I.e., that was another thing that broke in my code recently, when upgrading.)
Comment 4 Dmitry Olshansky 2012-02-27 02:36:06 UTC
Full backwards compatibility looked like a nice idea at start. 
I'm increasingly regret that decision, as things still got broken as I had to add new features that block some undocumented behavior.

Ehm escape sequences were partly broken in 2.057 ... sorry about that.

BTW this page shows that [ and ] should be escaped, and not a single word on it used as first character (unlike '-' that is supported).
http://www.php.net/manual/en/regexp.reference.character-classes.php

About Python, heh, I'm eager to see how would they go about adding set operations without breaking compatibility (they count [ as plain '[' in the middle of charset). I guess a brand new module if it they ever will.

> 
> But as a compromise: Perhaps a useful error message pointing out the escape
> thing could be added? Or it could be explicitly pointed out in a note in the
> documentation (to avoid special-casing the error code)?
> 
> I think some kind of "least surprise" handling for people coming from basically
> anywhere else might be useful ;-)

Hm.. that's a good idea. Hereby it's an enhacement request ;)
Comment 5 Magnus Lie Hetland 2012-02-27 03:18:50 UTC
Quoting Dmitry:
> BTW this page shows that [ and ] should be escaped, and not a single word on it
> used as first character (unlike '-' that is supported).
> http://www.php.net/manual/en/regexp.reference.character-classes.php

Huh? Did you read the first paragraph…?-)

Quoted, for your convenience (my highlight):
> An opening square bracket introduces a character class, terminated by a closing square
> bracket. A closing square bracket on its own is not special. **If a closing square bracket is
> required as a member of the class, it should be the first data character in the class** (after
> an initial circumflex, if present) or escaped with a backslash.

It says so right there, no? This is the way it's been in several languages I've used throughout the years. I guess they just didn't have escaping inside character classes in the olden days ;-)
Comment 6 Dmitry Olshansky 2012-02-27 05:18:12 UTC
(In reply to comment #5)
> Quoting Dmitry:
> > BTW this page shows that [ and ] should be escaped, and not a single word on it
> > used as first character (unlike '-' that is supported).
> > http://www.php.net/manual/en/regexp.reference.character-classes.php
> 
> Huh? Did you read the first paragraph…?-)

Searching gets the better of me :( I 'greped' for "[" 

> 
> Quoted, for your convenience (my highlight):
> > An opening square bracket introduces a character class, terminated by a closing square
> > bracket. A closing square bracket on its own is not special. **If a closing square bracket is
> > required as a member of the class, it should be the first data character in the class** (after
> > an initial circumflex, if present) or escaped with a backslash.
> 
> It says so right there, no? This is the way it's been in several languages I've
> used throughout the years. I guess they just didn't have escaping inside
> character classes in the olden days ;-)

Apparently it's one of these historical kind of things.
Comment 7 github-bugzilla 2016-04-07 00:21:06 UTC
Commits pushed to master at https://github.com/D-Programming-Language/phobos

https://github.com/D-Programming-Language/phobos/commit/afd16eac09d8178d47d2e39ba8fe87c765369fc7
Fix issue 7551 - Regex parsing bug for right bracket in character class

https://github.com/D-Programming-Language/phobos/commit/0ce66bc7aa9aee44757bda5b6102b8180dda3b19
Merge pull request #4161 from DmitryOlshansky/issue-11765

Fix issue 7551 - Regex parsing bug for right bracket in character class