D issues are now tracked on GitHub. This Bugzilla instance remains as a read-only archive.
Issue 3465 - isIdeographic can be wrong in std.xml
Summary: isIdeographic can be wrong in std.xml
Status: RESOLVED FIXED
Alias: None
Product: D
Classification: Unclassified
Component: phobos (show other issues)
Version: D2
Hardware: Other All
: P2 minor
Assignee: Shin Fujishiro
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-11-01 21:51 UTC by Michael Rynn
Modified: 2015-06-09 01:26 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description Michael Rynn 2009-11-01 21:51:25 UTC
The std.xml functionisIdeographic failed my parser on one of the xml conformance tests for the character 0x4E00.

// As implemented in XML Piece Parser Project,  http://source.miryn.org/
// but I took it from std.xml

//WRONG in std.xml
//invariant IdeographicTable=[0x4E00,0x9FA5,0x3007,0x3007,0x3021,0x3029];

//RIGHT, because for lookup function,
// the table data range pairs should be ordered!
dchar[] IdeographicTable=[0x3007,0x3007,0x3021,0x3029,0x4E00,0x9FA5];

// PERFORMANCE SUGGESTION
// also lookup is best done for tables that are larger
// for smaller tables, like this one, or character, 
// surely a hard coded search will be faster


// Surely not much more code, is generated for this.
// and faster, since no function call to lookup, and no array slices used.

bool isIdeographic(dchar c)
{
    if (c == 0x3007)
		return true;
    if (c >= 0x3007 && c <= 0x3029)
		return true;
    if (c >= 0x4E00 && c <= 0x9FA5)
		return true;
    return false;
}

// Only suggestion here..
// isChar has to be called for every single character in the document, and 
//    it must be worth a bit of optimisation,
//     especially for common cases.

/**
 * Returns true if the character is a character according to the XML standard
 * Character references must refer to one of these.
 * Any unicode character, excluding surrogate blocks FFFE and FFFF.
 * #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
 * Avoid [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF],
 * Standards: $(LINK2 http://www.w3.org/TR/1998/REC-xml-19980210, XML 1.0)
 *
 * Params:
 *    c = the character to be tested
 *    The standard ASCII case gets at most 3 value comparisons.
  */
bool isChar(dchar c) 
{
    if (c <= 0xD7FF)
    {
        if (c >= 0x20)
        {
            if (c >= 0x7F)
            {
                if (c <= 0x84)
                    return false;
                if (c >= 0x86)
                {
                    if (c <= 0x9F)
                        return false;
                }
            }
            return true;
        }
        switch(c)
        {
        case 0xA:
        case 0x9:
        case 0xD:
            return true;
        default:
            return false;
        }
    }
    else if (c >= 0xE000)
    {
        if (c < 0xFFFE)
        {
            if (c >= 0xFDD0 && c <= 0xFDEF)
                return false;
            return true;
        }
        if (c >= 0x10000)
        {
            if (c <= 0x10FFFF)
            {
		/* some conformance tests have the 0x10FFFF
                if ((c & 0xFFFE) == 0xFFFE)
                {
                    return false; 
                }
		*/
                return true;
            }
        }
    }
    return false;
}

// Most digits are expected to be ASCII ones
bool isDigit(dchar c)
{
	if (c <= 0x0039 && c >= 0x0030)
		return true;
	else
		return lookup(DigitTable,c);
}
Comment 1 Michael Rynn 2009-11-01 21:58:11 UTC
// A check on my code indicates afternoon doziness, so here is the better version

bool isIdeographic(dchar c)
{
	if (c == 0x3007)
		return true;
    if (c <= 0x3029 && c >= 0x3021 )
		return true;
    if (c <= 0x9FA5 && c >= 0x4E00)
		return true;
	return false;
}
Comment 2 Shin Fujishiro 2010-05-23 21:36:54 UTC
Fixed in svn r1552.
Thanks for your contribution!

Excuse me: I removed certain part of your code from the actual commit. The contributed code took care of newer Unicode standards. I like new things, but as far as supporting XML 1.0, we have to stick to Unicode 2.0.