D issues are now tracked on GitHub. This Bugzilla instance remains as a read-only archive.
Issue 7689 - splitter() on ivalid UTF-8 sequences
Summary: splitter() on ivalid UTF-8 sequences
Status: RESOLVED FIXED
Alias: None
Product: D
Classification: Unclassified
Component: phobos (show other issues)
Version: D2
Hardware: x86 Windows
: P2 normal
Assignee: monarchdodra
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-03-11 14:07 UTC by bearophile_hugs
Modified: 2013-11-18 02:36 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description bearophile_hugs 2012-03-11 14:07:42 UTC
Is this difference/inconsistency between split() and splitter() desired and good?


import std.string, std.array, std.algorithm, std.range;
void main() {
    char[] s = cast(char[])[131, 64, 32, 251, 22];
    assert(std.string.split(s).length == 2); // no error
    assert(walkLength(std.array.splitter(s)) == 2); // Invalid UTF-8 sequence
    assert(walkLength(std.algorithm.splitter(s)) == 2); // Invalid UTF-8 sequence
}


Output, DMD 2.059head:

std.utf.UTFException@std\utf.d(645): Invalid UTF-8 sequence (at index 1)
----------------
...\dmd2\src\phobos\std\array.d(469): dchar std.array.front!(char[]).front(char[])
...\dmd2\src\phobos\std\algorithm.d(2110): D3std9algorithm47__T8splitterS28...
...\dmd2\src\phobos\std\range.d(971): D3std5range97__...
----------------
Comment 1 monarchdodra 2012-10-22 23:06:22 UTC
(In reply to comment #0)
> Is this difference/inconsistency between split() and splitter() desired and
> good?
> 
> 
> import std.string, std.array, std.algorithm, std.range;
> void main() {
>     char[] s = cast(char[])[131, 64, 32, 251, 22];
>     assert(std.string.split(s).length == 2); // no error
>     assert(walkLength(std.array.splitter(s)) == 2); // Invalid UTF-8 sequence
>     assert(walkLength(std.algorithm.splitter(s)) == 2); // Invalid UTF-8
> sequence
> }
> 
> 
> Output, DMD 2.059head:
> 
> std.utf.UTFException@std\utf.d(645): Invalid UTF-8 sequence (at index 1)
> ----------------
> ...\dmd2\src\phobos\std\array.d(469): dchar
> std.array.front!(char[]).front(char[])
> ...\dmd2\src\phobos\std\algorithm.d(2110): D3std9algorithm47__T8splitterS28...
> ...\dmd2\src\phobos\std\range.d(971): D3std5range97__...
> ----------------

This is a bug in string.split (which is actually a public import of array.split).

Currently array.split only supports ascii white, and is oblivious to longer utf whites (but it does work on unicode).
Comment 2 bearophile_hugs 2013-11-18 02:36:29 UTC
Seems fixed:

https://github.com/D-Programming-Language/phobos/pull/1502