Issue 24190 - Identifier tokenizer is greedy steals new line characters
Summary: Identifier tokenizer is greedy steals new line characters
Status: NEW
Alias: None
Product: D
Classification: Unclassified
Component: dmd (show other issues)
Version: D2
Hardware: All All
: P1 enhancement
Assignee: No Owner
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-10-18 00:03 UTC by Richard (Rikki) Andrew Cattermole
Modified: 2023-10-27 11:20 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description Richard (Rikki) Andrew Cattermole 2023-10-18 00:03:16 UTC
Currently, the tokenizer for identifiers is quite greedy. It'll steal the non-ASCII character for new lines when it should probably defer to the outer loop to error.

```d
$ cat lsps.d
void main ()
{
    enum b = 8;
    mixin ("enum a1 =\u2028b; pragma (msg, a1);");
    mixin ("enum a2\u2028= b; pragma (msg, a2);");
    mixin ("enum\u2028a3 = b; pragma (msg, a3);");
}
$ dmd lsps.d
8
lsps.d-mixin-5(5): Error: char 0x2028 not allowed in identifier
lsps.d-mixin-6(6): Error: char 0x2028 not allowed in identifier
```

That character 0x2028 is a valid new line character.