Issue 23186 - wchar/dchar do not have their endianess defined
Summary: wchar/dchar do not have their endianess defined
Status: RESOLVED FIXED
Alias: None
Product: D
Classification: Unclassified
Component: dlang.org (show other issues)
Version: D2
Hardware: All All
: P1 enhancement
Assignee: No Owner
URL:
Keywords: pull
Depends on:
Blocks:
 
Reported: 2022-06-13 23:52 UTC by Richard (Rikki) Andrew Cattermole
Modified: 2022-09-02 17:10 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description Richard (Rikki) Andrew Cattermole 2022-06-13 23:52:46 UTC
For UTF-16 and UTF-32 there is little and big endian versions.

Even if it is target defined, it would be good to have this declared as such.
Comment 1 Dennis 2022-06-15 10:12:41 UTC
This is relevant when e.g. converting a `ubyte[]` to a `wchar[]` or `dchar[]`, but I don't think the language ever does that itself. A `wchar` and `dchar` are defined as "unsigned 16/32 bit" basic types, just like `ushort` or `uint`, and endianness in general is already specified to be target defined here:

https://dlang.org/spec/abi.html#endianness

Would it suffice to add char types to the table below it?

https://dlang.org/spec/abi.html#basic_types
Comment 2 Richard (Rikki) Andrew Cattermole 2022-06-15 16:33:27 UTC
No, this isn't an ABI thing, it's about encodings.

Ideally, wchar/dchar would have little and big endian versions so that we can represent both forms of the encoding in the type system.

It gotta be in: https://dlang.org/spec/type.html#basic-data-types

However, it can be kept pretty simple something like ``Unicode 8-bit code point with matching target endian``.
Comment 3 Dennis 2022-06-15 21:34:38 UTC
(In reply to Richard Cattermole from comment #2)
> No, this isn't an ABI thing, it's about encodings.

I don't follow, do you have a reference for me? I'm looking at:

https://en.wikipedia.org/wiki/UTF-16

"Each Unicode code point is encoded either as one or two 16-bit code units. How these 16-bit codes are stored as bytes then depends on the 'endianness' of the text file or communication protocol."

The `wchar` type is an integer, the 16-bit code. No integral operations on a `wchar` reveal the endianness, only once you reinterpret cast 'the text file' (a `ubyte[]`) will endianness come up, but at that point I think it's no different than casting a `ubyte[]` to a `ushort[]`. We don't have BE and LE `short` types either.

> However, it can be kept pretty simple something like `Unicode 8-bit code
> point with matching target endian`.

There's no endian difference for 8-bit code points, or are we talking about bit order instead of byte order?
Comment 4 Richard (Rikki) Andrew Cattermole 2022-06-15 21:44:37 UTC
(In reply to Dennis from comment #3)
> (In reply to Richard Cattermole from comment #2)
> > No, this isn't an ABI thing, it's about encodings.
> 
> I don't follow, do you have a reference for me? I'm looking at:
> 
> https://en.wikipedia.org/wiki/UTF-16
> 
> "Each Unicode code point is encoded either as one or two 16-bit code units.
> How these 16-bit codes are stored as bytes then depends on the 'endianness'
> of the text file or communication protocol."
> 
> The `wchar` type is an integer, the 16-bit code. No integral operations on a
> `wchar` reveal the endianness, only once you reinterpret cast 'the text
> file' (a `ubyte[]`) will endianness come up, but at that point I think it's
> no different than casting a `ubyte[]` to a `ushort[]`. We don't have BE and
> LE `short` types either.

Indeed. Integers you kinda expect that it is the same as cpu endian. But you cannot assume the same for UTF (hence we should document it).

> > However, it can be kept pretty simple something like `Unicode 8-bit code
> > point with matching target endian`.
> 
> There's no endian difference for 8-bit code points, or are we talking about
> bit order instead of byte order?

That should have been UTF-16 or UTF-32, but its the same.
Comment 5 Dlang Bot 2022-06-16 09:33:54 UTC
@dkorpel created dlang/dlang.org pull request #3319 "Fix 23186 - wchar/dchar do not have their endianess defined" fixing this issue:

- Fix 23186 - wchar/dchar do not have their endianess defined

https://github.com/dlang/dlang.org/pull/3319
Comment 6 Dlang Bot 2022-09-02 17:10:26 UTC
dlang/dlang.org pull request #3319 "Fix 23186 - wchar/dchar do not have their endianess defined" was merged into master:

- d3e822cf7d4acfd38fcf3dc3a632c3644741c6d3 by Dennis Korpel:
  Fix 23186 - wchar/dchar do not have their endianess defined

https://github.com/dlang/dlang.org/pull/3319