Issue 6829 - Unsigned rotate standard function in Phobos, or rotate intrinsic in core.bitop
Summary: Unsigned rotate standard function in Phobos, or rotate intrinsic in core.bitop
Status: RESOLVED WORKSFORME
Alias: None
Product: D
Classification: Unclassified
Component: phobos (show other issues)
Version: D2
Hardware: All All
: P2 enhancement
Assignee: No Owner
URL:
Keywords:
: 1116 (view as issue list)
Depends on:
Blocks:
 
Reported: 2011-10-18 15:10 UTC by bearophile_hugs
Modified: 2021-06-06 16:21 UTC (History)
7 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description bearophile_hugs 2011-10-18 15:10:25 UTC
I think Phobos needs one or two functions to rotate the bits of a unsigned integral value, because DMD is now able to optimize a specific coding pattern to use specific CPU instructions, but it's easy to write the operation in a way that doesn't allow the pattern to be recognized. 

A Phobos template function (with a constraint that lets it accept only the right input types) will allow to explicitly ask for this operation to every present and future D compiler, with no risk of mistakes from the programmer.

See also:
http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=146892
Comment 1 hsteoh 2013-07-09 09:05:22 UTC
The implementation should probably go into std.bitmanip or core.bitop.

What's the pattern that DMD recognizes for rotate instructions?
Comment 2 bearophile_hugs 2013-07-09 10:13:38 UTC
(In reply to comment #1)

> What's the pattern that DMD recognizes for rotate instructions?

Walter offers this example of recognizable rotation:


void test236() {
    uint a;
    int shift;
    a = 7;
    shift = 1;
    int r;
    r = (a >> shift) | (a << (int.sizeof * 8 - shift));
    assert(r == 0x8000_0003);
    r = (a << shift) | (a >> (int.sizeof * 8 - shift));
    assert(a == 7);
}
Comment 3 hsteoh 2013-07-09 12:06:57 UTC
Hmph. That doesn't work. Compiling without -O just makes DMD translate it literally into shl/shr followed by or. Compiling with -O computes the values via CTFE, so I changed the code by making a and shift parameters, but DMD still uses shl/shr followed by or, instead of replacing it with rol/ror, as claimed.

So basically, the pattern either isn't there, or isn't working properly.
Comment 4 hsteoh 2013-07-09 12:20:16 UTC
In fact, not even gdc -O3 recognizes this pattern.
Comment 5 bearophile_hugs 2013-07-10 02:11:50 UTC
I think this should be a recognizable rotate left function:

    private static uint rol(in uint x, in uint y) pure nothrow {
        return (x << y) | (x >> (32 - y));
    }
Comment 6 Iain Buclaw 2013-07-10 02:38:38 UTC
(In reply to comment #5)
> I think this should be a recognizable rotate left function:
> 
>     private static uint rol(in uint x, in uint y) pure nothrow {
>         return (x << y) | (x >> (32 - y));
>     }

It is (in gdc with -O :)

_D3rol3rolFNaNbxkxkZk:
        mov x, %eax
        mov y, %ecx
        rol %cl, %eax
        ret
Comment 7 Iain Buclaw 2013-07-10 02:50:07 UTC
(In reply to comment #5)
> I think this should be a recognizable rotate left function:
> 
>     private static uint rol(in uint x, in uint y) pure nothrow {
>         return (x << y) | (x >> (32 - y));
>     }

And likewise for rotate right function:

uint
ror (in uint x, in uint y) pure nothrow
{
  return (x >> y) | (x << (32 - y));
}
Comment 8 bearophile_hugs 2013-07-10 03:28:19 UTC
(In reply to comment #6)
> (In reply to comment #5)
> > I think this should be a recognizable rotate left function:
> > 
> >     private static uint rol(in uint x, in uint y) pure nothrow {
> >         return (x << y) | (x >> (32 - y));
> >     }
> 
> It is (in gdc with -O :)
> 
> _D3rol3rolFNaNbxkxkZk:
>         mov x, %eax
>         mov y, %ecx
>         rol %cl, %eax
>         ret


Both dmd and ldc2 recognize the patterns:


uint rol(in uint x, in uint y) pure nothrow {
    return (x << y) | (x >> (32 - y));
}

uint ror(in uint x, in uint y) pure nothrow {
  return (x >> y) | (x << (32 - y));
}

extern(C) uint rol2(in uint x, in uint y) pure nothrow {
    return (x << y) | (x >> (32 - y));
}

void main() {}


/*
dmd -O

_D5temp23rolFNaNbxkxkZk:
        push    EAX
        mov EAX,8[ESP]
        mov ECX,[ESP]
        rol EAX,CL
        pop ECX
        ret 4

_D5temp23rorFNaNbxkxkZk:
        push    EAX
        mov EAX,8[ESP]
        mov ECX,[ESP]
        ror EAX,CL
        pop ECX
        ret 4

_rol2 :
        mov EAX,4[ESP]
        mov ECX,8[ESP]
        rol EAX,CL
        ret

----------------

ldmd2 -O -output-s

__D5temp23rolFNaNbxkxkZk:
    movl    4(%esp), %edx
    movb    %al, %cl
    roll    %cl, %edx
    movl    %edx, %eax
    ret $4

__D5temp23rorFNaNbxkxkZk:
    movl    4(%esp), %edx
    movb    %al, %cl
    rorl    %cl, %edx
    movl    %edx, %eax
    ret $4

_rol2:
    movb    8(%esp), %cl
    movl    4(%esp), %eax
    roll    %cl, %eax
    ret

*/
Comment 9 Iain Buclaw 2013-07-10 03:49:49 UTC
(In reply to comment #8)
> (In reply to comment #6)
> > (In reply to comment #5)
> > > I think this should be a recognizable rotate left function:
> > > 
> > >     private static uint rol(in uint x, in uint y) pure nothrow {
> > >         return (x << y) | (x >> (32 - y));
> > >     }
> > 
> > It is (in gdc with -O :)
> > 
> > _D3rol3rolFNaNbxkxkZk:
> >         mov x, %eax
> >         mov y, %ecx
> >         rol %cl, %eax
> >         ret
> 
> 
> Both dmd and ldc2 recognize the patterns:
> 

Excellent, so could put these as templates then. :)

T rol (T)(in T x, in uint y) pure nothrow
{
  return cast(T)((x << y) | (x >> ((T.sizeof * 8) - y)));
}

T ror (T)(in T x, in uint y) pure nothrow
{
  return cast(T)((x >> y) | (x << ((T.sizeof * 8) - y)));
}


// Tests to check for assembly output.
extern ubyte xb;
extern ushort xs;
extern uint xi;
extern ulong xl;

extern uint yi;

{
  rol (xb, yi);   // rolb
  ror (xb, yi);   // rorb

  rol (xs, yi);   // rolw
  ror (xs, yi);   // rorw

  rol (xi, yi);   // roll
  ror (xi, yi);   // rorl

  rol (xl, yi);   // version(X86_64) rolq
  ror (xl, yi);   // version(X86_64) rorq
}
Comment 10 bearophile_hugs 2013-07-10 04:15:16 UTC
(In reply to comment #9)

> Excellent, so could put these as templates then. :)

I have improved your code a little. Follows the 32 bit asm for ldc2 and dmd:


import std.traits: isIntegral, isUnsigned;

T rol(T)(in T x, in uint y) @safe pure nothrow
if (isIntegral!T && isUnsigned!T)
{
    return cast(T)((x << y) | (x >> ((T.sizeof * 8) - y)));
}

T ror(T)(in T x, in uint y) @safe pure nothrow
if (isIntegral!T && isUnsigned!T)
{
    return cast(T)((x >> y) | (x << ((T.sizeof * 8) - y)));
}

void main() {
    // Tests to check for assembly output.
    {
        __gshared static ubyte xb;
        __gshared static ushort xs;
        __gshared static uint xi;
        __gshared static ulong xl;
        __gshared static uint yi;

        rol(xb, yi);   // rolb
        ror(xb, yi);   // rorb

        rol(xs, yi);   // rolw
        ror(xs, yi);   // rorw

        rol(xi, yi);   // roll
        ror(xi, yi);   // rorl

        rol(xl, yi);   // version(X86_64) rolq
        ror(xl, yi);   // version(X86_64) rorq
    }
}


/*
ldmd2 -O -output-s


__D5temp210__T3rolThZ3rolFNaNbNfxhxkZh:
    pushl   %esi
    movzbl  8(%esp), %edx
    movb    %al, %cl
    movl    %edx, %esi
    shll    %cl, %esi
    movl    $8, %ecx
    subl    %eax, %ecx
    shrl    %cl, %edx
    orl %esi, %edx
    movzbl  %dl, %eax
    popl    %esi
    ret $4

__D5temp210__T3rorThZ3rorFNaNbNfxhxkZh:
    pushl   %esi
    movzbl  8(%esp), %edx
    movb    %al, %cl
    movl    %edx, %esi
    shrl    %cl, %esi
    movl    $8, %ecx
    subl    %eax, %ecx
    shll    %cl, %edx
    orl %esi, %edx
    movzbl  %dl, %eax
    popl    %esi
    ret $4

__D5temp210__T3rolTtZ3rolFNaNbNfxtxkZt:
    pushl   %esi
    movzwl  8(%esp), %edx
    movb    %al, %cl
    movl    %edx, %esi
    shll    %cl, %esi
    movl    $16, %ecx
    subl    %eax, %ecx
    shrl    %cl, %edx
    orl %esi, %edx
    movzwl  %dx, %eax
    popl    %esi
    ret $4

__D5temp210__T3rorTtZ3rorFNaNbNfxtxkZt:
    pushl   %esi
    movzwl  8(%esp), %edx
    movb    %al, %cl
    movl    %edx, %esi
    shrl    %cl, %esi
    movl    $16, %ecx
    subl    %eax, %ecx
    shll    %cl, %edx
    orl %esi, %edx
    movzwl  %dx, %eax
    popl    %esi
    ret $4

__D5temp210__T3rolTkZ3rolFNaNbNfxkxkZk:
    movl    4(%esp), %edx
    movb    %al, %cl
    roll    %cl, %edx
    movl    %edx, %eax
    ret $4

__D5temp210__T3rorTkZ3rorFNaNbNfxkxkZk:
    movl    4(%esp), %edx
    movb    %al, %cl
    rorl    %cl, %edx
    movl    %edx, %eax
    ret $4

__D5temp210__T3rolTmZ3rolFNaNbNfxmxkZm:
    pushl   %ebp
    pushl   %ebx
    pushl   %edi
    pushl   %esi
    movl    %eax, %ebx
    movl    $64, %ecx
    subl    %ebx, %ecx
    movl    24(%esp), %edx
    movl    20(%esp), %eax
    movl    %eax, %esi
    shrdl   %cl, %edx, %esi
    movl    %edx, %edi
    shrl    %cl, %edi
    xorl    %ebp, %ebp
    testb   $32, %cl
    cmovnel %edi, %esi
    cmovnel %ebp, %edi
    movb    %bl, %cl
    shldl   %cl, %eax, %edx
    movb    %bl, %cl
    shll    %cl, %eax
    testb   $32, %bl
    cmovnel %eax, %edx
    cmovnel %ebp, %eax
    orl %esi, %eax
    orl %edi, %edx
    popl    %esi
    popl    %edi
    popl    %ebx
    popl    %ebp
    ret $8

__D5temp210__T3rorTmZ3rorFNaNbNfxmxkZm:
    pushl   %ebp
    pushl   %ebx
    pushl   %edi
    pushl   %esi
    movl    %eax, %ecx
    movl    24(%esp), %edx
    movl    20(%esp), %eax
    movl    %eax, %esi
    shrdl   %cl, %edx, %esi
    movl    %edx, %edi
    shrl    %cl, %edi
    xorl    %ebp, %ebp
    testb   $32, %cl
    cmovnel %edi, %esi
    cmovnel %ebp, %edi
    movl    $64, %ebx
    subl    %ecx, %ebx
    movb    %bl, %cl
    shldl   %cl, %eax, %edx
    movb    %bl, %cl
    shll    %cl, %eax
    testb   $32, %bl
    cmovnel %eax, %edx
    cmovnel %ebp, %eax
    orl %esi, %eax
    orl %edi, %edx
    popl    %esi
    popl    %edi
    popl    %ebx
    popl    %ebp
    ret $8

--------------------------------

dmd -O

_D5temp210__T3rolThZ3rolFNaNbNfxhxkZh:
        push    EAX
        push    EAX
        movzx   ECX,byte ptr 0Ch[ESP]
        mov EAX,ECX
        mov 0[ESP],ECX
        mov CL,4[ESP]
        mov EDX,0[ESP]
        shl AL,CL
        mov ECX,8
        sub ECX,4[ESP]
        sar EDX,CL
        add ESP,8
        or  AL,DL
        ret 4

_D5temp210__T3rorThZ3rorFNaNbNfxhxkZh:
        push    EAX
        push    EAX
        movzx   ECX,byte ptr 0Ch[ESP]
        mov EAX,ECX
        mov 0[ESP],ECX
        mov ECX,8
        sub CL,4[ESP]
        mov EDX,0[ESP]
        shl AL,CL
        mov ECX,4[ESP]
        sar EDX,CL
        add ESP,8
        or  AL,DL
        ret 4

_D5temp210__T3rolTtZ3rolFNaNbNfxtxkZt:
        push    EAX
        mov AX,8[ESP]
        mov CX,[ESP]
        mov DX,8[ESP]
        shl EAX,CL
        mov ECX,010h
        sub ECX,[ESP]
        and EDX,0FFFFh
        sar EDX,CL
        pop ECX
        or  EAX,EDX
        ret 4

_D5temp210__T3rorTtZ3rorFNaNbNfxtxkZt:
        push    EAX
        mov ECX,010h
        mov AX,8[ESP]
        sub ECX,[ESP]
        mov DX,8[ESP]
        shl EAX,CL
        mov ECX,[ESP]
        and EDX,0FFFFh
        sar EDX,CL
        or  EAX,EDX
        pop ECX
        ret 4

_D5temp210__T3rolTkZ3rolFNaNbNfxkxkZk:
        push    EAX
        mov EAX,8[ESP]
        mov ECX,[ESP]
        rol EAX,CL
        pop ECX
        ret 4

_D5temp210__T3rorTkZ3rorFNaNbNfxkxkZk:
        push    EAX
        mov EAX,8[ESP]
        mov ECX,[ESP]
        ror EAX,CL
        pop ECX
        ret 4

_D5temp210__T3rolTmZ3rolFNaNbNfxmxkZm:
        push    EAX
        mov ECX,0[ESP]
        mov EDX,0Ch[ESP]
        push    EBX
        mov EAX,0Ch[ESP]
        test    CL,020h
        jne L1A
        shld    EDX,EAX,CL
        shl EAX,CL
        jmp short   L20
L1A:        shl EAX,CL
        mov EDX,EAX
        xor EAX,EAX
L20:        mov ECX,040h
        mov EBX,0Ch[ESP]
        push    EDX
        mov EDX,014h[ESP]
        sub ECX,8[ESP]
        test    CL,020h
        jne L3E
        shrd    EBX,EDX,CL
        shr EDX,CL
        jmp short   L44
L3E:        shr EDX,CL
        mov EBX,EDX
        xor EDX,EDX
L44:        mov ECX,EDX
        or  EAX,EBX
        pop EDX
        or  EDX,ECX
        pop EBX
        pop ECX
        ret 8

_D5temp210__T3rorTmZ3rorFNaNbNfxmxkZm:
        push    EAX
        mov ECX,0[ESP]
        mov EDX,0Ch[ESP]
        push    EBX
        mov EAX,0Ch[ESP]
        test    CL,020h
        jne L1A
        shrd    EAX,EDX,CL
        shr EDX,CL
        jmp short   L20
L1A:        shr EDX,CL
        mov EAX,EDX
        xor EDX,EDX
L20:        mov ECX,040h
        mov EBX,0Ch[ESP]
        push    EDX
        mov EDX,014h[ESP]
        sub ECX,8[ESP]
        test    CL,020h
        jne L3E
        shld    EDX,EBX,CL
        shl EBX,CL
        jmp short   L44
L3E:        shl EBX,CL
        mov EDX,EBX
        xor EBX,EBX
L44:        mov ECX,EDX
        or  EAX,EBX
        pop EDX
        or  EDX,ECX
        pop EBX
        pop ECX
        ret 8

*/
Comment 11 Iain Buclaw 2013-07-10 04:57:05 UTC
(In reply to comment #10)
> (In reply to comment #9)
> 
> > Excellent, so could put these as templates then. :)
> 
> I have improved your code a little. Follows the 32 bit asm for ldc2 and dmd:
> 

gdc > (ldc && dmd);

It has no problem detecting all those cases.  :o)
Comment 12 bearophile_hugs 2013-07-10 06:45:05 UTC
A bit better, with a pre-condition to catch some bugs:


T rol(T)(in T x, in uint n) @safe pure nothrow
if (isIntegral!T && isUnsigned!T)
in {
    assert(n < (T.sizeof * 8));
} body {
    return cast(T)((x << n) | (x >> ((T.sizeof * 8) - n)));
}

T ror(T)(in T x, in uint n) @safe pure nothrow
if (isIntegral!T && isUnsigned!T)
in {
    assert(n < (T.sizeof * 8));
} body {
    return cast(T)((x >> n) | (x << ((T.sizeof * 8) - n)));
}

--------------

(In reply to comment #11)

> gdc > (ldc && dmd);
> 
> It has no problem detecting all those cases.  :o)

But is that asm generated by gdc actually faster than this asm generated by ldc2?

One of the main points of adding those two templated functions to core.bitop is to have a single common standard pair of rotations that all present and future D compilers can recognize and optimize well.
Comment 13 bearophile_hugs 2013-07-10 06:46:44 UTC
(In reply to comment #11)

> It has no problem detecting all those cases.  :o)

Perhaps you want to show the asm generated by gdc for those functions?

(Perhaps here there's material for a small enhancement request for LLVM.)
Comment 14 Iain Buclaw 2013-07-10 06:53:27 UTC
(In reply to comment #12)
> (In reply to comment #11)
> 
> > gdc > (ldc && dmd);
> > 
> > It has no problem detecting all those cases.  :o)
> 
> But is that asm generated by gdc actually faster than this asm generated by
> ldc2?
> 

Well, I would have to assume that ro{lr}{bwlq} is faster than shl/shr.  Someone should get speed reports in.  =)
Comment 15 Iain Buclaw 2013-07-10 06:55:08 UTC
(In reply to comment #13)
> (In reply to comment #11)
> 
> > It has no problem detecting all those cases.  :o)
> 
> Perhaps you want to show the asm generated by gdc for those functions?
> 
> (Perhaps here there's material for a small enhancement request for LLVM.)

You see the assembler output for roll/rorl.  It's identical with exception to the operand suffix - eg: rolb/rorb for ubyte.

But to give you a comparison.

// LDC

__D5temp210__T3rolThZ3rolFNaNbNfxhxkZh:
    pushl   %esi
    movzbl  8(%esp), %edx
    movb    %al, %cl
    movl    %edx, %esi
    shll    %cl, %esi
    movl    $8, %ecx
    subl    %eax, %ecx
    shrl    %cl, %edx
    orl %esi, %edx
    movzbl  %dl, %eax
    popl    %esi
    ret $4

vs.

// GDC

__D5temp210__T3rolThZ3rolFNaNbNfxhxkZh:
    movl    %edi, %eax
    movl    %esi, %ecx
    rolb    %cl, %al
    ret
Comment 16 Iain Buclaw 2013-07-10 06:59:38 UTC
(In reply to comment #13)
> (In reply to comment #11)
> 
> > It has no problem detecting all those cases.  :o)
> 
> Perhaps you want to show the asm generated by gdc for those functions?
> 
> (Perhaps here there's material for a small enhancement request for LLVM.)

Full listing:

_D4temp10__T3rolThZ3rolFNaNbNfxhxkZh:
        movl    %edi, %eax
        movl    %esi, %ecx
        rolb    %cl, %al
        ret

_D4temp10__T3rorThZ3rorFNaNbNfxhxkZh:
        movl    %edi, %eax
        movl    %esi, %ecx
        rorb    %cl, %al
        ret

_D4temp10__T3rolTtZ3rolFNaNbNfxtxkZt:
        movl    %edi, %eax
        movl    %esi, %ecx
        rolw    %cl, %ax
        ret

_D4temp10__T3rorTtZ3rorFNaNbNfxtxkZt:
        movl    %edi, %eax
        movl    %esi, %ecx
        rorw    %cl, %ax
        ret

_D4temp10__T3rolTkZ3rolFNaNbNfxkxkZk:
        movl    %edi, %eax
        movl    %esi, %ecx
        roll    %cl, %eax
        ret

_D4temp10__T3rorTkZ3rorFNaNbNfxkxkZk:
        movl    %edi, %eax
        movl    %esi, %ecx
        rorl    %cl, %eax
        ret

_D4temp10__T3rolTmZ3rolFNaNbNfxmxkZm:
        movq    %rdi, %rax
        movl    %esi, %ecx
        rolq    %cl, %rax
        ret

_D4temp10__T3rorTmZ3rorFNaNbNfxmxkZm:
        movq    %rdi, %rax
        movl    %esi, %ecx
        rorq    %cl, %rax
        ret
Comment 17 bearophile_hugs 2013-07-10 09:29:51 UTC
(In reply to comment #15)

> But to give you a comparison.
> 
> // LDC
> 
> __D5temp210__T3rolThZ3rolFNaNbNfxhxkZh:
>     pushl   %esi
>     movzbl  8(%esp), %edx
>     movb    %al, %cl
>     movl    %edx, %esi
>     shll    %cl, %esi
>     movl    $8, %ecx
>     subl    %eax, %ecx
>     shrl    %cl, %edx
>     orl %esi, %edx
>     movzbl  %dl, %eax
>     popl    %esi
>     ret $4
> 
> vs.
> 
> // GDC
> 
> __D5temp210__T3rolThZ3rolFNaNbNfxhxkZh:
>     movl    %edi, %eax
>     movl    %esi, %ecx
>     rolb    %cl, %al
>     ret

I have asked in the LLVM IRC channel, and it seems a limit/problem of the llvm back-end. So maybe it's worth writing an LLVM enhancement request.
Comment 18 hsteoh 2013-07-11 21:37:37 UTC
Hmph. I'm using bearophile's code with DMD git HEAD, running as dmd -O, and *none* of the cases produce a rotate instruction. :-(  Could it be an issue with 32bit vs. 64-bit? I'm running in a purely 64-bit environment.
Comment 19 hsteoh 2013-07-11 21:40:34 UTC
I can't get GDC git HEAD to recognize this pattern either. What am I doing wrong??
Comment 20 hsteoh 2013-07-11 21:46:05 UTC
Could it be because amd64 doesn't support this optimization? Seems odd, though. I'd expect it to work at least up to 16-bit argument form, since that's common to both x86 and amd.
Comment 21 Iain Buclaw 2013-07-12 00:58:49 UTC
1. Not my problem. :)
2. When comparing gdc and dmd, make sure your actually looking at object files generated by gdc, and not dmd. :)
3. That's highly unlikely as I've tested on x86_64. :)
Comment 22 Iain Buclaw 2013-07-12 01:00:57 UTC
(In reply to comment #21)
> 1. Not my problem. :)
> 2. When comparing gdc and dmd, make sure your actually looking at object files
> generated by gdc, and not dmd. :)
> 3. That's highly unlikely as I've tested on x86_64. :)

and 4. Under simple test conditions where all parameter values are const/known, templated function calls tend to have a habit of being inlined / optimised away.
Comment 23 Iain Buclaw 2013-07-12 01:02:52 UTC
(In reply to comment #22)
> (In reply to comment #21)
> > 1. Not my problem. :)
> > 2. When comparing gdc and dmd, make sure your actually looking at object files
> > generated by gdc, and not dmd. :)
> > 3. That's highly unlikely as I've tested on x86_64. :)
> 
> and 4. Under simple test conditions where all parameter values are const/known,
> templated function calls tend to have a habit of being inlined / optimised
> away.

and 5. Make sure that you use bearophiles last implementation example. ;)

http://d.puremagic.com/issues/show_bug.cgi?id=6829#c12
Comment 24 bearophile_hugs 2013-07-12 01:59:52 UTC
(In reply to comment #23)
> and 5. Make sure that you use bearophiles last implementation example. ;)

Updated code, lacks unittests:


import std.traits: isIntegral, isUnsigned;

/// Left-shift x by n bits.
T rol(T)(in T x, in uint nBits) @safe pure nothrow
if (isIntegral!T && isUnsigned!T)
in {
    assert(nBits < (T.sizeof * 8));
} body {
    return cast(T)((x << nBits) | (x >> ((T.sizeof * 8) - nBits)));
}

/// Right-shift x by n bits.
T ror(T)(in T x, in uint nBits) @safe pure nothrow
if (isIntegral!T && isUnsigned!T)
in {
    assert(nBits < (T.sizeof * 8));
} body {
    return cast(T)((x >> nBits) | (x << ((T.sizeof * 8) - nBits)));
}

void main() {
    // Tests to check for assembly output.
    {
        __gshared static ubyte xb;
        __gshared static ushort xs;
        __gshared static uint xi;
        __gshared static ulong xl;
        __gshared static uint yi;

        rol(xb, yi);   // rolb
        ror(xb, yi);   // rorb

        rol(xs, yi);   // rolw
        ror(xs, yi);   // rorw

        rol(xi, yi);   // roll
        ror(xi, yi);   // rorl

        rol(xl, yi);   // version(X86_64) rolq
        ror(xl, yi);   // version(X86_64) rorq
    }
}
Comment 25 hsteoh 2013-07-12 08:13:16 UTC
Nope, it's still not working. I copied-n-pasted exactly the code posted above, and compiled with gdc -frelease -O3 test.d, and here is the disassembly output:

00000000004042d0 <_D4test10__T3rolThZ3rolFNaNbNfxhxkZh>:
  4042d0:	40 0f b6 ff          	movzbl %dil,%edi
  4042d4:	b9 08 00 00 00       	mov    $0x8,%ecx
  4042d9:	29 f1                	sub    %esi,%ecx
  4042db:	89 f8                	mov    %edi,%eax
  4042dd:	d3 f8                	sar    %cl,%eax
  4042df:	89 f1                	mov    %esi,%ecx
  4042e1:	d3 e7                	shl    %cl,%edi
  4042e3:	09 f8                	or     %edi,%eax
  4042e5:	c3                   	retq   
  4042e6:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
  4042ed:	00 00 00 

00000000004042f0 <_D4test10__T3rorThZ3rorFNaNbNfxhxkZh>:
  4042f0:	40 0f b6 ff          	movzbl %dil,%edi
  4042f4:	b9 08 00 00 00       	mov    $0x8,%ecx
  4042f9:	29 f1                	sub    %esi,%ecx
  4042fb:	89 f8                	mov    %edi,%eax
  4042fd:	d3 e0                	shl    %cl,%eax
  4042ff:	89 f1                	mov    %esi,%ecx
  404301:	d3 ff                	sar    %cl,%edi
  404303:	09 f8                	or     %edi,%eax
  404305:	c3                   	retq   
  404306:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
  40430d:	00 00 00 

0000000000404310 <_D4test10__T3rolTtZ3rolFNaNbNfxtxkZt>:
  404310:	0f b7 ff             	movzwl %di,%edi
  404313:	b9 10 00 00 00       	mov    $0x10,%ecx
  404318:	29 f1                	sub    %esi,%ecx
  40431a:	89 f8                	mov    %edi,%eax
  40431c:	d3 f8                	sar    %cl,%eax
  40431e:	89 f1                	mov    %esi,%ecx
  404320:	d3 e7                	shl    %cl,%edi
  404322:	09 f8                	or     %edi,%eax
  404324:	c3                   	retq   
  404325:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
  40432c:	00 00 00 
  40432f:	90                   	nop

0000000000404330 <_D4test10__T3rorTtZ3rorFNaNbNfxtxkZt>:
  404330:	0f b7 ff             	movzwl %di,%edi
  404333:	b9 10 00 00 00       	mov    $0x10,%ecx
  404338:	29 f1                	sub    %esi,%ecx
  40433a:	89 f8                	mov    %edi,%eax
  40433c:	d3 e0                	shl    %cl,%eax
  40433e:	89 f1                	mov    %esi,%ecx
  404340:	d3 ff                	sar    %cl,%edi
  404342:	09 f8                	or     %edi,%eax
  404344:	c3                   	retq   
  404345:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
  40434c:	00 00 00 
  40434f:	90                   	nop

0000000000404350 <_D4test10__T3rolTkZ3rolFNaNbNfxkxkZk>:
  404350:	b9 20 00 00 00       	mov    $0x20,%ecx
  404355:	89 f8                	mov    %edi,%eax
  404357:	29 f1                	sub    %esi,%ecx
  404359:	d3 e8                	shr    %cl,%eax
  40435b:	89 f1                	mov    %esi,%ecx
  40435d:	d3 e7                	shl    %cl,%edi
  40435f:	09 f8                	or     %edi,%eax
  404361:	c3                   	retq   
  404362:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
  404369:	00 00 00 
  40436c:	0f 1f 40 00          	nopl   0x0(%rax)

0000000000404370 <_D4test10__T3rorTkZ3rorFNaNbNfxkxkZk>:
  404370:	b9 20 00 00 00       	mov    $0x20,%ecx
  404375:	89 f8                	mov    %edi,%eax
  404377:	29 f1                	sub    %esi,%ecx
  404379:	d3 e0                	shl    %cl,%eax
  40437b:	89 f1                	mov    %esi,%ecx
  40437d:	d3 ef                	shr    %cl,%edi
  40437f:	09 f8                	or     %edi,%eax
  404381:	c3                   	retq   
  404382:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
  404389:	00 00 00 
  40438c:	0f 1f 40 00          	nopl   0x0(%rax)

0000000000404390 <_D4test10__T3rolTmZ3rolFNaNbNfxmxkZm>:
  404390:	b9 40 00 00 00       	mov    $0x40,%ecx
  404395:	48 89 f8             	mov    %rdi,%rax
  404398:	29 f1                	sub    %esi,%ecx
  40439a:	48 d3 e8             	shr    %cl,%rax
  40439d:	89 f1                	mov    %esi,%ecx
  40439f:	48 d3 e7             	shl    %cl,%rdi
  4043a2:	48 09 f8             	or     %rdi,%rax
  4043a5:	c3                   	retq   
  4043a6:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
  4043ad:	00 00 00 

00000000004043b0 <_D4test10__T3rorTmZ3rorFNaNbNfxmxkZm>:
  4043b0:	b9 40 00 00 00       	mov    $0x40,%ecx
  4043b5:	48 89 f8             	mov    %rdi,%rax
  4043b8:	29 f1                	sub    %esi,%ecx
  4043ba:	48 d3 e0             	shl    %cl,%rax
  4043bd:	89 f1                	mov    %esi,%ecx
  4043bf:	48 d3 ef             	shr    %cl,%rdi
  4043c2:	48 09 f8             	or     %rdi,%rax
  4043c5:	c3                   	retq   
  4043c6:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
  4043cd:	00 00 00 


It's still using shift + bitwise OR instead of substituting a rotate instruction. I'm also on Linux x86_64 (so says uname -m), so I've no idea what I'm doing wrong. And gdc --version says it's gdc (GCC) 4.8.1. Did something go wrong/missing in my gdc build??
Comment 26 Iain Buclaw 2013-07-12 08:21:21 UTC
Don't currently have a gdc 4.8 compiler at hand to test...
Comment 27 hsteoh 2013-07-12 08:40:43 UTC
Huh? So which gdc have you been using to test this earlier?
Comment 28 Walter Bright 2013-07-13 00:32:38 UTC
(In reply to comment #2)
> (In reply to comment #1)
> 
> > What's the pattern that DMD recognizes for rotate instructions?
> 
> Walter offers this example of recognizable rotation:

Ack, the example is bad. This one generates rol/ror:

-----------
void test(int shift)
{
    uint a = 7;
    uint r;
    r = (a >> shift) | (a << (int.sizeof * 8 - shift));
    assert(r == 0x8000_0003);
    r = (r << shift) | (r >> (int.sizeof * 8 - shift));
    assert(r == 7);
}
-----------
compiling with -O:

_D3foo4testFiZv comdat
        assume  CS:_D3foo4testFiZv
L0:             push    EAX
                mov     EDX,7
                mov     ECX,EAX
                push    EAX
                ror     EDX,CL
                cmp     EDX,080000003h
                push    EBX
                mov     4[ESP],EDX
                je      L22
                mov     EAX,6
                call    near ptr _D3foo8__assertFiZv
L22:            mov     EBX,4[ESP]
                mov     ECX,8[ESP]
                rol     EBX,CL
                cmp     EBX,7
                je      L3B
                mov     EAX,8
                call    near ptr _D3foo8__assertFiZv
L3B:            pop     EBX
                add     ESP,8
                ret
Comment 29 Walter Bright 2013-07-13 00:35:45 UTC
Fix test case:

https://github.com/D-Programming-Language/dmd/pull/2341
Comment 30 bearophile_hugs 2013-07-13 02:37:33 UTC
(In reply to comment #28)

> Ack, the example is bad. This one generates rol/ror:

For normal D programmers it's easy to write D "rotation" code that the D compiler doesn't recognize. So here I have proposed to add two standard rol/ror template functions to core.bitop that all D programmers can use safely and all D compilers can optimize well.
Comment 31 hsteoh 2013-07-13 10:25:27 UTC
I still can't get it to work. I copied Walter's code exactly, compiled with dmd -O (from DMD git HEAD) and here's the disassembly:

0000000000416854 <_D4test4testFiZv>:
  416854:       55                      push   %rbp
  416855:       48 8b ec                mov    %rsp,%rbp
  416858:       48 83 ec 20             sub    $0x20,%rsp
  41685c:       53                      push   %rbx
  41685d:       41 54                   push   %r12
  41685f:       89 7d f8                mov    %edi,-0x8(%rbp)
  416862:       41 bc 07 00 00 00       mov    $0x7,%r12d
  416868:       48 89 f9                mov    %rdi,%rcx
  41686b:       41 d3 ec                shr    %cl,%r12d
  41686e:       b8 07 00 00 00          mov    $0x7,%eax
  416873:       48 ba 20 00 00 00 00    movabs $0x20,%rdx
  41687a:       00 00 00 
  41687d:       48 63 d9                movslq %ecx,%rbx
  416880:       48 2b d3                sub    %rbx,%rdx
  416883:       48 89 d1                mov    %rdx,%rcx
  416886:       d3 e0                   shl    %cl,%eax
  416888:       44 0b e0                or     %eax,%r12d
  41688b:       41 81 fc 03 00 00 80    cmp    $0x80000003,%r12d
  416892:       48 89 8d e8 ff ff ff    mov    %rcx,-0x18(%rbp)
  416899:       74 0a                   je     4168a5 <_D4test4testFiZv+0x51>
  41689b:       bf 06 00 00 00          mov    $0x6,%edi
  4168a0:       e8 37 00 00 00          callq  4168dc <_D4test8__assertFiZv>
  4168a5:       41 8b c4                mov    %r12d,%eax
  4168a8:       8b 4d f8                mov    -0x8(%rbp),%ecx
  4168ab:       d3 e0                   shl    %cl,%eax
  4168ad:       41 8b d4                mov    %r12d,%edx
  4168b0:       48 8b 8d e8 ff ff ff    mov    -0x18(%rbp),%rcx
  4168b7:       d3 ea                   shr    %cl,%edx
  4168b9:       0b c2                   or     %edx,%eax
  4168bb:       83 f8 07                cmp    $0x7,%eax
  4168be:       74 0a                   je     4168ca <_D4test4testFiZv+0x76>
  4168c0:       bf 08 00 00 00          mov    $0x8,%edi
  4168c5:       e8 12 00 00 00          callq  4168dc <_D4test8__assertFiZv>
  4168ca:       41 5c                   pop    %r12
  4168cc:       5b                      pop    %rbx
  4168cd:       48 8b e5                mov    %rbp,%rsp
  4168d0:       5d                      pop    %rbp
  4168d1:       c3                      retq   


I don't understand what I'm doing wrong. Is this a platform-specific issue? I'm running Linux/x86_64.
Comment 32 Walter Bright 2013-07-13 12:54:15 UTC
(In reply to comment #31)
> I don't understand what I'm doing wrong. Is this a platform-specific issue? I'm
> running Linux/x86_64.

It works with 32 bit code generation, not with 64. I didn't realize that. I'll look into fixing it.
Comment 33 hsteoh 2013-07-13 14:27:53 UTC
Interesting. Running dmd -m32 -O works, produces the rotate instructions.

For some reason, I still can't coax gdc to do this. I've tried all combinations of -O, -O2, -O3 and -m32 , -m64, and it still only produces shift + bitwise OR. Any clues, Iain? :)
Comment 34 Iain Buclaw 2013-07-14 02:47:24 UTC
(In reply to comment #33)
> Interesting. Running dmd -m32 -O works, produces the rotate instructions.
> 
> For some reason, I still can't coax gdc to do this. I've tried all combinations
> of -O, -O2, -O3 and -m32 , -m64, and it still only produces shift + bitwise OR.
> Any clues, Iain? :)

Works for me (gcc vanilla development) - so I'm shrugging my shoulders on this.
Comment 35 hsteoh 2013-07-14 07:45:05 UTC
Interestingly, translating the code into C and compiling with gcc 4.8.1 does produce the rotate instructions. But compiling the D version with gdc 4.8 doesn't. I've no idea why.
Comment 36 Iain Buclaw 2013-07-14 08:10:23 UTC
(In reply to comment #35)
> Interestingly, translating the code into C and compiling with gcc 4.8.1 does
> produce the rotate instructions. But compiling the D version with gdc 4.8
> doesn't. I've no idea why.

You can look at output of -fdump-tree-gimple and compare gcc to gdc generated code.

But there should be no difference. :)
Comment 37 Iain Buclaw 2013-07-14 08:38:50 UTC
(In reply to comment #36)
> (In reply to comment #35)
> > Interestingly, translating the code into C and compiling with gcc 4.8.1 does
> > produce the rotate instructions. But compiling the D version with gdc 4.8
> > doesn't. I've no idea why.
> 
> You can look at output of -fdump-tree-gimple and compare gcc to gdc generated
> code.
> 
> But there should be no difference. :)

gdc 4.9.0 20130707 - produces rol/ror instructions.
gdc 4.9.0 20130616 - produces rol/ror instructions.
gdc 4.9.0 20130505 - produces shl/shr instructions.


Related commits between 10/05 - 14/05.

http://gcc.gnu.org/viewcvs/gcc?view=revision&revision=198770
http://gcc.gnu.org/viewcvs/gcc?view=revision&revision=198823
http://gcc.gnu.org/viewcvs/gcc?view=revision&revision=198864
Comment 38 bearophile_hugs 2013-07-14 09:10:08 UTC
(In reply to comment #37)

> gdc 4.9.0 20130707 - produces rol/ror instructions.
> gdc 4.9.0 20130616 - produces rol/ror instructions.
> gdc 4.9.0 20130505 - produces shl/shr instructions.

I am changing my mind, maybe now I want rol/ror _intrinsics_ in core.bitop... :-)
Comment 39 Iain Buclaw 2013-07-14 10:28:05 UTC
(In reply to comment #38)
> (In reply to comment #37)
> 
> > gdc 4.9.0 20130707 - produces rol/ror instructions.
> > gdc 4.9.0 20130616 - produces rol/ror instructions.
> > gdc 4.9.0 20130505 - produces shl/shr instructions.
> 
> I am changing my mind, maybe now I want rol/ror _intrinsics_ in core.bitop...
> :-)

core.bitop might be a good idea... then I could map the functions to gcc's builtin lrotate and rrotate expressions. :o)

Slightly off the mark, but could also do vector rotate intrinsics too...
Comment 40 bearophile_hugs 2013-07-28 10:19:00 UTC
(In reply to comment #13)

> (Perhaps here there's material for a small enhancement request for LLVM.)

Here it is, in the LLVM Bugzilla:
http://llvm.org/bugs/show_bug.cgi?id=16726
Comment 41 hsteoh 2013-08-13 09:42:34 UTC
Related: issue #1116.
Comment 42 Temtaime 2013-08-13 10:47:52 UTC
Guys, please, fix the bugs, not improve performance of DMD's backend.
For that purpose there are LDC and GDC.
Comment 43 bearophile_hugs 2013-08-13 11:20:41 UTC
(In reply to comment #42)
> Guys, please, fix the bugs, not improve performance of DMD's backend.
> For that purpose there are LDC and GDC.

Walter is free to do what he likes with his time, including enhancing the DMD back-end. And currently LDC2 back-end is not optimizing the rotations well enough. I now think that having two rotation intrinsics is a good solution.
Comment 44 Iain Buclaw 2013-08-13 12:05:05 UTC
(In reply to comment #42)
> Guys, please, fix the bugs, not improve performance of DMD's backend.
> For that purpose there are LDC and GDC.

Intrinsics benefits GDC code generation also...
Comment 45 kai 2013-09-09 10:58:01 UTC
(In reply to comment #43)
> And currently LDC2 back-end is not optimizing the rotations well
> enough.

I had a closer look to this. The root cause is the uint nBits declaration. Because of this the x value is promoted to uint if T is ubyte or ushort (according to TDPL, 2.3.3). In this case LLVM is not smart enough to deduce that the zero extension is not necessary. If both types are ubyte or ushort, then the rolb/rolw and rorb/rorw instructions are generated.

I try to make LLVM smarter. :-)
Comment 46 Martin Nowak 2013-11-01 19:06:39 UTC
*** Issue 1116 has been marked as a duplicate of this issue. ***
Comment 47 moonlightsentinel 2021-06-06 16:21:43 UTC
core.bitop contains ror and rol