Issue 5100 - [Intel Atom] -O Degrades performance of loop statements
Summary: [Intel Atom] -O Degrades performance of loop statements
Status: NEW
Alias: None
Product: D
Classification: Unclassified
Component: dmd (show other issues)
Version: D2
Hardware: All Linux
: P3 normal
Assignee: No Owner
URL:
Keywords: backend, performance
Depends on:
Blocks:
 
Reported: 2010-10-22 05:56 UTC by Iain Buclaw
Modified: 2022-12-17 10:45 UTC (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description Iain Buclaw 2010-10-22 05:56:32 UTC
Two example cases:

loop1.d
---------
void main()
{
    for (int i = 0; i < int.max; i++)
    {
    }
}


loop2.d
---------
void main()
{
    int i = 0;
    while(i < int.max)
    {
        i++;
    }
}


$ dmd loop1.d
$ time ./loop1 
real	0m2.914s
user	0m2.884s
sys	0m0.012s

$ ./dmd loop1.d -O
$ time ./loop1 
real	0m5.695s
user	0m5.684s
sys	0m0.004s

$ ./dmd loop2.d 
$ time ./loop2
real	0m2.912s
user	0m2.892s
sys	0m0.004s

$ ./dmd loop2.d -O
$ time ./loop2
real	0m5.703s
user	0m5.688s
sys	0m0.004s


The speed of execution slows by almost double when optimisations are turned on. Something isn't right here...
Comment 1 Stephan Dilly 2010-12-01 16:50:43 UTC
for the lazy #5100 is
Comment 2 Stephan Dilly 2010-12-01 16:51:37 UTC
(In reply to comment #1)
> for the lazy #5100 is

ooops very sorry this comment was meant to be for bug #5294 cause i think it is related.
Comment 3 Don 2010-12-06 01:20:51 UTC
Cannot reproduce. On Windows, for both test cases, without -O it's about 5 seconds (does an INC and CMP of a stack variable). With -O it is about 1 second (just does INC and CMP of EAX).
Comment 4 Iain Buclaw 2010-12-06 14:51:32 UTC
objdump without -O on Linux:

push   %ebp
mov    %esp,%ebp
sub    $0x4,%esp
movl   $0x0,-0x4(%ebp)
cmpl   $0x7fffffff,-0x4(%ebp)
jge    1c <_Dmain+0x1c>
addl   $0x1,-0x4(%ebp)
jmp    d <_Dmain+0xd>
xor    %eax,%eax
leave  
ret    


objdump with -O on Linux

push   %ebp
mov    %esp,%ebp
xor    %eax,%eax
add    $0x1,%eax
cmp    $0x7fffffff,%eax
jb     5 <_Dmain+0x5>
pop    %ebp
xor    %eax,%eax
ret    


Looks to be same as what Don said that was on his Windows box.


Wonder why Linux is slower... (must be a quirk, that or my Intel Atom CPU is to blame).
Comment 5 Iain Buclaw 2010-12-07 00:59:45 UTC
Been playing about with GCC, this seems to be a better performant:

Objdump:

push   %ebp
mov    %esp,%ebp
and    $0xfffffff0,%esp
push   %eax
sub    $0xc,%esp
lea    0x0(%esi),%esi
add    $0x1,%eax
cmp    $0x7fffffff,%eax
jne    30 <_Dmain+0x10>
add    $0xc,%esp
mov    %ebp,%esp
pop    %ebp
ret    



GCC assembly:

.globl _Dmain
        .type   _Dmain, @function
_Dmain:
.LFB0:
        .cfi_startproc
        .cfi_personality 0x0,__gdc_personality_v0
        pushl   %ebp
        .cfi_def_cfa_offset 8
        movl    %esp, %ebp
        .cfi_offset 5, -8
        .cfi_def_cfa_register 5
        andl    $-16, %esp
        pushl   %eax
        .cfi_escape 0x10,0x3,0x7,0x55,0x9,0xf0,0x1a,0x9,0xfc,0x22
        subl    $12, %esp
        .p2align 4,,7
        .p2align 3
.L4:
        addl    $1, %eax
        cmpl    $2147483647, %eax
        jne     .L4
        addl    $12, %esp
        movl    %ebp, %esp
        popl    %ebp
        ret
        .cfi_endproc
.LFE0:
        .size   _Dmain, .-_Dmain


Can attach the full .s file if needed.

Regards
Comment 6 Walter Bright 2012-01-19 14:08:23 UTC
Perhaps it's because gcc is doing:

    ADD EAX,1

instead of:

    INC EAX
Comment 7 Iain Buclaw 2012-01-19 15:33:26 UTC
Maybe not...

I actually get the reverse on my new laptop with 2.057,

$ dmd loop2.d
$ objdump loop2.o -d
push   %ebp
mov    %esp,%ebp
sub    $0x4,%esp
movl   $0x0,-0x4(%ebp)
cmpl   $0x7fffffff,-0x4(%ebp)
jge    1b <_Dmain+0x1b>
incl   -0x4(%ebp)
jmp    d <_Dmain+0xd>
xor    %eax,%eax
leave  
ret    

$ time ./loop2 
real	0m11.780s
user	0m11.769s
sys	0m0.004s

$ dmd loop2.d -O
$ objdump loop2.o -d
push   %ebp
mov    %esp,%ebp
xor    %eax,%eax
inc    %eax
cmp    $0x7fffffff,%eax
jb     5 <_Dmain+0x5>
pop    %ebp
xor    %eax,%eax
ret    

$ time ./loop2 
real	0m3.936s
user	0m3.924s
sys	0m0.008s
Comment 8 Iain Buclaw 2012-01-19 15:39:32 UTC
And on my netbook:

$ dmd loop2.d
$ time ./loop2
real    0m2.948s
user    0m2.924s
sys    0m0.012s

$ dmd loop2.d -O
$ time ./loop2
real    0m5.725s
user    0m5.688s
sys    0m0.012s


Specs of Netbook:
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 28
model name	: Intel(R) Atom(TM) CPU N270   @ 1.60GHz
stepping	: 2
cpu MHz		: 800.000
cache size	: 512 KB
cpu cores	: 1


Specs of Laptop:
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 20
model		: 2
model name	: AMD E-450 APU with Radeon(tm) HD Graphics
stepping	: 0
cpu MHz		: 825.000
cache size	: 512 KB
cpu cores	: 2


Regards
Comment 9 Iain Buclaw 2012-01-19 15:48:04 UTC
My gut feeling is that the main source of it slowing down is the needless push and pop of the frame pointer.
Comment 10 Iain Buclaw 2015-08-09 06:27:51 UTC
For a while now I've been thinking that the bottleneck is probably to do with alignment, but I'd have to get out my (now two generations old) atom netbook to investigate further.