Issue 859 - float vector codegen after inlining very different from manual inlined code
Summary: float vector codegen after inlining very different from manual inlined code
Status: RESOLVED FIXED
Alias: None
Product: D
Classification: Unclassified
Component: dmd (show other issues)
Version: D2
Hardware: All All
: P2 enhancement
Assignee: No Owner
URL:
Keywords: performance
Depends on:
Blocks:
 
Reported: 2007-01-20 01:23 UTC by Bradley Smith
Modified: 2015-06-09 05:11 UTC (History)
4 users (show)

See Also:


Attachments
Example to test inlining of a simple function (1.00 KB, text/plain)
2007-01-20 01:24 UTC, Bradley Smith
Details
Assembly code from the example (19.60 KB, text/plain)
2007-01-20 01:25 UTC, Bradley Smith
Details

Note You need to log in before you can comment on or make changes to this issue.
Description Bradley Smith 2007-01-20 01:23:40 UTC
Compiler inlining of functions gives much worse performance than manually inlined functions (at least in some cases). In the attached example, the performance is 6 times slower.

C:\>dmd -O -inline -release -g testinline.d
C:\>testinline.exe
compiler inlined time: 374058
manually inlined time: 61362

C:\>obj2asm testinline.obj -ctestinline.asm

See line 486 for the compiler inlined code
See line 544 for the manually inlined code

The compiler inlined code extra instructions like the following:
	lea	ESI,-080h[EBP]
	lea	EDI,-048h[EBP]
	movsd
	movsd
	movsd
	lea	ESI,-074h[EBP]
	lea	EDI,-03Ch[EBP]
	movsd
	movsd
	movsd

These instructions are absent in the manually inlined code, and may be the cause of the poor performance.
Comment 1 Bradley Smith 2007-01-20 01:24:40 UTC
Created attachment 92 [details]
Example to test inlining of a simple function
Comment 2 Bradley Smith 2007-01-20 01:25:35 UTC
Created attachment 93 [details]
Assembly code from the example
Comment 3 Leandro Lucarella 2010-06-27 18:49:49 UTC
To avoid opening a new bug, I'll reuse this ancient bug report, since the summary is pretty much the same I'll write for this.

I'm having some performance problems moving some stuff from a lower-level C-style to a higher-lever D-style. Here is an example:

---
int find_if(bool delegate(ref int) predicate)
{
        for (int i = 0; i < 100; i++)
                if (predicate(i))
                        return i;
        return -1;
}

int main()
{
//      for (int i = 0; i < 100; i++)
//              if (i == 99)
//                      return i;
//      return -1;
        return find_if((ref int i) { return i == 99; });
}
---

The program produced by this source executes 4 times more instructions than the more direct (lower-level) version commented out. I would expect DMD to inline all functions/delegates and produce the same asm for both, but that's not the case.

This is a reduced test-case, but I'm working on improving the GC and I'm really hitting this problem. If I use this higher-level style in the GC, a Dil run for generating the Tango docs is 3.33 times slower than the C-ish style used by the current GC.

So I think this is a real problem for D, it's really important to be able to encourage people to use the higher-level D constructs.
Comment 4 nfxjfg 2010-06-27 19:00:28 UTC
@Leandro Lucarella: ldc seems to inline the predicate just fine, although the generated code is still slightly different.
Comment 5 Leandro Lucarella 2010-06-27 20:01:35 UTC
(In reply to comment #4)
> @Leandro Lucarella: ldc seems to inline the predicate just fine, although the
> generated code is still slightly different.

Yes, LDC is better at inlining because it doesn't use the front-end inlining code, it let the LLVM optimizer do the job instead (I think they inhibited the DMDFE inliner precisely because of this issues).

This bug report is about the DMD implementation.
Comment 6 bearophile_hugs 2010-07-08 05:10:27 UTC
An improved version of the test program, that allows to compare dmd and ldc on this inlining problem:


version (Tango) {
    import tango.stdc.stdio: printf;
    import tango.stdc.stdlib: atof;
} else {
    import std.c.stdio: printf;
    import std.c.stdlib: atof;
}

struct Vec3 {
    float x, y, z;
}

float dot(Vec3 A, Vec3 B) {
    return A.x * B.x + A.y * B.y + A.z * B.z;
}

struct Timer {
    long starttime;

    static long getTime() {
        asm {
            naked;
            rdtsc;
            ret;
        }
    }

    void start() {
        starttime = getTime();
    }

    void stop() {
        long endTime = getTime();
        printf("time: %lld\n", endTime - starttime);
    }
}

void main() {
    int n = 30_000;
    Vec3 a = Vec3(atof("1.0"), atof("2.0"), atof("3.0"));
    Vec3 b = Vec3(atof("4.0"), atof("5.0"), atof("6.0"));
    Timer t;
    float sum;

    printf("    Auto inlined ");
    sum = 0.0;
    t.start();
    for (int i; i < n; i++) {
        a.x++;
        a.y++;
        a.z++;
        sum += dot(a, b);
    }
    t.stop();
    printf("sum: %f\n", sum);

    printf("Manually inlined ");
    sum = 0.0;
    t.start();;
    for (int i; i < n; i++) {
        a.x++;
        a.y++;
        a.z++;
        sum += a.x * b.x + a.y * b.y + a.z * b.z;
    }
    t.stop();
    printf("sum: %f\n", sum);
}
Comment 7 Brad Roberts 2010-07-08 23:02:25 UTC
Guys, piling more stuff into a bug report isn't a good idea.  In fact, I need
to re-classify this bug since its not a problem with inlining at all.  The call
to DOT in the original code _is_ being inlined.  The resulting code is
different than the manually inlined version, but the code IS inlined.

While they might be the same, they're different enough right now to call them
different bugs.  I just split the new report into bug 4440
Comment 8 Brad Roberts 2010-07-11 09:05:03 UTC
This was fixed by the changes that fixed bug 2008.  This report passes static arrays as a parameter which was one of the things that caused the inliner to reject a function.

I'm going to close this bug.

I've opened bug 4447 to track a remaining issue regarding oddities involving the first function taking significantly longer to execute, regardless of which it is.