Issue 4440 - [patch] Inlining delegate literals
Summary: [patch] Inlining delegate literals
Status: ASSIGNED
Alias: None
Product: D
Classification: Unclassified
Component: dmd (show other issues)
Version: D2
Hardware: All All
: P3 normal
Assignee: Brad Roberts
URL:
Keywords: patch, performance
Depends on:
Blocks: 4264 4820
  Show dependency treegraph
 
Reported: 2010-07-08 22:59 UTC by Brad Roberts
Modified: 2022-12-17 10:45 UTC (History)
4 users (show)

See Also:


Attachments
add support for inlining function literals (2.45 KB, patch)
2010-07-11 14:39 UTC, Brad Roberts
Details | Diff

Note You need to log in before you can comment on or make changes to this issue.
Description Brad Roberts 2010-07-08 22:59:51 UTC
split from bug 859:

Leandro Lucarella      2010-06-27 18:49:49 PDT 

I'm having some performance problems moving some stuff from a lower-level
C-style to a higher-lever D-style. Here is an example:

---
int find_if(bool delegate(ref int) predicate)
{
        for (int i = 0; i < 100; i++)
                if (predicate(i))
                        return i;
        return -1;
}

int main()
{
//      for (int i = 0; i < 100; i++)
//              if (i == 99)
//                      return i;
//      return -1;
        return find_if((ref int i) { return i == 99; });
}
---

The program produced by this source executes 4 times more instructions than the
more direct (lower-level) version commented out. I would expect DMD to inline
all functions/delegates and produce the same asm for both, but that's not the
case.

This is a reduced test-case, but I'm working on improving the GC and I'm really
hitting this problem. If I use this higher-level style in the GC, a Dil run for
generating the Tango docs is 3.33 times slower than the C-ish style used by the
current GC.

So I think this is a real problem for D, it's really important to be able to
encourage people to use the higher-level D constructs.
Comment 1 Leandro Lucarella 2010-07-09 08:38:57 UTC
I'm not very keen at reading asm, but I'm seeing a couple of 'call's in the generated code:

---
$ dmd -release -O -gc -inline -c test.d
$ objdump -d test.o

test.o:     file format elf32-i386


Disassembly of section .text:

00000000 <.text>:
   0:	c3                   	ret    
   1:	60                   	pusha  
   2:	b8 38 00 00 00       	mov    $0x38,%eax
   7:	b9 00 00 00 00       	mov    $0x0,%ecx
   c:	8b 11                	mov    (%ecx),%edx
   e:	89 10                	mov    %edx,(%eax)
  10:	89 01                	mov    %eax,(%ecx)
  12:	61                   	popa   
  13:	c3                   	ret    

Disassembly of section .text._D4test7find_ifFDFKiZbZi:

00000000 <_D4test7find_ifFDFKiZbZi>:
   0:	55                   	push   %ebp
   1:	8b ec                	mov    %esp,%ebp
   3:	83 ec 0c             	sub    $0xc,%esp
   6:	53                   	push   %ebx
   7:	8b 55 0c             	mov    0xc(%ebp),%edx
   a:	8b 45 08             	mov    0x8(%ebp),%eax
   d:	c7 45 fc 00 00 00 00 	movl   $0x0,-0x4(%ebp)
  14:	89 d3                	mov    %edx,%ebx
  16:	8d 4d fc             	lea    -0x4(%ebp),%ecx
  19:	8b 45 08             	mov    0x8(%ebp),%eax
  1c:	51                   	push   %ecx
  1d:	ff d3                	call   *%ebx
  1f:	84 c0                	test   %al,%al
  21:	74 0a                	je     2d <_D4test7find_ifFDFKiZbZi+0x2d>
  23:	8b 45 fc             	mov    -0x4(%ebp),%eax
  26:	5b                   	pop    %ebx
  27:	8b e5                	mov    %ebp,%esp
  29:	5d                   	pop    %ebp
  2a:	c2 08 00             	ret    $0x8
  2d:	ff 45 fc             	incl   -0x4(%ebp)
  30:	83 7d fc 64          	cmpl   $0x64,-0x4(%ebp)
  34:	7c e0                	jl     16 <_D4test7find_ifFDFKiZbZi+0x16>
  36:	5b                   	pop    %ebx
  37:	8b e5                	mov    %ebp,%esp
  39:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
  3e:	5d                   	pop    %ebp
  3f:	c2 08 00             	ret    $0x8
  42:	90                   	nop
  43:	90                   	nop

Disassembly of section .text._Dmain:

00000000 <_Dmain>:
   0:	55                   	push   %ebp
   1:	8b ec                	mov    %esp,%ebp
   3:	b9 00 00 00 00       	mov    $0x0,%ecx
   8:	51                   	push   %ecx
   9:	31 c0                	xor    %eax,%eax
   b:	50                   	push   %eax
   c:	e8 fc ff ff ff       	call   d <_Dmain+0xd>
  11:	5d                   	pop    %ebp
  12:	c3                   	ret    
  13:	90                   	nop

Disassembly of section .text._D4test4mainFZi12__dgliteral1MFKiZb:

00000000 <_D4test4mainFZi12__dgliteral1MFKiZb>:
   0:	55                   	push   %ebp
   1:	8b ec                	mov    %esp,%ebp
   3:	50                   	push   %eax
   4:	8b 4d 08             	mov    0x8(%ebp),%ecx
   7:	b8 01 00 00 00       	mov    $0x1,%eax
   c:	83 39 63             	cmpl   $0x63,(%ecx)
   f:	74 02                	je     13 <_D4test4mainFZi12__dgliteral1MFKiZb+0x13>
  11:	31 c0                	xor    %eax,%eax
  13:	8b e5                	mov    %ebp,%esp
  15:	5d                   	pop    %ebp
  16:	c2 04 00             	ret    $0x4
---

Am I understanding it all wrong?
Comment 2 Brad Roberts 2010-07-11 10:13:14 UTC
Reducing the test case further:
extern(C) int printf(const char*, ...);

bool if_delegate(int i, bool delegate(ref int) predicate)
{
    return predicate(i);
}

bool if_nodelegate(int i)
{
    return i == 99;
}

void main()
{
    foreach(i; 1 .. 100)
        if (if_delegate(i, (ref int j) { return j == 99; }))
            printf("i = %d\n", i);

    foreach(i; 1 .. 100)
        if (if_nodelegate(i))
            printf("i = %d\n", i);
}

In the previous version of the code, the use of the for loops makes the cost of the functions too high for the inliner to bother inlining find_if into main.  Also,in the non-inlined version of the function, there's no way for it to inline the delegate since it's a variable, not a constant thing.

This changed version reduces the problem to: will it inline the delegate?  The answer is still no, but allows that specific problem to be explored.

Leandro, does this stray too far away from the underlying code that led you to file this bug?
Comment 3 Brad Roberts 2010-07-11 11:19:18 UTC
Ok.. still doesn't inline the delegate in this version:

extern(C) int printf(const char*, ...);

void main()
{
    auto d = (ref int j) { return j == 99; };

    foreach(i; 1 .. 100)
        if (d(i))
            printf("i = %d\n", i);
}

There's nothing in the dmd frontend inliner that even tries to deal with this sort of inlining right now.
Comment 4 Leandro Lucarella 2010-07-11 11:27:57 UTC
(In reply to comment #2)
> Reducing the test case further:
> extern(C) int printf(const char*, ...);
> 
> bool if_delegate(int i, bool delegate(ref int) predicate)
> {
>     return predicate(i);
> }
> 
> bool if_nodelegate(int i)
> {
>     return i == 99;
> }
> 
> void main()
> {
>     foreach(i; 1 .. 100)
>         if (if_delegate(i, (ref int j) { return j == 99; }))
>             printf("i = %d\n", i);
> 
>     foreach(i; 1 .. 100)
>         if (if_nodelegate(i))
>             printf("i = %d\n", i);
> }
> 
> In the previous version of the code, the use of the for loops makes the cost of
> the functions too high for the inliner to bother inlining find_if into main. 
> Also,in the non-inlined version of the function, there's no way for it to
> inline the delegate since it's a variable, not a constant thing.
>
> This changed version reduces the problem to: will it inline the delegate?  The
> answer is still no, but allows that specific problem to be explored.
> 
> Leandro, does this stray too far away from the underlying code that led you to
> file this bug?


Yes, the loop is important, because this happens a lot with opApply(), which almost always includes a loop, and when I use (x having an opApply() method):
foreach (a; x) {
   // some code
}
I almost always want // some code to be inlined, since it's not even a function (from the POV of the programmer), but maybe that should be another bug report, I really don't know.
Comment 5 Brad Roberts 2010-07-11 11:31:38 UTC
Ok.  I'll keep that in mind.  One step at a time for now though; need to get _any_ delegate inlining to happen before worrying about popping up the layers up to an opApply function.
Comment 6 Leandro Lucarella 2010-07-11 13:22:56 UTC
(In reply to comment #5)
> Ok.  I'll keep that in mind.  One step at a time for now though; need to get
> _any_ delegate inlining to happen before worrying about popping up the layers
> up to an opApply function.

Agreed, thanks for giving it some time to this :)
Comment 7 Brad Roberts 2010-07-11 14:39:39 UTC
Created attachment 690 [details]
add support for inlining function literals

This patch adds support for inlining function literals.  This won't cover many cases as most hide behind a variable, but it's a building block in the right direction:

The test case for this patch.. very contrived:

extern(C) int printf(const char*, ...);

void main()
{
    int i = 99;
    if ((delegate(ref int j) { return j == 99; })(i))
        printf("i = %d\n", i);
}

same result if the keyword 'delegate' is omitted.

It passes an abbreviated (only used ARGS="-O -inline -release") run of the test suite.
Comment 8 David Simcha 2010-08-29 13:55:31 UTC
Blocks 4264 because I need delegate literal inlining to get decent performance out of opApply-based map, filter, etc.
Comment 9 David Simcha 2010-08-29 14:26:50 UTC
I just tried this patch out and on second thought I'm not sure it's worth integrating as-is because it's so limited.  It won't even handle the case where a function that calls a delegate gets inlined and the delegate passed from the call site is a literal.  Even the following doesn't get inlined:

bool evaluateDelegate(bool delegate() dg) {
    return dg();
}

void main() {
    evaluateDelegate({return false;});
}


Here's the disassembly of main():

__Dmain PROC NEAR
;  COMDEF __Dmain
        sub     esp, 24                                 ; 0000 _ 83. EC, 18
        xor     eax, eax                                ; 0003 _ 31. C0
        mov     ecx, offset _D5test94mainFZv12__dgliteral1MFZb; 0005 _ B9, 00000000(segrel)
        push    ebx                                     ; 000A _ 53
        mov     edx, ecx                                ; 000B _ 8B. D1
        mov     dword ptr [esp+4H], eax                 ; 000D _ 89. 44 24, 04
        mov     eax, dword ptr [esp+4H]                 ; 0011 _ 8B. 44 24, 04
        mov     ebx, dword ptr [esp+4H]                 ; 0015 _ 8B. 5C 24, 04
        mov     dword ptr [esp+8H], ecx                 ; 0019 _ 89. 4C 24, 08
        call    edx                                     ; 001D _ FF. D2
        xor     eax, eax                                ; 001F _ 31. C0
        pop     ebx                                     ; 0021 _ 5B
        add     esp, 24                                 ; 0022 _ 83. C4, 18
        ret                                             ; 0025 _ C3
__Dmain ENDP
Comment 10 Brad Roberts 2010-08-29 14:31:55 UTC
Agreed.  Getting delegate inlining in dmd is going to take serious work.  The problem is that the inliner is no where near a const propagation pass.  It needs to know that the variables in play have a single, unchangeable, value.

At the point of inlining, there's just a variable with an unknown -- to the compiler -- value.

So, while this change does help that one narrow use case, it's of almost no practical value.
Comment 11 Walter Bright 2015-08-23 02:48:58 UTC
You're right, Brad. Fixing this involves doing constant propagation in the front end. It's possible to do it, as constant propagation is the easiest of the data flow optimizations to do.