D issues are now tracked on GitHub. This Bugzilla instance remains as a read-only archive.
Issue 6643 - Very slow compilation for large switch() using -O and -inline
Summary: Very slow compilation for large switch() using -O and -inline
Status: RESOLVED DUPLICATE of issue 2396
Alias: None
Product: D
Classification: Unclassified
Component: dmd (show other issues)
Version: D2
Hardware: All All
: P2 normal
Assignee: No Owner
URL:
Keywords: performance
Depends on:
Blocks:
 
Reported: 2011-09-10 09:07 UTC by Ben Eirich
Modified: 2015-06-09 05:11 UTC (History)
2 users (show)

See Also:


Attachments
Test case (74.58 KB, application/octet-stream)
2011-09-10 09:07 UTC, Ben Eirich
Details
trace.log and trace.def (483.10 KB, application/octet-stream)
2011-09-10 21:13 UTC, Ben Eirich
Details

Note You need to log in before you can comment on or make changes to this issue.
Description Ben Eirich 2011-09-10 09:07:24 UTC
Created attachment 1023 [details]
Test case

I am attempting to port some CPU emulators from C# to D. It uses a large
central switch statement (2221 lines in this case) with a case per opcode; the
contents of each opcode handler are 'manually inlined' because in C# the
automatic inliner is not very aggressive and there is no explicit inline
modifier (same as in D).

Building the attached program takes under 1 second with only -inline, about 4-5
seconds with only -O, and around 15 minutes with -O and -inline. -release
changes outcomes slightly but not in a significant way.

Because of the nature of this issue, it was not possible to reduce this to a
smaller test case. It is not the number of case statements in the switch that
cause this issue, it is the size or complexity of the code inside the switch
statement.

It's very possible the issue is simply related to very large methods rather
than anything at all to do with switch.

The resulting EXE does work and -O -inline is about 17% faster than -O alone
when run in a benchmark. It just takes a very long time to build.

This isn't a total blocker, however, this is only one CPU and I have about 6-7
I would like to port immediately, and long term, maybe around a dozen. The Z80
CPU has a switch statement which is about 4x bigger than this.

As I am very new to the D toolchain, I'm not currently using incremental
builds. I assuming this is possible and it should mitigate the problem
somewhat. It does make performance testing and optimization an issue, though. I
am using DMD 2.054.
Comment 1 bearophile_hugs 2011-09-10 10:37:30 UTC
Are you able to compile DMD and profile it during its compilation of (a reduced version of) this code? This profiling data will probably help to fix this performance bug.
Comment 2 Ben Eirich 2011-09-10 21:13:25 UTC
Created attachment 1024 [details]
trace.log and trace.def

First, I commented out half of the case statements, and this dropped the -O -inline compile time from 15 minutes to 2 minutes. So there is some definite non-linearity happening there.

Then, I compiled DMD 2.055 with it's "trace" makefile setting. I used that version of DMD to compile the half-size test case with -O -inline. This took a very long time with instrumentation - over an hour.

In any case, I attached the resulting trace.log and trace.def. It doesn't really mean anything to me, so I hope it means something to someone here!
Comment 3 bearophile_hugs 2011-09-11 04:05:24 UTC
The head of the trace log, a bit reformatted for clarity:

    Num       Tree        Func
   Calls      Time        Time

        25  687874498   336057504  void cdecl copyprop()
-896850051  294735899   294735899  unsigned cdecl vec_index(unsigned ,unsigned *)
      3318   77706260    66901704  void cdecl flowcp()
    159748   60550176    38034252  LIST *cdecl listrds(unsigned *,elem *,unsigned *)
 137723661    5558797     5558797  int cdecl ERTOL(elem *)
        48    2881023     2670153  void cdecl flowae()
  14607381    1832766     1832766  void cdecl vec_orass(unsigned *,unsigned *)
     30662    1028359     1028359  void cdecl updaterd(elem *,unsigned *,unsigned *)
   9533245    1008464     1008464  void *cdecl mem_calloc(unsigned )
   6762446     972262      972262  void cdecl vec_subass(unsigned *,unsigned *)
   5832042     801878      801878  void cdecl vec_sub(unsigned *,unsigned *,unsigned *)
   9542441    1158120      592461  void cdecl vec_free(unsigned *)
   8264499    1335987      579438  unsigned *cdecl vec_calloc(unsigned )
   9530121     565679      565679  void cdecl mem_free(void *)
   3582398     471076      471076  void cdecl vec_andass(unsigned *,unsigned *)
   1270712     411849      160136  unsigned *cdecl vec_clone(unsigned *)
   1977317     120103      120103  void cdecl vec_copy(unsigned *,unsigned *)
   1281149      97412       97412  int cdecl vec_equal(unsigned *,unsigned *)
        25   59795507       95264  void cdecl loopopt()
Comment 4 bearophile_hugs 2011-09-11 05:32:04 UTC
Maybe it's possible to speed up the vec_index function (in srt\tk\vec.c) using smarter "bithacks".
Comment 5 bearophile_hugs 2011-09-11 06:10:47 UTC
With a test (removing @property, replacing immutable with const, and removing the imports) the LDC1 compiler shows normal compilation speeds with every combination of switches I have tried.
Note: probably LDC1 disables the inlining logic of the D front-end and uses the LLVM one.
Comment 6 yebblies 2011-09-11 08:26:07 UTC
This is (from what I can see) a problem with the optimizer, a similar kind of bug to issue 2396.  It doesn't seem to have anything to do with the switch, but looks like a slow algorithm trying to reorder assignments, and choking on a 2000+ line function.
Doing some regex magic and placing each case block in a (){}(); delegate call gives normal compile times for me.  Maybe this will be a suitable workaround for you.
Unfortunately there are far fewer people working on the dmd backend than the frontend, so depending on the difficulty it may be a while before this is fixed.
Comment 7 yebblies 2012-02-01 06:26:44 UTC

*** This issue has been marked as a duplicate of issue 2396 ***