Issue 13410 - Performance problem with associative array byKey/byValue
Summary: Performance problem with associative array byKey/byValue
Status: RESOLVED FIXED
Alias: None
Product: D
Classification: Unclassified
Component: druntime (show other issues)
Version: D2
Hardware: x86 Windows
: P1 normal
Assignee: No Owner
URL:
Keywords: pull
Depends on:
Blocks:
 
Reported: 2014-09-01 09:14 UTC by bearophile_hugs
Modified: 2015-02-18 03:38 UTC (History)
5 users (show)

See Also:


Attachments
simple 'first used bucket' caching for AAs (3.11 KB, patch)
2014-09-01 10:48 UTC, Ketmar Dark
Details | Diff
simple 'first used bucket' caching for AAs with '_aaDelX' recaching (3.48 KB, patch)
2014-09-01 11:37 UTC, Ketmar Dark
Details | Diff
simple 'first used bucket' caching for AAs (3.63 KB, patch)
2014-09-01 12:01 UTC, Ketmar Dark
Details | Diff
simple 'first used bucket' caching for AAs (3.75 KB, patch)
2014-09-29 16:54 UTC, Ketmar Dark
Details | Diff

Note You need to log in before you can comment on or make changes to this issue.
Description bearophile_hugs 2014-09-01 09:14:18 UTC
This is a reduction of a performance problem I have found (ketmar on the newsgroup D.learn has found it). This is the D version:

- - - - - - - - - - -

import core.stdc.stdio;

int main() {
    long[int] aa;
    for (int i = 0; i < 50000; i++)
        aa[i] = i;

    long total = 0;

    while (aa.length) {
        int k = aa.byKey.front;
        long v = aa.byValue.front;
        aa.remove(k);
        total += k + v;
    }

    printf("%lld\n", total);
    return 0;
}

- - - - - - - - - - -

The C++ version:


#include <stdio.h>
#include <map>

int main() {
    std::map<int, long long> aa;
    for (int i = 0; i < 50000; i++)
        aa[i] = i;

    long long total = 0;

    while (!aa.empty()) {
        int k = aa.begin()->first;
        long long v = aa.begin()->second;
        aa.erase(aa.begin());
        total += k + v;
    }

    printf("%lld\n", total);
    return 0;
}

- - - - - - - - - - -

The run-time of the C++ version is about 0.05 seconds (the output is 2499950000), while the D code runs in about 3.8 seconds (76 times slower).

To show the performance problems don't come from the associative array creation, and that they don't come from the associative array remove, or from the long sum, here is a very similar version that just avoids byKey/byValue (idea suggested by ketmar), note that this code allocates an eager array with aa.keys. This runs in about 0.06 seconds, sufficiently similar to the C++ version:

- - - - - - - - - - -

import core.stdc.stdio;

int main() {
    long[int] aa;
    for (int i = 0; i < 50000; i++)
        aa[i] = i;

    long total = 0;

    foreach (int k; aa.keys) {
        long v = aa[k];
        aa.remove(k);
        total += k + v;
    }

    printf("%lld\n", total);
    return 0;
}

- - - - - - - - - - -
Comment 1 bearophile_hugs 2014-09-01 09:17:01 UTC
The exact problem was found by Daniel Kozak.
Comment 2 Ketmar Dark 2014-09-01 10:48:59 UTC
Created attachment 1407 [details]
simple 'first used bucket' caching for AAs

adding 'first used bucket' caching to aa halves execution time for the sample.

seems that we can't do better with current AAs and without hurting overall AA performance. aa.remove() hurts caching and there is no sense to recalculate cache on each remove.

p.s. please note that this patch is not heavily tested, it's just a quick 'hack-in'.
Comment 3 Ketmar Dark 2014-09-01 11:37:43 UTC
Created attachment 1408 [details]
simple 'first used bucket' caching for AAs with '_aaDelX' recaching

here is another patch which tries to revalidate cache in `_aaDelX()`. this makes aa.remove() little slower but… but execution time for both samples are nearly identical now.

and for bearophile's sample here ( http://forum.dlang.org/thread/efmjlfpehqqfszcrxmwq@forum.dlang.org ) we have some small improvement against the previous patch too (~0.4 seconds).
Comment 4 Ketmar Dark 2014-09-01 12:01:54 UTC
Created attachment 1409 [details]
simple 'first used bucket' caching for AAs

aaaaaand we have a WINNER!

this final patch should not hurt 'normal' AA usage (it will not update 'first used cache' on each operation) yet speds up this syntetic test (~3 seconds --> !0.03 seconds) and code from the forum (~3.5 seconds --> 1.1 seconds).
Comment 5 bearophile_hugs 2014-09-01 12:18:12 UTC
(In reply to Ketmar Dark from comment #4)

and code from the forum (~3.5 seconds --> 1.1 seconds).

For me the C++ version of the code runs in about 0.20 seconds, so there's still a significant performance difference.
Comment 6 Ketmar Dark 2014-09-01 12:35:12 UTC
(In reply to bearophile_hugs from comment #5)
> For me the C++ version of the code runs in about 0.20 seconds, so there's
> still a significant performance difference.
yes, AA still needs to search for the "first item" many times, caching just makes it faster in some cases. this can't be changed without complete rewrite of AA (and with performance hit for other use cases).

"get first item" will always be costly for AAs, that's the way AAs works. there are alot of 'buckets' for items, many of which are empty. to get *known* item from AA we can choose the necessary bucket with simple '%' operation, but to find "first available" item we must loop thru all buckets until we find the used one.

my patch does caching of "first used bucket" index (and tries to keep that in sync when AA contents changes), but sometimes i have to just "reset" cache, or get/remove operations will become unnecessarely slow.

besides, alot of adding/removing between 'byKey' calls will update cache for no reason at all, doing unnecessary calculations.

i tried to "amortise" that bad cases by not updating cache when there was no calls to byKey/byValue for the given AA, and by resetting cache on AA rehashing. this allows us to has almost no slowdowns when we using AAs without iterating and we can just call aa.rehash before adding/deleting big amount of data to turn off caching (besides, adding alot of data will eventually hit rehashing too).

i can't think out anything better for now. trying to speed up our use case will lead only to code bloating.
Comment 7 Orvid King 2014-09-01 17:44:57 UTC
It would be nice if this were done as a PR on github, as it would be much 
easier to review.
Comment 8 Ketmar Dark 2014-09-01 17:57:44 UTC
(In reply to Orvid King from comment #7)
> It would be nice if this were done as a PR on github, as it would be much 
> easier to review.
sorry, each time i'm thinking about registering at github god kills one kitten.
Comment 9 bearophile_hugs 2014-09-02 01:00:27 UTC
(In reply to Ketmar Dark from comment #8)

> but to find "first available" item we must loop thru all buckets
> until we find the used one.

Right. (But using an unordered_map in the C++ code from the newsgroup the C++ code only gets a little more than twice slower, that seems about twice faster than the patched D code. I don't know the source of such performance difference.)


> sorry, each time i'm thinking about registering at github god kills one
> kitten.

Thank you for your work. Probably someone else will create a pull request with this.
Comment 10 safety0ff.bugz 2014-09-02 06:01:24 UTC
(In reply to bearophile_hugs from comment #9)
>
> Right. (But using an unordered_map in the C++ code from the newsgroup the
> C++ code only gets a little more than twice slower, that seems about twice
> faster than the patched D code. I don't know the source of such performance
> difference.)
> 

C++ unordered_map uses templates and can do all kinds of things the D AA's currently can't.
It can inline/devirtualize the comparison functions, it can store the key and value together, etc.

D's AAs essentially go through a C interface, and so this limits what it can do.

I don't think there's a great data structure for this task out of the box in D:
- Using std.container's red-black tree isn't ideal because there's no equivalent to std::map's bracket operator (without doing multiple lookups.)
- You can't roll a bag (linked list) into the hash table because since key and value are stored separately you cannot map a reference to a value to a reference to the key (without storing a copy of the key inside the value and then doing a lookup.)

As for ketmar's patch, I don't think we should introduce a slowdown in _aaDelX.
The first entry can be simply treated as a hint, meaning that the first bucket is greater or equal to the hint.
I think _aaDelX should leave the cache value alone since _aaGetFirstEntry checks that the bucket is non-null anyway.
Furthermore, I don't think the plus one indexing is necessary.
Comment 11 Ketmar Dark 2014-09-02 06:10:18 UTC
(In reply to bearophile_hugs from comment #9)
> Right. (But using an unordered_map in the C++ code from the newsgroup the
> C++ code only gets a little more than twice slower, that seems about twice
> faster than the patched D code. I don't know the source of such performance
> difference.)
we can make D code alot faster if we'll add one pointer to each AA Entry, for example. but i think that this is just a wasted memory.

i have some other ideas, but they requires either additional memory or more support code in add/remove functions (thus making adding and removing elements slower in general).

but i'm still thinking. maybe i'll do some more patches soon. or someone who already knows the solution but don't bother to write it 'cause he never sees use cases for that. ;-)

i'm still thinking about adding `aa.frontKey` and `aa.frontValue`, 'cause i don't like the fact that using AA in foreach() turns on caching. but i believe that adding new methods will zero merging chances. "we already have too many methods and runtime hooks, there is no sense in adding two more, blah-blah-blah".
Comment 12 Ketmar Dark 2014-09-02 06:27:48 UTC
(In reply to safety0ff.bugz from comment #10)
> As for ketmar's patch, I don't think we should introduce a slowdown in
> _aaDelX.
cache bookkeeping turns on only when the code used byKey/byValue at least once (i.e. directly or in foreach loop). this is not that frequent operations for AAs. and user cannot modify AA in `foreach(k; aa.byKeys)` or `foreach(v; aa.byValues)` anyway.

the second thing is that cache bookkeeping is in effect only when "first used" bucket becomes empty.

and any rehash operation will turn off caching too.

without that code in _aaDelX syntetic test from this report is ALOT slower (~1 second vs ~0.3 seconds) and code from the forum is somewhat slower (~1.5 seconds vs ~1.1 seconds).

yet i'm sure that overhead for `aa.remove()` is so small that it can be ignored.

ok, we need some performance tests here. somebody?

> The first entry can be simply treated as a hint, meaning that the first
> bucket is greater or equal to the hint.
i've tried this 'hinting' strategy too, and it gives ~1.1 and ~1.9 seconds on samples. don't like that.

> I think _aaDelX should leave the cache value alone since _aaGetFirstEntry
> checks that the bucket is non-null anyway.
see above.

> Furthermore, I don't think the plus one indexing is necessary.
it's a matter of taste, i think. i need "cache turned off" flag, and zero is fine. sure we can use "size_t.max", for example, but then we can't use this trick:

  if (i+1 < aa.impl.firstUsedBucket) aa.impl.firstUsedBucket = i+1;

and have to write

  if (aa.impl.firstUsedBucket != size_t.max && i < aa.impl.firstUsedBucket) aa.impl.firstUsedBucket = i;

don't like it. ;-)
Comment 13 safety0ff.bugz 2014-09-02 06:41:11 UTC
(In reply to Ketmar Dark from comment #12)
> cache bookkeeping turns on only when the code used byKey/byValue at least
> once (i.e. directly or in foreach loop). this is not that frequent
> operations for AAs.

Ah, okay, I missed that it wasn't always on.
Comment 14 hsteoh 2014-09-26 03:37:07 UTC
Created a PR using Ketmar's patch:

https://github.com/D-Programming-Language/druntime/pull/967

Thanks for contributing!
Comment 15 Ketmar Dark 2014-09-26 04:03:59 UTC
tnx for turning this into PR.
Comment 16 hsteoh 2014-09-26 04:52:45 UTC
And this time, I remembered to set you as author, unlike the previous patches I turned into PRs. I know you said it doesn't matter, but still. :-)
Comment 17 Steven Schveighoffer 2014-09-26 15:27:43 UTC
(In reply to Ketmar Dark from comment #12)
> (In reply to safety0ff.bugz from comment #10)
> > Furthermore, I don't think the plus one indexing is necessary.
> it's a matter of taste, i think. i need "cache turned off" flag, and zero is
> fine. sure we can use "size_t.max", for example, but then we can't use this
> trick:
> 
>   if (i+1 < aa.impl.firstUsedBucket) aa.impl.firstUsedBucket = i+1;
> 
> and have to write
> 
>   if (aa.impl.firstUsedBucket != size_t.max && i < aa.impl.firstUsedBucket)
> aa.impl.firstUsedBucket = i;
> 
> don't like it. ;-)

Hm... I think you can just use 0. When caching is off, you need to start looking at bucket 0 anyway.
Comment 18 Ketmar Dark 2014-09-26 21:38:15 UTC
(In reply to Steven Schveighoffer from comment #17)
> Hm... I think you can just use 0. When caching is off, you need to start
> looking at bucket 0 anyway.
please read _aaDelX() code too. trick is harmless, neat looking and yeah, confuses some readers who aren't read the whole code (and so they will not 'fix' the thing they not fully understand ;-).

to keep the long story short: i like this trick and will not remove it. but it doesn't mean that anyone else can't took my code and rewrite it in any other way. ;-)
Comment 19 Ketmar Dark 2014-09-29 16:54:13 UTC
Created attachment 1441 [details]
simple 'first used bucket' caching for AAs

found a bug in remove(), updated patch.

2hsteoh: time to update PR. ;-)
Comment 20 bearophile_hugs 2014-09-29 21:14:58 UTC
(In reply to Ketmar Dark from comment #18)

> i like this trick and will not remove it.

I have not taken a look at the code, but in general leaving tricks in the code that some people find not intuitive introduces technical debt in the code. It's not a practice for a wise programmer.
Comment 21 Ketmar Dark 2014-09-30 00:39:47 UTC
it's widely known trick. at least among us, old farts.
Comment 22 hsteoh 2014-09-30 21:14:02 UTC
PR updated.
Comment 23 Martin Nowak 2014-10-01 08:37:53 UTC
I don't really agree with this patch.
HashTables are unordered and optimized for random indexing by a key.
If you run a tight loop and constantly query front (abusing byKey) then of course that's expensive O(N^2/2). It's like taking a singly linked list and removing the last element until it's empty.
Instead we should offer a nice and generic remove function that takes a predicate and allows to batch remove entries.
Comment 24 Martin Nowak 2014-10-01 08:46:38 UTC
In other words, this optimizes the following code

    foreach (_; 0 .. 1_000_000)
        aa.byKey();

but pessimizes the following

    aa.remove(key);

It does NOT improve iterating byKey.
So unless there is a valid use case to call aa.byKey.front in a loop this would only improve a pointless microbenchmark.
Batch deletion is much better done with a dedicated remove function.

    aa.remove((k, v) => pred(k));

Any volunteers for implementing the latter?
Comment 25 Ketmar Dark 2014-10-01 08:50:34 UTC
and we can make it alot less expensive without losing speed. and all this with only 4/8 bytes of overhead for single AA (not for single AA entry, but for the whole AA).

the only questionable thing with this patch is that it assumes that AA is mutable. but we can't have AA literals anyway, AFAIK.


and no, it's not really pessimizes remove() if noone used .byKey/.byValue on AA, and even if that properties was used, next rehashing will turn off caching again. so i daresay that remove() slowdown is not noticable at all.
Comment 26 bearophile_hugs 2014-10-01 09:43:37 UTC
(In reply to Martin Nowak from comment #24)

> So unless there is a valid use case to call aa.byKey.front in a loop this
> would only improve a pointless microbenchmark.

See here, it's not just a pointless microbenchmark:
http://forum.dlang.org/thread/efmjlfpehqqfszcrxmwq@forum.dlang.org

As shown by others, there are ways to rewrite that code to avoid most of the very slow part, but popping the "first" item from an an associative array is a sufficiently common operation (I have written and seen plenty of Python code that does it).
Comment 27 hsteoh 2014-10-01 18:07:50 UTC
@bearophile: perhaps that's an indication that what you want is actually an *ordered* map, not a hash map?
Comment 28 Steven Schveighoffer 2014-10-01 19:38:20 UTC
New simpler pull which shouldn't affect performance of aa.remove

https://github.com/D-Programming-Language/druntime/pull/979
Comment 29 bearophile_hugs 2014-10-01 19:46:15 UTC
(In reply to hsteoh from comment #27)
> @bearophile: perhaps that's an indication that what you want is actually an
> *ordered* map, not a hash map?

Nope. If you take a look at the link, that code doesn't need an ordered container. The "pop" operation just needs to give one arbitrary item.
Comment 30 Steven Schveighoffer 2014-10-01 20:57:29 UTC
bearophile or ketmar, can you test the new PR? The timings I get on my system are not very different from the original.
Comment 31 Martin Nowak 2014-10-01 22:20:28 UTC
(In reply to bearophile_hugs from comment #29)
> Nope. If you take a look at the link, that code doesn't need an ordered
> container. The "pop" operation just needs to give one arbitrary item.

Take an array or a stack for that.
Comment 32 bearophile_hugs 2014-10-01 22:37:36 UTC
(In reply to Martin Nowak from comment #31)
> (In reply to bearophile_hugs from comment #29)
> > Nope. If you take a look at the link, that code doesn't need an ordered
> > container. The "pop" operation just needs to give one arbitrary item.
> 
> Take an array or a stack for that.

Nope, the associative array nature is needed for other reasons by the code. And keeping the same keys in two collections in parallel is a waste.
Comment 33 Ketmar Dark 2014-10-02 04:04:03 UTC
(In reply to Steven Schveighoffer from comment #30)
> bearophile or ketmar, can you test the new PR?
for what reason? see #2 and #3, i was tried the variant without caching in remove(). i'm telling again that remove() is NOT slowed down by any noticable time, and if you doing alot of removes which all happen to hit the first used bucket, and each such hit empties that bucket, and you used .byKey/.byValue before… this is very unusual scenario, and if you *know* that it goes like this, you can force rehashing.

yet i'm not interested in defending the code: i integrated it in my local builds since i published the patch and it's ok for me. i don't really care if it will make to mainline in one form on another.
Comment 34 Steven Schveighoffer 2014-10-02 10:53:47 UTC
(In reply to Ketmar Dark from comment #33)
> (In reply to Steven Schveighoffer from comment #30)
> > bearophile or ketmar, can you test the new PR?
> for what reason? see #2 and #3, i was tried the variant without caching in
> remove().

This PR should have identical performance as your latest for the code in question, but not cause any slowdown for remove otherwise. But my box for some reason doesn't even exhibit the original problem with no patch. The speedup on my system is not measurable. So I need someone to test to make sure it's actually fixing the issue properly who does experience the problem.

>i'm telling again that remove() is NOT slowed down by any
> noticable time, and if you doing alot of removes which all happen to hit the
> first used bucket, and each such hit empties that bucket, and you used
> .byKey/.byValue before… this is very unusual scenario, and if you *know*
> that it goes like this, you can force rehashing.

As has been discussed, this isn't an acceptable tradeoff. Forcing a rehash for this seems unintuitive, users will not get it. Besides, there is no reason to recalculate cache unless you need it.

> yet i'm not interested in defending the code: i integrated it in my local
> builds since i published the patch and it's ok for me. i don't really care
> if it will make to mainline in one form on another.

Hm... people here have made a lot of effort to work with your patch given your objections to github. Normally I would suggest that unless you submit a PR, you will not get this code included. Your response confirms that should have been the correct action, and I'll remember that from now on.
Comment 35 Ketmar Dark 2014-10-02 12:43:33 UTC
(In reply to Steven Schveighoffer from comment #34)
> This PR should have identical performance as your latest for the code in
> question, but not cause any slowdown for remove otherwise.
sorry, i misread your patch (yes, i've looked at it before posting my answer). yet you doing a slightly different thing and i don't want to argue. that's why i'm saying that i don't care: both variants are good, and my own has some good points for my use cases. that's why i'm not interesting in defending my code — not that i don't care about changes or so.

> Hm... people here have made a lot of effort to work with your patch given
> your objections to github.
i can't remember when i forced anybody to do anything with this patch. yet if somebody feels bad about it, he can ask for refund: i'll gladly return his money.

> Normally I would suggest that unless you submit a
> PR, you will not get this code included.
i don't really care. i'm putting my patches here for those who interested and they are free to do anything they want with my code. i know that bugzilla patches have very little chance to be included in mainline. i just want to share my vision in the form of the code for others to decide.
Comment 36 Steven Schveighoffer 2014-10-06 15:30:49 UTC
I can reproduce the issue, I was accidentally using bearophile's second D implementation, which does not call byKey/byValue every time through the loop, but instead uses aa.keys. oops!
Comment 37 github-bugzilla 2014-10-10 21:03:10 UTC
Commit pushed to master at https://github.com/D-Programming-Language/druntime

https://github.com/D-Programming-Language/druntime/commit/7cb7307cb5ab8df5f285fd6c5d0d8b0918517d6d
Merge pull request #979 from schveiguy/issue13410

issue 13410 -- Performance problem with associative array byKey/byValue
Comment 38 bearophile_hugs 2014-10-10 21:43:30 UTC
The original D program ( http://forum.dlang.org/thread/efmjlfpehqqfszcrxmwq@forum.dlang.org ) was about 30 times slower than the equivalent C++ code. Now the D code is only 5 times slower than the C++ code. So the situation is much better now.

But the two programs are not exactly equivalent: the C++ code uses a tree-based map while the D code uses the built-in hash-based associative array.
Comment 39 Steven Schveighoffer 2014-10-10 21:59:52 UTC
Bearophile,

With the tree-based solution, when the front element is removed, the "next" element is 1 hop away.

But with hash, it can be far away in the table. So you likely are better off using a tree-based map, something like RedBlackTree if you want to come close to C++ performance.

There is also the issue that C++ uses a template-based map, so it can optimize the code, inline comparisons, etc. The AA-based solution cannot do this.
Comment 40 Ketmar Dark 2014-10-10 22:39:29 UTC
2Steven Schveighoffer: thanks for your work.
Comment 41 bearophile_hugs 2014-10-10 23:05:22 UTC
(In reply to Steven Schveighoffer from comment #39)

> With the tree-based solution, when the front element is removed, the "next"
> element is 1 hop away.
> 
> But with hash, it can be far away in the table. So you likely are better off
> using a tree-based map, something like RedBlackTree if you want to come
> close to C++ performance.
> 
> There is also the issue that C++ uses a template-based map, so it can
> optimize the code, inline comparisons, etc. The AA-based solution cannot do
> this.

I understand all this (the C++ program with a hash_map is 2.5 times faster than the current D situation).


> So you likely are better off
> using a tree-based map, something like RedBlackTree if you want to come
> close to C++ performance.

I have not tried to use the Phobos RedBlackTree in this program, it's an interesting test.