Issue 13595 - Extend std.algorithm.groupBy to support non-equivalence relations
Summary: Extend std.algorithm.groupBy to support non-equivalence relations
Status: RESOLVED FIXED
Alias: None
Product: D
Classification: Unclassified
Component: phobos (show other issues)
Version: D2
Hardware: x86 All
: P1 enhancement
Assignee: No Owner
URL:
Keywords: pull
Depends on:
Blocks:
 
Reported: 2014-10-10 10:17 UTC by bearophile_hugs
Modified: 2021-03-18 14:27 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description bearophile_hugs 2014-10-10 10:17:07 UTC
Perhaps I have not fully understood the usage of std.algorithm.groupBy. This is Haskell:

Prelude> import Data.List
Prelude Data.List> groupBy (\x y -> (mod (x*y) 3) == 0) [1,2,3,4,5,6,7,8,9]
[[1],[2,3],[4],[5,6],[7],[8,9]]



D code:

void main() {
    import std.stdio, std.algorithm, std.array, std.range;
    [1,2,3,4,5,6,7,8,9]
    .groupBy!((x, y) => ((x * y) % 3) == 0)
    .take(20)
    .writeln;
}

Output of the D code:

[[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []]

Please close down this issue if this behavour is expected and correct.
Comment 1 hsteoh 2014-10-24 15:10:26 UTC
Definitely a bug, the returned subranges should not have empty elements.
Comment 2 hsteoh 2014-10-24 15:21:13 UTC
Actually, I take that back. This is not a bug (w.r.t. the current documentation) because the predicate is expected to be an equivalence relation, but (x,y)=>((x*y)%3)==0 is not a reflexive relation.

So the issue is more, should groupBy be extended to support non-equivalence relations? I have considered this before but decided against it because it would be more inefficient to implement. However, allowing non-equivalence relations between adjacent elements *does* allow for much more interesting groupings (e.g., you could implement word breaks with it), so arguably it should be supported somehow.

Marking this as as enhancement request.
Comment 3 hsteoh 2014-11-04 19:18:48 UTC
https://github.com/D-Programming-Language/phobos/pull/2654

Note that the Haskell example is actually wrong, because the Haskell groupBy function also assumes that the predicate is an equivalence relation. It just so happens that Haskell's groupBy only assumes transitivity, whereas the original implementation of std.algorithm.groupBy also assumes reflexivity, which causes an infinite loop when that assumption fails to hold.

In the above referenced PR, groupBy now defaults to assuming a non-equivalence relation, meaning that the predicate is evaluated for every pair of adjacent elements, so the OP's predicate (x,y) => (x*y)%3 == 0 will produce the grouping [[1], [2,3,4], [5,6,7], [8,9]], because both 2*3 and 3*4 are divisible by 3, as are 5*6 and 6*7. (Notice that Haskell's groupBy evaluates the predicate against the first element of the subrange, thus it checks 2*3 and 2*4 instead of 2*3 and 3*4 for the second subrange.)
Comment 4 hsteoh 2014-11-04 19:21:22 UTC
P.S. Due to the extra performance overhead in evaluating the predicate for every adjacent pair of elements (and also additional logic to ensure correctness), groupBy now takes a compile-time parameter that can be used to get the original performance when the predicate is an equivalence relation. This way we have the best of both worlds. :-)
Comment 5 bearophile_hugs 2014-11-04 23:18:34 UTC
(In reply to hsteoh from comment #4)
> P.S. Due to the extra performance overhead in evaluating the predicate for
> every adjacent pair of elements (and also additional logic to ensure
> correctness), groupBy now takes a compile-time parameter that can be used to
> get the original performance when the predicate is an equivalence relation.
> This way we have the best of both worlds. :-)

Thank you. But there's a risk of "Boolean Blindness" (http://existentialtype.wordpress.com/2011/03/15/boolean-blindness/ ), so perhaps it's better to use std.typecons.Flag instead of a bool.

Regarding groupBy, I hope someday we'll also have a hashGroupBy or something similar (Issue 9842 ).
Comment 6 github-bugzilla 2014-12-01 02:34:13 UTC
Commit pushed to master at https://github.com/D-Programming-Language/phobos

https://github.com/D-Programming-Language/phobos/commit/5986e740873d6af793610381a942df78ab41f137
Merge pull request #2654 from quickfur/issue13595b

Issue 13595: Extend groupBy to handle non-equivalence relations.
Comment 7 bearophile_hugs 2014-12-01 10:50:50 UTC
As second try I've used groupBy to find the longest anagrams in a dictionary of words. "words.txt" is a text file that contains one different word each line.


void main() {
    import std.stdio, std.algorithm, std.string, std.file, std.array;

    auto anags =
         "words.txt"
         .readText
         .split
         .schwartzSort!((string w) => w.dup.representation.sort().release)
         .groupBy!((w1, w2) => w1.dup.representation.sort().equal(w2.dup.representation.sort()))
         .map!array
         .array
         .sort!((g1, g2) => g1.length > g2.length)
         .release
         .groupBy!((g1, g2) => g1.length == g2.length)
         .front;
    writefln("%(%s\n%)", anags);
}



If I replace "(string w)" with "w" in the first schwartzSort I receive the errors:

...\dmd2\src\phobos\std\algorithm.d(4659,5): Error: field _prev must be initialized in constructor, because it is nested struct
...\dmd2\src\phobos\std\algorithm.d(4770,51): Error: template instance anagrams1c.main.groupBy!(__lambda2, cast(Flag)false, SortedRange!(string[], (a, b) => binaryFun!less(unaryFun!transform(a), unaryFun!transform(b)))) error instantiating
anagrams1c.d(9,10):        instantiated from here: groupBy!((w1, w2) => w1.dup.representation.sort().equal(w2.dup.representation.sort()), SortedRange!(string[], (a, b) => binaryFun!less(unaryFun!transform(a), unaryFun!transform(b))))


If I remove the last ".release" (at line 13) I receive the similar errors:

...\dmd2\src\phobos\std\algorithm.d(4659,5): Error: field _prev must be initialized in constructor, because it is nested struct
...\dmd2\src\phobos\std\algorithm.d(4770,51): Error: template instance anagrams1c.main.groupBy!(__lambda4, cast(Flag)false, SortedRange!(string[][], __lambda3)) error instantiating
anagrams1c.d(14,10):        instantiated from here: groupBy!((g1, g2) => g1.length == g2.length, SortedRange!(string[][], __lambda3))


Is this a problem of groupBy?
Comment 8 hsteoh 2014-12-01 15:48:44 UTC
It looks like a compiler bug, in both cases the type returned by schwartzSort is exactly the same, but somehow the compiler treats it differently.
Comment 9 bearophile_hugs 2014-12-01 16:03:40 UTC
(In reply to hsteoh from comment #8)
> It looks like a compiler bug, in both cases the type returned by
> schwartzSort is exactly the same, but somehow the compiler treats it
> differently.

Yes, it looks like a compiler bug worth a new different bug report. But so far it looks not easy to pinpoint. later I'll open a new bug report and I'll try to reduce the problem a little.
Comment 10 hsteoh 2014-12-01 16:09:42 UTC
Here's a somewhat reduced case:
-----
void main() {
    import std.algorithm, std.string;

    auto anags =
         ["abc", "def"]
         .map!((a) => a)
         .groupBy!((w1, w2) => w1.dup.representation.sort().equal(w2.dup.representation.sort()))
         ;
}
-----

Changing .map!((a) => a) to .map!((string a) => a) fixes the problem.
Comment 11 hsteoh 2014-12-01 16:10:30 UTC
P.S. the groupBy predicate can also be simplified to just: .groupBy!((w1, w2) => w1 == w2)
Comment 12 github-bugzilla 2015-02-18 03:40:11 UTC
Commit pushed to 2.067 at https://github.com/D-Programming-Language/phobos

https://github.com/D-Programming-Language/phobos/commit/5986e740873d6af793610381a942df78ab41f137
Merge pull request #2654 from quickfur/issue13595b
Comment 13 Dlang Bot 2021-03-18 14:27:42 UTC
dlang/phobos pull request #7794 " Fix Issue 13595 - added splitWhen" was merged into master:

- 8122718d96167b84e2e073191e9c63766c7d08c0 by Ate Eskola:
  Fix Issue 13595 - Non-amended commit just to trigger the bot as rebasing didn't work

https://github.com/dlang/phobos/pull/7794