D issues are now tracked on GitHub. This Bugzilla instance remains as a read-only archive.
Issue 20270 - [REG2.087] Deadlock in garbage collection when running processes in parallel
Summary: [REG2.087] Deadlock in garbage collection when running processes in parallel
Status: RESOLVED FIXED
Alias: None
Product: D
Classification: Unclassified
Component: druntime (show other issues)
Version: D2
Hardware: x86_64 Linux
: P1 regression
Assignee: No Owner
URL:
Keywords: pull
Depends on:
Blocks:
 
Reported: 2019-10-06 09:33 UTC by Vladimir Panteleev
Modified: 2019-11-28 19:22 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description Vladimir Panteleev 2019-10-06 09:33:49 UTC
///////////// test.d /////////////
import std.parallelism;
import std.process;
import std.range;

void main()
{
    foreach (i; 200.iota.parallel)
        execute(["true"]);
}
//////////////////////////////////

This program has a roughly 60% chance to deadlock and never finish executing on my machine.

Inspecting the program's state with a debugger shows that the threads are generally in one of these states:

Thread 11 (Thread 0x7f2a80ff9700 (LWP 424924)):
#0  0x00007f2a89b82b12 in sigsuspend () from /usr/lib/libc.so.6
#1  0x0000563bd079bb08 in core.thread.thread_suspendHandler(int).op(void*) ()
#2  0x0000563bd079bb68 in core.thread.callWithStackShell(scope void(void*) nothrow delegate) ()
#3  0x0000563bd079ba95 in thread_suspendHandler ()
#4  <signal handler called>
#5  0x00007f2a89e8da6a in read () from /usr/lib/libpthread.so.0
...


Thread 10 (Thread 0x7f2a817fa700 (LWP 424923)):
#0  0x00007f2a89b82b12 in sigsuspend () from /usr/lib/libc.so.6
#1  0x0000563bd079bb08 in core.thread.thread_suspendHandler(int).op(void*) ()
#2  0x0000563bd079bb68 in core.thread.callWithStackShell(scope void(void*) nothrow delegate) ()
#3  0x0000563bd079ba95 in thread_suspendHandler ()
#4  <signal handler called>
#5  0x00007f2a89c11414 in fork () from /usr/lib/libc.so.6
...

Thread 9 (Thread 0x7f2a81ffb700 (LWP 424922)):
#0  0x00007f2a89b82b12 in sigsuspend () from /usr/lib/libc.so.6
#1  0x0000563bd079bb08 in core.thread.thread_suspendHandler(int).op(void*) ()
#2  0x0000563bd079bb68 in core.thread.callWithStackShell(scope void(void*) nothrow delegate) ()
#3  0x0000563bd079ba95 in thread_suspendHandler ()
#4  <signal handler called>
#5  0x00007f2a89c515c9 in __lll_lock_wait_private () from /usr/lib/libc.so.6
#6  0x00007f2a89c51a88 in __run_fork_handlers () from /usr/lib/libc.so.6
#7  0x00007f2a89c113e9 in fork () from /usr/lib/libc.so.6
...

Thread 8 (Thread 0x7f2a827fc700 (LWP 424921)):
#0  0x00007f2a89b82b12 in sigsuspend () from /usr/lib/libc.so.6
#1  0x0000563bd079bb08 in core.thread.thread_suspendHandler(int).op(void*) ()
#2  0x0000563bd079bb68 in core.thread.callWithStackShell(scope void(void*) nothrow delegate) ()
#3  0x0000563bd079ba95 in thread_suspendHandler ()
#4  <signal handler called>
#5  0x00007f2a89e8e145 in nanosleep () from /usr/lib/libpthread.so.0
#6  0x0000563bd077370e in _D4core6thread6Thread5sleepFNbNiSQBf4time8DurationZv ()
#7  0x0000563bd07b3e2e in core.internal.spinlock.SpinLock.yield(ulong) shared ()
#8  0x0000563bd07b3dc4 in core.internal.spinlock.SpinLock.lock() shared ()
#9  0x0000563bd07c9307 in _D2gc4impl12conservativeQw14ConservativeGC__T9runLockedS_DQCeQCeQCcQCnQBs12mallocNoSyncMFNbmkKmxC8TypeInfoZPvS_DQEgQEgQEeQEp10mallocTimelS_DQFiQFiQFgQFr10numMallocslTmTkTmTxQCzZQFcMFNbKmKkKmKxQDsZQDl ()
#10 0x0000563bd07c1456 in _D2gc4impl12conservativeQw14ConservativeGC6qallocMFNbmkxC8TypeInfoZS4core6memory8BlkInfo_ ()
#11 0x0000563bd0787fe7 in gc_qalloc ()
...

Thread 7 (Thread 0x7f2a82ffd700 (LWP 424920)):
#0  0x00007f2a89c515cb in __lll_lock_wait_private () from /usr/lib/libc.so.6
#1  0x00007f2a89bd06b3 in calloc () from /usr/lib/libc.so.6
#2  0x0000563bd07c61ad in _D2gc4impl12conservativeQw3Gcx16startScanThreadsMFNbZv ()
#3  0x0000563bd07c5f44 in _D2gc4impl12conservativeQw3Gcx12markParallelMFNbbZv ()
#4  0x0000563bd07c5862 in _D2gc4impl12conservativeQw3Gcx11fullcollectMFNbbZm ()
#5  0x0000563bd07c4050 in _D2gc4impl12conservativeQw3Gcx8bigAllocMFNbmKmkxC8TypeInfoZPv ()
#6  0x0000563bd07c935a in _D2gc4impl12conservativeQw14ConservativeGC__T9runLockedS_DQCeQCeQCcQCnQBs12mallocNoSyncMFNbmkKmxC8TypeInfoZPvS_DQEgQEgQEeQEp10mallocTimelS_DQFiQFiQFgQFr10numMallocslTmTkTmTxQCzZQFcMFNbKmKkKmKxQDsZQDl ()
#7  0x0000563bd07c1456 in _D2gc4impl12conservativeQw14ConservativeGC6qallocMFNbmkxC8TypeInfoZS4core6memory8BlkInfo_ ()
#8  0x0000563bd0787fe7 in gc_qalloc ()
...
Comment 1 igor.khasilev 2019-10-06 09:49:58 UTC
May (or may not) be related https://issues.dlang.org/show_bug.cgi?id=20256 if scanthread do not block SIGUSR1 and SIGUSR2
Comment 2 Vladimir Panteleev 2019-10-06 09:56:49 UTC
(In reply to igor.khasilev from comment #1)
> May (or may not) be related https://issues.dlang.org/show_bug.cgi?id=20256
> if scanthread do not block SIGUSR1 and SIGUSR2

Unfortunately `digger run stable+druntime#2813 -- dmd -run test` still hangs.
Comment 3 Rainer Schuetze 2019-10-06 10:17:11 UTC
I cannot reproduce locally in a VM. Does the problem go away with --DRT-gcopt=parallel:0 ?
Comment 4 Vladimir Panteleev 2019-10-06 10:18:50 UTC
(In reply to Rainer Schuetze from comment #3)
> Does the problem go away with --DRT-gcopt=parallel:0 ?

Yes.
Comment 5 Vladimir Panteleev 2019-10-06 10:38:19 UTC
(In reply to Rainer Schuetze from comment #3)
> I cannot reproduce locally in a VM.

From experimenting with taskset, it seems that there need to be at least 5 physical cores to run threads on for this bug to be reproduced. (Does not reproduce with `taskset f` but does reproduce with `taskset 1f`.)
Comment 6 Dlang Bot 2019-10-06 10:45:39 UTC
@rainers created dlang/druntime pull request #2816 "fix Issue 20270 - [REG2.087] Deadlock in garbage collection when runn…" fixing this issue:

- fix Issue 20270 - [REG2.087] Deadlock in garbage collection when running processes in parallel
  
  start scan threads while the world isn't suspended

https://github.com/dlang/druntime/pull/2816
Comment 7 Rainer Schuetze 2019-10-06 10:47:41 UTC
I have reproduced the issue when running the test for a higher number of times. Not sure why this doesn't appear more often. Please try https://github.com/dlang/druntime/pull/2816
Comment 8 Rainer Schuetze 2019-11-28 19:22:58 UTC
Not sure why this wasn't closed by the bot when https://github.com/dlang/druntime/pull/2816 got merged.