D issues are now tracked on GitHub. This Bugzilla instance remains as a read-only archive.
Issue 10636 - Vector calling convention for D?
Summary: Vector calling convention for D?
Status: NEW
Alias: None
Product: D
Classification: Unclassified
Component: dmd (show other issues)
Version: D2
Hardware: All Windows
: P4 enhancement
Assignee: No Owner
URL:
Keywords: performance, SIMD
Depends on:
Blocks:
 
Reported: 2013-07-13 13:41 UTC by bearophile_hugs
Modified: 2024-12-13 18:09 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description bearophile_hugs 2013-07-13 13:41:13 UTC
VS2013 designers have added a new calling convention, that allows to pass SIMD registers to functions avoiding the stack in most cases:
http://blogs.msdn.com/b/vcblog/archive/2013/07/12/introducing-vector-calling-convention.aspx


An example D program:


import core.stdc.stdio, core.simd;

struct Particle { float4 x, y; }

Particle addParticles(in Particle p1, in Particle p2)
pure nothrow {
    return Particle(p1.x + p2.x, p1.y + p2.y);
}

void main() {
    auto p1 = Particle([1, 2, 3, 4],
                       [10, 20, 30, 40]);
    printf("%f %f %f %f %f %f %f %f\n",
           p1.x.array[0], p1.x.array[1],
           p1.x.array[2], p1.x.array[3],
           p1.y.array[0], p1.y.array[1],
           p1.y.array[2], p1.y.array[3]);

    auto p2 = Particle([100, 200, 300, 400],
                       [1000, 2000, 3000, 4000]);
    printf("%f %f %f %f %f %f %f %f\n",
           p2.x.array[0], p2.x.array[1],
           p2.x.array[2], p2.x.array[3],
           p2.y.array[0], p2.y.array[1],
           p2.y.array[2], p2.y.array[3]);

    auto p3 = addParticles(p1, p2);
    printf("%f %f %f %f %f %f %f %f\n",
           p3.x.array[0], p3.x.array[1],
           p3.x.array[2], p3.x.array[3],
           p3.y.array[0], p3.y.array[1],
           p3.y.array[2], p3.y.array[3]);
}


Comping that code with the ldc2 v.0.11.0 on Windows 32bit with:

ldc2 -O5 -disable-inlining -release -vectorize-slp -vectorize-slp-aggressive -output-s test.d


It outputs the X86 asm:

__D4test12addParticlesFNaNbxS4test8ParticlexS4test8ParticleZS4test8Particle:
	pushl	%ebp
	movl	%esp, %ebp
	andl	$-16, %esp
	subl	$16, %esp
	movaps	40(%ebp), %xmm0
	movaps	56(%ebp), %xmm1
	addps	8(%ebp), %xmm0
	addps	24(%ebp), %xmm1
	movups	%xmm1, 16(%eax)
	movups	%xmm0, (%eax)
	movl	%ebp, %esp
	popl	%ebp
	ret	$64

__Dmain:
...
	movaps	160(%esp), %xmm0
	movaps	176(%esp), %xmm1
	movaps	%xmm1, 48(%esp)
	movaps	%xmm0, 32(%esp)
	movaps	128(%esp), %xmm0
	movaps	144(%esp), %xmm1
	movaps	%xmm1, 16(%esp)
	movaps	%xmm0, (%esp)
	leal	96(%esp), %eax
	calll	__D4test12addParticlesFNaNbxS4test8ParticlexS4test8ParticleZS4test8Particle
	subl	$64, %esp
	movss	96(%esp), %xmm0
	movss	100(%esp), %xmm1
	movss	104(%esp), %xmm2
	movss	108(%esp), %xmm3
	movss	112(%esp), %xmm4
	movss	116(%esp), %xmm5
	movss	120(%esp), %xmm6
	movss	124(%esp), %xmm7
	cvtss2sd	%xmm7, %xmm7
	movsd	%xmm7, 60(%esp)
	cvtss2sd	%xmm6, %xmm6
	movsd	%xmm6, 52(%esp)
	cvtss2sd	%xmm5, %xmm5
	movsd	%xmm5, 44(%esp)
	cvtss2sd	%xmm4, %xmm4
	movsd	%xmm4, 36(%esp)
	cvtss2sd	%xmm3, %xmm3
	movsd	%xmm3, 28(%esp)
	cvtss2sd	%xmm2, %xmm2
	movsd	%xmm2, 20(%esp)
	cvtss2sd	%xmm1, %xmm1
	movsd	%xmm1, 12(%esp)
	cvtss2sd	%xmm0, %xmm0
	movsd	%xmm0, 4(%esp)
	movl	$_.str3, (%esp)
	calll	___mingw_printf
	xorl	%eax, %eax
	movl	%ebp, %esp
	popl	%ebp
	ret


As shown in that article, with a vector calling convention to set the arguments of addParticles it needs only four movaps (instead of eigth and the leal). With the vectr calling convention the body of addParticles gets short, because all the needed operands are already in xmm registers. Probably the code of addParticles becomes only two addps, a ret and maybe two movaps to put the result in the right output registers.

D is meant to be useful for people that write fast video games, or other numerical code, and both use plenty of SIMD code. So I think adding such optimization can be useful. But I can't estimate how much advantage it's going to give, benchmarks are needed. They write:

>Today on AMD64 target, passed by value vector arguments (such as __m128/__m256/) must be turned into a passed by address of a temporary buffer (i.e. $T1, $T2, $T3 in the figure above) allocated in caller's local stack as shown in the figure above. We have been receiving increasing concerns about this inefficiency in past years, especially from game, graphic, video/audio, and codec domains. A concrete example is MS XNA library in which passing vector arguments is a common pattern in many APIs of XNAMath library. The inefficiency will be intensified on upcoming AVX2/AVX3 and future processors with wider vector registers.<

On the other hand small functions get inlined, and introducing a new calling convention has a disadvantage, as comment by Iain Buclaw:

> I'd vote for not adding more fluff which makes ABI differences 
> between compilers greater.  But it certainly looks like if would 
> be useful if you wish to save the time taken to copy the vector 
> from XMM registers onto the stack and back again when passing 
> values around.

Maybe such vector calling convention will become more standard in future, as it seems an useful improvement.
Comment 1 dlangBugzillaToGithub 2024-12-13 18:09:27 UTC
THIS ISSUE HAS BEEN MOVED TO GITHUB

https://github.com/dlang/dmd/issues/18631

DO NOT COMMENT HERE ANYMORE, NOBODY WILL SEE IT, THIS ISSUE HAS BEEN MOVED TO GITHUB