NEON SIMD, is it worth it?

I have tried to benchmark my Samsung Galaxy s3 neo, arm7a cortex use of arm simd neon. While shown to be present in my phone, every vectorised instruction takes the same time as doing the same operation on the floats themself. For example, a 4x32 bit mul took 4 times the time a regular 32 bit mul took. That took me by suprise, after googling it, i read that the data throughput to the alu from the cpu can't handle the 256 bits needed, and only passes 64 bits a cycle, making the SIMD not effective.

So, after my disappointment, should i care about NEON, or should i just focus on multithreading and leave the simd alone?
I really like the idea of simd,sse is real cool, but if neon doesn't make the cut, why sould i take the extra mile?
Something doesn't add up there :) A 4x32 mul is _128 bits_ not 256 bits. So if the NEON core on your CPU does 64 bits a cycle, it should be able to do do 4 multiplies in 2 cycles, not 4. What am I missing?

- Casey
A mul takes 2 registers of 128 bit, so the cpu should pass 256 bits to the alu. But the connection can only pass 64 bits.
Ah, I get what you're saying. So we're not talking about the operation, we're talking about the path from the register file to the ALU?

I guess what I would say is that the 7a might be bad, I don't know... looking at the cycle timings for the 8, NEON is definitely not supposed to work that way AFAICT. The default ARM instruction tables seem to suggest that a lot of SIMD ops are either 1 or 2 cycles, which is what I would expect (http://infocenter.arm.com/help/in....arm.doc.ddi0344k/ch16s06s02.html).

I don't know where the equivalent of those tables are for the 7a - a cursory glance in the same manual under the 7 section didn't seem to turn them up, but they may be in there somewhere...

- Casey
If you'll look here http://www.anandtech.com/show/697...formance-of-modern-arm-processors you can see that a lot of those device don't give much throughput.
NEON looks to be solidly twice as fast in those benchmarks, though, sometimes four times as fast? Look at the GFLOPs table.

- Casey
How are you compiling/benchmarking? It could be compiler is smart enough to turn your non-NEON path to use NEON instructions (autovectorize). Have you looked at output from disassembler to verify what instructions compiler generated?

I did a small benchmark for following two functions:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
void fun(int COUNT, float* A, float* B, float* C)
{
    for (int i=0; i<COUNT; i++)
    {
        C[i] = A[i] * B[i];
    }
}

void fun_neon(int COUNT, float* A, float* B, float* C)
{
    float32x4_t* A4 = (float32x4_t*)A;
    float32x4_t* B4 = (float32x4_t*)B;
    float32x4_t* C4 = (float32x4_t*)C;

    for (int i=0; i<COUNT/4; i++)
    {
        C4[i] = vmulq_f32(A4[i], B4[i]);
    }
}


Here's full code, including Android NDK makefiles: https://gist.github.com/mmozeiko/2b4451924eaf14e47b83

On my Nexus 5 (Qualcomm MSM8974 Snapdragon 800, similar to Cortex-A15), benchmark for 8 million floats gives:
1
2
Scalar: 51.26 msec, 112.56 Mcycles
NEON: 21.66 msec, 46.75 Mcycles

[strike]You don't get full 4x speedup, but NEON is ~2.3x faster.[/strike] Bad results, see next post.

Here's the inner loop for both functions using clang 3.6 compiler:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
fun
     9d2:	ed92 0a00 	vldr	s0, [r2]
     9d6:	3204      	adds	r2, #4
     9d8:	ed91 1a00 	vldr	s2, [r1]
     9dc:	3104      	adds	r1, #4
     9de:	3801      	subs	r0, #1
     9e0:	ee21 0a00 	vmul.f32	s0, s2, s0
     9e4:	eca3 0a01 	vstmia	r3!, {s0}
     9e8:	d1f3      	bne.n	9d2 <_Z3funiPfS_S_+0x6>

fun_neon
     a00:	f962 0aef 	vld1.64	{d16-d17}, [r2 :128]
     a04:	3001      	adds	r0, #1
     a06:	3210      	adds	r2, #16
     a08:	4560      	cmp	r0, ip
     a0a:	f961 2aef 	vld1.64	{d18-d19}, [r1 :128]
     a0e:	f101 0110 	add.w	r1, r1, #16
     a12:	ff42 0df0 	vmul.f32	q8, q9, q8
     a16:	f943 0aef 	vst1.64	{d16-d17}, [r3 :128]
     a1a:	f103 0310 	add.w	r3, r3, #16
     a1e:	dbef      	blt.n	a00 <_Z8fun_neoniPfS_S_+0x14>

Even with branch in inner loop NEON is faster! Changing C code so NEON inner loop doesn't have a branch doesn't give faster code.

Then I compiled same code on Raspberry Pi 2 (Cortex-A7) using clang 3.6 (under ArchLinux). Running executable gives me following output:
1
2
Scalar: 226.56 msec, 203.84 Mcycles
NEON: 192.93 msec, 173.59 Mcycles

Not so big difference anymore, [strike]NEON is ~1.17x faster.[/strike] I guess older Cortex'es are not so good at NEON. Samsung Galaxy S3 Neo has Cortex-A7 class CPU.

Here are commands I used to compile on RPi2:
1
2
3
clang++ -O2 -mfpu=vfpv3-d16 -c fun.cpp main.cpp
clang++ -O2 -c fun_neon.cpp
clang++ *.o -o a.exe

Edited by Mārtiņš Možeiko on
i just downloaded a app from the google store called VFP bench and it showed me benchmarks for a lot of instructions. thanks for the build files, it took me about 2 weekends to start building for android, and i still don't know how to optimize build, and how to see the disassembly.
By default ndk-build compiles optimized build. If you want debug build you need to add APP_OPTIM=debug to ndk-build commandline or "APP_OPTIM:=debug" in Application.mk file.

To disassemble you can do the following:
1
path\to\android\ndk\toolchains\arm-linux-androideabi-4.9\prebuilt\windows-x86_64\bin\arm-linux-androideabi-objdump.exe -d obj\local\armeabi-v7a\executableName >diasm.txt

This is on Windows, change windows/exe to needed platform you are on. This can disassemble executables, libraries or object files - change executableName to libSomething.so or objs\name\file.o

Edited by Mārtiņš Možeiko on
Oops, it's actually worse. Much worse.

I guess code before in NEON function used some data from cache, that's why it was faster. So added one dummy pass over arrays before measuring time in main.cpp:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
    {
        fun(COUNT, A, B, C);

        t1 = get_ticks();
        c1 = get_cycles();

        fun(COUNT, A, B, C);
        
        c2 = get_cycles();
        t2 = get_ticks();

        printf("Scalar: %.2f msec, %.2f Mcycles\n", (float)(t2 - t1) / 1e6f, (float)(c2 - c1) / 1e6f);
    }

    {
        fun_neon(COUNT, A, B, C);

        t1 = get_ticks();
        c1 = get_cycles();

        fun_neon(COUNT, A, B, C);
        
        c2 = get_cycles();
        t2 = get_ticks();

        printf("NEON: %.2f msec, %.2f Mcycles\n", (float)(t2 - t1) / 1e6f, (float)(c2 - c1) / 1e6f);
    }


On Nexus 5:
1
2
Scalar: 25.81 msec, 58.45 Mcycles
NEON: 21.70 msec, 49.04 Mcycles

NEON is ~1.19x faster.

On Raspberry Pi 2:
1
2
Scalar: 173.12 msec, 155.78 Mcycles
NEON: 186.48 msec, 167.81 Mcycles

NEON is SLOWER. ~1.08x slower!

Maybe simple multiplication is too trivial and more complex calculation will give better results for NEON code.

Edited by Mārtiņš Možeiko on
Yes, adding more calculations to loop shows better results:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
void fun(int COUNT, float* A, float* B, float* C)
{
    for (int i=0; i<COUNT; i++)
    {
        float x = A[i] * B[i];
        C[i] = x * (x + x * x);
    }
}

void fun_neon(int COUNT, float* A, float* B, float* C)
{
    float32x4_t* A4 = (float32x4_t*)A;
    float32x4_t* B4 = (float32x4_t*)B;
    float32x4_t* C4 = (float32x4_t*)C;

    for (int i=0; i<COUNT/4; i++)
    {
        float32x4_t x = vmulq_f32(A4[i], B4[i]);
        C4[i] = vmulq_f32(x, vmlaq_f32(x, x, x));
    }
}


On Nexus 5:
1
2
Scalar: 56.79 msec, 128.57 Mcycles
NEON: 21.14 msec, 47.75 Mcycles

~2.7x faster.

On Raspberry Pi 2:
1
2
Scalar: 326.87 msec, 293.89 Mcycles
NEON: 213.66 msec, 192.13 Mcycles

~1.5x faster.

All is good :)