How are you compiling/benchmarking? It could be compiler is smart enough to turn your non-NEON path to use NEON instructions (autovectorize). Have you looked at output from disassembler to verify what instructions compiler generated?
I did a small benchmark for following two functions:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19 | void fun(int COUNT, float* A, float* B, float* C)
{
for (int i=0; i<COUNT; i++)
{
C[i] = A[i] * B[i];
}
}
void fun_neon(int COUNT, float* A, float* B, float* C)
{
float32x4_t* A4 = (float32x4_t*)A;
float32x4_t* B4 = (float32x4_t*)B;
float32x4_t* C4 = (float32x4_t*)C;
for (int i=0; i<COUNT/4; i++)
{
C4[i] = vmulq_f32(A4[i], B4[i]);
}
}
|
Here's full code, including Android NDK makefiles:
https://gist.github.com/mmozeiko/2b4451924eaf14e47b83
On my Nexus 5 (Qualcomm MSM8974 Snapdragon 800, similar to Cortex-A15), benchmark for 8 million floats gives:
| Scalar: 51.26 msec, 112.56 Mcycles
NEON: 21.66 msec, 46.75 Mcycles
|
[strike]You don't get full 4x speedup, but NEON is
~2.3x faster.[/strike] Bad results, see next post.
Here's the inner loop for both functions using clang 3.6 compiler:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21 | fun
9d2: ed92 0a00 vldr s0, [r2]
9d6: 3204 adds r2, #4
9d8: ed91 1a00 vldr s2, [r1]
9dc: 3104 adds r1, #4
9de: 3801 subs r0, #1
9e0: ee21 0a00 vmul.f32 s0, s2, s0
9e4: eca3 0a01 vstmia r3!, {s0}
9e8: d1f3 bne.n 9d2 <_Z3funiPfS_S_+0x6>
fun_neon
a00: f962 0aef vld1.64 {d16-d17}, [r2 :128]
a04: 3001 adds r0, #1
a06: 3210 adds r2, #16
a08: 4560 cmp r0, ip
a0a: f961 2aef vld1.64 {d18-d19}, [r1 :128]
a0e: f101 0110 add.w r1, r1, #16
a12: ff42 0df0 vmul.f32 q8, q9, q8
a16: f943 0aef vst1.64 {d16-d17}, [r3 :128]
a1a: f103 0310 add.w r3, r3, #16
a1e: dbef blt.n a00 <_Z8fun_neoniPfS_S_+0x14>
|
Even with branch in inner loop NEON is faster! Changing C code so NEON inner loop doesn't have a branch doesn't give faster code.
Then I compiled same code on Raspberry Pi 2 (Cortex-A7) using clang 3.6 (under ArchLinux). Running executable gives me following output:
| Scalar: 226.56 msec, 203.84 Mcycles
NEON: 192.93 msec, 173.59 Mcycles
|
Not so big difference anymore, [strike]NEON is
~1.17x faster.[/strike] I guess older Cortex'es are not so good at NEON. Samsung Galaxy S3 Neo has Cortex-A7 class CPU.
Here are commands I used to compile on RPi2:
| clang++ -O2 -mfpu=vfpv3-d16 -c fun.cpp main.cpp
clang++ -O2 -c fun_neon.cpp
clang++ *.o -o a.exe
|