
SIMD (Single Instruction, Multiple Data)
SIMD support via Go assembly for arithmetic and bitwise operations.
Allowing for parallel element-wise computations.
Resulting in a 100% to 400% speedup.
Currently AMD64 (x86_64) and ARM64 processors are supported.
Function Documentation
SIMD Support
| AMD64 (x86_64) | ARM64 |
---|
AddFloat32 | SSE / AVX / AVX512VL | NEON |
AddFloat64 | SSE2 / AVX / AVX512VL | NEON |
AddInt32 | SSE2 / AVX2 / AVX512VL | NEON |
AddInt64 | SSE2 / AVX2 / AVX512VL | NEON |
AndInt32 | SSE2 / AVX2 / AVX512VL | NEON |
AndInt64 | SSE2 / AVX2 / AVX512VL | NEON |
DivFloat32 | SSE / AVX / AVX512VL | |
DivFloat64 | SSE2 / AVX / AVX512VL | |
DivInt32 | | |
DivInt64 | | |
MulFloat32 | SSE / AVX / AVX512VL | NEON |
MulFloat64 | SSE2 / AVX / AVX512VL | NEON |
MulInt32 | SSE4.1 / AVX2 / AVX512VL | NEON |
MulInt64 | AVX512VL | |
OrInt32 | SSE2 / AVX2 / AVX512VL | NEON |
OrInt64 | SSE2 / AVX2 / AVX512VL | NEON |
SubFloat32 | SSE / AVX / AVX512VL | NEON |
SubFloat64 | SSE2 / AVX / AVX512VL | NEON |
SubInt32 | SSE2 / AVX2 / AVX512VL | NEON |
SubInt64 | SSE2 / AVX2 / AVX512VL | NEON |
XorInt32 | SSE2 / AVX2 / AVX512VL | |
XorInt64 | SSE2 / AVX2 / AVX512VL | |
Make Targets
Tests
Command | Description |
---|
make test | Compiles and runs tests natively on hardware. |
make test_amd64 | Cross compiles for amd64 and runs tests via QEMU (qemu-x86_64 ). |
make test_arm64 | Cross compiles for arm64 and runs tests via QEMU (qemu-aarch64 ). |
Benchmarks
Command | Description |
---|
make bench | Compiles and runs benchmarks natively on hardware. |
make bench_amd64 | Cross compiles for amd64 and runs benchmarks via QEMU (qemu-x86_64 ). |
make bench_arm64 | Cross compiles for arm64 and runs benchmarks via QEMU (qemu-aarch64 ). |
AMD64 Performance (AMD Ryzen 7 7840U / DDR5 SO-DIMM)
Elements | Go ns/op | SIMD ns/op | Performance x |
---|
Small Vectors | | | |
100 | 38.33 | 7.580 | 5.056 |
200 | 79.59 | 12.80 | 6.217 |
300 | 117.0 | 18.45 | 9.593 |
400 | 154.5 | 16.20 | 9.537 |
500 | 191.5 | 20.38 | 9.396 |
600 | 228.6 | 26.37 | 8.668 |
700 | 265.6 | 33.70 | 7.881 |
800 | 303.1 | 29.38 | 10.31 |
900 | 340.3 | 33.54 | 10.14 |
Medium Vectors | | | |
1000 | 377.4 | 39.60 | 9.530 |
2000 | 751.2 | 69.45 | 10.81 |
3000 | 1153 | 148.3 | 7.774 |
4000 | 1499 | 325.1 | 4.610 |
5000 | 1871 | 431.6 | 4.335 |
6000 | 2243 | 523.6 | 4.283 |
7000 | 2614 | 614.1 | 4.256 |
8000 | 2987 | 701.6 | 4.257 |
9000 | 3360 | 792.5 | 4.239 |
Large Vectors | | | |
10000 | 3725 | 878.5 | 4.240 |
20000 | 7458 | 1754 | 4.251 |
30000 | 11187 | 2631 | 4.251 |
40000 | 14908 | 3509 | 4.248 |
50000 | 18677 | 4373 | 4.270 |
60000 | 22363 | 5276 | 4.238 |
70000 | 26107 | 6319 | 4.131 |
80000 | 29854 | 7820 | 3.817 |
90000 | 33613 | 9222 | 3.644 |
ARM64 Performance (Apple M1 Pro / LPDDR5 SDRAM)
Elements | Go ns/op | SIMD ns/op | Performance x |
---|
Small Vectors | | | |
100 | 51.81 | 13.68 | 3.787 |
200 | 102.2 | 24.24 | 4.216 |
300 | 152.8 | 35.93 | 4.252 |
400 | 209.0 | 47.71 | 4.380 |
500 | 258.7 | 64.88 | 3.987 |
600 | 309.8 | 73.42 | 4.219 |
700 | 359.6 | 89.01 | 4.039 |
800 | 410.6 | 101.9 | 4.029 |
900 | 460.3 | 112.5 | 4.091 |
Medium Vectors | | | |
1000 | 511.5 | 124.3 | 4.115 |
2000 | 1015 | 241.0 | 4.211 |
3000 | 1520 | 356.9 | 4.258 |
4000 | 2024 | 473.1 | 4.278 |
5000 | 2527 | 589.9 | 4.283 |
6000 | 3032 | 706.1 | 4.294 |
7000 | 3535 | 822.5 | 4.297 |
8000 | 4039 | 939.2 | 4.300 |
9000 | 4543 | 1056 | 4.302 |
Large Vectors | | | |
10000 | 5046 | 1172 | 4.305 |
20000 | 10107 | 2394 | 4.221 |
30000 | 15139 | 3599 | 4.206 |
40000 | 20178 | 4957 | 4.070 |
50000 | 25218 | 6190 | 4.073 |
60000 | 30253 | 7277 | 4.157 |
70000 | 35285 | 8707 | 4.052 |
80000 | 40346 | 9924 | 4.065 |
90000 | 45378 | 11189 | 4.055 |