Categorygithub.com/pehringer/simd
repositorypackage
1.2.3
Repository: https://github.com/pehringer/simd.git
Documentation: pkg.go.dev

# README

SIMD

SIMD (Single Instruction, Multiple Data)

SIMD support via Go assembly for arithmetic and bitwise operations. Allowing for parallel element-wise computations. Resulting in a 100% to 400% speedup. Currently AMD64 (x86_64) and ARM64 processors are supported.

Function Documentation

SIMD Support

AMD64 (x86_64)ARM64
AddFloat32SSE / AVX / AVX512VLNEON
AddFloat64SSE2 / AVX / AVX512VLNEON
AddInt32SSE2 / AVX2 / AVX512VLNEON
AddInt64SSE2 / AVX2 / AVX512VLNEON
AndInt32SSE2 / AVX2 / AVX512VLNEON
AndInt64SSE2 / AVX2 / AVX512VLNEON
DivFloat32SSE / AVX / AVX512VL
DivFloat64SSE2 / AVX / AVX512VL
DivInt32
DivInt64
MulFloat32SSE / AVX / AVX512VLNEON
MulFloat64SSE2 / AVX / AVX512VLNEON
MulInt32SSE4.1 / AVX2 / AVX512VLNEON
MulInt64AVX512VL
OrInt32SSE2 / AVX2 / AVX512VLNEON
OrInt64SSE2 / AVX2 / AVX512VLNEON
SubFloat32SSE / AVX / AVX512VLNEON
SubFloat64SSE2 / AVX / AVX512VLNEON
SubInt32SSE2 / AVX2 / AVX512VLNEON
SubInt64SSE2 / AVX2 / AVX512VLNEON
XorInt32SSE2 / AVX2 / AVX512VL
XorInt64SSE2 / AVX2 / AVX512VL

Make Targets

Tests

CommandDescription
make testCompiles and runs tests natively on hardware.
make test_amd64Cross compiles for amd64 and runs tests via QEMU (qemu-x86_64).
make test_arm64Cross compiles for arm64 and runs tests via QEMU (qemu-aarch64).

Benchmarks

CommandDescription
make benchCompiles and runs benchmarks natively on hardware.
make bench_amd64Cross compiles for amd64 and runs benchmarks via QEMU (qemu-x86_64).
make bench_arm64Cross compiles for arm64 and runs benchmarks via QEMU (qemu-aarch64).

AMD64 Performance (AMD Ryzen 7 7840U / DDR5 SO-DIMM)

ElementsGo ns/opSIMD ns/opPerformance x
Small Vectors
10038.337.5805.056
20079.5912.806.217
300117.018.459.593
400154.516.209.537
500191.520.389.396
600228.626.378.668
700265.633.707.881
800303.129.3810.31
900340.333.5410.14
Medium Vectors
1000377.439.609.530
2000751.269.4510.81
30001153148.37.774
40001499325.14.610
50001871431.64.335
60002243523.64.283
70002614614.14.256
80002987701.64.257
90003360792.54.239
Large Vectors
100003725878.54.240
20000745817544.251
300001118726314.251
400001490835094.248
500001867743734.270
600002236352764.238
700002610763194.131
800002985478203.817
900003361392223.644

ARM64 Performance (Apple M1 Pro / LPDDR5 SDRAM)

ElementsGo ns/opSIMD ns/opPerformance x
Small Vectors
10051.8113.683.787
200102.224.244.216
300152.835.934.252
400209.047.714.380
500258.764.883.987
600309.873.424.219
700359.689.014.039
800410.6101.94.029
900460.3112.54.091
Medium Vectors
1000511.5124.34.115
20001015241.04.211
30001520356.94.258
40002024473.14.278
50002527589.94.283
60003032706.14.294
70003535822.54.297
80004039939.24.300
9000454310564.302
Large Vectors
10000504611724.305
200001010723944.221
300001513935994.206
400002017849574.070
500002521861904.073
600003025372774.157
700003528587074.052
800004034699244.065
9000045378111894.055