• Klaus Post's avatar
    Use scalar functions when less traffic (#24) · 776275e0
    Klaus Post authored
    Switch to scalar assembly when less than 3 lanes are filled.
    
    This brings us very close to `crypto/md5` in cases where only a single lane is populated.
    
    When there are 2 lanes filled we use 2 goroutines with the scalar code and above that we switch to SIMD.
    
    Before, with a single writer:
    ```
    BenchmarkAvx2SingleWriter/32KB-32              14686       80893 ns/op   405.08 MB/s       976 B/op        8 allocs/op
    BenchmarkAvx2SingleWriter/64KB-32               7498      162843 ns/op   402.45 MB/s      1840 B/op       15 allocs/op
    BenchmarkAvx2SingleWriter/128KB-32              3636      327558 ns/op   400.15 MB/s      3568 B/op       29 allocs/op
    BenchmarkAvx2SingleWriter/256KB-32              1845      650406 ns/op   403.05 MB/s      7024 B/op       57 allocs/op
    BenchmarkAvx2SingleWriter/512KB-32               922     1295010 ns/op   404.85 MB/s     13937 B/op      113 allocs/op
    BenchmarkAvx2SingleWriter/1MB-32                 463     2598272 ns/op   403.57 MB/s     27765 B/op      225 allocs/op
    BenchmarkAvx2SingleWriter/2MB-32                 231     5164500 ns/op   406.07 MB/s     55411 B/op      449 allocs/op
    BenchmarkAvx2SingleWriter/4MB-32                 100    10170000 ns/op   412.42 MB/s    110709 B/op      897 allocs/op
    BenchmarkAvx2SingleWriter/8MB-32                  56    20357161 ns/op   412.07 MB/s    221305 B/op     1793 allocs/op
    ```
    
    After:
    ```
    BenchmarkAvx2SingleWriter/32KB-32              26785       44353 ns/op   738.80 MB/s       112 B/op        1 allocs/op
    BenchmarkAvx2SingleWriter/64KB-32              13682       87853 ns/op   745.98 MB/s       112 B/op        1 allocs/op
    BenchmarkAvx2SingleWriter/128KB-32              7058      175829 ns/op   745.45 MB/s       112 B/op        1 allocs/op
    BenchmarkAvx2SingleWriter/256KB-32              3428      346558 ns/op   756.42 MB/s       112 B/op        1 allocs/op
    BenchmarkAvx2SingleWriter/512KB-32              1713      686515 ns/op   763.69 MB/s       112 B/op        1 allocs/op
    BenchmarkAvx2SingleWriter/1MB-32                 874     1366132 ns/op   767.55 MB/s       112 B/op        1 allocs/op
    BenchmarkAvx2SingleWriter/2MB-32                 439     2740318 ns/op   765.30 MB/s       112 B/op        1 allocs/op
    BenchmarkAvx2SingleWriter/4MB-32                 220     5431817 ns/op   772.17 MB/s       113 B/op        1 allocs/op
    BenchmarkAvx2SingleWriter/8MB-32                 100    10840002 ns/op   773.86 MB/s       116 B/op        1 allocs/op
    ```
    
    Compare to pure crypto/md5:
    ```
    BenchmarkCryptoMd5/32KB-32             30612       39004 ns/op   840.11 MB/s         0 B/op        0 allocs/op
    BenchmarkCryptoMd5/64KB-32             15285       77985 ns/op   840.37 MB/s         0 B/op        0 allocs/op
    BenchmarkCryptoMd5/128KB-32             7498      156175 ns/op   839.26 MB/s         0 B/op        0 allocs/op
    BenchmarkCryptoMd5/256KB-32             3870      310336 ns/op   844.71 MB/s         0 B/op        0 allocs/op
    BenchmarkCryptoMd5/512KB-32             1874      623266 ns/op   841.19 MB/s         0 B/op        0 allocs/op
    BenchmarkCryptoMd5/1MB-32                960     1243750 ns/op   843.08 MB/s         0 B/op        0 allocs/op
    BenchmarkCryptoMd5/2MB-32                480     2489588 ns/op   842.37 MB/s         0 B/op        0 allocs/op
    ```
    
    After optimizing the assembly:
    ```
    BenchmarkAvx2SingleWriter
    BenchmarkAvx2SingleWriter/32KB-32         	   28570	     41941 ns/op	 781.29 MB/s	       0 B/op	       0 allocs/op
    BenchmarkAvx2SingleWriter/64KB-32         	   14388	     83055 ns/op	 789.06 MB/s	       0 B/op	       0 allocs/op
    BenchmarkAvx2SingleWriter/128KB-32        	    7500	    167734 ns/op	 781.43 MB/s	       0 B/op	       0 allocs/op
    BenchmarkAvx2SingleWriter/256KB-32        	    3636	    332508 ns/op	 788.38 MB/s	       1 B/op	       0 allocs/op
    BenchmarkAvx2SingleWriter/512KB-32        	    1818	    659667 ns/op	 794.78 MB/s	       2 B/op	       0 allocs/op
    BenchmarkAvx2SingleWriter/1MB-32          	     915	   1315847 ns/op	 796.88 MB/s	       5 B/op	       0 allocs/op
    BenchmarkAvx2SingleWriter/2MB-32          	     457	   2621787 ns/op	 799.89 MB/s	      11 B/op	       0 allocs/op
    BenchmarkAvx2SingleWriter/4MB-32          	     229	   5213972 ns/op	 804.44 MB/s	      22 B/op	       0 allocs/op
    BenchmarkAvx2SingleWriter/8MB-32          	     100	  10409999 ns/op	 805.82 MB/s	      51 B/op	       0 allocs/op
    ```
    776275e0