from bit import bit_reverse
fn CRC32(data: List[UInt8], dummy_table: List[UInt32]) -> UInt32:
var crc32: UInt32 = 0xffffffff
for byte in data:
= (bit_reverse(byte[]).cast[DType.uint32]() << 24) ^ crc32
crc32 for _ in range(8):
if crc32 & 0x80000000 != 0:
= (crc32 << 1) ^ 0x04c11db7
crc32 else:
= crc32 << 1
crc32
return bit_reverse(crc32^0xffffffff)
In a previous post on parsing PNG images in I very briefly mentioned cyclic redundancy checks, and posted a rather cryptic looking function which I claimed was a bit inefficient.
In this post I want to follow up on that a bit and see how we can speed up these calculations.
For reference, this post was done with Mojo 24.5, so a few language details have changed since my last post (e.g. math.bit
got moved to the top-level bit
and a few of it’s functions have been renamed).
I actually wrote most of this post in June with Mojo 24.4, but ran into a bit of an issue which has now been resolved. It even resulted in a new unit test for Mojo, so thanks to Laszlo Kindrat for fixing the issue, and for soraros and martinvuyk for helping figure out what the actual issue was.
A bit of context
But first, let’s go through a bit of background so we know what we’re dealing with.
Cyclic redundancy checks
CRCs are error detecting codes that are often used to detect corruption of data in digital files, an example of which is PNG files. In the case of PNGs for example the CRC-32 is calculated for the data of each chunk and appended to the end of the chunk, so that the person reading the file can verify whether the data they read was the same as the data that was written.
A CRC check technically does “long division in the ring of polynomials of binary coefficients (\(\Bbb{F}_2[x]\))” 😳.
It’s not as complicated as it sounds. I found the Wikipedia article on Polynomial long division to be helpful, and if you want an in depth explanation then this post by Kareem Omar does an excellent job of explaining both the concept and implementation considerations. I won’t go deep into the explanations, so I recommend you read at least the first part of Kareem’s post for more background. I pretty much use his post as a guide.
Did you read that post? Then welcome back, and well continue from there.
But tl;dr: XOR is equivalent to polynomial long division (over a finite field) for binary numbers, and XOR is a very efficient operation to calculate. Essentially what a CRC check does in practice is to run through a sequence of bytes, and iteratively perform a lot of XORs and bit-shifts. By iteratively going through each bit, one can come up with a value that will (nearly certainly) change if data is corrupted somehow.
The original
The CRC-32 check from my previous post looked something like this:
I’ll step through this in a moment, but the first thing you might notice here is that I am reversing a lot of bits here. The table argument is a bit of future proofing, which we won’t need for now, but will become apparent soon. Just ignore it for now.
This is because when I was implementing this function (based off a C example), I implemented a little-endian version of the algorithm, while PNGs are encoded as big-endian. It’s not a huge deal, but it does mean that I am constantly reversing bytes, and then reversing the output again.
The correct bit order
We can make this better by implementing the big-endian version:
fn CRC32_inv(data: List[UInt8], dummy_table: List[UInt32]) -> UInt32:
"""Big endian CRC-32 check using 0xedb88320 as polynomial."""
# Initialize crc32 as all 1s
var crc32: UInt32 = 0xffffffff
# Step though all bytes in bytestream
for byte in data:
# XOR new byte with crc32
# We need to cast it to a UInt32 first as Mojo doesn't do implicit casts
= (byte[].cast[DType.uint32]() ) ^ crc32
crc32
# Step though crc32 8 times
for _ in range(8):
# If leading bit is 1, bitshift by 1 and XOR with the generator polynomial
# otherwise just shift by one
if crc32 & 1 != 0:
= (crc32 >> 1) ^ 0xedb88320
crc32 else:
= crc32 >> 1
crc32
# Invert upon return
return crc32^0xffffffff
This is very similar, and just entails that we use a the reverse of the polynomial we did before (if you bit reverse 0x04c11db7
you get 0xedb88320
). This also saves us one 24-bit bit-shift, as we are now working on the bottom 8 bits of the UInt32
instead of the top 8.
Just to verify that these implementations are equivalent, let’s do a quick test:
var example_list = List[UInt8](5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201, 42)
print(hex(CRC32(example_list, List[UInt32](1))))
print(hex(CRC32_inv(example_list, List[UInt32](1))))
0x382aa34e
0x382aa34e
And there we go, a more elegant version of the CRC-32 check I implemented last time.
As the theme of today’s post is trying to speed things up, let’s do a little bit of benchmarking to see if this change has saved us any time. As we are doing one fewer bit reverse and bit shift per byte, as well as a final reverse, we should see a bit of a performance uplift.
A bit of benchmarking:
So let’s define a benchmarking function. This function will take two version of the CRC32 function and benchmark their runtimes.
We have to a little bit of work first. The functions we test need to return None
, so we need to wrap our functions in functions with no return value. Note the benchmark.keep(a)
calls: the compiler will realize that a
is never used and will compile this away unless you instruct it to keep them:
import benchmark
from testing import assert_equal
alias data_table_func = fn(data: List[UInt8], table: List[UInt32]) -> UInt32
fn wrap_func[testing_function: data_table_func, data: List[UInt8], table: List[UInt32] ]():
var a = testing_function(data = data, table = table)
benchmark.keep(a)
fn bench[function_1: data_table_func, function_2: data_table_func,
= False) raises -> Tuple[Float64, Float64, Float64]:
test_list: List[UInt8], test_table: List[UInt32]](quiet: Bool
var CRC32_1 = function_1(test_list, test_table)
alias function_1_wrapped = wrap_func[function_1]
var report_1 = benchmark.run[function_1_wrapped[test_list, test_table]](max_runtime_secs=1.0
).mean(benchmark.Unit.ms)
var CRC32_2 = function_2(test_list, test_table)
assert_equal(CRC32_1, CRC32_2)
alias function_2_wrapped = wrap_func[function_2]
var report_2 = benchmark.run[function_2_wrapped[test_list, test_table]](max_runtime_secs=1.0
).mean(benchmark.Unit.ms)
var speedup = report_1/report_2
if not quiet:
print("Function 1 runtime (ms): \t", report_1)
print("Function 2 runtime (ms): \t", report_2)
print("Speedup factor: \t\t", (speedup))
return report_1, report_2, speedup
Next we need a test case. I’m not sure if there is a nicer way to fill a List with random values yet, but for now we can list create an UnsafePointer
, alloc some space, and fill it with random numbers using rand
. Then we can init a List
using that data:
from random import rand
alias fill_size = 2**20
alias g = UnsafePointer[UInt8].alloc(fill_size)
rand[DType.uint8](g, fill_size)alias rand_list = List[UInt8](unsafe_pointer = g, size = fill_size, capacity = fill_size)
alias dummy_table = List[UInt32](1)
And finally we are ready to benchmark:
= bench[CRC32, CRC32_inv, rand_list, dummy_table]() original_time, improved_time, speedup
Function 1 runtime (ms): 9.437418430000001
Function 2 runtime (ms): 6.8247150400000001
Speedup factor: 1.3828296675666039
Nice! So just by avoiding a few bit reversals and bit shifts we get about a 30% uplift in performance depending on the run.
Note: I am doing this in a Jupyter notebook, so there is a bit of variance from run to run.
A quick Python speed-check
While we’re checking performance, let’s see how this implementation would perform in Python:
%%python
def py_crc32(data: bytearray) -> int:
= 0xffffffff
crc32
for byte in data:
= byte ^ crc32
crc32
for i in range(8):
if crc32 & 1 != 0:
= (crc32 >> 1) ^ 0xedb88320
crc32 else:
= crc32 >> 1
crc32
return crc32^0xffffffff
And let’s just do a sanity check to assure ourselves that we produce the same CRC-32 value given the same bytestream:
%%python
= [5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
py_example_list 5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201,
5, 78, 138, 1, 54, 172, 104, 99, 54, 167, 94, 56, 22, 184, 204, 90, 201, 42]
= bytearray(py_example_list)
py_example_bytearray
print(hex(py_crc32(py_example_bytearray)))
0x382aa34e
%%python
from timeit import timeit
import numpy as np
= np.random.randint(0, 255, size=2**20, dtype=np.uint8)
py_rand_list
= timeit(lambda: py_crc32(py_rand_list), number=10)/10
naive_python_seconds
print(f"Naive Python runtime (ms): {naive_python_seconds*100}")
Naive Python runtime (ms): 123.02583360986318
That’s pretty slow in fact this means that:
print("Speedup over Python: Mojo is", (naive_python_seconds*100/improved_time), "faster")
Speedup over Python: Mojo is 18.026515816235925 faster
So that’s a nice little speedup we can get by writing essentially the same logic in Mojo.
Now this is a bit unfair. If you actually wrote the this function in Python for anything other than educational purposes, then you are using Python wrong, but I’ll get back to how you would actually do it later.
Adding a table
Now, the majority of CRC-32 implementations you see will usually use a table.
This is called the Sarwate algorithm and exploits the fact that as we are operating on data one byte at a time, there are only 256 unique values that the calculation in the inner loop of the algorithm can take. This is a small enough table (1024 bytes for 256 unsigned 32-bit integers) that it can easily be kept in the L1 cache during computation.
Let’s try it out to see how much faster this gets us.
First, we need to pre-compute the table:
fn fill_table() -> List[UInt32]:
var table = List[UInt32](capacity=256)
for i in range(256):
var key = UInt8(i)
var crc32 = key.cast[DType.uint32]()
for i in range(8):
if crc32 & 1 != 0:
= (crc32 >> 1) ^ 0xedb88320
crc32 else:
= crc32 >> 1
crc32
= crc32
table[i] return table
alias little_endian_table = fill_table()
We’ve now effectively amortized the cost of the inner loop and replaced it with a table lookup which should in principle be faster.
Our CRC-32 algorithm now take this form:
fn CRC32_table(data: List[UInt8], table: List[UInt32]) -> UInt32:
var crc32: UInt32 = 0xffffffff
for byte in data:
var index = (crc32 ^ byte[].cast[DType.uint32]()) & 0xff
= table[int(index)] ^ (crc32 >> 8)
crc32
return crc32^0xffffffff
# Sanity check
print(hex(CRC32(example_list, List[UInt32](1))))
print(hex(CRC32_table(example_list, little_endian_table)))
0x382aa34e
0x382aa34e
Now we can test it out!
= bench[CRC32, CRC32_table, rand_list, little_endian_table]() _, _, speedup_table
Function 1 runtime (ms): 8.5761109999999992
Function 2 runtime (ms): 1.9543971584601614
Speedup factor: 4.3881106574863127
So a speedup of around 4-5 times as fast as when we started. This is already a pretty good result, but can we do better?
Two byte table
In principle, we could load the data as a 16-bit integer at a time and use the same trick above to build a table. However, such a table would have 65536 entries, resulting in a 256 KB table, which is a lot bigger than the 48 KB of L1 cache on my machine.
It turns out, we can still do two bytes at a time.
Following the description given by Kareem Omar, we realize that if we have two successive bytes in our message, let’s call them X
and Y
:
X = xxxxxxxx
Y = yyyyyyyy
if we pad these messages with zeroes
Xz = xxxxxxxx00000000
zY = 00000000yyyyyyyy
then
Xz ^ zY = xxxxxxxxyyyyyyyy = M
What this means is that if we have a CRC algorithm that works on a 16-bit message called M then:
CRC(M) = CRC(Xz ^ zY) = CRC(Xz) ^ CRC(zY)
For the operation on Y
, leading zeros don’t affect the calculation, so we can just use the 8-bit CRC algorithm we have developed above:
CRC(zY) = CRC(Y)
The same does not hold for trailing zeroes. However, as Xz
will always be of the form xxxxxxxx00000000
and there can only be 256 unique values for Xz
, we can build a new table with 256 entries to look up these values. We can build two separate 256 entry tables, or we can just build one 512 entry table,
So let’s construct this new table:
fn fill_table_2_byte() -> List[UInt32]:
var table = List[UInt32](capacity=512)
= 512
table.size 256*2, 0)
table.resize(
for i in range(256):
var key = UInt8(i)
var crc32 = key.cast[DType.uint32]()
for i in range(8):
if crc32 & 1 != 0:
= (crc32 >> 1) ^ 0xedb88320
crc32 else:
= crc32 >> 1
crc32
= crc32
table[i]
for i in range(256, 512):
var crc32 = table[i-256]
= (crc32 >> 8) ^ table[int(crc32.cast[DType.uint8]())]
table[i] return table
var little_endian_table_2_byte = fill_table_2_byte()
We have to now update our algorithm to do two table lookups instead of one.
Additionally, we have to modify our algorithm to read the data two bytes at a time.
fn CRC32_table_2_byte(data: List[UInt8], table: List[UInt32]) -> UInt32:
var crc32: UInt32 = 0xffffffff
var length = len(data)//2
var extra = len(data) % 2
# Go through range in steps of 2
for i in range(start = 0, end = length * 2 , step = 2):
# Concatenate the two bytes
var val: UInt32 = ((data[i + 1].cast[DType.uint32]() << 8) | data[i].cast[DType.uint32]())
# Creat index, and look up values in table
var index = crc32 ^ val
= table[int((index >> 8).cast[DType.uint8]())] ^ table[256 + int(index.cast[DType.uint8]())] ^ (crc32 >> 16)
crc32
# Tail call to calculate final byte in case message not divisable by 2
for i in range(2*length, 2*length + extra ):
var index = (crc32 ^ data[i].cast[DType.uint32]()) & 0xff
= table[int(index)] ^ (crc32 >> 8)
crc32
return crc32^0xffffffff
# Sanity check
print(hex(CRC32(example_list, List[UInt32](1))))
print(hex(CRC32_table_2_byte(example_list, little_endian_table_2_byte)))
0x382aa34e
0x382aa34e
So let’s see if this speeds things up:
alias a_little_endian_table_2_byte = fill_table_2_byte()
= bench[CRC32, CRC32_table_2_byte, rand_list, a_little_endian_table_2_byte]() _, _, speedup_table_2
Function 1 runtime (ms): 9.4386268461538467
Function 2 runtime (ms): 1.2856673344847436
Speedup factor: 7.3414223049670637
Nice, now we are at an around 7x speedup to where we began.
But can we go further? There’s nothing in the above that forces us to only use two bytes at a time.
4-bytes tabular
Nothing stops us from doing 4 bytes, i.e. 32-bits at a time with similar logic to above.
I’ll quickly create a function that will fill an arbitrarily sized table:
fn fill_table_n_byte[n: Int]() -> List[UInt32]:
var table = List[UInt32](capacity=256 * n)
= 256*n
table.size
for i in range(256 * n):
if i < 256:
var key = UInt8(i)
var crc32 = key.cast[DType.uint32]()
for _ in range(8):
if crc32 & 1 != 0:
= (crc32 >> 1) ^ 0xEDB88320
crc32 else:
= crc32 >> 1
crc32
= crc32
table[i] else:
var crc32 = table[i - 256]
var index = int(crc32.cast[DType.uint8]())
= (crc32 >> 8) ^ table[index]
table[i]
return table
var little_endian_table_4_byte = fill_table_n_byte[4]()
Now to create the 4-byte function:
fn CRC32_table_4_byte(data: List[UInt8], table: List[UInt32]) -> UInt32:
var crc32: UInt32 = 0xffffffff
var size = 4
var length = len(data)//size
var extra = len(data) % size
for i in range(start = 0, end = length*size, step = size):
var val: UInt32 = (data[i + 3].cast[DType.uint32]() << 24) | (data[i + 2].cast[DType.uint32]() << 16) | (data[i + 1].cast[DType.uint32]() << 8) | data[i].cast[DType.uint32]()
var index = crc32 ^ val.cast[DType.uint32]()
= table[0*256 + int((index >> 24).cast[DType.uint8]())] ^
crc32 1*256 + int((index >> 16).cast[DType.uint8]())] ^
table[2*256 + int((index >> 8).cast[DType.uint8]())] ^
table[3*256 + int((index >> 0).cast[DType.uint8]())]
table[
for i in range(size*length, size*length + extra ):
var index = (crc32 ^ data[i].cast[DType.uint32]()) & 0xff
= table[int(index)] ^ (crc32 >> 8)
crc32
return crc32^0xffffffff
# Sanity check
print(hex(CRC32(example_list, List[UInt32](1))))
print(hex(CRC32_table_4_byte(example_list, little_endian_table_4_byte)))
0x382aa34e
0x382aa34e
And presto, it still works!
Let’s see how much faster we are now?
alias a_little_endian_table_4_byte = fill_table_n_byte[4]()
= bench[CRC32, CRC32_table_4_byte, rand_list, a_little_endian_table_4_byte]() _, _, _
Function 1 runtime (ms): 9.9107345799999997
Function 2 runtime (ms): 0.69239487
Speedup factor: 14.313703075240866
A 14 times speedup over where we started!
But why stop there?
n-bytes implementation
We can in principle explicitly write out version that take 8, 16 or however many bytes at time. This get’s a little long-winded, so I’ll write some generic functions to make functions of arbitrary size.
fn CRC32_table_n_byte_compact[size: Int](data: List[UInt8], table: List[UInt32]) -> UInt32:
var crc32: UInt32 = 0xFFFFFFFF
alias step_size = 4 # Always smashing 4 bytes into an UInt32
alias units = size // step_size
var length = len(data) // size
var extra = len(data) % size
var n = 0
for i in range(start=0, end=length * size, step=size):
@parameter
for j in range(units):
var vals = (
+ j * step_size + 3].cast[DType.uint32]() << 24)
(data[i | (data[i + j * step_size + 2].cast[DType.uint32]() << 16)
| (data[i + j * step_size + 1].cast[DType.uint32]() << 8)
| (data[i + j * step_size + 0].cast[DType.uint32]() << 0)
)
if j == 0:
= vals ^ crc32
vals = 0
crc32
= size - j * step_size
n = (
crc32 - 4) * 256 + int((vals >> 24).cast[DType.uint8]())]
table[(n ^ table[(n - 3) * 256 + int((vals >> 16).cast[DType.uint8]())]
^ table[(n - 2) * 256 + int((vals >> 8).cast[DType.uint8]())]
^ table[(n - 1) * 256 + int((vals >> 0).cast[DType.uint8]())]
^ crc32
)
for i in range(size * length, size * length + extra):
var index = (crc32 ^ data[i].cast[DType.uint32]()) & 0xFF
= table[int(index)] ^ (crc32 >> 8)
crc32
return crc32 ^ 0xFFFFFFFF
Let’s do a quick sanity check to see if this works:
print(hex(CRC32(example_list, List[UInt32](1))))
print(hex(CRC32_table_n_byte_compact[4](example_list, little_endian_table_4_byte)))
0x382aa34e
0x382aa34e
How big can we go?
Let’s try and increase the table size as far as we can. We’ll go up in powers of two, and see how far we can go.
alias a_little_endian_table_4_byte = fill_table_n_byte[4]()
= bench[CRC32, CRC32_table_n_byte_compact[4], rand_list, a_little_endian_table_4_byte]() _, _, speedup_table_4
Function 1 runtime (ms): 9.5889589700000002
Function 2 runtime (ms): 0.69029748700000004
Speedup factor: 13.891052988868841
= bench[CRC32, CRC32_table_n_byte_compact[8], rand_list, fill_table_n_byte[8]()]() _, _, speedup_table_8
Function 1 runtime (ms): 10.002244664999999
Function 2 runtime (ms): 0.48423754050000001
Speedup factor: 20.655657251753283
= bench[CRC32, CRC32_table_n_byte_compact[16], rand_list, fill_table_n_byte[16]()]() _, _, speedup_table_16
Function 1 runtime (ms): 9.065976534999999
Function 2 runtime (ms): 0.3490761703607127
Speedup factor: 25.971341800936472
= bench[CRC32, CRC32_table_n_byte_compact[32], rand_list, fill_table_n_byte[32]()]() _, _, speedup_table_32
Function 1 runtime (ms): 9.6776499699999992
Function 2 runtime (ms): 0.22259112022425251
Speedup factor: 43.477250845631744
= bench[CRC32, CRC32_table_n_byte_compact[64], rand_list, fill_table_n_byte[64]()]() _, _, speedup_table_64
Function 1 runtime (ms): 9.1518207500000006
Function 2 runtime (ms): 0.29884062938339201
Speedup factor: 30.6244193397774
= bench[CRC32, CRC32_table_n_byte_compact[128], rand_list, fill_table_n_byte[128]()]() _, _, speedup_table_128
Function 1 runtime (ms): 9.4991627550000004
Function 2 runtime (ms): 0.53072807599999994
Speedup factor: 17.898361109126629
And there it is. At least on my machine, 32 bytes is the limit, maxing out at a roughly 40 times speedup over the naive implementation. After that we start to see a performance decrease.
Let’s plot this to see the trend:
Code
from python import Python
var np = Python.import_module("numpy")
var plt = Python.import_module("matplotlib.pyplot")
var logging = Python.import_module("logging")
'matplotlib.font_manager').disabled = True
logging.getLogger(
1, 2, 4, 8, 16, 32, 64, 128], [speedup_table, speedup_table_2, speedup_table_4, speedup_table_8, speedup_table_16, speedup_table_32, speedup_table_64, speedup_table_128])
plt.plot(['CRC32 speedup over naive implementation')
plt.title('Table size')
plt.xlabel('Speedup')
plt.ylabel(
"log", base=2)
plt.xscale(
plt.xkcd() plt.show()
As you can see from the above, with a 32-byte table we hit our maximum speedup of around 40 times as fast as the naive implementation. After that, the performance falls off a cliff.
Why would this be? If you read the blog post I linked above you already know the answer: Cache.
Option | num_tables | elements | tbl size |
---|---|---|---|
1-byte tabular | num_tables = 1 | 256 | 1 KB |
2-byte tabular | num_tables = 2 | 512 | 2 KB |
4-byte tabular | num_tables = 4 | 1024 | 4 KB |
8-byte tabular | num_tables = 8 | 2048 | 8 KB |
16-byte tabular | num_tables = 16 | 4096 | 16 KB |
32-byte tabular | num_tables = 32 | 8192 | 32 KB |
64-byte tabular | num_tables = 64 | 16384 | 64 KB |
In Kareem Omar’s original post the recommendation is to not go above a 16-byte table, as this will take up approximately half of the standard 32 KB of L1 cache on most computers. However, since this post was written in 2019, L1 cache sizes have increased a bit, and on my machine with a 48 KB L1 cache the 32-byte table performs best, but it’s clear that once you go past that you run into cache issues.
This is actually a place then where some compile-time metaprogramming might help: depending on the size of your L1 cache, you might want to compile your application with a 16-byte or a 32-byte table. Or if you want to future proof your application for the (presumable not so distant) future case where the CPUs have at least 128 KB of L1 cache, then you can even add the option of a 64-byte table.
At some point Mojo had an autotune functionality that would have been able to do this, but it was removed and we’ll have to wait for it to be added back.
Now if you read Kareem’s post, you might realize he went even further by calling some (hardware accelerated) intrinsic functions to go even faster. There is a caveat here in that in that case, the polynomial used is baked into the silicone, and the variation for which this works is called CRC-32C
where the C stands for Castagnoli, and importantly this is not the variation that is used for PNG checksums, so I won’t go further with this.
A more realistic Python benchmark
Taking our best result from above, we get:
= bench[CRC32, CRC32_table_n_byte_compact[32], rand_list, fill_table_n_byte[32]()](quiet = True)
_, result_32_byte, _ print("Speedup over Python: Mojo is", (naive_python_seconds*100/result_32_byte), "faster")
Speedup over Python: Mojo is 556.3741906871564 faster
So an astounding 500 times speedup over pure Python. Now this is a completely unfair comparison, as I am comparing the naive Python implementation to the optimized Mojo one.
Now as I hinted before, no-one in their right mind would write a CRC-32 check like I did above in Python.
What one would really do is use zlib from Python’s standard library. In this case we get:
%%python
import zlib
= np.random.randint(0, 255, size=2**20, dtype=np.uint8)
py_rand_list
= timeit(lambda: zlib.crc32(py_rand_list), number=100)/100
realistic_python_seconds
print(f"Realistic Python runtime (ms): {realistic_python_seconds*100}")
Realistic Python runtime (ms): 0.06717531400499865
print("Python is:", result_32_byte/(realistic_python_seconds*100), "faster than Mojo")
Python is: 3.291695363181048 faster than Mojo
So when using Python as you realistically would, Python is actually still more than three times faster than our best effort so far!
Of course, the counterpoint to this is that the zlib implementation is done in C, not Python, so we are effectively comparing Mojo to C at this point.
The last bit of performance
The above begs the question however, why is the zlib version still faster than Mojo?
What kind of black magic is happening in zlib?
Well, this lead me down a rabbit hole and I ended up reading a pretty informative whitepaper by Andrew Kadatch and Bob Jenkins titled Everything We Know About CRC But Afraid To Forget. Now, I can’t find where this was officially published, and the only record of this seems to be in some guy’s github repo. I’m kidding a little on the last point, it’s in the zlib repo which is maintained by Mark Adler. Update: I have been informed that it was originally published as part of the release of crcutil on Google Code. Thanks to jorams on HN who pointed this out.
The zlib CRC32 implementation is written in C that has been optimized to within an inch of it’s life, and has so many #ifdef
statement in there that it’s hard to know which way is up.
In any case, there is a commit by Mark Adler from 2018 that’s titled Speed up software CRC-32 computation by a factor of 1.5 to 3.. Well, that’s about the amount of performance I am missing, so I guess that’s where I need to start looking. The commit message states:
Use the interleaved method of Kadatch and Jenkins in order to make use of pipelined instructions through multiple ALUs in a single core. This also speeds up and simplifies the combination of CRCs, and updates the functions to pre-calculate and use an operator for CRC combination.
It’s a pretty large commit and reads about as easily as hieroglyphics, so it might take me a moment to digest what’s going on there.
Impact on PNG reading
The whole thing that kicked this work off was reading PNG image in Mojo. So how much faster does this improved CRC32 check make reading PNGs? Not much it seems. Reading a 3008 x 1960
pixel PNG image with Mimage is now about 3.5% faster.
I suspect the majority of the time reading PNGs is either spent reading from disk or decompressing the actual data. But hey, faster is faster.
I’m wondering if/when Mojo will get it’s equivalent of zlib, and what shape that might take.