Mojo 0.5.0 and SIMD

The 0.5.0 update of Mojo and a bit of a dive into Single Instruction Multiple Data in Mojo.
mojo
Author

Ferdinand Schenck

Published

November 17, 2023

Another month, another Mojo release!
It feel like every time I run into a missing feature in Mojo it gets added in the next version:
In my first post I complained about a lack of file handling, which was then added soon after in version 0.4.0.
For version 0.4.0 I ran into the issue that you can’t print Tensors, which has now been added in the 0.5.0 release. So this means Mojo has now unlocked everyone’s favourite method of debugging: printing to stdout.
In addition to that, Tensors can now also be written to and read from files with fromfile() and tofile().

There have also been a couple of updates to the SIMD type, which lead me to ask: How does the SIMD type work in Mojo?

For a bit of background, you might have noticed that CPUs clock speeds haven’t really increased by much in the last decade or so, but computers have definitely gotten faster. One of the factors that have increased processing speed has been a focus on vectorization through SIMD, which stands for Single Instruction, Multiple Data, i.e. applying the same operation to multiple pieces of data.
Modern CPUs come with SIMD registers that allow the CPU to apply the same operation over all the data in that register, resulting in large speedups, especially in cases where you are applying the same operation to multiple pieces of data, e.g. in image processing where you might apply the same operation to the millions of pixels in an image.

The SIMD type in Mojo

One of the main goals of the Mojo language is to leverage the ability of modern hardware, both CPUs and GPUs, to execute SIMD operations.

There is no native SIMD support in Python, however Numpy does make this possible.

Note: SIMD is not the same as concurrency, where you have several different threads running different instructions. SIMD is doing the same operation on different data.

Generally, SIMD objects are initialized as SIMD[datatype, size](values), so to create a SIMD object consisting of four 8-bit unsigned integers we would do:

simd_4x_uint8 = SIMD[DType.uint8, 4](1, 2, 4, 8)
print(simd_4x_uint8)
[1, 2, 4, 8]

And actually, SIMD is so central to Mojo that the builtin Float32 type is actually just an alias for SIMD[DType.float32, 1]:

let x: Float32 = 10.
print(x.type)

let y = SIMD[DType.float32, 1](10.)
print(y.type)
float32
float32

Modern CPUs have SIMD registers, so lets use the sys.info package in Mojo to see what the register width on my computer is:

from sys.info import simdbitwidth, simdwidthof
print(simdbitwidth())
256

This means we can pack 256 bits of data into this register and efficiently vectorize an operation over it. Some CPUs support AVX-512, with as the name suggests 512 bit SIMD registers. Most modern CPUs will apply the same operation to all values in their register in one step, allowing for significant speedup for functions that can exploit SIMD vectorization.

In my case, we’ll have to live with 256 bits. This means in this register we can either put 4 64-bit values, 8 32-bit values 16 16-bit values, or even 32 8-bit values.

We can use the utility function simdwidthof to tell us how many 32-bit floating point numbers will fit in our register:

print(simdwidthof[DType.float32]())
8

One of the new features in Mojo 0.5.0 is that SIMD types will default to the width of the architecture, meaning if we call:

simd_float_32 = SIMD[DType.float32]()
print(len(simd_float_32))
8
simd_int_8 = SIMD[DType.int8]()
print(len(simd_int_8))
32

Mojo will automatically pack 8 32-bit values, or 32 8-bit values int the register.

This is equivalent to calling:

SIMD[DType.float32, simdwidthof[DType.float32]()]

SIMD operations

Operations over SIMD types are quite intuitive. Let’s try adding two SIMD objects together:

simd_a = SIMD[DType.float32, 8](1.0)
simd_b = SIMD[DType.float32, 8](2.0)
print(simd_a + simd_b)
[3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0]

Additionally, since the version 0.5.0, we can also concatenate SIMD objects with .join():

print(simd_a.join(simd_b))
[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0]

Operations applied to a SIMD object will be applied element-wise to the data in it, if the function is set up to handle this:

from math import sqrt
simd_c = SIMD[DType.int32, 4](4, 16, 49, 144)
simd_c_sqrt = sqrt(simd_c)
print(simd_c_sqrt)
[2, 4, 7, 12]

As far as I can tell, this doesn’t just work automatically. If I define a function as:

fn single_add_two(a: Float32) -> Float32:
    return a + 2.0

Then applying it to a single floating point number works as expected:

print(single_add_two(2.0))
4.0

But trying this on a SIMD object does not:

print(single_add_two(simd_a))
error: Expression [85]:1:21: invalid call to 'single_add_two': argument #0 cannot be converted from 'SIMD[f32, 8]' to 'SIMD[f32, 1]'
print(single_add_two(simd_a))
      ~~~~~~~~~~~~~~^~~~~~~~

Expression [83]:1:1: function declared here
fn single_add_two(a: Float32) -> Float32:
^

expression failed to parse (no further compiler diagnostics)

However, if I define a version of the function to take a SIMD object:

fn simd_add_two[simd_width: Int](a: SIMD[DType.float32, simd_width]) -> SIMD[DType.float32, simd_width]:
    return a + 2.0

Then (with the additional specification of the parameter simd_width), it will apply the function to all the values:

simd_ones = SIMD[DType.float32, 8](1.0)
print(simd_add_two(simd_ones))
[3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0]

While still working on single floating point values, as they are just SIMD objects of width one under the hood:

let one: Float32 = 1.0
print(simd_add_two(one))
3.0

I do miss the flexibility of Julia a bit, where you can define one function and then vectorize it with a dot, i.e. if you have a function myfun(scalar) that operates on scalar values, then calling myfun.(vector) will apply it element-wise to all values of that vector, and return a vector of the same shape.
But for the most part, defining functions just to apply to SIMD values in Mojo doesn’t lose you much generality anyway.

Conclusions:

To be honest, I was a little bit daunted when I first saw the SIMD datatype in Mojo. I vaguely remember playing around with SIMD in C++, where it can be quite complicate to implement SIMD operations.
But in Mojo, it really is transparent and relatively straightforward to get going with SIMD.

It is clear that exploiting vectorization is a top priority for the Modular team, and a lot of through has clearly gone into making it easy to exploit the SIMD capabilities of modern hardware.

I might take a look at vectorization vs parallelization in Mojo in the future, and maybe een try my hand at a bit of benchmarking.