Speed comparisons for Arduino Uno/Nano, Due, Teensy 3.5 and ESP32

It’s been more than a year since I published my post on numerical integration on an Arduino. Since then, the post has been quite popular, recieving a steady stream of visitors (mostly via Google). When I originally wrote it, I only had an Arduino Uno at hand – since then I’ve added a couple of Nanos and lately an Arduino Due to my inventory and decided it would be interesting to do a couple of speed tests to see how they perform. The latest addition to my growing circus of microcontroller boards is a Teensy 3.5 board. (Update May 2019: Added a ESP32 dev board)

As I pointed out in the original post, numerical integration relies heavily on floating-point math – which is something the Arduino’s 8-bit processor is not particularly good at. The Due features a 32-bit processor, a clock frequency of 84 instead of 16 MHz and the possibility to use double (64 bit) instead of float (32 bit) as a data type – so I was curious to see how it would compare to the Arduino Uno. The Nano is supposed to have more or less the same characteristics as an Uno, but is a lot smaller and cheaper – see below for details.

Now added to the comparison, the Teensy 3.5 includes a 32-bit processor with 120 MHz clock speed and a FPU for speedier floating-point math.

Direct comparison of floating point math

I’ve built a small sketch called ‘speedtest’ that simply performs many +, -, * and / operations on a float variable and checks the time required. It took some iterations to get it right – most importantly, the result needs to be actually used somewhere in the code (I added it to the output via serial) for the compiler to include the loops in the compiled binary. Also, the looping itself is also included in the time recorded but since this is about rough estimations, it shouldn’t be an issue.

For the Nano, results turned out to be exactly the same as for the Uno – which is no surprise given that both appear to use the exactly same chip. Results for the Arduino Uno, Due and Teensy 3.5 (ty3.5) are given below:

Time for floating-point operations:
in seconds per million / ms per thousand / ns per operation

    Uno    Due           Teensy 3.5    ESP32       
    float  float double  float double  float double
+    9.09   1.26   1.87   0.04   1.13   0.02   0.23
-    9.19   1.26   1.88   0.04   1.13   0.03   0.27
*    9.69   0.80   1.70   0.04   0.83   0.04   0.44
/   30.82   2.74  10.62   0.12   5.55   0.22   2.53

There are a few surprises here, things I didn’t expect. My main takeaways are:

  • Floating-point math takes a lot of CPU cycles: Even on 32bit-arduinos, something between 100 and 200 cycles are required for one + / – operation. This has been discussed elsewhere and the regular recommendation is to avoid floating-point math wherever possible.
  • Divisions are far more costly – so I’d expect some speed gain by exchanging them with multiplications wherever possible (= optimize your equations).
  • On a due, 32-bit variables (float) show only small speed improvements over 64-bit variables for +/- operations, but significant improvements for * (~2 times faster) and / (~ 4 times faster!) operations.
  • On the Teensy, the FPU appears to only cover 32-bit (float) variables, giving it a whopping ~250x speed advantage compared to an arduino Uno / Nano. If you’re doing something with a microcontroller that needs fast floating-point math, the teensy is incredibly superior to the normal arduino models.
  • The ESP32 board I tested for a few minutes appears to also include a 32bit FPU, giving it comparable results to the teensy 3.5. Please note that I don’t own one and didn’t check any code beyond this test – but it looks as if ESP32 boards are currently your cheapest option for fast floating-point math on microcontrollers (please write a comment if you replicated these results!)..

Sidenote: If you really want fast floating-point math on a small system, you might also want to consider a raspberry pi, offering something between 40 and 300 MFlops depending on benchmark and model (source 1, source 2) – the data above indicates something between 0.1 and 1 Mflops on a due and up to 25 MFlops on the Teensy 3.5 (note that the Teensy 3.6 offers 1.5x higher calculations).

But there is something the Due appears to be amazingly good at: as an experiment, I also added reading analog input to my benchmark. Here the arduino Due took ~4.2 s for 1 mio. analog reads (either at 10 or 12 bit resolution) while the Uno/Nano takes  ~112s (and is limited to 10 bits). There are also suggestions on improving the Due analog read rate to ~1 MHz (see also this site with in-depth information). From my perspective, an analog read rate of >200 kHz (at 12 bits resolution and with the option to increase it up to 1Mhz) is really nice – especially when it’s as easy to use as on the arduino.
Then again, if you’re restricted to an Uno / Nano but need higher analog read rates, there also seem to be some options available on how to reduce the analog read time (more in the range of up to 50 kHz)…
Additional sidenote: The Teensy 3.5 ADC appears to be slower than the one on the Due, giving me ~8.4ms / 12-bit analog read and 20.2 ms / 13-bit analog read (update: have a look at pedvide’s ADC library for things like simultaneous reads on both teensy ADCs and faster read times)

Comparison of possible calculation speeds for Runge-Kutta numerical integration on arduinos

Most considerations for floating-point math are discussed above, but I also did the same speed-comparisons for my original runge-kutta solver, so here you go:

ODE speeds for the example ODE 
(simple linear system with 4 state variables)

Uno / Nano (single precision):  760 ms / 1000 steps  ~1300 steps / second
Due (double precision):         140 ms / 1000 steps  ~7000 steps / second
Due (single precision):         115 ms / 1000 steps  ~8700 steps / second
ty3.5 (double precision):       106 ms / 1000 steps  ~9400 steps / second
ty3.5 (single precision):         8 ms / 1000 st.  ~125000 steps / second

Considering these numbers, I’d recommend using double over float on the Due, but this maybe depends on your requirements. Especially, I’d love to see some hardware-in-the-loop applications on the Due, given its analog-read and write capabilities and decent possibilities for in-between simulation of physical systems…
Update: with the new Teensy 3.5/3.6 models now available, I’d change my recommendations above. If you really want to do something with ODEs on a microcontroller, the new Teensy models and single precision gives you an incredible advantage over the other possibilities.

As a side-note: I’ve also updated the code for the ODE solver, fixing some issues with how numbers in the code were written. Please get the new code here or from the original blogpost.

Update on Oct. 23rd, 2016: Updated the blogpost to include the Teensy 3.5 board.
Update on May 31st, 2019: Included preliminary results for an ESP32 dev board.

3 thoughts on “Speed comparisons for Arduino Uno/Nano, Due, Teensy 3.5 and ESP32

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s