Wednesday , 22 February 2017
Home » Electronics » How to precisely measure microseconds with STM32

How to precisely measure microseconds with STM32

I received this apparently simply question from a reader of this blog: how can I delay a fistful of microseconds in STM32? That is, how to measure microseconds precisely in STM32?

The answer is: there are several ways to do this, but some methods are more accurate and other ones are more versatile among different MCUs and clock configuration.

Let's consider one member of the STM32F4 family: STM32F401RE, the MCU that equips the STM32Nucleo-F401RE board. This micro is able to run up to 84Mhz using internal RC clock. This means that ever 1µs, the clock cycles 84 times. So, we need a way to count 84 clock cycles to assert that 1µs is elapsed (I'm assuming that you can tolerate the internal RC clock 1% accuracy).

Sometime is common to find code like this:

But how to establish how many clock cycles are required to compute one step of the while(cycleCount--) instruction? Unfortunately, it's not simple to give an answer. Let's assume that cycleCount is equal to 1. Doing some tests (I'll explain later how I've done them), with compiler optimizations disabled (option -O0 to GCC), we can see that in this case the whole C instruction requires 24 cycles to execute. How is it possible that? You have to figure out that our C statement is unrolled in several assembly instructions, as we can see if we disassemble the firmware binary file:

Moreover, another source of latency is related to the fetch from internal MCU flash. So that instruction has a "basic cost" of 24 cycles. How many cycles are required if cycleCount is equal to 2? In this case the MCU requires 33 cycles, that is 9 additional cycles. This means that if we want to spin for 84 cycles cycleCount has to be equal to (84-24)/9, which is about 7.  So, we can write our delay function in a more general way:

Testing this function with this code:

we can check using a decent oscilloscope, connected to a GPIO configured as GPIO_SPEED_HIGHthat this is what we expect:


Hey dude, why are you using that code to drive the GPIO?

I can't use HAL here, because we want to reduce CPU cycles to the minimum, so we need to modify GPIO peripheral register directly! Don't forget that even a simple procedure call, without parameter passing, costs CPU cycles due to branching and invalidation of cache pipeline.

Is this way to delay 1µs always consistent? The answer is NO. First of all, it works well only when this specific MCU (STM32F401RE) works at full speed (84Mhz). If you decide to use a different clock speed, you need to rearrange it doing tests. Second, it is subject to compiler optimizations.

Let's now enable GCC optimization for "size" (-Os). What results do we obtain? In this case we have that the delayUS() function costs only 72 CPU cycles, that is ~850ns. The scope confirms this:



And what happens if we enable the maximum optimization for speed (-O3)? In this case we have only 64 CPU cycles, that is our delayUS() lasts only ~750ns, as confirmed by our scope:


However, this issue can be addressed using specific GCC pragma directive:

Having said this, the fact remains that if we want use a lower CPU frequency or we want to port our code to a different MCU we need to redo tests again.

However, take in account that the lower the CPU frequency is the more difficult is to delay for 1µs precisely, because the number of cycles are fixed for a given instruction, but there is less amount of cycles in the same unit of time.

So, how can we obtain a precise 1µs delay without doing tests if we change hardware setup? The answer is: we need a hardware timer. And we have several options.

The first one comes from the previous tests. How have I measured CPU cycles? Cortex-M processors can have an optional debug unit that provides watchpoints, data tracing, and system profiling for the processor. One register of this unit is CYCCNT, which counts the number of cycles performed by CPU. So, we can use this special unit available in STM32 to count the number of cycles performed by the MCU during instruction execution.

Using DWT we can build a more generic delayUS() routine in this way:

How much precise this function is? If you are interested to the best resolution at 1µs, this function isn't the best, as shown by the scope.


The best performance is achieved when the higher compiler optimization level is set. As you can see, for a wanted delay of 1µs the function gives about 1.22µs delay (22% slower). However, if we want to spin for 10µs, we obtain a real delay of 10.5µs (5% slower), which is more close to what we want.


Starting from 100µs delay the error is completely negligible.

Why this function is not so precise? To understand why this function is less precise from the other one, you have to figure out that we are using a series of instructions to check how many cycles are expired since the function is started (the while condition). These instructions cost CPU cycles both to update the internal CPU registers with the content of CYCCNT and to do comparison and branching. However, the advantage of this function is that it automatically detects CPU speed, and it works out of the box especially if we are working on faster processors.

Other solutions could be achieved using hardware timers, like TIMx and SYSTICK timers. However, I've obtained performances similar to delayUS_DWT() function.

If you want full control among compiler optimizations, the best 1µs delay can be reached using this macro fully written in assembler:

This is the most optimized way to write the while(counter--) function. Doing tests with the scope, I found that 1µs delay can be obtained when the MCU execute this loop 16 times at 84MHZ. However, this macro has to be rearranged if you processor speed is lower, and keep in mind that being a macro, it is "expanded" every time you use it, causing the increase of firmware size.

These are the best solution to obtain 1µs delay with the STM32 platform. If you think that other solutions can be better than these ones, feel free to comment below this post.

Check Also

Correct way to perform re-annotation of designators in Altium

It's really common that at the end of the board layout we have that all …


  1. Doesn't the actual time also depend on whether the instructions are cached or not when they are called?


    • Yes, it depends also on pipeline cache. I forget to mention that if you disable cache in stm32fxxx_hal_conf.h file:

      it affects the duration of delay routines.

  2. Great post,

    however, if you substract fixed amount of cycles when calculating number of cycles you need with DWT, you will get most accurate delay.

    You have to substract cycles for function call and for calculating SystemCoreClock to MHz. Better option could be using static inline function where function will be placed directly inline without calling it directly.

  3. Thanks Carmine, this is a very useful and informative post!

  4. I was able to get 440ns pulses using SYSTICK based delay function with 20ns variance using 168Mz STM32F4-Disc board. I am using HAL_GPIO__WritePin with GCC.

    I am in a process of testing STM32F7 to improve accuracy as well as smaller size pulses.

    Using your ASM increases variance for logic 0.

    I was able to create any size pulses with 20s variance with minimum size of 440ns. My goal is to get it down to 10 ns accuracy with .5ns variance.

    Any idea on how I might be able to achieve my goal? Thanks in advance.

    • 100Mhz switching signal is really hard to achieve with a STM32, probably even with the STM32F7.
      You need to manipulate directly the GPIO ODR registry, without using the HAL that adds a lot of overhead. Moreover, you need to enable also the compensation cell and to define the GPIO speed (that is, the GPIO duty cycle) to the fastest value.

    • You can checkout the high resolution timer on the F334 where resolution can go down to 217ps.

  5. if you want to generate exactly timed pulses why not using a timer?
    also the timers interrupt capabilty can be used to derive an exact delay time if necessary.
    i havent done this with a STM32 till now, but often with Hitachi/Renasas SH8 MCU's in the past.

    • Handling delays of some microseconds using interrupts would flood the MCU, and it's not that good for precise delays. Even if Cortex-M has deterministic interrupt latency, this can cost up to 16 clock cycles in some Cortex-M (formerly M0+ processors). For a STM32 MCUs running at "low speeds" this is a non-negligible overhead (moreover you have to add the cost of clearing the UIF flag, which costs other 3-5 cycles). This means a delay shift of about 500ns, which is 50% more.

  6. Hi Carmine,
    your articles are always very intersting but I have a question on board timers. I am using TIM3 to generate an interrupt every 1 ms but my interrupt service routine lasts, more or less, 800 us; during this time the timer continues to count or it is freezed? I hope I was cleared.

    • Carmine Noviello

      During that time the timer keeps counting, and when it expires the UIF flag is set: this will cause that the IRQ is set at the NVIC, if enabled. So, if your code lasts more than 1us, you are likely to lose the IRQ sometime. Otherwise it should work.

  7. Hello. Thanks for Article.

    I could not run the code with CYCCNT 🙁
    but with this code:
    #pragma GCC push_options
    #pragma GCC optimize ("O0")
    void delayUS(uint32_t us) {
    volatile uint32_t counter = 3*us;
    #pragma GCC pop_options

    On STM32F103 at 72MHZ, reach 1.02 Us and 980.0 kHz

    I hope you can explain more in depth CYCCNT.


    • Cortex- M3/4/7 processors can have an optional debug unit, named Data Watchpoint and Tracing (DWT), that provides watchpoints, data tracing, and system profiling for the processor. One register of this unit is CYCCNT, which counts the number of cycles performed by CPU. So, we can use this special unit available to count the number of cycles performed by the MCU during instruction execution. Which issue do you have when using CYCCNT?

Leave a Reply

Your email address will not be published. Required fields are marked *