Performance Take 2
First blog on performance was quite vague in terms of numbers. And there was rather good reason for it: We didn’t really have that many of them to compare against. Now we fixed this and so we can come up with actual numbers tis time. Not just numbers for CMRX, these were already stated in the previous entry. This time we have actual numbers for FreeRTOS using the same hardware.
Why was the score so low?
First things first. The whole effort around performance testing began in order to find out how much performance is CMRX keeping on the table. The answer was short: A lot. Now once we know it, the next question is: How much of this unrealized performance can be reclaimed and how much is the price paid for automatic memory protection management?
The CMRX kernel is not optimized for performance basically in any way. This is no secret to anyone who takes a look at the source code. Its not that we don’t know how to do it or don’t care about the performance. Initially, the task was to prove that automatic memory protection can be realized in a meaningful way on a $1 range microcontroller. Now that this has been successfully proven, next task is to prove, it can be done fast.
Score for the thread benchmark for the CMRX kernel shown that for the one particular test we had numbers for FreeRTOS, CMRX kernel turned out to be 5.7x slower.
We decided to pick some of that unrealized performance up and improve the situation.
To understand why the CMRX kernel performs so poor compared to FreeRTOS (and here it is probably worth to mention, that the comparison is done against FreeRTOS, not FreeRTOS-MPU) is that CMRX kernel is really configured for failing the very moment something goes wrong. You don’t want to go and hunt gremlins in your memory protected kernel during the development. So there are sanity checks, asserts and more checks all over the place, active even in release build. This adds additional slow-down to code which avoids using optimizations to keep debugging easy.
This was an appropriate approach while we were bringing the memory protection up and making it working reliable. Now that memory protection is stable for quite some time, it is probably right time to improve the performance a bit.
So, we did it.
Improving kernel performance
Two easiest things we could do was to disable this additional diagnostics: disable asserts and sanity checks in release build. Each of them on its own brough some 10% performance increase over the original numbers.
Another performance improvement done was trivial to fix and could bring in some more performance: System call interface. Here, the kernel service lookup has to be done for service ID requested by the application. While CMRX kernel currently provides only 22 system call services, in workloads that call kernel a lot, the overhead may pile up, if this lookup is not effective enough. As you might have guessed by now, the original method used by CMRX was by no means effective - it was linear search. This meant that system calls had vastly different overhead of finding them. We improved the situation by sorting system calls and performing binary search. This simple change improved performance by 10% on top of performance gained by disabling diagnostics.
Another performance improvement done was to cache the MPU region settings for stack. CMRX does process region calculation at boot time. These MPU regions are calculated once and then these values are reused. So swapping processes boils down to several register writes. But the stack was completely different story. With stack, the configuration for MPU region that protects stack was calculated on each thread switch. This calculation is not entirely trivial and the value remains the same for the whole lifetime of thread, so there is no good reason for not caching it. And so we implemented this cache. This improvement rewarded us with another 8% of performance on top of performance gained by disabling diagnostics and speeding up system call interface.
In total, the cummulative improvement of all these changes is whooping 44%! Just for this single test. We re-ran the whole benchmark and got following scores:
| Benchmark | Score Before | Score After | Change |
|---|---|---|---|
| Basic No-Op | 263019565 | 263839672 | +0.3% |
| Message Processing | 817142 | 1050000 | +28% |
| Synchronization | 1861834 | 3010197 | +61% |
| Cooperative Scheduling | 1515575 | 2201941 | +45% |
| Preemptive Scheduling | 675715 | 976250 | +44% |
As you can see, rather small change improved performance 30 - 60%!
Actual numbers
Yet still we only had that one number to compare against. So we decided to recreate the FreeRTOS benchmark so we can measure all the numbers. It didn’t take long and the benchmark suite was ported over to FreeRTOS. While doing this, we decided to actually change the implementation of one test on the CMRX side. The synchronization test in CMRX used for all previous tests was based on the notify_object/wait_for_object framework. As the FreeRTOS test is using mutexes internally, we decided that using mutexes in CMRX for comparison will be a better idea. This is to explain why score for this particular test won’t match scores stated previously.
And so here are the numbers:
| Benchmark | FreeRTOS 10.3.1 | CMRX 0.1.0 | CMRX : FreeRTOS |
|---|---|---|---|
| Basic No-Op | 265819464 | 263839683 | 99.3% |
| Message Processing | 4689994 | 1056666 | 22.5% |
| Synchronization | 7183628 | 2764998 | 38.5% |
| Cooperative Scheduling | 17044259 | 2187546 | 12.8% |
| Preemptive Scheduling | 4357890 | 977330 | 22.4% |
Note that synchronization test score differs from previous score. This is caused by change in test methodic to match the test on FreeRTOS. The reason why score is lower is that mutexes are using notification framework internally, so CMRX is doing even more work now. Scores for scheduling tests are cummulative to keep things readable.

Now we have some hard numbers to compare against. Improvements made previously were a low-hanging fruit in terms of ease of implementation and brought in some nice performance improvements. This does not mean we already fixed every place where performacne was wasted and remaining places will bring in only small improvements. We played with CMRX configuration and found out that small change of configuration, like changing size of thread table makes huge difference in benchmark score. This hints there is still a lot of unrealized potential in the code.
Methodology
Scores presented in this post were obtained running both CMRX 0.1.0 and FreeRTOS 10.3.1 (integration done by CubeMX) running on STM32L432KB6 microcontroller. The microcontroller was configured to run at 80MHz. No peripherals were configured. Both CMRX and FreeRTOS were using TIM1 as system timer in CubeMX configuration. Both CMRX and FreeRTOS were kept in their default configuration, except of FreeRTOS heap was resized to 16kB. As the memory management benchmark is not part of this comparison, this should not have impact on score.
Compiler used was arm-none-eabi-gcc (Arm GNU Toolchain 14.2.Rel1 (Build arm-14.52)) 14.2.1 20241119 and both systems were build using -Os -g3 flags (RelWithDebInfo). Other than that all other flags were kept as generated by CubeMX.
Each individual test ran for 30 seconds. We performed multiple runs of benchmark and obtained stable numbers with no variance. Scheduling tests were configured with 5 threads each.
Conclusion
Now that we have some hard numbers and easy and reliable way to reproduce them, we can work on improving the CMRX kernel performance. Numbers we see already suggest areas needing improvements: the most critical part is scheduling algorithm, which is slow. Fortunately the kernel is now covered by unit test suite so we can work on performance improvements confidently, without risking that things will get broken along the way. How big performance improvements can we reach?