Performance Benchmark
The question of performance is very relevant for CMRX kernel. You have to consider that CMRX has to pay for the overhead of microkernel environment and that memory isolation is active all the time. So naturally, we’ve been interested what is the price of all this added overhead? Or, in other words: what is the price to be paid for essential cybersecurity?
The starting point for this effort was Beningo RTOS performance Report which becnharks performance of several commonly used RTOSes. This report is based around Thread Metric Benchmark, which is part of Eclipse Foundation ThreadX RTOS. In this report, following RTOSes are tested:
- FreeRTOS
- ThreadX
- PX5 RTOS
- Zephyr RTOS
The original benchmark suite consist of 8 distinct benchmarks. These benchmarks are designed to test performance of kernel facilities that are regularly used by the application software and their performance will thus directly affect speed of execution of the application. Our interest into this test suite was based primary on the fact that it benchmarks features which are inherently living inside the kernel and trigger thread switching. Thus the whole memory isolation machinery can be stressed by these benchmarks. Another factor is that the results of these benchmarks can (under ideal conditions) be compared across various systems (that’s what the original report is about).
These eight benchmarks are:
- Basic No-op test. This test does nothing, just increments a timer. It sserves as a calibration. All systems executing this test should produce roughly the same result.
- Message processing test. This test sends message via queue, measuring how fast queue subsystem of RTOS is.
- Synchronization test. In this test, synchronization subsystem is stressed via semaphore blocking and unblocking.
- Memory allocator test. Here the memory allocator is stressed via repeating memory allocation and deallocation.
- Cooperative scheduling test. In this test threads are yielding CPU voluntarily. Speed and fairness of scheduling is tested.
- Preemptive scheduling test. Similar to above, yet threads yield CPU based on their priority. Similarly as above, speed and fairness is tested.
- Interrupt processing test. In this test, thread induce an interrupt and exchange semaphore.
- Interrupt preemption test. Similar to above, main difference is that interrupt posts semaphore another, higher priority thread is waiting for, causing preemption.
Its worth mentioning that this benchmark is not made to test contention. In almost all benchmarks, subsystems are stressed using single thread. So when semaphores or memory manager are tested, all accesses come from single thread. The aim of these benchmarks is to test sheer overhead of kernel services in happy path scenarios.
The Beningo performance report doesn’t include benchmarks number 7 and 8, probably due to the fact that interrupts are handled vastly differently between systems. Report also doesn’t contain any source code (which is not a problem as the benchmark source code is in ThreadX repository) nor actual measured numbers. Paper only contains relative performance comparison of operating systems included and specs of the hardware benchmarks were executed on. As we had almost identical hardware laying around, we decided to produce benchmark results for CMRX in order to compare them with the paper.
The report only contains one specific number - per-thread counter value for test nr. 6 for FreeRTOS. Our original goal was to recreate this test for FreeRTOS and compare CMRX numbers to RTOS but as it turns out, this would not be representative to extrapolate results to other RTOSes. Finally, we decided simply to run the test suite on CMRX without making any elaborate direct comparison.
So we took 6 remaining benchmarks from the Threat Metric Benchmark and started implementing them on CMRX. Soon we had to leave one test out. The Memory allocator test is testing memory allocator performance. As CMRX is targeted towards fully static operation and full memory isolation, there is no kernel-provided memory allocator. Apps can (have to) provide their own allocator if they want to use dynamic memory allocation. At least for now.
There are few more specifics of how these benchmarks are implemented in CMRX. In many real-time operating systems, all features that are tested in this benchmark are provided by the kernel. As kernel in CMRX is a microkernel, not all features benchmarked in this benchmark are actually kernel features. One of such features are queues. Queues are implemented as user-space library and there is user-space server, which can provide queues functionality in case you really need to queue data across processes.
As the Message processing test in Thread Metric Benchmark accesses queue from single thread, we decided to use the library version of queue.
With synchronization test, we had two options of how to implement the test. Originally, the test is using semaphores. CMRX currently doesn’t provide semaphores, neither as kernel function nor as a library function. Instead of semaphores, there are mutexes and notifications available. Mutexes are implemented as mostly user-space futexes while notifications are one of core primitives offered by the kernel itself. We decided to implement the synchronization test using notifications to actually stress the kernel.
Also the implementation of no-op test is a bit different than one in ThreadX source code. In our case this test really does nothing, just increments the counter.
Now for the actual numbers. The table below summarizes benchmark results for all benchmarks that were actually executed. First three benchmarks run single-threaded, either doing nothing, calling queue library or notification syscalls. Thus only result in Cumulative Counter column is available for these benchmarks. Both scheduling benchmarks spawn 5 threads and stress the scheduler, so there are both Cumulative Counter and per-thread counters.
| Benchmark | Cumulative Counter | 1 | 2 | 3 | 4 | 5 | Deterministic |
|---|---|---|---|---|---|---|---|
| Basic No-Op | 263019565 | - | - | - | - | - | |
| Message Processing | 817142 | - | - | - | - | - | |
| Synchronization Processing | 1861834 | - | - | - | - | - | |
| Cooperative Scheduling | 1515575 | 303114 | 303115 | 303116 | 303114 | 303116 | NO |
| Preemptive Scheduling | 675715 | 135143 | 135143 | 135143 | 135143 | 135143 | YES |
The row Deterministic in scheduling benchmarks determines whether scheduling is deterministic in both cases or not. The rule for test result being deterministic is that counter values for individual threads differ by + or - 1. This would mean threads are scheduled in round robin (cooperative, threads have same priority) or strictly following the pattern (preemptive) scheduling. Similarly to FreeRTOS, the cooperative scheduling test result is that it is not deterministic. Unlike FreeRTOS where individual threads’ counters are spread across range of +- 2, here the range is +- 1. The root cause of this is not completely clear yet.
As for numbers, these are consistent across multiple runs of benchmark.
Only numbers the original Beningo report mentions are numbers for the last test and FreeRTOS. The benchmark states that FreeRTOS thread scored 778888 +- 2 iterations (in total 3894439 iterations). If we compare this score to CMRX, we can see that the CMRX performance is roughly 5.7x worse.
So, is the price of cybersecurity so high? Do we really lose 80% of performance?
Fortunately, not.
One fairly obvious fact is that CMRX kernel wasn’t optimized in basically any way. Much of the kernel is using linear searching across whole configured lists in all cases when these lists are searched for stuff. This makes many algorithms slow. Additionally, even release build has assertions and various sanitization checks active. Many of these are in hot path of the thread switching machinery and thus contribute to overhead of all APIs which cause thread switch.
To be honest, we knew that. We also expected the result to be “poor”. So why bother to actually run the benchmark if you know that the result will be poor? The reason is simple: While we know that CMRX is far from being optimal, we don’t want it to remain such. We want to optimize it and for this we need to know how we are doing.
As there are some more mechanisms present in CMRX which are not covered by the Thread Metric Benchmark, we’ll extend this benchmark and run it periodically.