6 min read

Circular Lazy STM32 UART

How easily can a low-powered embedded STM32L4xx deploy a fully asynchronous high-speed Universal Asynchronous Receiver Transmitter (UART) using FreeRTOS queues and tasks? Say that an embedded system requires a pair of queues to wrap access to a UART. Writing to the transmitter queue indirectly writes to the serial port via DMA. Incoming serial-port octets arrive by DMA as 8-bit “messages” in a receiver queue.

Many variants of the STM32 series 32-bit Cortex-M microprocessor equip an LPUART, a low-power UART. It connects to on-chip DMA and can run a high-speed full-duplex serial port with minimal servicing requirements. Bytes appear in memory and the core periodically receives interrupt signals. These signals wake up the core at various stages for processing the received data. The core consumes power; sleeping it as often as possible and as long as possible minimises power consumption.

Assumptions:

  • FreeRTOS tasks and queues
    • circular DMA receiver queue for line idle reception
    • standard non-circular “normal” DMA transmitter queue
  • STM32L4xx HAL drivers

Receive by DMA

The receiver side requires a special task and receive-event callback. They operate on a circular DMA buffer with line idle event handling.

Receiver task

void vUART2RxTask(void *pvParameters) {
	extern UART_HandleTypeDef huart2;
	extern QueueHandle_t xUART2RxQueueHandle;
	uint8_t data[32];
	while (HAL_UARTEx_ReceiveToIdle_DMA(&huart2, data, sizeof(data)) != HAL_OK)
		vTaskDelay(1);
	UBaseType_t ux = 0UL;
	for (;;) {
		uint32_t ulNotified;
		xTaskNotifyWait(0UL, ULONG_MAX, &ulNotified, portMAX_DELAY);
		if (ux == ulNotified)
			continue;
		if (ulNotified > ux) {
			for (; ux < ulNotified; ux++)
				xQueueSend(xUART2RxQueueHandle, data + ux, portMAX_DELAY);
		} else {
			for (; ux < sizeof(data) / sizeof(data[0]); ux++)
				xQueueSend(xUART2RxQueueHandle, data + ux, portMAX_DELAY);
			for (ux = 0UL; ux < ulNotified; ux++)
				xQueueSend(xUART2RxQueueHandle, data + ux, portMAX_DELAY);
		}
		ux = ulNotified;
	}
}

The task accesses the UART handle and the receiver queue, here by using external references.

The sizeof(data) / sizeof(data[0]) correctly computes the dimensions of the data array, 32 in this case. It computes at compile time rather than at run-time; the core does not perform an integer division. Implementers may prefer to remove the divisor, but doing so strictly speaking adds an implicit assumption about the size of the element.

‘Receiving to idle’ behaviour retries until success after a small delay. This may need some additional diagnostics to catch other non-busy failures. Note importantly that the ‘receive to idle’ executes once and once only at the start of the system.

Receive event callback

void HAL_UARTEx_RxEventCallback(UART_HandleTypeDef *huart, uint16_t Size) {
	extern TaskHandle_t xUART2TaskHandle;
	xTaskNotifyFromISR(xUART2TaskHandle, Size, eSetValueWithOverwrite, NULL);
}

Signals a receive-event on a circular DMA buffer. Passes the “size” by value, overwriting any notification previously scheduled. Realise importantly that this notification value is not the size, not the number of octets received. More about this later.

Transmit by DMA

Implement a DMA transmitter task and the HAL’s transmit-complete callback.

Transmitter task

void vUART2TxTask(void *pvParameters) {
	extern QueueHandle_t xUART2TxQueueHandle;
	extern UART_HandleTypeDef huart2;
	for (;;) {
		uint8_t data[32];
		xQueueReceive(xUART2TxQueueHandle, data, portMAX_DELAY);
		UBaseType_t ux;
		for (ux = 1UL;
			ux < sizeof(data) / sizeof(data[0])
				&& xQueueReceive(xUART2TxQueueHandle, data + ux, 1UL);
			ux++);
		while (HAL_UART_Transmit_DMA(&huart2, data, ux) == HAL_BUSY)
			vTaskDelay(1UL);
		uint32_t ulNotified;
		xTaskNotifyWait(0UL, ULONG_MAX, &ulNotified, portMAX_DELAY);
	}
}

It reads the transmitter queue lazily, collecting octets; then starts a DMA transmit operation, and finally waits for a transmit-complete notification.

The implementation might also utilise a stream buffer rather than a queue. That would add a trigger level to the mix, allowing the transmitter to block until some trigger level signals a transmit or the delay times out.

Transmit complete callback

void HAL_UART_TxCpltCallback(UART_HandleTypeDef *huart) {
	extern TaskHandle_t xUART2TxTaskHandle;
	xTaskNotifyFromISR(xUART2TxTaskHandle, 1UL, eSetBits, NULL);
}

Signals transmit completion. This example utilises an arbitrary notification bit.

Explanations

This architecture divides the serial port driver into two parallel channels.

Circular receiver

In circular buffer mode, the Size argument is not the number of received bytes. Instead, it indicates the latest buffer position. The receiver task compares it with the local buffer offset. Nothing to do if the offsets match. If the offset grows, every octet in-between the two offsets moves to the receiver queue. If the new offset falls below the existing buffer offset, two spans join the receiver queue: firstly from the existing offset to the end, and then from the start of the buffer to the new offset.

What if the receiver task fails to wake up before another receiver-interrupt callback fires? Does the callback overwrite a previous notification? Typically, it rarely does. The DMA would need to trigger two consecutive receive events, either because the buffer wraps again or a line idle condition arises soon after a buffer full event but before the receiver task takes the notification. Either way, the interrupt correctly updates to the new offset, safely presuming that the notification bounces to the receiver task atomically.

Lazy transmitter

FreeRTOS offers no way to transfer messages from queues in blocks. The receiver collects a message one at a time. This limited behaviour allows for flexible rescheduling since the queue receiver allows for an immediate task switch at the first message, in this case at the first byte to transmit. The transmitter task above reads the first octet from the transmitter queue by waiting indefinitely. It then fills the remaining stack-based transmit buffer space with the shortest possible receive delay. This gives other tasks a short time to complete a multi-octet transmission span.

Note that the transmitter latency adds cumulatively up to the buffer size threshold. The transmitter task could delay transmission for up to 32 ticks if the queue loader takes its time filling the transmission queue.

Conclusions

The histogram below plots performance in terms of echo latency; it averages around 5ms at 115,200 baud.

The plot computes in R as:

#' @example
#' open(com3 <- serial::serialConnection(port = "com3"))
#' hist(replicate(100L, delay(com3)), main = "Latency", xlab = "s")
delay <- function(com) {
  serial::write.serialConnection(com, paste0(paste0(as.character(as.numeric(Sys.time()))), "\n"))
  repeat {
    was <- serial::read.serialConnection(com)
    if (was != "") break
  }
  now <- Sys.time()
  as.numeric(now) - as.numeric(was)
}

“Receive to idle” is the correct operation for responsiveness. It allows for burst transfers with buffer wrapping but also accounts for short bursts with prompt interrupt response.

The lazy transmitter task has two latency parameters: the size of the temporary stack buffer (32 bytes in the exemplar) and the ‘subsequent octet’ latency time (1 tick above). Systems may tune these parameters. Transmit latency is a useful threshold to have provided not too long. The transmitter wants to collect as many octets as possible for an uninterrupted DMA transfer. The core can sleep if possible thereafter until more octets appear ready for transmission.

Running the channel over DMA has another advantage when debugging. The channel can receive while the core sits at a breakpoint. All in all, this approach provides for the sleepiest, lowest power consuming, technique for full-duplex UART operation while maintaining octet-level responsiveness.