30 Dec
Posted by: frank in: Electronics, Programming, STM32
A very significant limitation with the STM32F4xx family (STM32F405 / 407 / 415 / 417) is that fully a third of its internal RAM is inaccessible to the DMA controller. Of the 192 kB of available RAM, only 128 kB can be accessed by the DMA. The other 64 kB, known as the CCM, cannot be read or written by DMA.
For a Cortex-M4 processor that is promoted using DSP type benchmarks (filters and FFTs etc), this is a glaring oversight. DSP type operations are all about reading data in, processing the data, and writing the resultant data out. Two of those three tasks require the DMA if they’re to be performed efficiently, and on the STM32F4xx family the DMA is unusable for a third of its RAM. For me personally, coming from a long DSP background, this stilted memory architecture is crazy beyond words.
Still, it’s not the first time the hardware designers have made life tough for the software folks, and it won’t be the last. We just have to deal with it as best we can. I’ve been attempting to get the SDIO SD Card interface working under interrupt, so that the additional 64 kB of RAM we’re paying for can be accessed by the SDIO. This post will share a few things I’ve learned.
ST SD Card Interrupt Examples
As far as I can find, there aren’t any. I’ve looked through both the STM32F2xx and STM32F4xx software examples, and it all uses DMA exclusively for the data handling. If you come across any ST example code doing SD card data handling via interrupt, please let me know.
Double-Handling the Data
This is an option, and I have considered it. The idea would be (as an example):
Obviously what I’ve listed is worst-case and ugly as sin. You really wouldn’t want to do it. Still, if you did, an efficient software-copy routine would be essential. This Stellaris forum posting contains details for a fast assembler Cortex-M3/M4 memory copy routine. I’ve played with it and it works well.
SDIO Requests More Data Then It Needs
If you’re using the STM32F2xx / STM32F4xx SDIO to transmit data to an SD Card, under interrupt you’ll probably be using the “transmit FIFO half empty” TXFIFOHE interrupt flag. When this triggers, you know your interrupt handler software needs to write 8 words (32 bytes) to the SDIO FIFO.
The problem is that the SDIO will request more data than what it actually requires, which could, if you’re not careful, result in you reading past the end of your data buffer, possibly generating some kind of a bus fault or hard fault. To explain, take a look at this example code snippet from within an SDIO interrupt handler:
if (SDIO->STA & SDIO_FLAG_TXFIFOHE) { ptr = source_addr; // address of source data to Tx to card while (SDIO->STA & SDIO_FLAG_TXFIFOHE) { BUTTON_OUT_HIGH SDIO->FIFO = *ptr++; // write first word (32 bits = 4 bytes) to the FIFO SDIO->FIFO = *ptr++; SDIO->FIFO = *ptr++; SDIO->FIFO = *ptr++; SDIO->FIFO = *ptr++; SDIO->FIFO = *ptr++; SDIO->FIFO = *ptr++; SDIO->FIFO = *ptr++; // 8th word of data written to the SDIO FIFO BUTTON_OUT_LOW } source_addr = ptr; // remember data position for next time }
You can see it’s checking to see if the Tx FIFO Half-Empty flag is set, and if so, it writes 8 words (32 bytes) of data to the FIFO, updates its data pointer, and that’s it. We’ve made it slightly more efficient by wrapping it in the while() loop, so it does it repeatedly until the Tx FIFO is no longer needing more data – this allows it to more quickly fill the FIFO at startup when the FIFO is empty.
The BUTTON_OUT_xxxx sets a GPIO pin so we can see on the oscilloscope what’s happening.
When writing a single block / sector to the SD card, which is 512 bytes, we would expect to see 512 / 32 = 16 writes (of 32 bytes) to the FIFO. Let’s look at the scope:
There are a few things of great interest to be seen here.
At the start of the scope plot, on the left, we can see 4 writes in very quick succession. This is thanks to the while() loop in the code. The SDIO Tx FIFO is 32 words deep, so the TXFIFOHE remains set until the FIFO is full, which requires 4 sets of 8 words to be written. This is good – we’re getting the Tx FIFO filled very quickly.
If we count the total number of writes on the scope plot, we see 19. Huh? We expected to see 16; what gives? 19 means we’ve read 608 bytes from our data buffer (actually: right past the end of our data buffer) and given it to the SDIO; that’s too much for a 512 byte write. The reason is the title of this section: the SDIO requests more data than it needs. It appears the designers of the SDIO block did not give it the intelligence to compare its FIFO level with its DCOUNT register. If the FIFO contains sufficient empty space to accept another 8 words, it will set its TXFIFOHE flag to request more data, EVEN THOUGH IT DOES NOT NEED IT TO COMPLETE THE CURRENT TRANSFER. Be aware of this.
Changing our SDIO IRQ handler slightly to consider the DCOUNT register, for example like this:
if ((SDIO->STA & SDIO_FLAG_TXFIFOHE) && (SDIO->DCOUNT >= 32)) {
does not help, because we cannot know the amount of data currently held in the FIFO.
To deal with this, you need to keep your own “data remaining count” variable, which you can count down as you give data to the SDIO FIFO. Then when your count variable reaches zero, you should turn off the TXFIFOHE interrupt (by clearing its bit in the SDIO->MASK register).
Something else to note from this scope capture is the interrupt rate and CPU utilisation. In this example the SDIO clock is 20 MHz, meaning we can write data to the card at 10 MB/s. Given that we’re writing 32 bytes at a time (except at the very beginning where we write 4 times that), we calculate we’re writing data every 3.2 microseconds. The scope shot bears this out. This corresponds to an interrupt rate of 312.5 kHz! This is a very high rate for a small processor, and the CPU utilisation should be expected to be high. From the scope shot we can estimate we’re spending about 12% – 15% of our 120 MHz processor doing nothing except servicing these SDIO interrupts. It’s a steep price to pay for making so much RAM inaccessible to the DMA.
Tx FIFO Underrun
Getting data transmit (send data to the card) to startup properly on the SMT32F4xx / 2xx can be very tricky. Here’s my understanding.
When you enable the SDIO (via the DTEN bit in the SDIO_DCTRL register) the FIFO is empty. So the TXFIFOHE interrupt will trigger immediately, and at the same time the SDIO peripheral will start attempting to write data to the SD card. Hence data must appear in the Tx FIFO extremely quickly, otherwise a Tx FIFO underrun will occur and the SDIO peripheral will shut down.
It is not possible to pre-load the FIFO before enabling the SDIO. I’ve tried and it doesn’t work. I believe the FIFO is hardware-cleared until the SDIO is enabled, or something similar to that.
What this means is that at the moment of SDIO turn-on (when the DTEN bit is set), that TXFIFOHE interrupt must trigger. At that point in time it must be the highest priority interrupt in the system, or be the only interrupt. If it’s delayed for any reason, for example because another interrupt occurs at that time, then a Tx FIFO underrun will very quickly follow. Think very carefully about your enabled interrupts at that critical SDIO transmit start-up point. You may want to consider using the NVIC to make the SDIO be the highest priority interrupt, permitted to preempt all other interrupts. Or, come up with some other scheme to ensure that first TXFIFOHE interrupt can execute immediately.
SDIOIT Status Bit
The SDIO_STA status register contains the SDIOIT bit with a very vague description. I’ve seen this bit being set from time to time but I’ve never worked out what it means. If you understand what it actually represents, please let me know.
6 Responses
Stefan
09|Jun|2012 1Very useful, explained in great detail. Thanks a lot! Out of curiosity: Did you try to operate the SD interface at more than 25MHz? 48MHz should be possible? What was the best write speed you could manage?
frank
11|Jun|2012 2I’ve not personally run faster than 20 MHz, due to the fact that my hardware has a level converter (1.8V – 3.3V) between the STM32 and the SD Card. Given that, the fastest write speed I’ve achieved has been 9 MB/s, which actually isn’t too bad (I’ll admit it, I was happy). I’ve no doubt faster speeds are possible, but you’d certainly need to run the STM32 at 3.3V so you can avoid a level converter to make those faster SD Card clock speeds possible.
Tobias
15|Jun|2012 3Remember to align your SD-Card (if you use a FS) and write only Data-Blocks with a SD-Erase-Sector-Size multiplier. Then you will get easily a faster write-speed.
Christopher James Huff
18|Sep|2012 4“For a Cortex-M4 processor that is promoted using DSP type benchmarks (filters and FFTs etc), this is a glaring oversight. ”
It’s not an oversight, it’s a feature, one that’s actually oriented toward DMA-heavy DSP applications. The CCM can be accessed by the core without competing with peripherals for the AHB bus. You can DMA to a buffer in main memory, use CCM for intermediate values, and store the result in main memory for DMA output, with the core only hitting main memory when it needs to load new data or store the results of computations.
In fact, the main memory is split into 112 and 16 kB blocks with separate connections to the AHB matrix, so if you arrange things carefully, you can have two simultaneous DMA operations going while the core is happily crunching data, without any contention for memory accesses.
Not having to wait for the AHB to become available can also be important for some hard-realtime tasks. The CCM is always immediately available.
frank
18|Sep|2012 5Your comment is very valid and I’m glad you posted it. I personally believe (based on my experience) that the size of the CCM (as a percentage of the total memory of the processor) is far too large for “intermediate values”. Others may disagree with me, and I certainly hope they do.
Christopher James Huff
22|Sep|2012 6Well, if you’re devoting the bulk of your memory to one input buffer, one output buffer, and one working buffer, all of equal size, then it’s pretty close to ideal. CCM can also be a good place to put stacks, task state, application variables, etc. I’ve been using it as main memory, with the system memory set aside for DMA buffers. Some more flexibility would be nice, but it seems a reasonable compromise.
If you can execute code from CCM, that’d be another possible application for hard-realtime tasks, due to more deterministic execution speed…no waiting for code to load from flash or for a peripheral to finish up with the system SRAM. I’m not sure if this can actually be done, though…the AHB diagram seems to indicate that the core’s instruction bus can only be connected to the flash, the 112 kB block of system SRAM, or the FSMC. (An aside: the 16 kB block of system RAM appears only accessible by the core via the system bus, which might make it a particularly bad location for a stack…which happens to be a setup I’ve seen often in online examples.)
Leave a reply
Search
Categories
Archives
Links
Calendar
A design creation of Design Disease
Copyright © 2009 - Frank's Random Wanderings - is proudly powered by WordPress
InSense 1.0 Theme by Design Disease.