A very significant limitation with the STM32F4xx family (STM32F405 / 407 / 415 / 417) is that fully a third of its internal RAM is inaccessible to the DMA controller. Of the 192 kB of available RAM, only 128 kB can be accessed by the DMA. The other 64 kB, known as the CCM, cannot be read or written by DMA.
For a Cortex-M4 processor that is promoted using DSP type benchmarks (filters and FFTs etc), this is a glaring oversight. DSP type operations are all about reading data in, processing the data, and writing the resultant data out. Two of those three tasks require the DMA if they’re to be performed efficiently, and on the STM32F4xx family the DMA is unusable for a third of its RAM. For me personally, coming from a long DSP background, this stilted memory architecture is crazy beyond words.
Still, it’s not the first time the hardware designers have made life tough for the software folks, and it won’t be the last. We just have to deal with it as best we can. I’ve been attempting to get the SDIO SD Card interface working under interrupt, so that the additional 64 kB of RAM we’re paying for can be accessed by the SDIO. This post will share a few things I’ve learned.
ST SD Card Interrupt Examples
As far as I can find, there aren’t any. I’ve looked through both the STM32F2xx and STM32F4xx software examples, and it all uses DMA exclusively for the data handling. If you come across any ST example code doing SD card data handling via interrupt, please let me know.
Double-Handling the Data
This is an option, and I have considered it. The idea would be (as an example):
- DMA data from SD card into the 128 kB of RAM
- Software copy the data into the 64 kB of RAM
- process the data
- Software copy the results from the 64 kB of RAM into the 128 kB of RAM
- DMA the results from the 128 kB of RAM to the SD card
Obviously what I’ve listed is worst-case and ugly as sin. You really wouldn’t want to do it. Still, if you did, an efficient software-copy routine would be essential. This Stellaris forum posting contains details for a fast assembler Cortex-M3/M4 memory copy routine. I’ve played with it and it works well.
SDIO Requests More Data Then It Needs
If you’re using the STM32F2xx / STM32F4xx SDIO to transmit data to an SD Card, under interrupt you’ll probably be using the “transmit FIFO half empty” TXFIFOHE interrupt flag. When this triggers, you know your interrupt handler software needs to write 8 words (32 bytes) to the SDIO FIFO.
The problem is that the SDIO will request more data than what it actually requires, which could, if you’re not careful, result in you reading past the end of your data buffer, possibly generating some kind of a bus fault or hard fault. To explain, take a look at this example code snippet from within an SDIO interrupt handler:
if (SDIO->STA & SDIO_FLAG_TXFIFOHE) {
ptr = source_addr; // address of source data to Tx to card
while (SDIO->STA & SDIO_FLAG_TXFIFOHE) {
BUTTON_OUT_HIGH
SDIO->FIFO = *ptr++; // write first word (32 bits = 4 bytes) to the FIFO
SDIO->FIFO = *ptr++;
SDIO->FIFO = *ptr++;
SDIO->FIFO = *ptr++;
SDIO->FIFO = *ptr++;
SDIO->FIFO = *ptr++;
SDIO->FIFO = *ptr++;
SDIO->FIFO = *ptr++; // 8th word of data written to the SDIO FIFO
BUTTON_OUT_LOW
}
source_addr = ptr; // remember data position for next time
}
You can see it’s checking to see if the Tx FIFO Half-Empty flag is set, and if so, it writes 8 words (32 bytes) of data to the FIFO, updates its data pointer, and that’s it. We’ve made it slightly more efficient by wrapping it in the while() loop, so it does it repeatedly until the Tx FIFO is no longer needing more data – this allows it to more quickly fill the FIFO at startup when the FIFO is empty.
The BUTTON_OUT_xxxx sets a GPIO pin so we can see on the oscilloscope what’s happening.
When writing a single block / sector to the SD card, which is 512 bytes, we would expect to see 512 / 32 = 16 writes (of 32 bytes) to the FIFO. Let’s look at the scope:
There are a few things of great interest to be seen here.
At the start of the scope plot, on the left, we can see 4 writes in very quick succession. This is thanks to the while() loop in the code. The SDIO Tx FIFO is 32 words deep, so the TXFIFOHE remains set until the FIFO is full, which requires 4 sets of 8 words to be written. This is good – we’re getting the Tx FIFO filled very quickly.
If we count the total number of writes on the scope plot, we see 19. Huh? We expected to see 16; what gives? 19 means we’ve read 608 bytes from our data buffer (actually: right past the end of our data buffer) and given it to the SDIO; that’s too much for a 512 byte write. The reason is the title of this section: the SDIO requests more data than it needs. It appears the designers of the SDIO block did not give it the intelligence to compare its FIFO level with its DCOUNT register. If the FIFO contains sufficient empty space to accept another 8 words, it will set its TXFIFOHE flag to request more data, EVEN THOUGH IT DOES NOT NEED IT TO COMPLETE THE CURRENT TRANSFER. Be aware of this.
Changing our SDIO IRQ handler slightly to consider the DCOUNT register, for example like this:
if ((SDIO->STA & SDIO_FLAG_TXFIFOHE) && (SDIO->DCOUNT >= 32)) {
does not help, because we cannot know the amount of data currently held in the FIFO.
To deal with this, you need to keep your own “data remaining count” variable, which you can count down as you give data to the SDIO FIFO. Then when your count variable reaches zero, you should turn off the TXFIFOHE interrupt (by clearing its bit in the SDIO->MASK register).
Something else to note from this scope capture is the interrupt rate and CPU utilisation. In this example the SDIO clock is 20 MHz, meaning we can write data to the card at 10 MB/s. Given that we’re writing 32 bytes at a time (except at the very beginning where we write 4 times that), we calculate we’re writing data every 3.2 microseconds. The scope shot bears this out. This corresponds to an interrupt rate of 312.5 kHz! This is a very high rate for a small processor, and the CPU utilisation should be expected to be high. From the scope shot we can estimate we’re spending about 12% – 15% of our 120 MHz processor doing nothing except servicing these SDIO interrupts. It’s a steep price to pay for making so much RAM inaccessible to the DMA.
Tx FIFO Underrun
Getting data transmit (send data to the card) to startup properly on the SMT32F4xx / 2xx can be very tricky. Here’s my understanding.
When you enable the SDIO (via the DTEN bit in the SDIO_DCTRL register) the FIFO is empty. So the TXFIFOHE interrupt will trigger immediately, and at the same time the SDIO peripheral will start attempting to write data to the SD card. Hence data must appear in the Tx FIFO extremely quickly, otherwise a Tx FIFO underrun will occur and the SDIO peripheral will shut down.
It is not possible to pre-load the FIFO before enabling the SDIO. I’ve tried and it doesn’t work. I believe the FIFO is hardware-cleared until the SDIO is enabled, or something similar to that.
What this means is that at the moment of SDIO turn-on (when the DTEN bit is set), that TXFIFOHE interrupt must trigger. At that point in time it must be the highest priority interrupt in the system, or be the only interrupt. If it’s delayed for any reason, for example because another interrupt occurs at that time, then a Tx FIFO underrun will very quickly follow. Think very carefully about your enabled interrupts at that critical SDIO transmit start-up point. You may want to consider using the NVIC to make the SDIO be the highest priority interrupt, permitted to preempt all other interrupts. Or, come up with some other scheme to ensure that first TXFIFOHE interrupt can execute immediately.
SDIOIT Status Bit
The SDIO_STA status register contains the SDIOIT bit with a very vague description. I’ve seen this bit being set from time to time but I’ve never worked out what it means. If you understand what it actually represents, please let me know.
Hi Wolk,
Nice to see you have worked on SDIO with STM32L151RD. Can you share your project files here. I am a beginner here and stuck with SDIO.
Thanks in advance.
Nice find. And thanks for the kind comments.
I’ve did some experiments and found a more tricky solution for trouble with 24MHz SDIO DMA underrun (without NOP’s delay). My code look like this now:
– configure DPSM and DMA
– disable SDIO_CK
– enable DPSM (DTEN + DMAEN in the DCTRL register, the DMA will start transmitting data from RAM to the SDIO FIFO, but at this point SDIO_CK is off and no data will flow to the SD card)
– wait while TXFIFOHE bit in SDIO_STA register is set (in the SDIO FIFO buffer reside at least 16 words)
– enable SDIO_CK (data flow to the SD card)
Frank, thanks again for great blog.
Thanks very much for the tips Tim.
As a followup, seems the problem is in certain DMA settings when dealing with SDIO: the flow master should be the SDIO, not the DMA controller. And the transfer should be word (32 bits only), and the burst size should be set to 4 increments.
Frank, thanks for great blog, partucularly on STM32, SDIO and DMA universe. I have been struggling with SDIO and DMA for RX recently, perhaps someone would be able to help me out here. I am using STM32CubeMX with firmware 1.5.0 on a STM32F407Z board.
When trying to transfer 512 bytes from SD card to memory, last 4 bytes won’t transfer. I can see that from DMA2_S3NDTR (NDT bits 15-0) still holding 4. DMA3_S3CR has EN bit set to 1 which means DMA transfer was not completed. At the same time, SDIO status register has bit RXACT set to 1, and RXDAVL set to 1.
The reception buffer is being aligned properly (tried 4, 16 and even 512 and 1024 bytes just for fun). Perhaps somebody would give me a clue?
Wolk – thanks for your posts, they helped me a lot. We have just had the same problem with using the Keil FlashFS middleware with an SDCard on the STM32L151RD. With your suggested mod all works fine at 24MHz + 4-bit. It looks like a processor bug to me as it is probably the DMA taking time to get going on memory to peripheral transfers, but I have not tried the DMA on other peripherals, so cannot be sure. Interesting this is not needed on DMA from peripheral to memory during reading.
Very interesting – I wonder why it needs the delay. Perhaps the card needs some extra time to fully power up.
Hello again. I found solution for my problem with writing to the SD card (24MHz SDIO clock and 4-bit bus).
Solution is:
configure DPSM (timeout, data length)
configure DMA transfer
disable SDIO_CK clock output (clear CLKEN bit in the SDIO CLKCR register)
enable DPSM (set DTEN bit in the SDIO DCTRL register)
do some delay (in my case it is minimum 96 system clocks -> 96 NOP commands)
enable SDIO_CK clock output (set CLKEN bit)
What is this and why – I don’t know. But this trick works… so far so good… Maybe it helps someone.
I’ve tried to use SPL SDIO library on my STM32L151RD, but no luck with 4-bit bus and frequencies more than 8MHz. So I decided to write my own and information from this page was very useful. Now my lib can read and write with 16MHz SDIO lock (more only 24MHz and there I have overrun/underrun errors, because my CPU clock is only 32MHz).
Then I wrote read/write functions with DMA. With it read works fine with 4-bit bus and 24MHz SDIO clock. But write demonstrates odd behavior. 1-bit bus and 24MHz write goes fine, but with 4-bit bus after I send CMD24(CMD25), configure DMA, configure DPSM and enable DTEN bit in DCTRL register (to start actual transfer), instantly pops TX UNDERRUN error. It looks like a DMA does not have time to provide first portion of data for SDIO TX FIFO.
Anybody can say something about it?
Good find – thanks!
RE:The SDIOIT bit. It turns out it is documented, just not by “SDIO Mode”…..
See https://groups.yahoo.com/neo/groups/nuttx/conversations/messages/7899
Well, if you’re devoting the bulk of your memory to one input buffer, one output buffer, and one working buffer, all of equal size, then it’s pretty close to ideal. CCM can also be a good place to put stacks, task state, application variables, etc. I’ve been using it as main memory, with the system memory set aside for DMA buffers. Some more flexibility would be nice, but it seems a reasonable compromise.
If you can execute code from CCM, that’d be another possible application for hard-realtime tasks, due to more deterministic execution speed…no waiting for code to load from flash or for a peripheral to finish up with the system SRAM. I’m not sure if this can actually be done, though…the AHB diagram seems to indicate that the core’s instruction bus can only be connected to the flash, the 112 kB block of system SRAM, or the FSMC. (An aside: the 16 kB block of system RAM appears only accessible by the core via the system bus, which might make it a particularly bad location for a stack…which happens to be a setup I’ve seen often in online examples.)
Your comment is very valid and I’m glad you posted it. I personally believe (based on my experience) that the size of the CCM (as a percentage of the total memory of the processor) is far too large for “intermediate values”. Others may disagree with me, and I certainly hope they do.
“For a Cortex-M4 processor that is promoted using DSP type benchmarks (filters and FFTs etc), this is a glaring oversight. ”
It’s not an oversight, it’s a feature, one that’s actually oriented toward DMA-heavy DSP applications. The CCM can be accessed by the core without competing with peripherals for the AHB bus. You can DMA to a buffer in main memory, use CCM for intermediate values, and store the result in main memory for DMA output, with the core only hitting main memory when it needs to load new data or store the results of computations.
In fact, the main memory is split into 112 and 16 kB blocks with separate connections to the AHB matrix, so if you arrange things carefully, you can have two simultaneous DMA operations going while the core is happily crunching data, without any contention for memory accesses.
Not having to wait for the AHB to become available can also be important for some hard-realtime tasks. The CCM is always immediately available.
Remember to align your SD-Card (if you use a FS) and write only Data-Blocks with a SD-Erase-Sector-Size multiplier. Then you will get easily a faster write-speed.
I’ve not personally run faster than 20 MHz, due to the fact that my hardware has a level converter (1.8V – 3.3V) between the STM32 and the SD Card. Given that, the fastest write speed I’ve achieved has been 9 MB/s, which actually isn’t too bad (I’ll admit it, I was happy). I’ve no doubt faster speeds are possible, but you’d certainly need to run the STM32 at 3.3V so you can avoid a level converter to make those faster SD Card clock speeds possible.
Very useful, explained in great detail. Thanks a lot! Out of curiosity: Did you try to operate the SD interface at more than 25MHz? 48MHz should be possible? What was the best write speed you could manage?