DDR SDRAM is essentially a 2n prefetch architecture with an interface designed to transfer two data words per clock cycle at the I/O pins. A single read or write access for the DDR SDRAM consists of a single 2n–bit wide data transfer at the internal DRAM array during one clock cycle, and two corresponding n–bit wide data transfers at the I/O pins, each during one-half of the clock cycle. Therefore, the internal data bus is twice as wide as the external interface would indicate.

Read and write accesses to the DDR SDRAM are burst oriented. Accesses start at a selected location and continue for a programmed number of locations in a programmed sequence. The burst length can be programmed to 2, 4, or 8 locations. Accesses begin with the registration of an Active command, which is then followed by a Read or Write command. The address bits registered with an Active command are used to select the bank and row to be accessed. The address bits registered with a Read or Write command are used to select the bank and the starting column location for the burst access. An Auto Precharge function may be enabled to provide a row precharge that is initiated at the end of the burst access.

Because of the high data-transition speeds, on a heavily loaded bus the low voltage TTL interface cannot be used for the I/O buffers. For that reason, DDR SDRAM uses an I/O interface called SSTL_2 (Stub-Series Terminated Logic).

Read Operation

Burst read operations are initiated with a Read command. The starting column and bank addresses are provided with this command. The Auto Precharge operation is either enabled or disabled for that burst access. If Auto Precharge is enabled, the row that is accessed will start precharge at the completion of the burst operation.

Consecutive burst Read operations for the DDR SDRAM memory

深入内存/主存：解剖DRAM存储器

DMA

Reasons for using DMA

the processor can be doing something else while the transfer is in progress
Data transformations
Lower power
Higher data throughput

Types of DMA transfer

single-cycle DMA transfer: produce or consume data a word at a time
burst transaction

Cache Coherency

systems with bus snooping or cache snooping
When the hardware doesn’t have snooping, DMA-based device drivers usually use one of two techniques to avoid cache coherency problems:
ensure that the data buffers are allocated from a non-cacheable region of memory or are marked as non-cacheable by the processor’s memory management unit.
the device driver software explicitly flush or invalidate the data cache (depending on the transfer direction) before initiating a transfer or making data buffers available to bus mastering peripherals.

As well as ensuring that the data buffers themselves are coherent with the cache, the driver may also have to ensure that control structures and DMA descriptor lists are also cache-coherent.

Address Translation

To set up a DMA transfer to or from that buffer a device driver would have to:

build a scatter gather table that links together the non-contiguous physical pages that make up the data buffer
translate the address of the individual pages into physical addresses
(for a PCI device) translate the physical addresses into PCI bus addresses the device can process
create a DMA descriptor list using the physical or bus-specific address, as appropriate
program the DMA controller or device with the DMA descriptor list

Buffer Alignment

If the application is providing the buffer it may not be correctly aligned for the device:

force the application to using strictly aligned buffers
copy the data from the misaligned user buffer into a temporary buffer that is accessed during the DMA operation. -> add significant extra processing effort and reduce any performance gains from using DMA
transfer the misaligned part of the buffer using programmed I/O

Problems with cache coherency were discussed above but there is also a cache problem caused by incorrect buffer alignment. This occurs when the data buffer used for a DMA transfer shares a cache line with other program data. Consider the situation where a variable used by the program is in the same cache line as part of a buffer that is the destination of a DMA transfer. The application requests a DMA transfer and then updates the variable, which would normally update the value in cache but not immediately in main memory. At the end of the transfer the DMA controller driver invalidates the cache line, to avoid the cache being stale with respect to the data just DMA’d into main memory. At this point, main memory contains the correct value of the data from the DMA transfer along with the stale value of the variable used by the program. Because the cache line is now invalid, when the program next accesses the variable the stale value from the main memory will be loaded into the cache which means that the previous update has been lost.

The simple solution to this problem is to ensure that buffers used for DMA are multiples of the cache line size (typically 32 or 64 bytes) and are aligned to a cache-line size boundary.

WXWorkCapture_17398663162905

Solution 1: to perform a cache maintenance operation after writing data to a cacheable memory region, by forcing a D-cache clean operation by software through CMSIS function SCB_CleanDCache() (all the dirty lines are write-back to SRAM1).

Solution 2: in order to ensure the cache coherency, the user must modify the MPU attribute of the SRAM1 from write-back (default) to write-through policy.

Solution 3: to modify the MPU attribute of the SRAM1 by using a shared attribute. This prevents by default the SRAM1 from being cached in D-cache.

Solution 4: to perform a cache maintenance operation, by forcing write-through policy for all the writes. This can be enabled by setting force write-through in the D-Cache bit in the CACR control register.

The data coherency between the core and the DMA is ensured by:

Either making the SRAM1 buffers not cacheable.

Or making the SRAM1 buffers cache enabled with write-back policy, with the coherency ensured by software (clean or invalidate D-Cache).

Or modifying the SRAM1 region in the MPU attribute to a shared region.

Or making the SRAM1 buffer cache enabled with write-through policy.

Another case is when the DMA is writing to the SRAM1 and the CPU is going to read data from the SRAM1. To ensure the data coherency between the cache and the SRAM1, the software must perform a cache invalidate before reading the updated data from the SRAM1.

Mistakes to avoid and tips

After reset, the user must invalidate each cache before enabling it, otherwise an UNPREDICTIBLE behavior can occur.

When disabling the data cache, the user must clean the entire cache to ensure that any dirty data is flushed to the external memory.

Before enabling the data cache, the user must invalidate the entire data cache if the external memory might have changed since the cache was disabled.

Before enabling the instruction cache, the user must invalidate the entire instruction cache if the external memory might have changed since the cache was disabled.

If the software is using cacheable memory regions for the DMA source/or destination buffers. The software must trigger a cache clean before starting a DMA operation to ensure that all the data are committed to the subsystem memory. After the DMA transfer complete, when reading the data from the peripheral, the software must perform a cache invalidate before reading the DMA updated memory region.

Always better to use non-cacheable regions for DMA buffers. The software can use the MPU to set up a non-cacheable memory block to use as a shared memory between the CPU and DMA.

Do not enable cache for the memory that is being used extensively for a DMA operation.

When using the ART accelerator, the CPU can read an instruction in just 1 clock from the internal Flash memory (like 0-wait state). So I-cache cannot be used for the internal Flash memory.

When using NOR Flash, the write-back causes problems because the erase and write commands are not sent to this external Flash memory.

If the connected device is a normal memory, a D-cache read is useful. However, If the external device is an ASIC and/or a FIFO, the user must disable the D-cache for reading.

https://developer.arm.com/documentation/den0042/a/Memory-Ordering/Cache-coherency-implications

A system containing an external DMA device and a core provides a simple example of possible problems. There are two situations in which a breakdown of coherency can occur. If the DMA reads data from main memory while newer data is held in the core cache, the DMA will read the old data. Similarly, if a DMA writes data to main memory and stale data is present in the core cache, the core can continue to use the old data.