05 Memory (and FIFOs) - alex-aleyan/xilinx GitHub Wiki
Sources:
- DDR Memory Playlist
- https://www.systemverilog.io/ddr4-basics
- https://www.youtube.com/watch?v=kCmu5k6jfXM&list=PLQgOKy0AlWh9SfysI81pNwQVxzq6ROiN6
- "UG954 ZC706 eval board" contains good notes on DDR memory @ page 18 DDR3 SODIMM Memory (PL) - YouTube link
- DMA - Xilinx AXI DMA Product Guide
- CLBs:
- ug574-ultrascale-clb
- ug573-ultrascale-memory-resources
- https://www.amd.com/en/products/adaptive-socs-and-fpgas/intellectual-property/dist_mem_gen.html#:~:text=The%20Distributed%20Memory%20Generator%20is,get%20up%20and%20running%20quickly
- https://docs.amd.com/r/en-US/ug901-vivado-synthesis/Distributed-RAM-Examples
- https://docs.amd.com/r/en-US/ug574-ultrascale-clb/Distributed-RAM-Applications
- BRAM:
- ug573-ultrascale-memory-resources
- See Table 9-12 in "Block RAM Port Signals" section.
- https://www.amd.com/en/products/adaptive-socs-and-fpgas/technologies/memory.html
- ug573-ultrascale-memory-resources
Terminology:
- CPU/Mother board level:
- Memory Channel
- DIMMS
- RAM Stick level:
- Memory Ranks
- DDR4 IC level:
- Bank Group
- Bank Address
- Row Address
- Column Address
- Data Bus/Strobe x2,x4,x8,x16
- Xilinx
- distributed RAM (0.6 ~ 103Mb; UG574)
- FIXME: add notes about LUT's configurability as a 32x1 FIFO in addition to the already mentioned 64x1 dRAM).
- Supports data depths ranging from 16 to 65,536 words
- Supports data widths ranging from 1 to 1024 bits
- 7-Series:
- SLICEM only 25% of the total slices;
- 50 CLBs (Configurable Logic Block) per CR (clock region) where each CLB has 2 Slices - therefore total of 100 slices per CR with 25 SLICE-M and 75 SLICE-L).
- 25 SLICEM * 4 LUTs per SLICEM * 64x1 (bits of storage per LUT) = 6,400 bit (6k roughly) of distributed ram storage per clock region.
- Ultrascale:
- 60 CLBs per CR (also 24 DSP48E2 slices and 12 BRAMs per CR; 52 I/Os per bank and 4 GbT that are pitch matched to the CRs).
- CLB-M and CLB-L (what's the percentage distribution? Still 25M/75L?)
- Can be configurable as (UG574 Distributed RAM (SLICEM Only) section):
Single-port 32 x (1 to 16)-bit RAMSingle-port 64 x (1 to 8)-bit RAMSingle-port 128 x (1 to 4)-bit RAMSingle-port 256 x (1 to 2)-bit RAMSingle-port 512 x 1-bit RAMDual-port 32 x (1 to 8)-bit RAMDual-port 64 x (1 to 4)-bit RAMDual-port 128 x 2-bit RAMDual-port 256 x 1-bit RAMSimple dual-port 32 x (1 to 14)-bit RAMSimple dual-port 64 x (1 to 7)-bit RAMQuad-port 32 x (1 to 4)-bit RAMQuad-port 64 x (1 to 2)-bit RAMQuad-port 128 x 1-bit RAMOctal-port 64 x 1-bit RAM
- Block RAM (0.8Mb ~ 174Mb; UG573)
- 7-Series:
- Ultrascale:
- 36Kb single RAM, or two 18Kb RAMs, or one 18Kb RAM + one 18Kb FIFO.
- Up to 12 36kb-BRAM blocks can be cascaded per clock region.
- Adjacent blocks combine to 64K x 1 w/o extra logic.
- Can be configurable as (FIXME: replace the below with an HTML table):
Single-port 18k: 16k x 1 bit RAMSingle-port 18k: 8k x 2 bit RAMSingle-port 18k: 4k x 4 bit RAMSingle-port 18k: 2k x 9 bit RAMSingle-port 18k: 1k x 18 bit RAMSingle-port 36k: 32k x 1 bit RAMSingle-port 36k: 16k x 2 bit RAMSingle-port 36k: 8k x 4 bit RAMSingle-port 36k: 4k x 9 bit RAMSingle-port 36k: 2k x 18 bit RAMSingle-port 36k: 1k x 36 bit RAMDual-port 18k: 16k x 1 bit RAMDual-port 18k: 8k x 2 bit RAMDual-port 18k: 4k x 4 bit RAMDual-port 18k: 2k x 9 bit RAMDual-port 18k: 1k x 18 bit RAMDual-port 36k: 32k x 1 bit RAMDual-port 36k: 16k x 2 bit RAMDual-port 36k: 8k x 4 bit RAMDual-port 36k: 4k x 9 bit RAMDual-port 36k: 2k x 18 bit RAMDual-port 36k: 1k x 36 bit RAMSimple Dual-port 18k: 16k x 1 bit RAMSimple Dual-port 18k: 8k x 2 bit RAMSimple Dual-port 18k: 4k x 4 bit RAMSimple Dual-port 18k: 2k x 9 bit RAMSimple Dual-port 18k: 1k x 18 bit RAMSimple Dual-port 18k: 512k x 36 bit RAMSimple Dual-port 36k: 32k x 1 bit RAMSimple Dual-port 36k: 16k x 2 bit RAMSimple Dual-port 36k: 8k x 4 bit RAMSimple Dual-port 36k: 4k x 9 bit RAMSimple Dual-port 36k: 2k x 18 bit RAMSimple Dual-port 36k: 1k x 36 bit RAMSimple Dual-port 18k: 512k x 72 bit RAM
- Ultrascale+:
- UltraRAM (6.8Mb ~ 717Mb; UG573)
- 288Kb memory blocks (72b wide x 4K word deep).
- 16 URAMs per clock region per column (4.608Mb per column or 64K word deep 72-bit wide).
- 90~150 columns per ultrascale+ device.
- Dual port, single clock, optional output register, 64-bit ECC/Hamming-Code/SECDED
- Built-it hard cascade (data, address, and control).
- Does not utilize interconnect resources when cascaded withing a single column of URAMs.
- Cascading between columns utilizes minimal logic resources at the entry/exit points of the columns.
- High Bandwidth Memory (HBM; 4GB ~ 16GB in UltraScale+ and 4GB ~ 32GB in Versal Adaptive SoC)
- Hard interface for External DDR4 (2133~2666 Transactions per Second or Mb/s).
- distributed RAM (0.6 ~ 103Mb; UG574)
Design Flow Recommendations by Xilinx (UG574)
CLB resources are inferred for generic design logic and do not require instantiation. HDL or high-level synthesis (HLS) is sufficient to achieve an efficient implementation. Recommended that these coding practices should be considered when targeting UltraScale architecture CLBs:
- CLB flip-flops have either a set or a reset. Do not use both set and reset on the same element.
- Flip-flops are abundant. Consider pipelining to improve performance.
- Control inputs are shared across multiple resources in a CLB. Minimize the number of unique control inputs required for a design. Control inputs include clock, clock enable, set/reset, and write enable.
- To efficiently implement shift registers in the LUTs, avoid resets on the shift registers.
- For small storage requirements, a 6-input LUT can be used as 64 x 1 memory.
- Standard arithmetic functions are effectively implemented using dedicated carry logic. The recommended design flow:
Implement the design using preferred methodologies (HDL, HLS, IP, etc.).
- Evaluate utilization reports to determine resources used. Check to ensure that arithmetic logic, distributed RAM, and SRL are used, when helpful.
- Consider flip-flop usage.
- Pipeline for performance
- Use dedicated flip-flops at the outputs of dedicated resources (block RAM, DSP)
- Allow shift registers to use SRLs (avoid set/resets)
- Minimize the use of set/resets. The flip-flops are automatically initialized every time the device is powered up.
Resources Examples:
- Artix 7 7a50t
- available Slices = 8,150
- available LUTs as logic (utilizes Slice-Ls & Slice-Ms) = 8150 x 4 = 32,600
- available LUTs as dRAM (utilizes Slice-Ms only) = 9,600 (from report; 8150 x 1 = is 8150 so where did 1450 came from?).
- available FFs = 8150 x 8 = 65,200
- BRAM Tiles = 75
- RAMB36/FIFO = 75
- RAMB18 = 75 * 2 = 150
- DSPs = 120
- Bonded IOBs = 250
- IBUFDS = 240
- IBUFDS_GTE2 = 2
- ILOGIC=250, OLOGIC=250
- available Slices = 8,150
- XCUX35-3VSVA1365E (Alveo X3522 A-X3522-P08G-PQ-G)