05 Memory (and FIFOs) - alex-aleyan/xilinx GitHub Wiki

Sources:

Terminology:

  • CPU/Mother board level:
    • Memory Channel
    • DIMMS
  • RAM Stick level:
    • Memory Ranks
  • DDR4 IC level:
    • Bank Group
    • Bank Address
    • Row Address
    • Column Address
    • Data Bus/Strobe x2,x4,x8,x16
  • Xilinx
    • distributed RAM (0.6 ~ 103Mb; UG574)
      • FIXME: add notes about LUT's configurability as a 32x1 FIFO in addition to the already mentioned 64x1 dRAM).
      • Supports data depths ranging from 16 to 65,536 words
      • Supports data widths ranging from 1 to 1024 bits
      • 7-Series:
        • SLICEM only 25% of the total slices;
        • 50 CLBs (Configurable Logic Block) per CR (clock region) where each CLB has 2 Slices - therefore total of 100 slices per CR with 25 SLICE-M and 75 SLICE-L).
        • 25 SLICEM * 4 LUTs per SLICEM * 64x1 (bits of storage per LUT) = 6,400 bit (6k roughly) of distributed ram storage per clock region.
      • Ultrascale:
        • 60 CLBs per CR (also 24 DSP48E2 slices and 12 BRAMs per CR; 52 I/Os per bank and 4 GbT that are pitch matched to the CRs).
        • CLB-M and CLB-L (what's the percentage distribution? Still 25M/75L?)
        • Can be configurable as (UG574 Distributed RAM (SLICEM Only) section):
          • Single-port 32 x (1 to 16)-bit RAM
          • Single-port 64 x (1 to 8)-bit RAM
          • Single-port 128 x (1 to 4)-bit RAM
          • Single-port 256 x (1 to 2)-bit RAM
          • Single-port 512 x 1-bit RAM
          • Dual-port 32 x (1 to 8)-bit RAM
          • Dual-port 64 x (1 to 4)-bit RAM
          • Dual-port 128 x 2-bit RAM
          • Dual-port 256 x 1-bit RAM
          • Simple dual-port 32 x (1 to 14)-bit RAM
          • Simple dual-port 64 x (1 to 7)-bit RAM
          • Quad-port 32 x (1 to 4)-bit RAM
          • Quad-port 64 x (1 to 2)-bit RAM
          • Quad-port 128 x 1-bit RAM
          • Octal-port 64 x 1-bit RAM
    • Block RAM (0.8Mb ~ 174Mb; UG573)
      • 7-Series:
      • Ultrascale:
        • 36Kb single RAM, or two 18Kb RAMs, or one 18Kb RAM + one 18Kb FIFO.
        • Up to 12 36kb-BRAM blocks can be cascaded per clock region.
        • Adjacent blocks combine to 64K x 1 w/o extra logic.
        • Can be configurable as (FIXME: replace the below with an HTML table):
          • Single-port 18k: 16k x 1 bit RAM
          • Single-port 18k: 8k x 2 bit RAM
          • Single-port 18k: 4k x 4 bit RAM
          • Single-port 18k: 2k x 9 bit RAM
          • Single-port 18k: 1k x 18 bit RAM
          • Single-port 36k: 32k x 1 bit RAM
          • Single-port 36k: 16k x 2 bit RAM
          • Single-port 36k: 8k x 4 bit RAM
          • Single-port 36k: 4k x 9 bit RAM
          • Single-port 36k: 2k x 18 bit RAM
          • Single-port 36k: 1k x 36 bit RAM
          • Dual-port 18k: 16k x 1 bit RAM
          • Dual-port 18k: 8k x 2 bit RAM
          • Dual-port 18k: 4k x 4 bit RAM
          • Dual-port 18k: 2k x 9 bit RAM
          • Dual-port 18k: 1k x 18 bit RAM
          • Dual-port 36k: 32k x 1 bit RAM
          • Dual-port 36k: 16k x 2 bit RAM
          • Dual-port 36k: 8k x 4 bit RAM
          • Dual-port 36k: 4k x 9 bit RAM
          • Dual-port 36k: 2k x 18 bit RAM
          • Dual-port 36k: 1k x 36 bit RAM
          • Simple Dual-port 18k: 16k x 1 bit RAM
          • Simple Dual-port 18k: 8k x 2 bit RAM
          • Simple Dual-port 18k: 4k x 4 bit RAM
          • Simple Dual-port 18k: 2k x 9 bit RAM
          • Simple Dual-port 18k: 1k x 18 bit RAM
          • Simple Dual-port 18k: 512k x 36 bit RAM
          • Simple Dual-port 36k: 32k x 1 bit RAM
          • Simple Dual-port 36k: 16k x 2 bit RAM
          • Simple Dual-port 36k: 8k x 4 bit RAM
          • Simple Dual-port 36k: 4k x 9 bit RAM
          • Simple Dual-port 36k: 2k x 18 bit RAM
          • Simple Dual-port 36k: 1k x 36 bit RAM
          • Simple Dual-port 18k: 512k x 72 bit RAM
      • Ultrascale+:
    • UltraRAM (6.8Mb ~ 717Mb; UG573)
      • 288Kb memory blocks (72b wide x 4K word deep).
      • 16 URAMs per clock region per column (4.608Mb per column or 64K word deep 72-bit wide).
        • 90~150 columns per ultrascale+ device.
      • Dual port, single clock, optional output register, 64-bit ECC/Hamming-Code/SECDED
      • Built-it hard cascade (data, address, and control).
        • Does not utilize interconnect resources when cascaded withing a single column of URAMs.
        • Cascading between columns utilizes minimal logic resources at the entry/exit points of the columns.
    • High Bandwidth Memory (HBM; 4GB ~ 16GB in UltraScale+ and 4GB ~ 32GB in Versal Adaptive SoC)
    • Hard interface for External DDR4 (2133~2666 Transactions per Second or Mb/s).

Design Flow Recommendations by Xilinx (UG574)

CLB resources are inferred for generic design logic and do not require instantiation. HDL or high-level synthesis (HLS) is sufficient to achieve an efficient implementation. Recommended that these coding practices should be considered when targeting UltraScale architecture CLBs:

  • CLB flip-flops have either a set or a reset. Do not use both set and reset on the same element.
  • Flip-flops are abundant. Consider pipelining to improve performance.
  • Control inputs are shared across multiple resources in a CLB. Minimize the number of unique control inputs required for a design. Control inputs include clock, clock enable, set/reset, and write enable.
  • To efficiently implement shift registers in the LUTs, avoid resets on the shift registers.
  • For small storage requirements, a 6-input LUT can be used as 64 x 1 memory.
  • Standard arithmetic functions are effectively implemented using dedicated carry logic. The recommended design flow:

Implement the design using preferred methodologies (HDL, HLS, IP, etc.).

  • Evaluate utilization reports to determine resources used. Check to ensure that arithmetic logic, distributed RAM, and SRL are used, when helpful.
  • Consider flip-flop usage.
  • Pipeline for performance
  • Use dedicated flip-flops at the outputs of dedicated resources (block RAM, DSP)
  • Allow shift registers to use SRLs (avoid set/resets)
  • Minimize the use of set/resets. The flip-flops are automatically initialized every time the device is powered up.

Resources Examples:

  • Artix 7 7a50t
    • available Slices = 8,150
      • available LUTs as logic (utilizes Slice-Ls & Slice-Ms) = 8150 x 4 = 32,600
      • available LUTs as dRAM (utilizes Slice-Ms only) = 9,600 (from report; 8150 x 1 = is 8150 so where did 1450 came from?).
      • available FFs = 8150 x 8 = 65,200
    • BRAM Tiles = 75
      • RAMB36/FIFO = 75
      • RAMB18 = 75 * 2 = 150
    • DSPs = 120
    • Bonded IOBs = 250
      • IBUFDS = 240
      • IBUFDS_GTE2 = 2
      • ILOGIC=250, OLOGIC=250
  • XCUX35-3VSVA1365E (Alveo X3522 A-X3522-P08G-PQ-G)