Computer Architect Tips - yszheda/wiki GitHub Wiki

Courses

Reading Materials

1974年,Robert Dennard提出,当芯片的尺寸缩小S倍,频率提升S倍,只要芯片的工作电压相应地降低S倍,单位面积的功耗就会保持恒定。在新一代工艺下,单位面积的晶体管数量会增大S^2倍,频率提升S倍,单位面积的功耗却能保持不变。

为了克服功耗墙的问题,2005年以后的芯片设计中普遍采用了“暗硅”(dark silicon)的思路:通过多核和异构架构等设计,限制了芯片上全速工作(点亮)的工作区域,从而使得芯片满足了功耗约束。

对于GPGPU等面积巨大的芯片而言,另一堵由于半导体工艺约束而产生的高墙已经近在咫尺:光刻墙,指的是光刻设备在一次曝光中能支持的最大芯片面积 (reticle limit)。

Predicate Register

Scratchpad Memory

Cache Type Addressing Scheme Management Scheme
Transparent cache Transparent Implicit (by cache)
Software-managed cache Transparent Explicit (by application)
Self-managed scratch-pad Non-transparent Implicit (by cache)
Scratch-pad memory Non-transparent Explicit (by application)

Memory-level parallelism


Cache

Types of Caches

  • Direct mapping: a memory value can only be placed at a single corresponding location in the cache.
  • Set-associative mapping: a memory value can be placed in any location of a set in the cache.
  • Fully-associative mapping: a memory value can be placed anywhere in the cache.

SA Cache

Three (or Four) Cs (Cache Miss Terms)

  • Compulsory Misses:
    • cold start misses (Caches do not have valid data at the start of the program)
  • Capacity Misses:
    • Increase cache size
  • Conflict Misses:
    • Increase cache size and/or associativity.
    • Associative caches reduce conflict misses
  • Coherence Misses:
    • In multiprocessor systems

3Cs Absolute Miss Rate (SPEC92)

  • Compulsory misses are a tiny fraction of the overall misses
  • Capacity misses reduce with increasing sizes
  • Conflict misses reduce with increasing associativity

Reducing Conflict Misses

  • Set associative (SA) cache

    • multiple possible locations in a set
  • Fully associative (FA) cache

    • any location in the cache
  • Hardware and speed overhead

    • Comparators
    • Multiplexors
    • Data selection only after Hit/Miss determination (i.e., after tag comparison)

Cache Write Policy

  • Write through: The value is written to both the cache line and to the lower-level memory.

  • Write back: The value is written only to the cache line. The modified cache line is written to main memory only when it has to be replaced.

On Write Miss

  • Write allocate
    • The line is allocated on a write miss, followed by the write hit actions above.
    • Write misses first act like read misses
  • No write allocate
    • Write misses do not interfere cache
    • Line is only modified in the lower level memory
    • Mostly use with write-through cache

Write buffers

  • A small cache that can hold a few values waiting to go to main memory.
  • To avoid stalling on writes, many CPUs use a write buffer.
  • This buffer helps when writes are clustered.
  • It does not entirely eliminate stalls since it is possible for the buffer to fill if the burst is larger than the buffer.

How to improve cache performance?

  • Reducing cache misses by more flexible placement of blocks
  • Reducing the miss penalty using multilevel caches

Cache Replacement Policy

  • Random
    • Replace a randomly chosen line
  • FIFO
    • Replace the oldest line
  • LRU (Least Recently Used)
    • Replace the least recently used line
  • NRU (Not Recently Used)
    • Replace one of the lines that is not recently used
    • In Itanium2 L1 Dcache, L2 and L3 caches

LIP: LRU Insertion Policy

    • 采用LRU,incoming block长期占据MRU位置;采用LIP,incoming block放入LRU位置,如果下次没被用到就成为victim

Useless Block Evicted at next eviction Useful Block Moved to MRU position

BIP: Bimodal Insertion Policy

  • LIP may not age older lines
  • Infrequently insert lines in MRU position
  • Let e = Bimodal throttle parameter
if  ( rand() < e )
        Insert at MRU position; // LRU replacement policy
else
        Insert at LRU position;
	Promote to MRU if reused

DIP: Dynamic Insertion Policy

  • Two types of workloads: LRU-friendly or BIP-friendly

  • DIP can be implemented by:

    • Monitor both policies (LRU and BIP)
    • Choose the best-performing policy
    • Apply the best policy to the cache

Not Recently Used (NRU)

  • Employ an NRU bit for each cache to indicate usage

    • 0: the line has been re-referenced
    • 1: the line has not been referenced for a while
  • Cache hit (or first insertion) : Set NRU:=0

  • Eviction: Victim is the line with NRU==1

    • Left to right priority for multiple candidates
  • If found no victim, set everyone’s NRU:=1

RRIP

Virtual Memory

  • Virtual memory – separation of logical memory from physical memory.

    • Only a part of the program needs to be in memory for execution. Hence, logical address space can be much larger than physical address space.
    • Allows address spaces to be shared by several processes (or threads).
    • Allows for more efficient process creation.
  • Virtual memory can be implemented via:

    • Demand segmentation
  • The concept of a virtual (or logical) address space that is bound to a separate physical address space is central to memory management

    • Virtual address – generated by the CPU
    • Physical address – seen by the memory
  • Virtual and physical addresses are the same in compile-time and load-time address-binding schemes; virtual and physical addresses differ in execution-time address-binding schemes

Advantages of Virtual Memory

  • Translation:
    • Program can be given consistent view of memory, even though physical memory is scrambled
    • Only the most important part of program (“Working Set”) must be in physical memory.
    • Contiguous structures (like stacks) use only as much physical memory as necessary yet grow later.
  • Protection:
    • Different threads (or processes) protected from each other.
    • Different pages can be given special behavior
      • (Read Only, Invisible to user programs, etc).
    • Kernel data protected from User programs
    • Very important for protection from malicious programs => Far more “viruses” under Microsoft Windows
  • Sharing:
    • Can map same physical page to multiple users (“Shared memory”)

Paging

  • Divide physical memory into fixed-size blocks (e.g., 4KB) called frames
  • Divide logical memory into blocks of same size (4KB) called pages
  • To run a program of size n pages, need to find n free frames and load program
  • Set up a page table to map page addresses to frame addresses (operating system sets up the page table)

Inverted Page Table

  • One entry for each real page of memory
  • Shared by all active processes
  • Entry consists of the virtual address of the page stored in that real memory location, with Process ID information
  • Decreases memory needed to store each page table, but increases time needed to search the table when a page reference occurs

Fast Address Translation

  • Use Translation Lookaside Buffer (TLB)

    • Instruction-TLB & Data-TLB
    • Essentially a cache (tag array = VPN, data array=PPN)
    • Small (32 to 256 entries are typical)
    • Typically fully associative (implemented as a content addressable memory, CAM) or highly associative to minimize conflicts
  • Several Design Alternatives

    • VIVT: Virtually-indexed Virtually-tagged Cache
    • VIPT: Virtually-indexed Physically-tagged Cache
    • PIVT: Physically-indexed Virtually-tagged Cache
      • Not outright useful, R6000 is the only used this.
    • PIPT: Physically-indexed Physically-tagged Cache