Computer Architect Tips - yszheda/wiki GitHub Wiki

Courses

Reading Materials

1974年，Robert Dennard提出，当芯片的尺寸缩小S倍，频率提升S倍，只要芯片的工作电压相应地降低S倍，单位面积的功耗就会保持恒定。在新一代工艺下，单位面积的晶体管数量会增大S^2倍，频率提升S倍，单位面积的功耗却能保持不变。

为了克服功耗墙的问题，2005年以后的芯片设计中普遍采用了“暗硅”（dark silicon）的思路：通过多核和异构架构等设计，限制了芯片上全速工作（点亮）的工作区域，从而使得芯片满足了功耗约束。

对于GPGPU等面积巨大的芯片而言，另一堵由于半导体工艺约束而产生的高墙已经近在咫尺：光刻墙，指的是光刻设备在一次曝光中能支持的最大芯片面积 (reticle limit)。

Predicate Register

https://www.sciencedirect.com/topics/computer-science/predicate-register

Scratchpad Memory

Cache Type	Addressing Scheme	Management Scheme
Transparent cache	Transparent	Implicit (by cache)
Software-managed cache	Transparent	Explicit (by application)
Self-managed scratch-pad	Non-transparent	Implicit (by cache)
Scratch-pad memory	Non-transparent	Explicit (by application)

Memory-level parallelism

https://en.wikipedia.org/wiki/Memory-level_parallelism

Cache

L1，L2，L3 Cache究竟在哪里？

Types of Caches

Direct mapping: a memory value can only be placed at a single corresponding location in the cache.
Set-associative mapping: a memory value can be placed in any location of a set in the cache.
Fully-associative mapping: a memory value can be placed anywhere in the cache.

SA Cache

Three (or Four) Cs (Cache Miss Terms)

Compulsory Misses:
- cold start misses (Caches do not have valid data at the start of the program)
Capacity Misses:
- Increase cache size
Conflict Misses:
- Increase cache size and/or associativity.
- Associative caches reduce conflict misses
Coherence Misses:
- In multiprocessor systems

3Cs Absolute Miss Rate (SPEC92)

Compulsory misses are a tiny fraction of the overall misses
Capacity misses reduce with increasing sizes
Conflict misses reduce with increasing associativity

Reducing Conflict Misses

Set associative (SA) cache
- multiple possible locations in a set
Fully associative (FA) cache
- any location in the cache
Hardware and speed overhead
- Comparators
- Multiplexors
- Data selection only after Hit/Miss determination (i.e., after tag comparison)

Cache Write Policy

Write through: The value is written to both the cache line and to the lower-level memory.
Write back: The value is written only to the cache line. The modified cache line is written to main memory only when it has to be replaced.

On Write Miss

Write allocate
- The line is allocated on a write miss, followed by the write hit actions above.
- Write misses first act like read misses
No write allocate
- Write misses do not interfere cache
- Line is only modified in the lower level memory
- Mostly use with write-through cache

Write buffers

A small cache that can hold a few values waiting to go to main memory.
To avoid stalling on writes, many CPUs use a write buffer.
This buffer helps when writes are clustered.
It does not entirely eliminate stalls since it is possible for the buffer to fill if the burst is larger than the buffer.

How to improve cache performance?

Reducing cache misses by more flexible placement of blocks
Reducing the miss penalty using multilevel caches

Cache Replacement Policy

Random
- Replace a randomly chosen line
FIFO
- Replace the oldest line
LRU (Least Recently Used)
- Replace the least recently used line
NRU (Not Recently Used)
- Replace one of the lines that is not recently used
- In Itanium2 L1 Dcache, L2 and L3 caches

LIP: LRU Insertion Policy

- 采用LRU，incoming block长期占据MRU位置；采用LIP，incoming block放入LRU位置，如果下次没被用到就成为victim

Useless Block Evicted at next eviction Useful Block Moved to MRU position

BIP: Bimodal Insertion Policy

LIP may not age older lines
Infrequently insert lines in MRU position
Let e = Bimodal throttle parameter

if  ( rand() < e )
        Insert at MRU position; // LRU replacement policy
else
        Insert at LRU position;
	Promote to MRU if reused

DIP: Dynamic Insertion Policy

Two types of workloads: LRU-friendly or BIP-friendly
DIP can be implemented by:
- Monitor both policies (LRU and BIP)
- Choose the best-performing policy
- Apply the best policy to the cache

Not Recently Used (NRU)

Employ an NRU bit for each cache to indicate usage
- 0: the line has been re-referenced
- 1: the line has not been referenced for a while
Cache hit (or first insertion) : Set NRU:=0
Eviction: Victim is the line with NRU==1
- Left to right priority for multiple candidates
If found no victim, set everyone’s NRU:=1

RRIP

Virtual Memory

Virtual memory – separation of logical memory from physical memory.
- Only a part of the program needs to be in memory for execution. Hence, logical address space can be much larger than physical address space.
- Allows address spaces to be shared by several processes (or threads).
- Allows for more efficient process creation.
Virtual memory can be implemented via:
- Demand paging
- Demand segmentation
The concept of a virtual (or logical) address space that is bound to a separate physical address space is central to memory management
- Virtual address – generated by the CPU
- Physical address – seen by the memory
Virtual and physical addresses are the same in compile-time and load-time address-binding schemes; virtual and physical addresses differ in execution-time address-binding schemes

Advantages of Virtual Memory

Translation:
- Program can be given consistent view of memory, even though physical memory is scrambled
- Only the most important part of program (“Working Set”) must be in physical memory.
- Contiguous structures (like stacks) use only as much physical memory as necessary yet grow later.
Protection:
- Different threads (or processes) protected from each other.
- Different pages can be given special behavior
- - (Read Only, Invisible to user programs, etc).
- Kernel data protected from User programs
- Very important for protection from malicious programs => Far more “viruses” under Microsoft Windows
Sharing:
- Can map same physical page to multiple users (“Shared memory”)

Paging

Divide physical memory into fixed-size blocks (e.g., 4KB) called frames
Divide logical memory into blocks of same size (4KB) called pages
To run a program of size n pages, need to find n free frames and load program
Set up a page table to map page addresses to frame addresses (operating system sets up the page table)

Inverted Page Table

One entry for each real page of memory
Shared by all active processes
Entry consists of the virtual address of the page stored in that real memory location, with Process ID information
Decreases memory needed to store each page table, but increases time needed to search the table when a page reference occurs

Fast Address Translation

Use Translation Lookaside Buffer (TLB)
- Instruction-TLB & Data-TLB
- Essentially a cache (tag array = VPN, data array=PPN)
- Small (32 to 256 entries are typical)
- Typically fully associative (implemented as a content addressable memory, CAM) or highly associative to minimize conflicts
Several Design Alternatives
- VIVT: Virtually-indexed Virtually-tagged Cache
- VIPT: Virtually-indexed Physically-tagged Cache
- PIVT: Physically-indexed Virtually-tagged Cache
- - Not outright useful, R6000 is the only used this.
- PIPT: Physically-indexed Physically-tagged Cache