Computer Architect Tips - yszheda/wiki GitHub Wiki
Courses
- 18-447 Introduction to Computer Architecture – Spring 2015
- CMPT 295 : Introduction to Computer System
- CS 758 Advanced Topics in Computer Architecture Fall 2019 Section 1
Reading Materials
1974年,Robert Dennard提出,当芯片的尺寸缩小S倍,频率提升S倍,只要芯片的工作电压相应地降低S倍,单位面积的功耗就会保持恒定。在新一代工艺下,单位面积的晶体管数量会增大S^2倍,频率提升S倍,单位面积的功耗却能保持不变。
为了克服功耗墙的问题,2005年以后的芯片设计中普遍采用了“暗硅”(dark silicon)的思路:通过多核和异构架构等设计,限制了芯片上全速工作(点亮)的工作区域,从而使得芯片满足了功耗约束。
对于GPGPU等面积巨大的芯片而言,另一堵由于半导体工艺约束而产生的高墙已经近在咫尺:光刻墙,指的是光刻设备在一次曝光中能支持的最大芯片面积 (reticle limit)。
Predicate Register
Scratchpad Memory
- https://en.wikipedia.org/wiki/Scratchpad_memory
- https://www.sciencedirect.com/topics/computer-science/scratchpad-memory
Cache Type | Addressing Scheme | Management Scheme |
---|---|---|
Transparent cache | Transparent | Implicit (by cache) |
Software-managed cache | Transparent | Explicit (by application) |
Self-managed scratch-pad | Non-transparent | Implicit (by cache) |
Scratch-pad memory | Non-transparent | Explicit (by application) |
Memory-level parallelism
Cache
Types of Caches
- Direct mapping: a memory value can only be placed at a single corresponding location in the cache.
- Set-associative mapping: a memory value can be placed in any location of a set in the cache.
- Fully-associative mapping: a memory value can be placed anywhere in the cache.
SA Cache
Three (or Four) Cs (Cache Miss Terms)
- Compulsory Misses:
-
- cold start misses (Caches do not have valid data at the start of the program)
- Capacity Misses:
-
- Increase cache size
- Conflict Misses:
-
- Increase cache size and/or associativity.
-
- Associative caches reduce conflict misses
- Coherence Misses:
-
- In multiprocessor systems
3Cs Absolute Miss Rate (SPEC92)
- Compulsory misses are a tiny fraction of the overall misses
- Capacity misses reduce with increasing sizes
- Conflict misses reduce with increasing associativity
Reducing Conflict Misses
-
Set associative (SA) cache
-
- multiple possible locations in a set
-
Fully associative (FA) cache
-
- any location in the cache
-
Hardware and speed overhead
-
- Comparators
-
- Multiplexors
-
- Data selection only after Hit/Miss determination (i.e., after tag comparison)
Cache Write Policy
-
Write through: The value is written to both the cache line and to the lower-level memory.
-
Write back: The value is written only to the cache line. The modified cache line is written to main memory only when it has to be replaced.
On Write Miss
- Write allocate
-
- The line is allocated on a write miss, followed by the write hit actions above.
-
- Write misses first act like read misses
- No write allocate
-
- Write misses do not interfere cache
-
- Line is only modified in the lower level memory
-
- Mostly use with write-through cache
Write buffers
- A small cache that can hold a few values waiting to go to main memory.
- To avoid stalling on writes, many CPUs use a write buffer.
- This buffer helps when writes are clustered.
- It does not entirely eliminate stalls since it is possible for the buffer to fill if the burst is larger than the buffer.
How to improve cache performance?
- Reducing cache misses by more flexible placement of blocks
- Reducing the miss penalty using multilevel caches
Cache Replacement Policy
- Random
-
- Replace a randomly chosen line
- FIFO
-
- Replace the oldest line
- LRU (Least Recently Used)
-
- Replace the least recently used line
- NRU (Not Recently Used)
-
- Replace one of the lines that is not recently used
-
- In Itanium2 L1 Dcache, L2 and L3 caches
LIP: LRU Insertion Policy
-
- 采用LRU,incoming block长期占据MRU位置;采用LIP,incoming block放入LRU位置,如果下次没被用到就成为victim
Useless Block Evicted at next eviction Useful Block Moved to MRU position
BIP: Bimodal Insertion Policy
- LIP may not age older lines
- Infrequently insert lines in MRU position
- Let e = Bimodal throttle parameter
if ( rand() < e )
Insert at MRU position; // LRU replacement policy
else
Insert at LRU position;
Promote to MRU if reused
DIP: Dynamic Insertion Policy
-
Two types of workloads: LRU-friendly or BIP-friendly
-
DIP can be implemented by:
-
- Monitor both policies (LRU and BIP)
-
- Choose the best-performing policy
-
- Apply the best policy to the cache
Not Recently Used (NRU)
-
Employ an NRU bit for each cache to indicate usage
-
- 0: the line has been re-referenced
-
- 1: the line has not been referenced for a while
-
Cache hit (or first insertion) : Set NRU:=0
-
Eviction: Victim is the line with NRU==1
-
- Left to right priority for multiple candidates
-
If found no victim, set everyone’s NRU:=1
RRIP
Virtual Memory
-
Virtual memory – separation of logical memory from physical memory.
-
- Only a part of the program needs to be in memory for execution. Hence, logical address space can be much larger than physical address space.
-
- Allows address spaces to be shared by several processes (or threads).
-
- Allows for more efficient process creation.
-
Virtual memory can be implemented via:
-
- Demand segmentation
-
The concept of a virtual (or logical) address space that is bound to a separate physical address space is central to memory management
-
- Virtual address – generated by the CPU
-
- Physical address – seen by the memory
-
Virtual and physical addresses are the same in compile-time and load-time address-binding schemes; virtual and physical addresses differ in execution-time address-binding schemes
Advantages of Virtual Memory
- Translation:
-
- Program can be given consistent view of memory, even though physical memory is scrambled
-
- Only the most important part of program (“Working Set”) must be in physical memory.
-
- Contiguous structures (like stacks) use only as much physical memory as necessary yet grow later.
- Protection:
-
- Different threads (or processes) protected from each other.
-
- Different pages can be given special behavior
-
-
- (Read Only, Invisible to user programs, etc).
-
-
- Kernel data protected from User programs
-
- Very important for protection from malicious programs => Far more “viruses” under Microsoft Windows
- Sharing:
-
- Can map same physical page to multiple users (“Shared memory”)
Paging
- Divide physical memory into fixed-size blocks (e.g., 4KB) called frames
- Divide logical memory into blocks of same size (4KB) called pages
- To run a program of size n pages, need to find n free frames and load program
- Set up a page table to map page addresses to frame addresses (operating system sets up the page table)
Inverted Page Table
- One entry for each real page of memory
- Shared by all active processes
- Entry consists of the virtual address of the page stored in that real memory location, with Process ID information
- Decreases memory needed to store each page table, but increases time needed to search the table when a page reference occurs
Fast Address Translation
-
Use Translation Lookaside Buffer (TLB)
-
- Instruction-TLB & Data-TLB
-
- Essentially a cache (tag array = VPN, data array=PPN)
-
- Small (32 to 256 entries are typical)
-
- Typically fully associative (implemented as a content addressable memory, CAM) or highly associative to minimize conflicts
-
Several Design Alternatives
-
- VIVT: Virtually-indexed Virtually-tagged Cache
-
- VIPT: Virtually-indexed Physically-tagged Cache
-
- PIVT: Physically-indexed Virtually-tagged Cache
-
-
- Not outright useful, R6000 is the only used this.
-
-
- PIPT: Physically-indexed Physically-tagged Cache