













# **12 Advanced Cache Optimizations**

### **Reducing Hit Time**

- 1.Small and simple caches
- 2.Way prediction
- 3.Trace caches

### Increasing Cache Bandwidth

4.Pipelined caches5.Multi-banked caches6.Non-blocking caches

### **Reducing Miss Penalty**

- 7. Critical word first
- 8. Merging write buffers

### **Reducing Miss Rate**

- 9. Victim Cache
- 10. Hardware prefetching
- 11. Compiler prefetching
- 12. Compiler Optimizations







2. Fast Hit Time: Way Prediction • How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way Set Associative cache? • Way Prediction: keep a few extra bits in cache to predict the "way," or block within the set, of next cache access.  $^\circ$  Multiplexor is set early to select desired block; only 1 tag comparison performed during  $1^{
m st}$ clock cycle in parallel with reading the cache data  $^{\circ}$  Miss 1<sup>st</sup> cycle  $\Rightarrow$  check other blocks for matches in next clock cycle Hit Time Way-Miss Hit Time Miss Penalty ° Accuracy • > 90% for two-way; > 80% for four-way; I-cache > D-cache ° Power consumption: lower as multiple block checking avoided on a hit First used on MIPS R10000 ~mid-90s; now ARM Cortex-A8 Drawback: hard to tune CPU pipeline if hit time varies from 1 or 2 cycles 12









| 5. Inc      | rease Cach                                        | ne Bandw                | /idth։ Mւ                       | ultibanked                                     | Caches            |
|-------------|---------------------------------------------------|-------------------------|---------------------------------|------------------------------------------------|-------------------|
|             | Block<br>address Bank 0                           | Block<br>address Bank 1 | Block<br>address Bank 2         | Block<br>address Bank 3                        |                   |
|             |                                                   |                         |                                 |                                                |                   |
|             | 4 8                                               | 5                       | 6<br>10                         | 7                                              |                   |
|             | 12                                                | 13                      | 14                              | 15                                             |                   |
| • Organ     | bytes per blocks, ea<br>addressing.               | ch of these address     | es would be multipl             | dressing. Assuming 64<br>ied by 64 to get byte | 200000            |
| ° ARM       | Cortex-A8 supports 1-4<br>i7 supports 4 banks for | banks for L2            |                                 | rt simultaneous                                | access            |
| system      |                                                   |                         | s naturally sp<br>banks affects | read themselve<br>s behavior of m              | s across<br>emory |
| • Also re   | educes power o                                    | onsumption              |                                 |                                                |                   |
| - / (150 10 |                                                   |                         |                                 |                                                | 17                |















#### 23

# **11. Reducing Misses by Software Prefetching Data**

 Insert prefetch instructions to request data before the processor needs it

### • Data prefetch

- Register Prefetch: Load data into register (HP PA-RISC loads)
- Cache Prefetch: Load into cache (MIPS IV, PowerPC, SPARC v.9)
- ° Special prefetching instructions cannot cause faults; a form of speculative execution

### Issuing prefetch instructions takes time

- ° Is cost of prefetch issues < savings in reduced misses?
- ° Higher superscalar reduces difficulty of issue bandwidth

## **12. Reducing Misses by Compiler Optimizations**

- McFarling [1989] reduced cache misses by 75% on 8 KB direct mapped cache, 4 Byte blocks in software
- Instructions

° Reorder procedures in memory to reduce conflict misses

- ° Profiling to look at conflicts (using custom tools)
- Data
  - <sup>o</sup> Merging Arrays: Improve spatial locality by single array of compound elements vs. 2 arrays
  - Loop Interchange: Change nesting of loops to access data in order stored in memory
  - Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap
  - Blocking: Improve temporal locality by accessing "blocks" of data repeatedly vs. going down whole columns or rows

25

# **Summary: Advanced Cache Optimizations**

| Technique                                        | Hit<br>time | Band-<br>width | Miss<br>penalty | Miss<br>rate | Power<br>consumption | Hardware cost/<br>complexity | /<br>Comment                                                                                                         |
|--------------------------------------------------|-------------|----------------|-----------------|--------------|----------------------|------------------------------|----------------------------------------------------------------------------------------------------------------------|
| Small and simple<br>caches                       | +           |                |                 | -            | +                    | 0                            | Trivial; widely used                                                                                                 |
| Way-predicting caches                            | +           |                |                 |              | +                    | 1                            | Used in Pentium 4                                                                                                    |
| Pipelined cache access                           | -           | +              |                 |              |                      | 1                            | Widely used                                                                                                          |
| Nonblocking caches                               |             | +              | +               |              |                      | 3                            | Widely used                                                                                                          |
| Banked caches                                    |             | +              |                 |              | +                    | 1                            | Used in L2 of both i7 and<br>Cortex-A8                                                                               |
| Critical word first<br>and early restart         |             |                | +               |              |                      | 2                            | Widely used                                                                                                          |
| Merging write buffer                             |             |                | +               |              |                      | 1                            | Widely used with write through                                                                                       |
| Compiler techniques to reduce cache misses       |             |                |                 | +            |                      | 0                            | Software is a challenge, but<br>many compilers handle<br>common linear algebra<br>calculations                       |
| Hardware prefetching<br>of instructions and data |             |                | +               | +            | -                    | 2 instr.,<br>3 data          | Most provide prefetch<br>instructions; modern high-<br>end processors also<br>automatically prefetch in<br>hardware. |
| Compiler-controlled<br>prefetching               |             |                | +               | +            |                      | 3                            | Needs nonblocking cache;<br>possible instruction overhead;<br>in many CPUs                                           |



# Real World Example #1: AMD Opteron Memory Hierarchy

- 12-stage integer pipeline yields a maximum clock rate of 2.8 GHz
- 48-bit virtual and 40-bit physical addresses
- I and D cache: 64 KB, 2-way set associative, 64-B block, LRU
- L2 cache: 1 MB, 16-way, 64-B block, pseudo LRU
- Data and L2 caches use write back, write allocate
- L1 caches are virtually indexed and physically tagged
- L1 I TLB and L1 D TLB: fully associative, 40 entries
   32 entries for 4 KB pages and 8 for 2 MB or 4 MB pages
- L2 I TLB and L1 D TLB: 4-way, 512 entities of 4 KB pages
- Memory controller allows up to 10 cache misses
  - $^\circ\,$  8 from D cache and 2 from I cache



### Pentium 4 vs. Opteron Memory Hierarchy

| CPU                    | Pentium 4 (3.2 GHz*)                                    | Opteron (2.8 GHz*)                                         |    |
|------------------------|---------------------------------------------------------|------------------------------------------------------------|----|
| Instruction<br>Cache   | Trace Cache<br>(8K micro-ops)                           | 2-way associative, 64<br>KB, 64B block                     |    |
| Data<br>Cache          | 8-way associative, 16 KB,<br>64B block, inclusive in L2 | 2-way associative, 64<br>KB, 64B block, exclusive<br>to L2 |    |
| L2 cache               | 8-way associative,<br>2 MB, 128B block                  | 16-way associative, 1<br>MB, 64B block                     |    |
| Prefetch               | 8 streams to L2                                         | 1 stream to L2                                             |    |
| Memory                 | 200 MHz x 64 bits                                       | 200 MHz x 128 bits                                         |    |
| *Clock rate for this o | comparison in 2005; faster versions existed             |                                                            | 30 |





# Intel core i7 Memory Hierarchy

| Characteristic     | L1                  | L2             | L3                                                    |  |
|--------------------|---------------------|----------------|-------------------------------------------------------|--|
| Size               | 32 KB I/32 KB D     | 256 KB         | 2 MB per core                                         |  |
| Associativity      | 4-way I/8-way D     | 8-way          | 16-way                                                |  |
| Access latency     | 4 cycles, pipelined | 10 cycles      | 35 cycles                                             |  |
| Replacement scheme | Pseudo-LRU          | Pseudo-<br>LRU | Pseudo-LRU but with an<br>ordered selection algorihtm |  |

#### • 3-level cache hierarchy

- ° L1 is virtually indexed and physically tagged while the L2 and L3 caches are physically indexed
- ° L1 and L2 are separate for each core; L3 is shared (max. 2 MB/core)
- ° All three caches are nonblocking and allow multiple outstanding writes
- ° A merging write buffer is used for the L1 cache, which holds data in the event that the line is not present in L1 when it is written. (That is, an L1 write miss does not cause the line to be allocated)
- $^\circ~$  L3 is inclusive of L1 and L2  $\,$
- Replacement is by a variant on pseudo-LRU; in the case of L3 the block replaced is always the lowest numbered way whose access bit is turned off. This is not quite random but is easy to compute. 33







