

















# **Bell's Law of Computer Classes**

#### Definition

 Roughly every decade a new, lower priced computer class forms based on a new programming platform, network, and interface resulting in new usage and the establishment of a new industry

#### Evolution

- ° mainframes (1960s)
- ° minicomputers (1970s); essentially replaced by clusters of PCs
- personal computers and workstations evolving into a network enabled by Local Area Networking or Ethernet (1980s)
- ° web browser client-server structures enabled by the Internet (1990s)
- ° cloud computing, e.g., Amazon Web Services or Microsoft's Azure (2000s)
- $^\circ\,$  small form-factor devices such as cell phones and other cell phone sized devices, e.g., Smartphones (c. 2000)
- ° Wireless Sensor Networks (WSN), Internet of Things (IoT) (c. >2005)























## What is Computer Architecture?

Gap too large to bridge in one step

(but there are exceptions, e.g., magnetic compass)

| Application                         |  |  |  |  |  |  |
|-------------------------------------|--|--|--|--|--|--|
| Algorithm                           |  |  |  |  |  |  |
| Programming Language                |  |  |  |  |  |  |
| Operating System/Virtual Machine    |  |  |  |  |  |  |
| Instruction Set Architecture (ISA)  |  |  |  |  |  |  |
| Microarchitecture                   |  |  |  |  |  |  |
| Gates/Register-Transfer Level (RTL) |  |  |  |  |  |  |
| Circuits                            |  |  |  |  |  |  |
| Devices                             |  |  |  |  |  |  |
| Physics                             |  |  |  |  |  |  |

At each abstraction layer, optimizations can be done to impact:

- --Performance
- --Power
- --Area
- --Cost
- --Reliability
- --Security

•••

In its broadest definition, **computer architecture** is the design of the abstraction layers that allow us to implement information processing **applications** efficiently using available **manufacturing technologies** 



- 1950s to 1960s: Computer Architecture Course: Computer Arithmetic
- 1970s to mid 1980s: Computer Architecture Course: Instruction Set Design, especially ISA appropriate for compilers
- 1990s: Computer Architecture Course: Design of CPU, memory system, I/O system, Multiprocessors, Networks
- 2000s: Computer Architecture Course: Multi-core design, on-chip networking, parallel programming, power reduction, instruction level parallelism
- ~2013: Computer Architecture Course: Data and thread level parallelism? Self adapting systems? Security and reliability? Warehouse scale computing?

# Outline

- •What is and what does a computer architect?
- •Classes of computers
- •Trends in technology
- Defining computer architecture
- •New definition of computer architecture

#### **Crossroads: Conventional Wisdom in Comp. Arch.** • Old Conventional Wisdom (CW): Power is free, Transistors expensive • New CW: "Power wall" Power expensive, Transistors free (Can put more on chip than can afford to turn on) Old CW: Sufficiently increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, ...) • New CW: "ILP wall" law of diminishing returns on more HW for ILP Old CW: Multiplies are slow, Memory access is fast New CW: "Memory wall" Memory slow, multiplies fast (200 clock cycles to DRAM memory, 4 clocks for multiply) Old CW: Uniprocessor performance 2X / 1.5 yrs • New CW: Power Wall + ILP Wall + Memory Wall = "Brick Wall" <sup>o</sup> Uniprocessor performance now 2X / 5(?) yrs Change in chip design - multiple cores: 2X processors per chip / ~2 years Over a second processors 23



# Part 2: Computer Architecture – Design Principles and Analysis













| CPU time | = Seconds    | = Instru   | ctions x | Cycles x    | Seconds |
|----------|--------------|------------|----------|-------------|---------|
|          | Program      | Prog       | ram      | Instruction | Cycle   |
|          |              |            |          |             |         |
|          |              | Inst Count | CPI      | Cycle time  |         |
|          | Program      | Х          |          |             |         |
|          | Compiler     | Х          | (X)      |             |         |
|          | Inst. Set.   | Х          | Х        |             | -       |
|          | Organization |            | Х        | Х           | -       |
|          | Technology   |            |          | Х           | -       |







| • CDC Wren I, 1983                                     | • Seagate 373453, 200                                            | 3                   |
|--------------------------------------------------------|------------------------------------------------------------------|---------------------|
| • 3600 RPM                                             | • 15000 RPM                                                      | (4X)                |
| 0.03 GBytes capacity                                   | • 73.4 GBytes                                                    | (2500X)             |
| <ul> <li>Tracks/Inch: 800</li> </ul>                   | • Tracks/Inch: 64,000                                            | (80X)               |
| • Bits/Inch: 9,550                                     | • Bits/Inch: 533,000                                             | (60X)               |
| Three 5.25" platters                                   | <ul> <li>Four 2.5" platters<br/>(in 3.5" form factor)</li> </ul> |                     |
| <ul> <li>Bandwidth:</li> <li>0.6 MBytes/sec</li> </ul> | <ul> <li>Bandwidth:<br/>86 MBytes/sec</li> </ul>                 | (140X)              |
| <ul> <li>Latency: 48.3 ms</li> </ul>                   | • Latency: 5.7 ms                                                | ( <mark>8X</mark> ) |
| Cache: none                                            | Cache: 8 MBytes                                                  |                     |















#### 6 Reasons for "Latency Lags BandWidth" 1. Moore's Law helps BW more than latency Faster transistors, more transistors, more pins help Bandwidth 0.130 vs. 42 M xtors MPU Transistors: (300X) **DRAM** Transistors: 0.064 vs. 256 M xtors (4000X) MPU Pins: 68 vs. 423 pins (6X) . DRAM Pins: 16 vs. 66 pins (4X) Smaller, faster transistors but latency has not reduced as dramatically with successive generations Feature size: 1.5 to 3 vs. 0.18 micron (8X,17X) MPU Die Size: 35 vs. 204 mm<sup>2</sup> (ratio sqrt $\Rightarrow$ 2X) DRAM Die Size: 47 vs. 217 mm<sup>2</sup> (ratio sqrt $\Rightarrow$ 2X) • 44

## 6 Reasons for "Latency Lags BandWidth"

### 2. Distance limits latency

- Size of DRAM block  $\Rightarrow$  long bit and word lines  $\Rightarrow$  most of DRAM access time
- 1. & 2. explains linear latency vs. square BW

## 3. Bandwidth easier to sell ("bigger=better")

- E.g., 10 Gbits/s Ethernet ("10 Gig") vs. 10 μsec latency Ethernet
- 4400 MB/s DIMM ("PC4400") vs. 50 ns latency
- Even if it is just marketing, customers are now trained
- Since bandwidth sells, more resources thrown at bandwidth, which further tips the balance





| 6 Reasons for "Latency Lags BandWid                                                                                                                                                                                                                                                   | lth"       |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------|
| <ul> <li>5. Bandwidth hurts latency</li> <li>Queues help Bandwidth, hurt Latency (Queuing Theory)</li> <li>Adding chips to widen a memory module increases Bandwidth k<br/>fan-out on address lines may increase Latency</li> </ul>                                                   | out higher |
| <ol> <li>Operating System overhead hurts Latency I<br/>than Bandwidth</li> </ol>                                                                                                                                                                                                      | more       |
| <ul> <li>Long messages amortize overhead;<br/>overhead bigger part of short messages</li> <li>It takes longer to create and to send a long message,<br/>which is needed instead of a short message to lessen<br/>average cost per data byte of fixed size message overhead</li> </ul> | 47         |







# <section-header> Derformance: What to Measure Sually rely on benchmarks vs. real workloads To increase predictability, collections of benchmark applications, called benchmark suites, are popular SPECPU: popular desktop benchmark suite Predivide the stream of th

# CINT2006 for Opteron X4 2356

| Name           | Description                   | IC×10 <sup>9</sup> | CPI   | Tc (ns) | Exec time | Ref time | SPECratio |
|----------------|-------------------------------|--------------------|-------|---------|-----------|----------|-----------|
| perl           | Interpreted string processing | 2,118              | 0.75  | 0.4     | 637       | 9,777    | 15.3      |
| bzip2          | Block-sorting compression     | 2,389              | 0.85  | 0.4     | 817       | 9,650    | 11.8      |
| gcc            | GNU C Compiler                | 1,050              | 1.72  | 0.4     | 24        | 8,050    | 11.1      |
| mcf            | Combinatorial optimization    | 336                | 10.00 | 0.4     | 1,345     | 9,120    | 6.8       |
| go             | Go game (AI)                  | 1,658              | 1.09  | 0.4     | 721       | 10,490   | 14.6      |
| hmmer          | Search gene sequence          | 2,783              | 0.80  | 0.4     | 890       | 9,330    | 10.5      |
| sjeng          | Chess game (AI)               | 2,176              | 0.96  | 0.4     | 37        | 12,100   | 14.5      |
| libquantum     | Quantum computer simulation   | 1,623              | 1.61  | 0.4     | 1,047     | 20,720   | 19.8      |
| h264avc        | Video compression             | 3,102              | 0.80  | 0.4     | 993       | 22,130   | 22.3      |
| omnetpp        | Discrete event simulation     | 587                | 2.94  | 0.4     | 690       | 6,250    | 9.1       |
| astar          | Games/path finding            | 1,082              | 1.79  | 0.4     | 773       | 7,020    | 9.1       |
| xalancbmk      | XML parsing                   | 1,058              | 2.70  | 0.4     | 1,143     | 6,900    | 6.0       |
| Geometric mean |                               |                    |       |         |           |          | 11.7      |

## How to Mislead with Performance Reports

- 1. Select pieces of workload that work well on your design, ignore others
- 2. Use unrealistic data set sizes for application (too big or too small)
- 3. Report throughput numbers for a latency benchmark
- 4. Report latency numbers for a throughput benchmark
- 5. Report performance on a kernel and claim it represents an entire application
- 6. Use 16-bit fixed-point arithmetic (because it's fastest on your system) even though application requires 64-bit floating-point arithmetic
- 7. Use a less efficient algorithm on the competing machine
- 8. Report speedup for an inefficient algorithm (bubblesort)
- 9. Compare hand-optimized assembly code with unoptimized C code
- 10. Compare your design using next year's technology against competitor's year old design (1% performance improvement per week)
- 11. Ignore the relative cost of the systems being compared
- 12. Report averages and not individual results
- 13. Report speedup over unspecified base system, not absolute times
- 14. Report efficiency not absolute times
- 15. Report MFLOPS not absolute times (use inefficient algorithm) [ David Bailey, "Twelve ways to fool the masses when giving performance results for parallel supercomputers" ]



