

# Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel

Presented by Bryan Youse

**Department of Computer & Information Sciences** 

University of Delaware



- Dense Matrix "common" matrix
- Sparse Matrix few non-zero entries
- Sparsity typically expressed as the number o<sup>-</sup> non-zero entries per row



CISC 879 · Software Support for Multicore Architectures



# Why care?

- SpMV one of the most heavily used kernels in scientific computing
- Other applications
  - Economic modeling
  - Information retrieval
  - Network theory
- Historically (currently!) terrible performance:
  - Current algorithms run at 10% or less of machine peak performance on a single-core, cache based processor
  - Dense matrix kernels >>> similar Sparse kernels



#### Problems with sparse kernels

- data structure woes; needs to both:
  - exploit properties of the sparse matrix
  - fit machine architecture well
- run-time information needed
  - · this is the major disadvantage to the dense kernels



CISC 879 · Software Support for Multicore Architectures



### Compressed Sparse Row (CSR) is the standard



Many optimization tricks can be used to exploit the format



- 3 categories of optimizations:
  - Low-level code optimizations
  - Data structure optimizations
    - this includes the requisite code changes
  - Parallelization optimizations
- Note that the first two largely affect single core performance
- Goal: As much autotuning as possible





- Thread Blocking
  - Thread-level parallelism split matrix up (by rows, cols, or blocks)
- Cache Blocking
  - Problem: Very large matrices cannot fit the entire source or answer vectors in cache
  - Solution: Split matrix up into cache-sized tiles (~1k x 1K is common)



#### TLB Blocking

- TLB misses can vary by an order of magnitude depending on the blocking strategy
- Register Blocking
  - Group adjacent non-zeros into rectangular tiles
- Key point to take from blocking: SpMV is a memory-bound application; reducing memory footprint is more important than anything else



- Index Size Selection
  - 16b integers to reduce memory traffic
- SIMDization
- Software Prefetching
  - Get the data we know we'll need soon in cache
- Architecture Specific Kernels
  - through auto-tuning



- Loop Optimizations, of course!
  - CSR format enables the core kernel loop to go fron

Old & Busted: Nested loop with two loop variables

#### New Hotness: Remove inner loop variable

| Code Optimization                    |                 |                   |              | Data Structure Optimization |                |                | Parallelization Optimizatio |                  |                |                |
|--------------------------------------|-----------------|-------------------|--------------|-----------------------------|----------------|----------------|-----------------------------|------------------|----------------|----------------|
|                                      | x86             | N2                | Cell         |                             | x86            | N2             | Cell                        |                  | x86            | N2             |
| SIMDization                          | $\checkmark$    | N/A               | $\checkmark$ | BCSR                        | $\checkmark$   | $\checkmark$   |                             | Row Threading    | $\checkmark^4$ | $\checkmark^4$ |
| Software Pipelining                  |                 |                   | $\checkmark$ | BCOO                        | $\checkmark$   | $\checkmark$   | $\checkmark$                | Process Affinity | $\checkmark^6$ | $\checkmark^8$ |
| Branchless                           | ✓ <sup>10</sup> |                   | $\checkmark$ | 16-bit indices              | $\checkmark$   | $\checkmark$   | $\checkmark$                | Memory Affinity  | √6             | N/A            |
| Pointer Arithmetic                   |                 | $\checkmark^{10}$ |              | 32-bit indices              | $\checkmark$   | $\checkmark$   | N/A                         |                  |                |                |
| PF/DMA <sup>1</sup> Values & Indices | $\checkmark$    | $\checkmark$      | $\checkmark$ | Register Blocking           | $\checkmark$   | $\checkmark$   | √9                          | ]                |                |                |
| PF/DMA <sup>1</sup> Pointers/Vectors |                 |                   | $\checkmark$ | Cache Blocking              | $\checkmark^2$ | $\checkmark^2$ | $\sqrt{3}$                  |                  |                |                |
| inter-SpMV data structure caching    | $\checkmark$    | $\checkmark$      |              | TLB Blocking                | $\checkmark$   | $\checkmark$   |                             | ]                |                |                |



Several leading multicore platforms were tested

|              | (intel)             |         | IBM                 | Sun Sun microsystems |
|--------------|---------------------|---------|---------------------|----------------------|
| Name         | Clovertwn           | Opteron | Cell                | Niagara 2            |
| Chips*Cores  | 2*4 = 8             | 2*2 = 4 | 1*8 = 8             | 1*8 = 8              |
| Architecture | 4-/3-issu<br>000, c |         | 2-VLIW,<br>SIMD,RAM | 1-issue,<br>MT,cache |
| Clock Rate   | 2.3 GHz             | 2.2 GHz | 3.2 GHz             | 1.4 GHz              |
| Peak MemBW   | 21 GB/s             | 21 GB/s | 26 GB/s             | 41 GB/s              |
| Peak GFLOPS  | 74.6 GF             | 17.6 GF | 14.6 GF             | 11.2 GF              |



## **Testing Suite**

| spyplot      | Name                 | Dimensions     | Nonzeros<br>(nnz/row) | Description                             |
|--------------|----------------------|----------------|-----------------------|-----------------------------------------|
|              | Dense                | 2K x 2K        | 4.0M<br>(2K)          | Dense matrix in<br>sparse format        |
| X            | Protein              | 36K x 36K      | 4.3 <b>M</b><br>(119) | Protein data<br>bank 1HYS               |
|              | FEM /<br>Spheres     | 83K x 83K      | 6.0M<br>(72)          | FEM concentric<br>spheres               |
| $\backslash$ | FEM /<br>Cantilever  | 62K x 62K      | 4.0M<br>(65)          | FEM cantilever                          |
|              | Wind<br>Tunnel       | 218K x 218K    | 11.6M<br>(53)         | Pressurized<br>wind tunnel              |
| $\sum$       | FEM /<br>Harbor      | 47K x 47K      | 2.37M<br>(50)         | 3D CFD of<br>Charleston harbor          |
|              | QCD                  | 49K x 49K      | 1.90M<br>(39)         | Quark propagators<br>(QCD/LGT)          |
| $\searrow$   | FEM/Ship             | 141K x 141K    | 3.98M<br>(28)         | FEM Ship<br>section/detail              |
| $\square$    | Economics            | 207K x 207K    | 1.27M<br>(6)          | Macroeconomic<br>model                  |
| $\square$    | Epidemiology         | 526K x 526K    | 2.1M<br>(4)           | 2D Markov model<br>of epidemic          |
| X            | FEM /<br>Accelerator | 121K x 121K    | 2.62M<br>(22)         | Accelerator cavity<br>design            |
| X            | Circuit              | 171K x 171K    | 959K<br>(6)           | Motorola circuit<br>simulation          |
| N. K.        | webbase              | $1M \times 1M$ | 3.1M<br>(3)           | Web connectivity<br>matrix              |
|              | LP                   | 4K x 1.1M      | 11.3M<br>(2825)       | Railways set cover<br>Constraint matrix |

Actually, the dense matrix provides the performance upper bound: • SpMV is limited by memory throughput • Dense case supports arbitrary

register blocks (no added zeros)
Loops are long running -> more
CPU time vs. Memory fetch time



# **Promising Initial Results**

|             | Sustained Memory Bandwidth in GB/s (% of configuration peak bandwidth)<br>Sustained Performance in GFlop/s (% of configuration peak computation) |         |             |         |            |         |              |         |  |
|-------------|--------------------------------------------------------------------------------------------------------------------------------------------------|---------|-------------|---------|------------|---------|--------------|---------|--|
|             | one socket,                                                                                                                                      |         | one socket, |         | one socl   | cet,    | all sockets, |         |  |
|             | one core,                                                                                                                                        |         | one core,   |         | all core   | es,     | all cores,   |         |  |
| Machine     | one thread                                                                                                                                       |         | all threads |         | all threa  | ads     | all threads  |         |  |
| AMD X2      | 5.24  GB/s                                                                                                                                       | (49.2%) | 5.24  GB/s  | (49.2%) | 6.73GB/s   | (63.0%) | 13.35GB/s    | (62.6%) |  |
|             | 1.31  GF/s                                                                                                                                       | (29.7%) | 1.31  GF/s  | (29.7%) | 1.68 GF/s  | (19.1%) | 3.31  GF/s   | (18.8%) |  |
| Clovertown  | 3.82  GB/s                                                                                                                                       | (35.8%) | 3.82  GB/s  | (35.8%) | 5.37GB/s   | (50.3%) | 11.24GB/s    | (52.7%) |  |
| Clovertown  | 0.95GF/s                                                                                                                                         | (10.2%) | 0.95GF/s    | (10.2%) | 1.33  GF/s | (3.6%)  | 2.79  GF/s   | (3.7%)  |  |
| Niagara2    | 0.66GB/s                                                                                                                                         | (1.5%)  | 3.79GB/s    | (8.9%)  | 23.28GB/s  | (54.6%) | 23.28GB/s    | (54.6%) |  |
|             | 0.16GF/s                                                                                                                                         | (11.7%) | 0.94GF/s    | (67.3%) | 5.80 GF/s  | (51.8%) | 5.80 GF/s    | (51.8%) |  |
| Cell(PS3)   | 4.76GB/s                                                                                                                                         | (18.6%) | 4.76GB/s    | (18.6%) | 21.16GB/s  | (82.6%) | 21.16GB/s    | (82.6%) |  |
| con(roo)    | 1.15GF/s                                                                                                                                         | (62.9%) | 1.15GF/s    | (62.9%) | 5.12  GF/s | (46.6%) | 5.12  GF/s   | (46.6%) |  |
| Cell(Blade) | 4.75GB/s                                                                                                                                         | (18.6%) | 4.75GB/s    | (18.6%) | 24.73GB/s  | (96.6%) | 47.29GB/s    | (92.4%) |  |
| cen(Blade)  | 1.15GF/s                                                                                                                                         | (62.9%) | 1.15GF/s    | (62.9%) | 5.96 GF/s  | (40.7%) | 11.35GF/s    | (38.8%) |  |

Table 3: Sustained bandwidth and computational rate for a dense matrix stored in sparse format, in GB/s (and percentage configuration's peak bandwidth) and GFlop/s (and percentage of configuration's peak performance).

 Remember, outside of this project, we typically expect only 10% of peak performance









CISC 879 · Software Support for Multicore Architectures





- Cell's SpMV times are dramatically faster
- All architectures showed good results
  - motivating the need for these optimizations!