

## **High-Performance** Computation on **Spatially Sparse Data Structures**

**Yuanming Hu<sup>1</sup>** Tzu-Mao Li<sup>2</sup> Luke Anderson<sup>1</sup> Jonathan Ragan-Kelley<sup>2</sup> Fredo Durand<sup>1</sup>

<sup>1</sup>MIT CSAIL <sup>2</sup>UC Berkeley

## Contended of the second sec Programming Language







3 million particles simulated with MLS-MPM; rendered with path tracing. Using programs written in Taichi.







3 million particles simulated with MLS-MPM; rendered with path tracing. Using programs written in Taichi.











### Bounding Volume





### Bounding Volume



### Region of Interest



### Bounding Volume

### Spatial Sparsity: Regions of interest only occupy a small fraction of the bounding volume.















## VDB [Museth 2013]



### **Shallow Multi-Level Sparse Voxel Grids**





## SPGrid



### Even shallower sparse grid system

- Virtual Memory
- Morton Coding
- Bitmasks

[Setaluri, Aanjaneya, Bauer, and Sifakis, SIGGRAPH Asia 2014] SPGrid: A sparse paged grid structure applied to adaptive smoke simulation



### Particles







0



### 1x1x1















### Particles







0



### 1x1x1















## Using Sparse Data Structures is Hard

**Boundary Conditions** 

Maintaining Topology

**Parallelization & Load Balancing** 

Memory Management

**Data Structure Overhead** 



## Using Sparse Data Structures is Hard

**Boundary Conditions** 

Maintaining Topology

**Parallelization & Load Balancing** 

Memory Management

**Data Structure Overhead** 





### 10%

### **Essential Computation Data Structure Overhead**

90%





90%



### **Essential Computation Data Structure Overhead**



## In reality...





99%

1 %

## In reality...





99%

1 %

## In reality...

1%

### 99%

## In reality...

### Hash table lookup: 10s of clock cycles **Indirection:** cache/TLB misses Node allocation: locks, atomics, barriers **Branching:** misprediction / warp divergence

• • •



1 %

### 99%

## In reality...

### Hash table lookup: 10s of clock cycles **Indirection:** cache/TLB misses Node allocation: locks, atomics, barriers **Branching:** misprediction / warp divergence

• • •

Low-level engineering reduces data structure overhead, but harms productivity and couples algorithms and data structures, making it difficult to explore different data structure designs and find the optimal one.





# Data structure access: 50 clock cycles / element Simple Stencil Computation: 0.5 clock cycle / element



# Data structure access: 50 clock cycles / element Simple Stencil Computation: 0.5 clock cycle / element

## Sparse data structure overhead can be **100x** higher than essential computation



### **Data structure access:** • 50 clock cycles / element **Simple Stencil Computation:** 0.5 clock cycle / element

Sparse data structure overhead can be

Fun fact: without low-level engineering, dense data structures are often faster for problems with >10% sparsity

### Sparse Data Structure



## **100x** higher than essential computation





### **Dense Data Structure**





### **Dense Data Structure**





### **Dense Data Structure**

































# Data Accesses Drawn Proportionally...



#### **Sparse Data Structure**



## Traditional Sparse Computation Workflow

- **1.** Choose a sparse data structure library
- 2. Implement the algorithm on that sparse data structure
- 3. Do low-level engineering to optimize for performance • Code is complex and coupled with the data structure library



### Traditional Sparse Computation Workflow

- 1. Choose a sparse data structure library
- 2. Implement the algorithm on that sparse data structure
- 3. Do low-level engineering to optimize for performance • Code is complex and coupled with the data structure library

  - "Oh no, this data structure isn't really optimal for this algorithm"

### start over



# Ideal Sparse Computing Workflow

- 1. Implement the algorithm as if all grids are dense
- 2. Describe your data structure
- 3. The compiler optimizes performance
- Benchmark performance, and try different data structures

# Ideal Sparse Computing Workflow

- 1. Implement the algorithm as if all grids are dense
- 2. Describe your data structure
- 3. The compiler optimizes performance
  - Benchmark performance, and try different data structures

e.g., Halide [Ragan-Kelley, Adams, Paris, Levoy, Amarasinghe, Durand. SIGGRAPH 2012]

- Related work: split languages,



#### **Decouple** computation from data structures 1)

| Computational Kernels | (Sparse) Data Structur |
|-----------------------|------------------------|
|                       |                        |
|                       |                        |
|                       |                        |

| res |
|-----|
|-----|

#### **Decouple** computation from data structures 1)

#### **Computational Kernels**

(Sparse) Data Structur

```
Kernel(laplace).def([&]() {
  For(u, [&](Expr i, Expr j){
    auto c = 1.0f / (dx * dx);
    u[i, j] = c * (4 * v[i, j] - v[i+1, j]
- v[i-1, j] - v[i, j+1] - v[i, j-1]);
 });
});
                            2D Laplace operator
```

2) Imperative computation language

| res |
|-----|
|-----|

#### **Decouple** computation from data structures 1)

#### **Computational Kernels**

```
Kernel(laplace).def([&]() {
  For(u, [&](Expr i, Expr j){
   auto c = 1.0f / (dx * dx);
   u[i, j] = c * (4 * v[i, j] - v[i+1, j])
             - v[i-1, j] - v[i, j+1] - v[i, j-1]);
 });
});
                        2D Laplace operator
```

#### (Sparse) Data Structures

Global(u, f32); Global(v, f32); layout([&]() { auto ij = Indices(0, 1); root.dense(ij, {128, 128}).pointer() .dense(ij, {8, 8}).place(u, v); **});** 

1024<sup>2</sup> sparse grid with 8<sup>2</sup>

2) Imperative computation language

3) Hierarchical data structure description language

#### **Decouple** computation from data structures 1)

#### **Computational Kernels**

```
Kernel(laplace).def([&]() {
  For(u, [&](Expr i, Expr j){
    auto c = 1.0f / (dx * dx);
    u[i, j] = c * (4 * v[i, j] - v[i+1, j]
- v[i-1, j] - v[i, j+1] - v[i, j-1]);
 });
});
                            2D Laplace operator
```

#### (Sparse) Data Structures

Global(u, f32); Global(v, f32); layout([&]() { auto ij = Indices(0, 1); root.dense(ij, {128, 128}).pointer() .dense(ij, {8, 8}).place(u, v); **});** 

1024<sup>2</sup> sparse grid with 8<sup>2</sup>

2) Imperative computation language

3) Hierarchical data structure description language





4) Intermediate representation (IR) & data structure access optimizations

#### **Decouple** computation from data structures 1)

#### **Computational Kernels**

```
Kernel(laplace).def([&]() {
  For(u, [&](Expr i, Expr j){
    auto c = 1.0f / (dx * dx);
    u[i, j] = c * (4 * v[i, j] - v[i+1, j]
- v[i-1, j] - v[i, j+1] - v[i, j-1]);
 });
});
                            2D Laplace operator
```

#### (Sparse) Data Structures

Global(u, f32); Global(v, f32); layout([&]() { auto ij = Indices(0, 1); root.dense(ij, {128, 128}).pointer() .dense(ij, {8, 8}).place(u, v); **});** 

1024<sup>2</sup> sparse grid with 8<sup>2</sup>

2) Imperative computation language

3) Hierarchical data structure description language





4) Intermediate representation (IR) & data structure access optimizations **Runtime System** 

5) Auto parallelization, memory management, ...



#### **Decouple** computation from data structures 1)

#### **Computational Kernels**

```
Kernel(laplace).def([&]() {
  For(u, [&](Expr i, Expr j){
    auto c = 1.0f / (dx * dx);
    u[i, j] = c * (4 * v[i, j] - v[i+1, j]
- v[i-1, j] - v[i, j+1] - v[i, j-1]);
 });
});
                            2D Laplace operator
```

#### (Sparse) Data Structures

Global(u, f32); Global(v, f32);  $layout([\&]() {$ auto ij = Indices(0, 1); root.dense(ij, {128, 128}).pointer() .dense(ij, {8, 8}).place(u, v); **});** 

1024<sup>2</sup> sparse grid with 8<sup>2</sup>

2) Imperative computation language

3) Hierarchical data structure description language



4) Intermediate representation (IR) & data structure access optimizations



#### **1) Decouple** *computation* from *data structures*

#### **Computational Kernels**

#### (Sparse) Data Structures

Global(u, f32); Global(v, f32);
layout([&]() {
 auto ij = Indices(0, 1);
 root.dense(ij, {128, 128}).pointer()
 .dense(ij, {8, 8}).place(u, v);
});

1024<sup>2</sup> sparse grid with 8<sup>2</sup>

2) Imperative computation language

3) Hierarchical data
 structure description
 language 2



4) Intermediate
 representation (IR) &
 data structure
 access optimizations



# Defining Computation

$$u_{i,j} = \frac{1}{\Delta x^2} (4v_{i,j} - v_{i+1,j} - v_{i-1,j} - v_{i,j+1} - v_{i,j-1})$$

$$\downarrow$$
**Taichi Kernel**

- @ti.kernel def laplace(): for i, j in u: 3 c = 1 / (dx \* dx)4 5 6

### **Finite Difference Stencil**

#### u[i, j] = c \* (4.0 \* v[i, j] - v[i-1, j] - v[i+1, j]-v[i, j-1] - v[i, j+1])

Program on **sparse** data structures as if they are **dense**; **Parallel** for-loops (Single-Program-Multiple-Data, like CUDA/ispc); Loop over only active elements in the sparse data structure; Complex control flows (e.g. lf, While) supported.



| Sample/pixel/sec: 7.1 |
|-----------------------|
| depth_limit: 20       |
| density_scale: 400.00 |
| max_density: 724.000  |
| ground_y: 0.029       |
| light_phi: 0.419      |
| light_theta: 0.218    |
| light_smoothness: 0.0 |
| light_ambient: 0.150  |
| exposure: 0.567       |
| gamma: 0.500          |
| file_id: 0            |
| output_samples: 10    |
| grid_level: 1         |
| video_step: 1         |
| Save                  |
| Render All            |
|                       |

| 7.126 |
|-------|
|       |
| .000  |
| 000   |
|       |
|       |
|       |
| 0.050 |
| 50    |
|       |
|       |
|       |
| 0     |
|       |
|       |
|       |
| แ     |
|       |
|       |



| Sample/pixel/sec: 7.1 |
|-----------------------|
| depth_limit: 20       |
| density_scale: 400.00 |
| max_density: 724.000  |
| ground_y: 0.029       |
| light_phi: 0.419      |
| light_theta: 0.218    |
| light_smoothness: 0.0 |
| light_ambient: 0.150  |
| exposure: 0.567       |
| gamma: 0.500          |
| file_id: 0            |
| output_samples: 10    |
| grid_level: 1         |
| video_step: 1         |
| Save                  |
| Render All            |
|                       |

| 7.126 |
|-------|
|       |
| .000  |
| 000   |
|       |
|       |
|       |
| 0.050 |
| 50    |
|       |
|       |
|       |
| 0     |
|       |
|       |
|       |
| แ     |
|       |
|       |

dense: A fixed-length contiguous array.

hash: Use a hash table to maintain the mapping from active coordinates to data address in memory. Suitable for high sparsity.

dynamic: Variable-length array, with a predefined maximum length. It serves the role of std::vector, and can be used to maintain objects (e.g. particles) contained by a block.

Structural Nodes

morton: Reorder the data in memory using a Z-order curve (Morton coding), for potentially higher spatial locality. For dense only. bitmasked: Use a mask to maintain sparsity information, one bit per child. For dense only.

pointer: Store pointers instead of the whole structure to save memory and maintain sparsity. For dense and dynamic.

Node Decorators



dense: A fixed-length contiguous array.

hash: Use a hash table to maintain the mapping from active coordinates to data address in memory. Suitable for high sparsity.

dynamic: Variable-length array, with a predefined maximum length. It serves the role of std::vector, and can be used to maintain objects (e.g. particles) contained by a block.

Structural Nodes

morton: Reorder the data in memory using a Z-order curve (Morton coding), for potentially higher spatial locality. For dense only. bitmasked: Use a mask to maintain sparsity information, one bit per child. For dense only.

pointer: Store pointers instead of the whole structure to save memory and maintain sparsity. For dense and dynamic.

Node Decorators





dense: A fixed-length contiguous array.

hash: Use a hash table to maintain the mapping from active coordinates to data address in memory. Suitable for high sparsity.

dynamic: Variable-length array, with a predefined maximum length. It serves the role of std::vector, and can be used to maintain objects (e.g. particles) contained by a block.

### Structural Nodes

root.hash(ijk, 32).dense(ijk, 16).pointer() .dense(ijk, 8).place(u, v, w);

### VDB [Museth 2013]

morton: Reorder the data in memory using a Z-order curve (Morton coding), for potentially higher spatial locality. For dense only. bitmasked: Use a mask to maintain sparsity information, one bit per child. For dense only.

pointer: Store pointers instead of the whole structure to save memory and maintain sparsity. For dense and dynamic.

Node Decorators

root.dense(ijk, 512).morton().bitmasked() .dense(ijk, {8, 4, 4}).place(flags, u, v, w);

SPGrid [Setaluri et al. 2014]



dense: A fixed-length contiguous array.

hash: Use a hash table to maintain the mapping from active coordinates to data address in memory. Suitable for high sparsity.

dynamic: Variable-length array, with a predefined maximum length. It serves the role of std::vector, and can be used to maintain objects (e.g. particles) contained by a block.

### Structural Nodes

root.hash(ijk, 32).dense(ijk, 16).pointer() .dense(ijk, 8).place(u, v, w);

### VDB [Museth 2013]

root.hash(ijk, 512).dense(ijk, 512).morton().bitmasked().dense(ijk, {8, 4, 4}).place(flags, u, v, w);

morton: Reorder the data in memory using a Z-order curve (Morton coding), for potentially higher spatial locality. For dense only. bitmasked: Use a mask to maintain sparsity information, one bit per child. For dense only.

pointer: Store pointers instead of the whole structure to save memory and maintain sparsity. For dense and dynamic.

### Node Decorators

root.dense(ijk, 512).morton().bitmasked() .dense(ijk, {8, 4, 4}).place(flags, u, v, w);

### SPGrid [Setaluri et al. 2014]

#### "SPVDB"









### Unbounded sparse grid structure



### Unbounded sparse grid structure



# Access Simplification based on computation and data structure info

# Access Simplification



### Root2leaf (end2end) data access

Access lowering

micro-access instructions

"Common subexpression elimination"

### Simplified micro-access instructions

# Access Simplification

### Removing redundant data structure traversals

#### **Unoptimized Accesses**



### More optimizations:

• • •

- shared memory utilization on GPUs;
- avoid unnecessary activation checks;
- better vectorized loads on CPUs;

### **Optimized Accesses**



on on GPUs; ation checks; on CPUs;



# **Results** 10.0x shorter code 4.55x higher performance

High-Performance CPU/GPU Kernels<br/>Ours v.s. State-of-the-art:MLS-MPM13x shorter code, 1.2x fasterFEM Kernel13x shorter code, 14.5x fasterMGPCG7x shorter code, 1.9x fasterSparse CNN9x shorter code, 13x faster



### Patterns: Particle scatter/gather

### Particle to Grid (P2G)

### Grid to Particle (G2P)









### Patterns: Particle scatter/gather

### Particle to Grid (P2G)

### Grid to Particle (G2P)







| Particle Layout | Ord |
|-----------------|-----|
| SOA             | 3.5 |
| AOS             | 3.1 |

### **AOS much faster than SOA for random access!** No sorting needed.

### ered Randomly Shuffled

- 52ms 21.23 ms
- $4.28 \mathrm{ms}$ 5ms

**Reproduce:** ti mpm\_benchmark particle\_soa=[true/false] initial\_shuffle=[true/false]

### **Benchmarks: MLS-MPM** The use of scratch pad memory [NVIDIA shared memory]

| Grou | p particles<br>& scatter/g | into blocks<br>ather |
|------|----------------------------|----------------------|
|      | GPU-SPM                    | GPU+SPM              |
| P2G  | 5.102ms                    | 2.011ms              |
| G2P  | 1.975ms                    | 0.722ms              |

**Reproduce:** ti mpm\_benchmark use\_cache=[true/false]

### **Benchmarks: MLS-MPM** The use of scratch pad memory [NVIDIA shared memory]

| Group partic<br>& scat |
|------------------------|
| GPU-S                  |
| P2G 5.102              |
| G2P 1.975              |
|                        |

**Reproduce:** ti mpm\_benchmark use\_cache=[true/false]

- cles into blocks ter/gather
- PM **GPU+SPM**
- 2.011ms 2ms
- ims 0.722 ms

### 2-3x faster using shared memory

Index analysis for scratchpad memory size inference





### Compared with baseline [Gao et al.]: [GPU] 1.2x faster 13x shorter code

**Reproduce:** ti mpm\_benchmark particle\_soa=[true/false] initial\_shuffle=[true/false]

### Benchmarks: FEM Kernel

# $c \in C(i) j \in \mathcal{V}(c)$



 $\underline{\mathbf{f}} = \sum (\mu^{(c)} \cdot \mathbf{K}^{\mu} + \underline{\lambda}^{(c)} \cdot \mathbf{K}^{\lambda})_{i^{(c)} j^{(c)}} \cdot \underline{\mathbf{u}}_{j}.$ 



### Benchmarks: FEM Kernel

# 3 channels $c \in C(i) j \in \mathcal{V}(c)$ 8 elements 8 elements

**Patterns: Stencils with very high arithmetic intensity (compute bound)** 

3 channels  $\underline{\mathbf{f}} = \sum_{i=1}^{\infty} (\mu^{(c)} \cdot \mathbf{K}^{\mu} + \underline{\lambda}^{(c)} \cdot \mathbf{K}^{\lambda})_{i^{(c)} i^{(c)}} \cdot \underline{\mathbf{u}}_{i}.$ 

### 3x8x8x3x2=1152 FLOPs/vertex



# Benchmarks: FEM Kernel



## 4x4-Blocked Sparse Grid

# Benchmarks: FEM Kernel



## **5-Point Stencil** (Scalar)

## (Simplified: actual stencil is much larger)

## Benchmarks: FEM Kernel **5-Point Stencil** (4-wide Vectorized) Taichi compiler merges 4 addressing into 1, and then do a vectorized load (more details later)



# Benchmarks: FEM Kernel



## **5-Point Stencil** (4-wide Vectorized)



| Benc | hmark | KS: F | ΕN |
|------|-------|-------|----|
|      |       |       |    |

| Benchmarks: FEM Kerne                   |  |  |  |  |  |  |  |
|-----------------------------------------|--|--|--|--|--|--|--|
| Ablation CPU Time GPU Time              |  |  |  |  |  |  |  |
| No multithreading 73.43ms               |  |  |  |  |  |  |  |
| No vectorization 83.54ms                |  |  |  |  |  |  |  |
| No vectorized load instructions 22.69ms |  |  |  |  |  |  |  |
| No simplification I 17.01ms 2.13 m      |  |  |  |  |  |  |  |
| No access lowering 182.19ms 6.046 m     |  |  |  |  |  |  |  |
| No simplification II 85.51ms 11.784 m   |  |  |  |  |  |  |  |
| AOS instead of SOA 136.03ms 20.992 m    |  |  |  |  |  |  |  |
| All optimizations on 17.16ms 2.11 m     |  |  |  |  |  |  |  |

**Reproduce:** ti fem gpu=[t/f] simp1=[t/f] vec=[t/f] threads=[1-8] lower\_access=[t/f] simp2=[t/f] vec\_load\_cpu=[t/f] block\_soa=[t/f]



| Benchmarks: FEM Kerne |          |           |  |  |
|-----------------------|----------|-----------|--|--|
| Ablation              | CPU Time | GPU Time  |  |  |
|                       |          |           |  |  |
|                       |          |           |  |  |
|                       |          |           |  |  |
| No simplification I   | 17.01ms  | 2.13 ms   |  |  |
| No access lowering    | 182.19ms | 6.046 ms  |  |  |
| No simplification II  | 85.51ms  | 11.784 ms |  |  |
| AOS instead of SOA    | 136.03ms | 20.992 ms |  |  |
| All ontimizations on  | 17 16ma  | 2 11 mg   |  |  |

### All optimizations on

Without access lowering, the backend compiler (gcc/clang/nvcc) fails to discover potential vectorized loads and reduce data access

17.16ms 2.11 ms

**Reproduce:** ti fem gpu=[t/f] simp1=[t/f] vec=[t/f] threads=[1-8] lower\_access=[t/f] simp2=[t/f] vec\_load\_cpu=[t/f] block\_soa=[t/f]







| Benchmarks: FEM Kernel |          |           |  |  |  |
|------------------------|----------|-----------|--|--|--|
| Ablation               | CPU Time | GPU Time  |  |  |  |
|                        |          |           |  |  |  |
|                        |          |           |  |  |  |
|                        |          |           |  |  |  |
|                        |          |           |  |  |  |
|                        |          |           |  |  |  |
|                        |          |           |  |  |  |
| AOS instead of SOA     | 136.03ms | 20.992 ms |  |  |  |
| All optimizations on   | 17.16ms  | 2.11 ms   |  |  |  |

**Reproduce:** ti fem  $gpu=[t/f] simp1=[t/f] vec=[t/f] threads=[1-8] lower_access=[t/f] simp2=[t/f] vec_load_cpu=[t/f] block_soa=[t/f]$ 

### **AOS is really bad** in this case since no vectorized ld/st low cacheline util.





## Benchmarks: FEM Kernel

### Ablation

### CPU Time

### All optimizations on

17.16ms 2.11 ms



**Reproduce:** ti fem  $gpu=[t/f] simp1=[t/f] vec=[t/f] threads=[1-8] lower_access=[t/f] simp2=[t/f] vec_load_cpu=[t/f] block_soa=[t/f]$ 



## Vectorized FEM Access Optimization Initial IR for i in range(0, n, step 4): %1 = load voxel 1 from root %2 = load voxel 2 from root %3 = load voxel 3 from root %4 = load voxel 4 from root %9 = make vector(%1,%2,%3,%4) 2 3



### Vectorized FEM Access Optimization After access lowering: for i in range(0, n, step 4): %1 = get block for voxel 1 %2 = get voxel 1 from %1 %3 = get block for voxel 2 %4 = get voxel 2 from %3 %5 = get block for voxel 3 %6 = get voxel 3 from %5 %7 = get block for voxel 4 3 2 4 %8 = get voxel 4 from %7 %9 = make\_vector(%2,%4,%6,%8)



### Vectorized FEM Access Optimization Index analysis for i in range(0, n, step 4): %1 = get block for voxel i+0 %2 = get voxel i+0 from %1 %3 = get block for voxel i+1%4 = get voxel i+1 from %3 %5 = get block for voxel i+2 %6 = get voxel i+2 from %5 %7 = get block for voxel i+32 3 4 %8 = get voxel i+3 from %7 %9 = make\_vector(%2,%4,%6,%8)











### Vectorized FEM Access Optimization With data structure info (block size=16) for i in range(0, n, step 4): %1 = get block (i+0)/16%2 = get voxel i+0 from %1 %3 = get block (i+1)/16%4 = get voxel i+1 from %3 %5 = get block (i+2)/16%6 = get voxel i+2 from %5 %7 = get block (i+3)/16 2 3 4 %8 = get voxel i+3 from %7 %9 = make\_vector(%2,%4,%6,%8)



### Vectorized FEM Access Optimization Index analysis (i % 4 == 0) & and integer division property: for i in range(0, n, step 4): %1 = get block(i+0)/16%2 = get voxel i+0 from %1 %3 = get block(i+0)/16%4 = get voxel i+1 from %3 %5 = get block(i+0)/16%6 = get voxel i+2 from %5 %7 = get block (i+0)/16 2 3 4 %8 = get voxel i+3 from %7 %9 = make\_vector(%2,%4,%6,%8)



### Vectorized FEM Access Optimization **Index analysis (i % 4 == 0) & simplification** for i in range(0, n, step 4): %1 = get block(i+0)/16%2 = get voxel i+0 from %1 %4 = get voxel i+1 from %1 %6 = get voxel i+2 from %1 %8 = get voxel i+3 from %1 $%9 = make_vector(%2,%4,%6,%8)$ 3 2







## Vectorized FEM Access Optimization



Index analysis + data structure info

for i in range(0, n, step 4): %1 = get block i/16

- %2 = get voxel i+0 within %1
- %3 = get 1st voxel right to %2
- %4 = get 2nd voxel right to %2
- %5 = get 3rd voxel right to %2
- $%9 = make_vector(%2, %3, %4, %5)$





## Vectorized FEM Access Optimization



Index analysis + data structure info

for i in range(0, n, step 4): %1 = get block i/16

- %2 = get voxel i within %1
- %3 = vector\_load(%2, width=4)





# **Reasons for Performance**

# Why can't traditional compilers do the optimizations?

# Index analysis Instruction granularity Data access semantics

## The Granularity Spectrum

### x[i, j] access1(i,j) access2(i,j) = false sp = get child [S3->S sl0 = bit\_extract(\$2 sl1 = linearized(ind sl2 = [S2][dense]::lo activate = false scivate = false



\$4 = [S4][root]::lookup(root, \$3) coord = {\$2} activate = false \$5 = get child [S4->S3] \$4 \$6 = bit\_extract(\$2 + 0, 7~14) \$7 = linearized(ind {\$6}, stride {128}) \$8 = [S3][dense]::lookup(\$5, \$7) coord = {\$2} a = false \$9 = get child [S3->S2] \$8 \$10 = bit\_extract(\$2 + 0, 0~7) \$11 = linearized(ind {\$10}, stride {128}) \$12 = [S2][dense]::lookup(\$9, \$11) coord = {\$2}

### Taichi IR

```
%63 = lshr i32 %62, 0
%64 = and i32 %63, 255
\%65 = add i32 \%37, 0
%66 = lshr i32 %65, 0
%67 = and i32 %66, 255
\%68 = add i32 0, \%64
%69 = mul i32 %68, 256
%70 = add i32 %69, %67
%71 = bitcast %struct.DenseMeta* %5 tc
call void @StructMeta_set_snode_id(%st
call void @StructMeta_set_element_siz@
call void @StructMeta_set_max_num_eler
call void @StructMeta_set_lookup_elem@
_element)
call void @StructMeta_set_is_active(%:
call void @StructMeta_set_get_num_eler
lements)
call void @StructMeta_set_from_parent_
```

LLVM IR

movl \$0, %eax addl %eax, %ebx popl %eax looptop: imul %edx andl \$0xFF, %eax cmpl \$100, %eax jb looptop leal 4(%esp), %ebp movl %esi, %edi subl \$8, %edi shrl %cl, %ebx movw bx, -2(bp)movl \$0, %eax addl %eax, %ebx popl %eax looptop: imul %edx andl \$0xFF, %eax cmpl \$100, %eax jb looptop leal 4(%esp), %ebp movl %esi, %edi subl \$8, %edi shrl %cl, %ebx movw bx, -2(bp)

### Machine code



### Hidden Optimization Opportunities

### End2end access Level-wise Access Taichi IR LLVM IR Machine code

### Coarser

### Analysis Difficulty





4

## Data Access Semantics

- + (Seemingly trivial) assumptions that enables compiler optimization:

  - accesses of form **sparse\_grid**[indices]
  - Read access does not modify anything
    - No memory allocation
    - No exception if out of ranges (element does not exist)

• No pointer aliasing:  $\mathbf{a}[x, y]$  and  $\mathbf{b}[i, j]$  never overlaps if a != b• All memory accesses are done through **sparse\_grid**[indices] • The only way data structures get modified, is through write



### Performance



### Performance



### Performance

data structure library +

general-purpose compiler

### 1) data structure abstraction



*Taichi:***10.0x** shorter code**4.55x** higher performance

3) algorithm data structure decoupling

1) data structure abstraction

data structure library + general-purpose compiler

Performance

## DiffTaichi: **Differentiable Programming for Physical Simulation**



## End2end optimization of neural network controllers with gradient descent

(Advertisement)







## DiffTaichi: **Differentiable Programming for Physical Simulation**



## End2end optimization of neural network controllers with gradient descent

(Advertisement)







Source code: <u>https://github.com/yuanming-hu/taichi</u>

All performance numbers from our system are reproducible (commit dc162e11) with a single command.

# Thank you!

# The End

## pip3 install taichi-nightly

