# The Teraflux approach for massive parallel processing on-die

# Prof. Avi Mendelson, Technion & Microsoft R&D, Israel

2nd Workshop on Future Architecture Support for Parallel Programming (FASPP12)









| Teraflux in a nutshell                                                                 | TERAFLUX                              |
|----------------------------------------------------------------------------------------|---------------------------------------|
| An EU research project (FET).                                                          |                                       |
| Assumes 1000's processors on die                                                       |                                       |
| Connected through a NoC                                                                | A A A A A A A A A A A A A A A A A A A |
| No system-wide support for HW coherency                                                | × K                                   |
| HW components can become faulty                                                        |                                       |
| Transient errors                                                                       | ×                                     |
| Stuck at faults                                                                        |                                       |
| SW needs to make sure it works transparency to potential                               | faults                                |
| Resource allocation and scheduling should be distribution                              | H H H                                 |
| Disclaimer: The project examine different potential solutions, th presents my approach | his presentation                      |
| 6                                                                                      | A A A                                 |
|                                                                                        |                                       |





# Fundamental approach (General)

### **General Purpose**

- Target to run any program in a reasonable performance and power consumption
- Mostly assume to be latency sensitive
- Use "reverse engineering" (e.g., branch prediction) to unveil the internal structure of the program

9 Prof. Avi Mendelson - FASPP12

## Special purpose

- Targeted specific class of applications
  - Applications the don't fit into this category may not run or run in a very inefficient way.
- Usually Use SW/HW co-design
- Can be an order of magnitude more efficient than general purpose architectures for specific class of application

# Fundamental approach (Teraflux)

- The system is dynamically partitioned between cores that can run General purpose applications and cores that can run "special purpose" accelerator code, A.K.A Teraflux cores/
- The code for the Teraflux cores is based on Special branch of the DataFlow paradigm, called Task-Parallelism (similar to Actors)
- The Teraflux cores subsystem is built as SW/HW codesign

10 Prof. Avi Mendelson - FASPP12





















TERAFLUX

# How it works

- Compiler generate DF code out of sequential code (e.g., C) or programing languages that support parallelism (e.g., OpenMP, Java, Scala)
- The execution always starts on the service cores that generate the Tasks (Tokens) and send them to the different clusters.
- All tasks sent to a cluster are kept in a "safe memory" queue and being scheduled to cores by the TSU
- After finishing the execution and assuming no fault happen, results are written to the task-memory and the TSU is reported it can write the results back to main memory. After successful update of the global memory, the Task is removed from the clustered queue.



### 11







# <section-header><section-header><section-header><section-header><text><text><text><text><text><text><text>

### 13

















