

University of Siena



Barcelona Supercomputing Center TERA<sup>F</sup>LUX.EU

Exploiting Dataflow Parallelism in Teradevice Computing



### THALES



University of Augsburg



Roberto Giorgi – University of Siena (coordinator)

Cagliari, Italy - Computing Frontiers

16/05/2012



University of Cyprus



1



FUNDING OPPORTUNITIES from the

FUTURE & EMERGING TECHNOLOGIES scheme



University of Manchester

### What is **TERAFLUX** about

Architecture+Programmability+Reliability of Future (single chip) Many-cores (targeting 1000+ cores)



### Future Scenarios == 3D stacking, 8nm, 3D transistors, Graphene



### Fundamental approach: DATAFLOW

A Scheme of Computation in which an activity is initiated by presence of the data it needs to perform its function (Jack Dennis)



# Recent Projects/Efforts towards DATAFLOW

- Maxeler (UK) selling "dataflow computer" to J.P. Morgan → about 350x speedup vs. standard x86 cores
- DARPA funding 25M\$ for UPHC program, encompassing:
  - Gao's dataflow execution model (codelet based) – SWARM by ETI
  - Intel's Runnamede project

The Intel-lead UHPC team intends to develop new circuit topologies, new chip and system architectures, and new programming techniques to reduce the amount of energy required per computation by between 100x and 1000x compared to today's computing systems. Such dramatic reduction in energy consumption will allow these future systems to take full advantage of the increasing transistor budgets afforded by the steady advances in Moore's Law.





J.P. Morgan Deploys Maxeler Dataflow Supercomputer for Fixed Income Trading December 15, 2011

UPHC=Ubiquitus High-Performance Computing

# TERA<sup>F</sup>LUX – Future Many-cores

Key Challenges: Architecture+Programmability+Reliability



### tera<sup>f</sup>lux



### TERA<sup>F</sup>LUX.EU Working Hypothesis

- 1000 Billion- or 1 TERAdevice computing platforms pose new challenges:
  - (at least)
     programmability,
     complexity of design,
     reliability
- TERAFLUX context:
  - High performance computing and applications (not necessarily embedded)
  - TERAFLUX scope:

•

Exploiting a less
 exploited path
 (DATAFLOW) at each
 level of abstraction

Roberto Giorgi – giorgi@unisi.it --- http://teraflux.eu



#### **TERAFLUX** Architectural template

#### LEGENDA:

- n = # of nodes
- m = # of cores per node
- u= # of DRAM controllers insisting on the Unified Physical Address Space
- z = # of I/O Hubs

Nk = k-th Node (k=1..n) NI = Network Interface NoC = Network on Chip



Cj = j-th core (j=1..m) MC = Memory Controller DTSU = Distributed Thread-Scheduler Unit DFDU = Distributed Fault-Detection Unit LL\$H = Last Level Cache Hierarchy



CL\$H = Core Level Cache Hierarchy PU = Processing Unit LTSU = Local Thread-Scheduler Unit LFDU = Local Fault-Detection Unit



## Our pillars

- FIXED and MOST-USED ISA (x86)
- MANYCORE FULL SYSTEM SIMULATOR (COTSon)
- REAL WORLD APPLICATIONS (e.g. GROMACS)
- SYNCHRONIZATION: TRANSACTIONAL MEMORY
- GCC based TOOL-CHAIN
- OFF-THE-SHELF COMPONENTS FOR CORES, OS, NOC, MEMORY HIERARCHY
- FDU AND TSU (Fault Detection Unit and Thread Scheduling Unit)

### A REVIEW OF RECENT MANY-CORES

TERA<sup>F</sup>LUX

### HotChips 2011

 Hot Chips papers suggest that the rest of the world is moving in a different direction: large numbers of relatively simple CPUs. But the trend is reinforcing a long-appreciated set of questions—as the number of cores grows, **how** do you deal scalability with interconnect, memory hierarchy, coherency, and intra-thread synchronization? Answers to these questions depend on the size of the design, the application space, and the heritage of the design team.

tera<sup>F</sup>lux

### SARC Architecture



Figure 1. Schematic of the SARC architecture. The number of masters, workers, level-2 (L2) blocks, and memory interface controllers is implementation dependent, as is their on-chip layout.

Table 1. Baseline SARC simulation parameters.

| Parameter                                     | Value                                                   |
|-----------------------------------------------|---------------------------------------------------------|
| Clock frequency                               | 3.2 GHz                                                 |
| Memory controllers                            | $4 \times 2$ DDR3 channels                              |
| Channel bandwidth                             | 12.8 Gbytes per second (GBps)<br>(DDR3-1600)            |
| Memory latency                                | Real DDR3-1600                                          |
| Memory interface controllers<br>(MICs) policy | Closed-page, in-order processing                        |
| Shared L2 cache                               | 128 Mbytes (32 blocks æ 4 Mbytes),<br>4-way associative |
| L2 cache latency                              | 40 cycles                                               |
| Local store                                   | 256 Kbytes, 6 cycles                                    |
| L0 cache                                      | 32 Kbytes, 3 cycles                                     |
| Interconnection links                         | 8 bytes/cycle (25.6 GBps)                               |
| Intracluster network on chip (NoC)            | 2-bus (51.2 GBps)                                       |
| Global NoC                                    | 16-bus (409.6 GBps)                                     |

More recently (20110908), Dimitris Nikoloupos confirmed me that there won't be anymore the LS as it will be integrated in the L2.

#### The SARC architecture, IEEE micro, Oct. 2010, vol. 30, n. 5, pp. 16-29 TERA LUX Roberto Giorgi – giorgi@unisi.it --- http://teraflux.eu

### IBM BlueGene/Q



Rather than replicate more than 18 of these large cores, IBM chose to give each core hardware support for four concurrent threads, so under ideal circumstances the chip can behave almost as a 64-CPU system

 Ruud Haring, The IBM Blue Gene/Q Compute chip+SIMD floating-point unit, HotChips Symposium, Aug 2011.
 TERAFLUX

### BlueGene/Q module and system



The BlueGene/Q module with DDR3 memory, five links and a water cooling system.

Thanks to the 64-bit support, the modules can now run 8 or 16 GB of DDR3 memory. Five links (2 GB/s per direction) connect each module to its neighbors, making it possible to create different 5D topologies. Half a rack with 8192 BlueGene/Q cores has already proven its capabilities in the Linpack benchmark. With 65.3 Tflops, the test system from the Thomas J. Watson Research Center scored 115th place in the new Top500 list, its power consumption of 38.8 kW represented a new record value for energy efficiency at close to 1700 Mflops/watt. The Sequoia is supposed to get 96 fully equipped racks, which are supposed to deliver 20 Pflops of theoretical peak performance at the end of 2012. **TERASLINX** — translation from the original article from German in c't by Marcel Sieslack

### Cavium – Octeon II CN6880



Cavium relies primarily on locks for synchronization, assisted by facilities in the scheduler hardware

**TERA** OCTEON II CN6880 Multi-Core MIPS64 Processor, HotChips Symposium, Aug. 2011. Roberto Giorgi – giorgi@unisi.it --- http://teraflux.eu

# Octeon II (2)



#### TERA<sup>F</sup>LUX



Fig.1. Overview of Godson-T.

Cui HM, Wang L, Fang DR *et al.* Landing stencil code on Godson-T. JOURNAL OF COMPUTER SCIENCE ANDTECHNOLOGY 25(4): 886–894 July 2010.

#### TERA<sup>F</sup>LUX

### SPARC64<sup>™</sup> VIIIfx Chip Overview



### Architecture Features

- 8 cores
- Shared 5 MB L2\$
- Embedded Memory Controller
- 2 GHz

### Fujitsu 45nm CMOS

- 22.7mm x 22.6mm
- 760M transistors
- 1271 signal pins
- Performance (peak)
  - 128GFlops
  - 64GB/s memory throughput
- Power
  - 58W (TYP, 30°C)
  - Water Cooling Low leakage power and High reliability

# **K-Computer**

### Compute nodes and Network



Mitsuo Yokokawa, Fumiyoshi Shoji, Atsuya Uno, Motoyoshi Kurokawa, and Tadashi Watanabe. 2011. The K computer: Japanese next-generation supercomputer development project. In Proceedings of the 17th IEEE/ACM international symposium on Low-power electronics and design (ISLPED '11).



# K-Computer (2)

796mm

### Packaging of the system

 A rack consists of 24 system boards, 6 IO boards, power supply units, system storages, and diagnostic processors.

✓ A hose pipe is connected to the water loop under the floor.



• S. Fumiyoshi, The K Computer: Project Overview **TERA<sup>F</sup>LUX** 

### **IBM Cyclops-64**



Figure 1: IBM Cyclops-64 (C64) Many-Core Architecture: The architecture consists of 80 processors (Processor 0 -79). Each processor has two Thread Units (TUs) called TU 0 and TU 1. Both share one Floating-Point Unit (FPU) and one crossbar port (MPG). Each TU is connected to a SRAM bank, which can be accessed by all other TUs via the crossbar. Ten TUs share one Instruction Cache (IC). The system has four on-chip DDR2 memory controllers to access off-chip memory. The A-Switch is used to connect to the six surrounding neighbors in a 3D-mesh network.

TERATE Divutzka, Yuhei Hayashi, Joseph B. Manzano, and Guang R. Gao. 2011. The elephant and the mice: the role of non-strict fine-grain synchronization for modern many-core architectures. In *Proceedings of the international conference on Supercomputing* (ICS '11). ACM, New York, NY, USA, 338-347.

### **TERAFLUX RESEARCH OVERVIEW**



# WP2 - APPLICATIONS

- MPI applications
- Measures with executions with 16, 32, 64, 128 procs.
  - Linear regression projections
- Off-chip Memory Bandwidth
- 1K cores will need more than 256GB/s sustained bandwidth
- 3x than current DDR3
- Total memory footprint for MPI applications
- Total memory footprint increases with the number of processors
- Manycores with more than 100 cores will require a few dozens GBs of main memory
- Alternative programming models are required to deal with memory requirements for scalability





Overall memory bandwidth projections

### WP3 – PROGRAMMING MODEL

Transactions and Dataflow additions to Scala

- Modified to the Scala compiler to include transactional constructs surveyed other possibilities using closures
- Runtime STM support
- Statically Typed Dataflow Library
- Reimplementation of the Scala parallel collection using dataflow plus transactions
- Analysis for Lee-TM of benefits of Dataflow plus transactions

### http://apt.cs.man.ac.uk/projects/TERAFLUX/MUTS/

Daniel Goodman, Behram Khan, Salman Khan, Chris Kirkham, Mikel Lujan and Ian Watson. MUTS: Native Scala Constructs for Software Transactional Memory. In: Scala Days Workshop, Stanford, California, June 2-3, 2011.

Roberto Giorgi – giorgi@unisi.it --- http://teraflux.eu

TERA<sup>F</sup>LUX

### WP4 – COMPILATION TOOLS Compilation for Dataflow Threads Automatic DF Thread Extraction



TERA<sup>F</sup>LUX

### WP5 - RELIABILITY

- DOUBLE EXECUTION detects control flow AND data errors
- Runs each thread twice, once as a leading thread *t* and second time as a trailing thread *t'*
- The duplicated threads can run on the same core or on different cores
   of the same node/cluster



- Each execution generates signature of output results
- At completion compare the two signatures, if consistent, the D-TSU writes its results to subsequent thread frames
- If not, no commitment and recovery

Sebastian Weis, Arne Garbade, Julian Wolf, Bernhard Fechner, Avi Mendelson, Roberto Giorgi, and Theo Ungerer. A Fault Detection and Recovery Architecture for a Teradevice Dataflow System. In Data-Flow Execution Models for Extreme Scale Computing (DFM) 2011 Workshop Proceedings. IEEE Computer Society, 2011

### tera<sup>F</sup>lux

### WP6 - ARCHITECTURE



TERAFLUX

#### .x86 assembly (parallel)

| main: | movq<br>movq<br>cmpq<br>TSCHEDULE<br>TSCHEDULE<br>TSCHEDULE<br>TWRITE<br>TWRITE<br>TWRITE<br>TWRITE<br>TWRITE<br>TWRITE<br>TWRITE<br>TWRITE<br>TWRITE<br>TWRITE | \$4,<br>\$4,<br>\$1,<br>\$add,<br>\$mult,<br>\$div,<br>\$R8,<br>%R9,<br>%R12,<br>%R8,<br>%R9,<br>%R12, | %R8<br>%R9<br>%RAX<br>\$3,<br>\$2,<br>%R10,<br>%R10,<br>%R10,<br>%R11,<br>%R11,<br>%R11, | %R10<br>%R11<br>%R12<br>\$2<br>\$3<br>\$4<br>\$2<br>\$3<br>\$4<br>\$2<br>\$3<br>\$4 | add: | TREAD<br>TREAD<br>TREAD<br>movq<br>addq<br>TWRITE<br>TDESTRO | \$2,<br>\$3,<br>\$4,<br>%R8,<br>%R9,<br>%R11,     | %R8<br>%R9<br>%R10<br>%R11<br>%R11<br>%R10, | \$2 |
|-------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------|------|--------------------------------------------------------------|---------------------------------------------------|---------------------------------------------|-----|
| mult: | TREAD\$2,TREAD\$3,TREAD\$4,movq%R8,mulq%R9movq%RAX,TWRITE%R11,TDESTROY                                                                                          | %R8<br>%R9<br>%R10<br>%RAX<br>%R11<br>%R10,                                                            | \$3                                                                                      |                                                                                     | div: | TREAD<br>TREAD<br>movq<br>movq<br>divq<br>movq<br>TDESTRO    | \$2,<br>\$3,<br>\$0,<br>%R9,<br>%R8<br>%RAX,<br>Y | %R8<br>%R9<br>%RDX<br>%RAX<br>%R10          |     |

TERA<sup>F</sup>LUX

┛

### T\* (or T86) ISE: TSCHEDULE/TDESTROY

|                 | T* INSTRUCTIONS                                                                                                                                                                                                                                                                                                                                                                                              | IMPLIED COMPILER TARGET                                                |  |  |  |
|-----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------|--|--|--|
| <u>Synopsis</u> | <u>TSCHEDULE RS1, RS2, RD</u>                                                                                                                                                                                                                                                                                                                                                                                | TSCHEDULE( <ip>, <sc>, &amp;<frame_pointer>)</frame_pointer></sc></ip> |  |  |  |
| Description     | This instruction allocates the resources (a DF-frame of size RS2 words and a corresponding entry in the Distributed Thread Scheduler – or DTS) for a new DF-thread and returns its Frame Pointer (FP) in RD. RS1 specifies the Instruction Pointer (IP) of the first instruction of the code of this DF-thread and RS2 specifies the Synchronization Count (SC). Finally the zero flag is logically negated. |                                                                        |  |  |  |
| Notes           | The allocated DF-thread is not executed until its SC reaches 0. The TSCHEDULE can be conditional or non-conditional based on the value stored in the zero flag. If the zero flag is set to 1 then the TSCHEDULE will take effect, otherwise it is ignored. Two subsequent TSCHEDULE implement an if-then-else.                                                                                               |                                                                        |  |  |  |
| <u>Synopsis</u> | <u>TDESTROY</u>                                                                                                                                                                                                                                                                                                                                                                                              | <u>TDESTROY</u>                                                        |  |  |  |
| Description     | The thread that invokes TDESTROY finishes and its DF-frame is freed, (the corresponding entry in the Thread Scheduling Unit is also freed).                                                                                                                                                                                                                                                                  |                                                                        |  |  |  |
| Notes           | -                                                                                                                                                                                                                                                                                                                                                                                                            |                                                                        |  |  |  |

**TERA** LUX cores", *HIPEAC ACACES-2011*, ISBN:978 90 382 17987, Fiuggi, Italy, July 2011, pp. 277-280.

# T\* (or T86) ISE: TWRITE/TREAD

|                 | T* INSTRUCTIONS                                                                                                                                                                                  | IMPLIED COMPILER TARGET                                                                                                       |  |  |  |
|-----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|--|--|--|
| <u>Synopsis</u> | <u>TWRITE RS, RD, offset</u>                                                                                                                                                                     | <pre>*(<frame_pointer> + <offset>) = (<source_register>)</source_register></offset></frame_pointer></pre>                     |  |  |  |
| Description     | The data in RS is stored into the DF-frame pointed to by RD at the specified offset.                                                                                                             |                                                                                                                               |  |  |  |
| Notes           | Side Effect: The Distributed Thread Scheduler decrements the SC of the corresponding DF-thread entry (located through the FP): $SC_{FP} = SC_{FP}-1$                                             |                                                                                                                               |  |  |  |
| <u>Synopsis</u> | <u>TREAD offset, RD</u>                                                                                                                                                                          | <pre>(<destination_register>) = *(<self_frame_pointer> + <offset>)</offset></self_frame_pointer></destination_register></pre> |  |  |  |
| Description     | Loads the data indexed by 'offset' from the self (current thread) DF-frame into RD.                                                                                                              |                                                                                                                               |  |  |  |
| Notes           | Assumption: the DTS has to load into the register implicitly used by TREAD the value <self_frame_pointer>. In a x86-64 implementation, we can reserve RAX for this purpose.</self_frame_pointer> |                                                                                                                               |  |  |  |

#### TERA<sup>F</sup>LUX

# T\* (or T86) ISE: TALLOC/TFREE

|                 | T* INSTRUCTIONS                                                                                                                      | IMPLIED COMPILER TARGET                                     |  |  |  |
|-----------------|--------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------|--|--|--|
| <u>Synopsis</u> | TALLOC RS1, RS2, RD                                                                                                                  | <pointer> = TALLOC (<size>, <type>)</type></size></pointer> |  |  |  |
| Description     | Allocates a block of memory of RS1 words. The pointer to it is stored in RD. RS2 specifies the special purpose memory type.          |                                                             |  |  |  |
| <u>Notes</u>    | The Distributed Thread Scheduler tracks the memory allocated. An implementation can code <type> in the 2 LSB of <size></size></type> |                                                             |  |  |  |
| <u>Synopsis</u> | <u>TFREE (RS)</u>                                                                                                                    | TFREE( <pointer>)</pointer>                                 |  |  |  |
| Description     | Frees memory pointed to by RS.                                                                                                       |                                                             |  |  |  |
| Notes           | The Thread Scheduling Unit tracks the memory deallocated.                                                                            |                                                             |  |  |  |

### DTS – Distributed Thread Scheduler

(formerly called TSU)



### TERA<sup>F</sup>LUX

Distributed Thread Scheduler

A 2-node x 4-core example



- Core (Cjk)
  - Off-the-shelf cores (may include L1, L2slice)
  - Minimal ISA extension
- Local Thread Scheduling Unit (LTSU)
  - Manages threads on this Core
- Distributed Thread Scheduling Unit (DTSU)
  - Distributes threads among Nodes
- Node
  - Groups Cores+Resources

# Scheduling Example



- LTSU11  $\rightarrow$  DTSU1:
  - I need a new frame
- DTSU1  $\rightarrow$  LTSU13:
  - You're available, give one frame to LTSU11
- LTSU13  $\rightarrow$  LTSU11:
  - Here is a frame you can use

Fast communication  $\rightarrow$  low overhead What if every PE in the cluster is busy?

- LTSU14  $\rightarrow$  DTSU1:
  - "I need a new frame"
- DTSU1  $\rightarrow$  DTSU2:
  - LSE14 needs a frame, I don't have it
- DTSU2  $\rightarrow$  LTSU22:
  - Give a frame to LSE14
- − LTSU22  $\rightarrow$  LTSU14:
  - Here is a frame you can use

### Fine-Grain Thread Scheduling

|              | Plurality        | CUDA                | TFlux              | SSI               | DTA                |
|--------------|------------------|---------------------|--------------------|-------------------|--------------------|
| HW/SW        | HW + SW          | HW + SW             | SW                 | HW                | HW                 |
| Prog. Model  | Custom C         | Custom C            | Custom C           | Standard thread-  | Standard thread-   |
|              | language         | language            | preprocessor       | oriented C        | oriented C         |
|              | extensions       | extensions          | macros             | libraries         | libraries          |
|              |                  |                     | (#pragma)          | (pthreads)        | (pthreads)         |
| Exec. Model  | - Pool of RISC   | - Each thread       | High-level         | Subset static     | - Decoupling       |
|              | processors       | block is split into | threads are        | interleaved       | memory and         |
|              | - Uniform shared | warps (thread       | mapped to OS       | scheduling of     | execution activity |
|              | memory           | block.              | threads, using the | fine-grained /    | of non-blocking    |
|              | - Hardware       | - Each thread       | standard OS        | coarse-grained    | threads.           |
|              | scheduler,       | block is executed   | programming        | threads           | - Threads are      |
|              | synchronizer and | by only one         | interfaces as      | performed by a    | synchronized and   |
|              | load balancer    | multiprocessor.     | backend.           | hardware Task     | communicate        |
|              |                  | - A multi-          |                    | Scheduling Unit   | each other in a    |
|              |                  | processor can       |                    | (TSU)             | producer-          |
|              |                  | execute several     |                    |                   | consumer           |
|              |                  | blocks              |                    |                   | fashion.           |
|              |                  | concurrently.       |                    |                   |                    |
| Architecture | Complex          | SIMD, on-chip       | Everything is      | Hardware          | Thread             |
|              | hardware         | shared memory       | implemented in     | scheduler (TSU)   | management and     |
|              | subsystem:       |                     | software: can run  | + extensions to a | frame memory       |
|              | synchronizer,    |                     | on any general     | VLIW prototype    | management         |
|              | scheduler, load  |                     | purpose            |                   | implemented in     |
|              | balancer         |                     | architecture       |                   | hardware           |

SSI Subset Static Interleaved

DTA=Decoupled Threaded Architecture

### fib(21): number of threads



# WP7: Evaluating a MANY-CORE chip of the future (2020), i.e., 1000+ cores on a chip



### AMD SIMnow and COTSon



### **COTSon Overview**

Time Synchronization, Simulation Parallelization, Network Instrumentation, Network Statistics, ...



The ambition of TERAFLUX is however to be able of *changing* such machine in a flexible way, while tackling research challenges on programmability, architectural design and reliability. Therefore, we have the need to stress the COTSon platform, in order to being able to simulate 1000 cores.



Comparison among different approaches for doing research related to 1000-core computing system (Information revised from data of the RAMP project)

|                           | SMP                     | Cluster                 | FPGA                     | Emulator                 | Simulator                  |
|---------------------------|-------------------------|-------------------------|--------------------------|--------------------------|----------------------------|
| Scalability<br>(1K cores) | С                       | A                       | A                        | A                        | A                          |
| Cost (1K cores)           | F(€40M)                 | С                       | B(€0.1-0.2M)             | A+ (€0.01M)              | A+(€0.01M)                 |
| Power/Space (Kw, racks)   | D (120 kw, 12<br>racks) | D (120 kw, 12<br>racks) | A (1.5 kw, 0.3<br>racks) | A+ (0.1 kw,<br>0.1racks) | A+ (0.1 kw, 0.1<br>racks ) |
| Observability             | D                       | С                       | A+                       | A+                       | A+                         |
| Reproducibility           | В                       | D                       | A+                       | A+                       | A+                         |
| Reconfigurability         | D                       | С                       | A+                       | A+                       | A+                         |
| Credibility               | A+                      | A+                      | B+/A-                    | F/D                      | С                          |
| Development time          | В                       | В                       | С                        | A+                       | A+                         |
| Performance (clock)       | A (2GHz)                | A(3GHz)                 | C (0.1 GHz)              | B(≈ 0.9 of original)     | C(1/10 to 1/1000<br>SMP)   |
| x86-64 ISA                | A+                      | A+                      | F                        | A+                       | A+                         |
| Modifiable                | F                       | F                       | В                        | А                        | А                          |
| GPA                       | D                       | D                       | B+/A-                    | В                        | А                          |

### FUNCTIONAL/TIMING SIMULATION



Source: Multifacet Project (<u>www.cs.wisc.edu/multifacet</u>) -[Mauer02-sigmetrics-Full\_System Timing\_First Simulation] Roberto Giorgi – giorgi@unisi.it --- http://teraflux.eu

accuracy

### COTSon: FUNCTIONAL-DIRECTED

- A variant of "functional first"
  - Adds timing feedback at coarse granularity (100s – 1000s of instructions)
- Applications see an approximation of time
  - May miss some fine-grain timing interaction
- Compatible with fast (caching) emulators and samplers



### **OTHER RECENT X64 SIMULATORS**

|                            | Sniper | Graphite | Gem5                  | COTSon     | MARSSx86 |
|----------------------------|--------|----------|-----------------------|------------|----------|
| Timing-directed/integrated |        |          | Х                     |            |          |
| Func-directed              | Х      | Х        |                       | Х          | Х        |
|                            |        |          |                       |            |          |
| User-level                 | Х      | Х        | Х                     | Х          | Х        |
| Full-system                |        |          | Х                     | Х          | Х        |
|                            |        |          |                       |            |          |
| Archs Supported            | x64    | x64      | x64<br>Alpha<br>SPARC | x64        | x64      |
| Parallel (in-node)         | Х      | Х        |                       | Multi-node |          |
| Shared caches              | Х      |          | Х                     | Х          | Х        |

Heirman120401-ISPASS Tutorial The SNIPER multi-core simulator

#### Simulation booting up 1024 cores. (1) COTSon execution of 32 SimNow instances. (2) Each instance manages 32 cores. Host: 48 cores, 256 GB memory



TERAFLUX

Note: the simulation is PARALLEL at GUEST NODE-LEVEL and it's also possible to distribute the simulation on several HOST NODES 45

### **INITIAL EXPERIMENTAL RESULTS**

The proposed simulation framework has been validated running applications and benchmarks on a target machine with up to 1024 cores, operating in accordance with dataflow principle on standard cores

We run several applications and benchmarks based on well established programming models (mainly OpenMP and MPI):



- NAS Parallel Benchmark (NPB)
- Graph500 and HPL 2.0 Linpack
- Sequential Recursive Fibonacci

### Major Technical Innovations in TERAFLUX

- Fragmenting the Applications in Finer grained DF-threads:
  - DF-threads allow an easy way to decouple memory accesses, therefore hiding memory latencies, balancing the load, managing fault, temperature information without fine grain intervention of the software.
- Possibility to repeat the execution of a DF-thread in case this thread happened to be on a core later discovered as faulty
- Taking advantage of a "direct" dataflow communication of the data (through what we call DF-frames).
- Synchronizing threads while taking advantage of native dataflow mechanism (e.g. several threads can be synchronized at a barrier)
  - DF-threads allow (atomic ) Transactional semantics (DF meets TM)
- A Thread Scheduling Unit would allow fast thread switching and scheduling, besides the OS scheduler; scalable and distributed
- A Fault Detection Unit works in conjunction with TSU

### TERAFLUX SIMULATOR (COTSon)

# http://cotson.sf.net

### HP-Labs COTSon is **OPEN-SOURCE**



Roberto Giorgi – giorgi@unisi.it --- http://teraflux.eu



# TERA<sup>F</sup>LUX

#### **Exploiting dataflow parallelism in Teradevice Computing**

**PROJECT NUMBER: 249013** 

http://teraflux.eu



### **BACKUP SLIDES**



### TERAFLUX TOOLCHAIN (Jan. 2011)





- FreeFrameTable (CID Core ID, FFN number of free frames)
  - Keeps track of the occupancy of processors inside a cluster
  - Updated on each TDestroy and accepted TSchedule
- TCRQ ThreadScheduleRequestQueue
  - Holds unserved ThreadScheduleRequest messages
  - Message is pushed into queue when there are no free local resources
  - Message is popped from the queue when either TDestroyor BroadcastResponse arrives
- Control logic
  - Responsible for both inter and intra node communication and updating the messages inside a scheduler



- Map table (V Valid, M Mapped, ThID thread ID, Fptr Frame pointer)
  - Keeps track of the issued resource requests for the execution of new threads
  - a ThID is assigned to a thread when new Tschedule instruction occurs; it is used just inside that core
  - a Fptr is assigned when a TScheduleResponse message arrives; it is unique globally in the system
  - Can be cleared on the thread completion
- Store buffer (V valid, ThID thread ID, offset for storing in a frame, DATA data to store)
  - Keeps track of the TStores issued for the threads that didn't receive a TScheduleResponse yet (those kept in Map table and still not mapped)
  - On each TStore for the new thread that still doesn't have resources assigned, a new entry is created
  - When TScheduleResponse arrives, all entries are checked and TStore messages are sent (entry invalidated) if there is any matching
  - If TStore occurs for a thread that already has its resources assigned, there is no need to use the buffer

## **Distributed Thread Scheduler Unit**

- On new TScheduleRequestMessage checks the availability in local node
  - If yes forwards it
  - If no put the message in FRQ and send broadcast
- Message is removed from FRQ when FfreeMessage or BroadcastResponse arrive
- Other messages are just forwarded



# Local Thread Scheduling Unit

- On TScheduleRequestMessage
  - Choose a free frame for execution of the new thread
  - Send TScheduleResponseMessage to the issuing processor
- On TScheduleResponseMessage simply update the continuation with the frame identifier
- On store send DataStore message (group them if the destination is the same)



### Non-blocking resource assignment (1)

- Avoid waiting from the distributed scheduler by introducing Virtual frame pointers
- Two additional structures map table and store buffer



Blocking frame assignment

Non-blocking frame assignment

Even if we don't speed-up the starting time of new threads, execution time is shorter and processor becomes free earlier

### One Physical Machine running two Virtual Machine instances that communicate through the Virtual Network (Mediator).

#### ADVANTAGES

This setup can run both on a real machines (at least at smal scale for tests) AND on the COTSon simulator It allows us to modify system parameters like e.g. number o. cores in each simulated instance.

It allows for a parallelization of the simulation (the several instances are running in parallel on the available cores – load balancing automatically provided by the Host OS scheduler). Possible to avoid copying buffers among instances because

they reside in the Host Shared Memory Network Possibility to take advantage of RVI/VT-x virtualization mechanisms across different Physical Machines (under development).

The communication and synchronization among the simulation instances adds up to the Application traffic, but could bypass TCP/IP and avoid using the Physical Interconnection Network. No need to use the Physical Network.



#### DISADVANTAGES

- Taking into account that we aim to flexibly change the programming model and architecture (e.g. the dataflow based execution model and architecture), this setup may end up in poor performance when N (number of nodes) increases.
- Tightens the Application to the Machine, which is exactly the opposite direction that we follow globally in TERAFLUX: we aim to decouple the Application (WP2) from the Machine with appropriate Programming Models (WP3), Compilation Tools (WP4) and Execution Models (WP6).
- The MPI run-time is constantly involved to appropriately schedule the ready tasks/threads on the available nodes.
- The Physical architecture that is more natural to model is a Distributed Machine not like the general one we aim in TERAFLUX.



### VM instances governed by a Single Source Image (SSI) OS

#### ADVANTAGES

- Allows us to run Shared Memory applications like OpenMP ones (can still run MPI as if it was a single big node).
- Can run both on a real machines (at least at small scale for tests) and on the COTSon simulator.
- It allows us to modify system parameters like e.g. number of cores in each simulated instance.
- It allows for a parallelization of the simulation (the several instances are running in parallel on the available cores – load balancing automatically provided by the Host OS scheduler).
- Possible to avoid copying buffers among instances because they reside in the Host Shared Memory Network
- Possibility to take advantage of RVI/VT-x virtualization mechanisms across different Physical Machines (under development).
- The communication and synchronization among the simulation instances adds up to the Application traffic, but could bypass TCP/IP and avoid using the Physical Interconnection Network.
- Load Balancing for the Application is managed by the Guest OS
- No need to use the Physical Network.

|            |            | APPLICATION           |                 |        |
|------------|------------|-----------------------|-----------------|--------|
|            |            | OS Guest              |                 |        |
| SIMNow (0) | SIMNow (1) |                       | SIMNow<br>(999) | COTSon |
|            |            | OS Host               |                 |        |
|            |            | CPU (1 or more cores) |                 |        |

#### DISADVANTAGES

This setup requires the use of a Distributed OS as Guest OS (like e.g., Kerrighed [KERRIGHED10], which offers the view of a unique SMP machine on top of a cluster) or in

general a SSI (Single System Image) OS. Relatively poor performance when N (number of nodes) increases;

Partially tightens the Application to the Machine, which is in the opposite direction in respect to what we follow globally in TERAFLUX: we aim to decouple the Application (WP2) from the Machine with appropriate Programming Models (WP3), Compilation Tools (WP4) and Execution Models (WP6).

The underlying Guest Architecture is a "cluster", which is then more naturally mapped to a physical Distributed Machine not a generic one like we aim in TERAFLUX.

## One core aware of all the other cores

#### ADVANTAGES

- Allows us to run Shared Memory applications like OpenMP ones (can still run MPI as if it was a single big node).
- Can run both on a real machines (at least at small scale for tests) AND on the COTSon simulator as provided at the Month-1 of the TERAFLUX project.
- It allows us to modify system parameters like e.g. number of cores in each simulated instance.
- It allows for a parallelization of the simulation (the several instances are running in parallel on the available cores load balancing automatically provided by the Host OS scheduler).
- Possible to avoid copying buffers among instances because they reside in the Host Shared Memory Network.
- Possibility to take advantage of RVI/VT-x virtualization mechanisms across different Physical Machines (under development).
- The communication and synchronization among the simulation instances adds up to the Application traffic, but could bypass TCP/IP and avoid using the Physical Interconnection Network.
- Load Balancing for the Application is managed by the Guest OS
- No need to use the Physical Network.
- No need to use a very different OS like an SSI OS.
- The underlying Guest Architecture is a shared memory machine, however thanks to the availability of a global address space, there is now full possibility of evolving the machine in a more "general one" like the one we aim to evolve during the TERAFLUX project. The TERAFLUX Execution Model can decouple completely the architecture of the machine.



#### DISADVANTAGES

- Relatively poor performance when N (number of nodes) increases; however, as other simulator like COREMU [Wang11] already demonstrated a high speed up in simulations even with 255 cores, we have good confidence that we can improve much the simulation speed going in a similar direction.
- Requires some patches to the Linux OS; however we shall need to patch anyway the Memory Manager and the Scheduler in order to properly support the TERAFLUX threads

# NAS benchmarks running in COTSon

- Machine 37nodes of 4 cores. One node Master and 36 Slaves
- Two examples from the set:

| NAS Parallel Benchmarks 3.3 BT<br>No input file inputbt.data. Using con<br>Class A: Size: 64x 64x 64 |           | Time in seconds | Mop/s total M | lop/s/process |
|------------------------------------------------------------------------------------------------------|-----------|-----------------|---------------|---------------|
| Iterations: 200 dt: 0.0008000                                                                        | BT 1-4    | 4 398.93        | 421.84        | 11.72         |
| Number of active processes: 36                                                                       | BT 2-4    | 422.08          | 398.7         | 11.08         |
|                                                                                                      | BT 4-4    | 398.17          | 422.65        | 11.74         |
| NAS Parallel Benchmarks 3.3 – CG                                                                     |           |                 |               |               |
| Size: 14000<br>Iterations: 15                                                                        | CG 1-4    | 46.26           | 32.35         | 1.01          |
| Number of active processes: 32                                                                       | CG 2-4    | 48.45           | 30.89         | 0.97          |
| Number of nonzeroes per row:<br>Eigenvalue shift: .200E+02                                           | 11 CG 4-4 | 46.65           | 32.08         | 1             |

# NAS benchmarks running in COTSon (cont.)

| NAS Parallel Benchma<br>No input file inputft.da                                                                                                        | $\frac{1}{2} CPU Time = 5.5777, N = 2^{28}$<br>arks 3.3 FT Benchmark<br>ata. Using compiled defaults<br>256x 128 (Class A)                            |
|---------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
|                                                                                                                                                         | 6, Number of processes : 32                                                                                                                           |
|                                                                                                                                                         | 1x 32, Layout type : 1D                                                                                                                               |
| -                                                                                                                                                       | 32, IS Benchmark Completed                                                                                                                            |
| Number of processes:                                                                                                                                    | 32, IS Benchmark Completed<br>ze = 8388608                                                                                                            |
| Number of processes:<br>Class = A, Siz<br>Iterations = 1                                                                                                | 32, IS Benchmark Completed<br>ze = 8388608                                                                                                            |
| Number of processes:<br>Class = A, Siz<br>Iterations = 1                                                                                                | 32, IS Benchmark Completed<br>ze = 8388608<br>0<br>arks 3.3 LU Benchmark                                                                              |
| Number of processes:<br>Class = A, Siz<br>Iterations = 1<br>NAS Parallel Benchma                                                                        | 32, IS Benchmark Completed<br>ze = 8388608<br>0<br>arks 3.3 LU Benchmark<br>lass A),Iterations: 250                                                   |
| Number of processes:<br>Class = A, Siz<br>Iterations = 1<br>NAS Parallel Benchma<br>Size: 64x 64x 64 (C<br>Number of processes:<br>NAS Parallel Benchma | 32, IS Benchmark Completed<br>ze = 8388608<br>0<br>arks 3.3 LU Benchmark<br>lass A),Iterations: 250<br>32<br>arks 3.3 MG Benchmark                    |
| Number of processes:<br>Class = A, Siz<br>Iterations = 1<br>NAS Parallel Benchma<br>Size: 64x 64x 64 (Cl<br>Number of processes:                        | 32, IS Benchmark Completed<br>ze = 8388608<br>0<br>arks 3.3 LU Benchmark<br>lass A),Iterations: 250<br>32<br>arks 3.3 MG Benchmark<br>mpiled defaults |

|         | Time in | Mop/s  | Mop/s/  |
|---------|---------|--------|---------|
|         | seconds | total  | process |
|         |         | 410.53 |         |
| EP 1-4  | 4.9301  | 108.9  | 3.4     |
| EP 2-4  | 5.5777  | 96.25  | 3.01    |
| EP 4-4  | 4.9324  | 108.85 | 3.4     |
| FT 1 -4 | 187.25  | 38.11  | 1.19    |
| FT 2 -4 | 185.43  | 38.49  | 1.2     |
| FT 4 -4 | 201.85  | 35.36  | 1.1     |
| IS 1-4  | 2.59    | 32.39  | 1.01    |
| IS 2 -4 | 2.67    | 31.45  | 0.98    |
| IS 4 -4 | 2.57    | 32.64  | 1.02    |
| LU 1-4  | 188.96  | 631.33 | 19.73   |
| LU 2 -4 | 185.15  | 644.32 | 20.14   |
| LU 4 -4 | 183.93  | 648.6  | 20.27   |
| MG 1-4  | 171.15  | 22.74  | 0.71    |
| MG 2 -4 | 176.11  | 22.1   | 0.69    |

# Plurality

- Plurality: <a href="http://www.plurality.com/profile.html">http://www.plurality.com/profile.html</a>
  - Architecture: general-purpose accelerator for multicore/manycore system-on-chip (SoC)
  - Task-oriented programming model: the programmer has to perform a partitioning of the program into specific tasks (taskmap)
  - The body of each task is a traditional sequential code
  - Each core is a RISC processor
  - Scheduler, synchronization and load balancing among cores are done by a complex hardware subsystem that communicates with all the RISC processors

 ${}_{\text{TERA}^{\text{F}}\text{L}\bar{\text{U}}\text{X}}$  Uniform shared memory access

# CUDA

- Programming model: extensions to standard C language (CUDA libraries)
- DRAM memory addressing + on-chip shared memory
- However a single process must run spread across multiple disjoint memory spaces (???)
- Recursive functions are not supported (must be converted to loops)
- Bus bandwidth and latency between CPU and GPU may be a bottleneck (acceleratore esterno)

## TFlux

- TFlux:
  - Paralell processing system targeted to commodity, Linux-based shared-memory multiprocessor systems
  - Data-Driven multi-threading
  - Programming model: takes as input a C program, argumented with TFlux-specific compiler directives (#pragma's)
  - Everything implemented in software

## Subset Static Interleaved (SSI)

- Interleaved threads
  - Advantage: operations latencies become shorter in terms of executed instructions from the same thread
- Combination of blocked multithreading and static interleaved multithreading:
  - A set of background threads + a set of foreground threads. Foreground threads are interleaved until a stall occurs (e.g. cache miss). When a foreground thread stalls and a background thread is ready for execution we exchange them so that the foreground thread becomes a background thread and vice versa

TSU: task scheduling unit (in hardware)