# Datei:TU Wien-Parallel Computing VU (Träff) - Zusammenfassung Skriptum (V 0.6, 29. Juni).pdf

## Beschreibung

# Parallel Computing Zusammenfassung
Version 0.6, 29. Juni 2020

# 1 Introduction

## Performance of Processors
- *nominal processor performance* is often measured in FLOPS (maximum number of Floating Point Operations per Second)
- performance measured as clock frequency and number of instructions completable per clock cycle (e.g. number of FLOP/cycle)
- number of instructions per cycle determined by architecture
- processor with smaller number of cores is called *multi-core*
- nominal performance is then nominal performance of one core times the number of cores

## Parallel vs Distributed Computing
- *parallel computing*: efficiently utilizing *parallel* resources to solve computational problems
- deals with algorithms, their implementations and the structure of the computer architecture
- *distributed computing*: make *independent*, non-dedicated resources available to solve specific problem complexes
- *concurrent computing*: manage resources and processes that may or may not progress at the same time
- parallel computing problems often require a lot of interaction
- distributed resources may not be directly available, be distributed over a large area, may fail or change

## Sample computational problems
- matrix multiplications
- sequence merging
- graph problems
- sorting problems
- reduction problems (summing or finding maxima...)
- etc.

## The PRAM Model
- stands for *Parallel Random Access Machine*
- unrealistic model (but useful)
- processors are strictly synchronized (lockstep)
- memory may be accessed simultaneously, three variants
- *EREW* (Exclusive Read Exclusive Write), memory cant be accessed simultaneously
- *CREW* (Concurrent Read Exclusive Write), memory cant be written simultaneously
- *CRCW* (Concurrent Read Concurrent Write)
- in *Arbitrary CRCW PRAM* either simultaneous written value may be kept
- in *Priority CRCW PRAM* processors have priority, the highest priority processor writes

## Flynns Taxonomy
- differentiate machines based on instructions and data
- *SISD* (Single-Instruction, Single-Data), sequential computer
- *SIMD* (Single-Instruction, Multiple-Data), single instruction can operate on bigger amount of data (e.g. arrays), vector instructions are example of SIMD
- *MIMD* (Multiple-Instruction, Multiple-Data), general PRAM machine, each processor has own instructions and own data
- *MISD* (Multiple-Instruction, Single-Data), may be a pipline-like system where single data stream passses through different stages

## Sequential and Parallel Time
- *Seq* for sequential algorithms, *Par* for parallel
- $T_{seq}(n)$ is sequential running time in amount of steps, $T_{par}(p, n)$ is parallel time with p processors

## Speed-up
- Absolute speedup: $S_p(n)=\frac{T_{seq}(n)}{T_{par}(n)}$
- measures gain of parallel algorithm over sequential

## Linear speed-up is best possible
- parallel algorithm with $p$ cores can be simulated on single core in $pT_{par}(p, n)$
- if speed-up would be non-linear, the simulation could run faster than the best known sequential algorithm (which would make the algorithm the best known)
- in practice *super-linear speed-up* may happen when work of sequential and parallel algorithms is not the same, e.g. randomized algorithms
- also possible for search algorithms

## Cost and Work
- cost is time of p processor cores occupied with Par
- equals $pT_{par}(p, n)$
- work $W_par(p, n)$ is number of operations of algorithm
- for sequential algorithms, work is same as time
- for parallel algorithms work is total work of all processors
- algorithm is *Cost-optimal* if cost $pT_{par}(p, n)$ is $O(T_{seq}(n))$ (for a best known sequential algorithm)
- algorithm is *Work-optimal* if $W_par(p, n)$ is $O(T_{seq}(n))$ (for a best known *Seq*)
- cost optimal algorithms have linear speed-ups

## Relative Speed-up and Scalability
- relative speed-up is ratio of parallel running time with one processor to running time with p processors
- $\text{SRel}_p(n) = \frac{T_{par}(1, n)}{T_{par}(p, n)}$
- fastest possible running time is $T\infty (n)$
- parallelism: $\frac{T(1,n)}{T\infty(n)}=\frac{T(1, n)}{T(\infty, n)}$
- denotes the largest possible speed up

- parallel algorithms often perform more work than best sequential ones
- excess work is called *overhead*
- can occur through communication, synchronization or extra preprocessing
- if overheads are in bounds of sequential work $O(T_{seq}(n))$ the algorithm can still be work-optimal
- communication intervals of algorithms are called *granularity*
- rare communication: *coarse grained* algorithm
- frequent communication: *fine grained* algorithm
- processors may have different amounts of work
- if $T_{par}(i, n)$ is the time a processor $i$ takes, the *load imbalance* is the biggest difference between the times of two processors, over all processors
- *load balancing* is the problem of making sure the times of all processors are about the same
- *static load balancing* is when the amount of work is divided up front between processors
- *oblivious static load balancing* is the subdivision of the problem only by input size and structure, regardles of actual input
- *dynamic load balancing* is the exchange of work during execution of the algorithm
- problems where input can be statically distributed and no further interaction is needed are called *embarrassingly* or *trivially* or *pleasantly parallel*

## Amdahls Law
- assuming algorithm can be subdivided into a sequential fraction $s$ and a perfectly parallelizable fraction $r = (1-s)$
- maximum speed up by parallelization is $\frac{1}{s+\frac{1-s}{p}}=1/s$, if $p \rightarrow \infty$
- problems with parallel algorithms:
- input, output
- sequential preprocessing
- sequential data structures
- operations that have to be performed sequentially
- good algorithms have the parallel part not be a constant fraction but decrease with n

## Efficiency and Weak Scaling
- *scaled speed up* is when the faster $\frac{T(n)}{t(n)}$ converges, the faster the speed up converges
- with $T(n)$ as the parallelizable term and $t(n)=T\infty(n)$ as the non parallelizable term
- efficiency is the comparison of *Par* against the best possible parallelization of *Seq*
- parallel efficiency: $E_p(n) = \frac{T_{seq}(n)}{pT_{par(p, n)}}=\frac{S_p(n)}{p}$
- linear speed up if $E_p(n)=e$ for some $e$
- iso-efficiency: when efficiency is not constant (constant means it does not change with n) then we have to calculate n as a function of p
- do this by calculating e as a function of n and p and then changing the formula for n
- *Weak Scalability* is when you want to achieve an efficiency $E_p(n) = e$ there is a function $f(p)$ so that n is in $\Omega(f(p))$
- so the efficiency varies with the number of processors if the problem size remains constant
- so a problem may be more efficiently solved with fewer processors
- other definition: if the average work per processor $T_{seq}(n)/p$ is constant at $w$ the running time of the algorithm is constant at $T_{par}(p, n)$
- $f(p)$ is called the iso-efficiency function
- says how n should grow as a function of p, so efficiency remains constant

## Scalability Analysis
- *Strong Scaling* analysis, n is constant, if run time decreases proportionally to p (linear speed up) the algorithm is *strongly scalable*
- *Weak Scaling* analysis, work per processor is constant (through bigger n), if parallel running time is also constant algorithm is *weakly scalable*

### [rest skipped]

# 2 Shared Memory

## Caches
- three different types:
- directly mapped: each block mapped to one distinct position in cache
- fully associative: each block can go anywhere
- (k-way) set associative: like directly mapped cache but space for multiple blocks per position
- chache misses
- cold (compulsory) miss: no data in the cache line where you would look for it (i think)
- capacity miss: all lines are full, one line has to be evicted (only fully associative)
- conflict miss: already element in line (only directly and set associative)
- replacement strategies
- LRU: least recently used
- LFU: least frequently used
- on write either:
- block already in cache: block is updated
- block not in cache: block is allocated (called write-allocate) (may create conflict miss) or memory is written directly (write non-allocate)
- if block is updated:
- write-through cache: block updated and value is written directly to memory too
- write back: written to memory when line is evicted
- locality of access (for applications and algorithms)
- temporal locality: memory address is reused frequently (will not be evicted)
- spatial locality: addresses of the same block are also used

## Matrix-Matrix Multiplication and Cache Performance
- $n\times l$ Matrix $A$ multiplied with $l\times m$ Matrix $B$
- sequential algorithm in $O(nml)$ or $O(n^3)$ (for squares)
- three loops (for $n, m, l$) can be parallelized, order of parallelization matters
- differences because matrices are accessable in row order
- different index order causes more or fewer cache misses

## Recursive Matrix-Matrix Multiplication Algorithm
- recursively split $A$ and $B$ in smaller submatrices until very small then iterative solution

## Blocked Matrix-Matrix Multiplication
- split matrices into blocks at start
- then multiply with 6 nested loops
- block size can be chosen depending on cache size
- called *cache-aware algorithm* (opposite is *cache-oblivious*)

## Multi-core Caches
- cache system has several dimensions
- L1 (lowest level) to L3 (or more)
- L1 is closest to processor, smallest (few KB) and fastest
- L3 is *Last Level Cache* LLC, several MB big
- L1 has data and instruction cache
- *Translation Lookaside Buffer* for memory management (pages)
- lowest level caches are private (only for one processor)
- higher level caches are shared
- *cache coherence problem* when one block is updated in one cache and also stored in another cache of another processor
- *coherent* cache system if line of other cache will be updated at some point in time (can also just mean it is invalidated in the cache)
- *non coherent* if it will never be updated
- problem solved through *cache coherence protocol*
- may cause a lot of *cache coherence traffic*
- *false sharing* is when two caches have the same blocks of memory stored
- update to one adress of block will cause the whole line of other to be replaced

## The Memory System
- cache system part of *memory hierarchy*
- from fast low levels (caches) to slower bigger higher levels
- *write buffer* for memory writes (FIFO)
- for multi-core CPUs not every processor has direct connection to main memory
- instead connected to *memory controller*
- some processors closer to memory controller than others $\rightarrow$ address access times are not uniform

## Super-linear Speed-up through the Memory System
- through memory hierarchy of large caches, working sets of each processor get smaller and eventually can fit into the fastest caches

## Application Performance and the Memory Hierarchy
- *memory-bound*: when execution (and associated reading and writing from memory) of instructions is faster than reading or writing the instruction in memory
- *compute-bound*: other way around, intructions are read faster than they are executed

## Memory Consistency
- when process 1 sets a certain flag in memory and process 2 checks if that flag is set it may happen that the flag is still in the write buffer and process 2 may read a wrong value
- frameworks like pthreads or OpenMP help

# $\texttt{pthreads}$
- *thread* is smallest execution unit that can be scheduled

- are symmetric peers, thread can wait for any other thread to complete
- have same program (SPMD) but can have different traces (MIMD)
- are scheduled by the OS
- no implicit synchronisation of threads, they progress independently from each other
- share the same memory
- constructs for coordination and synchronization are provided

## pthreads in C
- use #include <pthread.h>

## Race Conditions
- when two threads update variable at same time, outcome may be write of either thread
- result is *non-deterministic*
- special race condition is called *data race*
- two ore more threads access shared memory, one of accesses is a write

## Critical Sections, MutEx, Locks
- lock is programming model to guarantee mutual exclusion on a critical section
- threads try to acquire lock, if granted they enter critical section, release lock when done
- threads are blocked till lock is acquired
- locks have to be deadlock free
- lock is starvation free if thread will not be starved
- threads at lock will *serialize*, one thread after another will pass critical section
- locks where many threads are competing are called *contended*
- *try-locks* allow execution of some code when lock could not be acquired
- in *spin-locks* threads test by busy waiting
- in *blocking-locks* waiting threads are blocked and freed by the OS
- condition variables are associated with mutex
- thread waits on variable till a *signal* occurs
- concepts with conditions with signals and waits are called *monitor*
- *barriers* are similar to mutex with conditions
- *concurrent initialization* is where first thread executes initialization code before other threads begin

## Locks in data structures
- trivially make data structures work with parallel algorithms by using a single lock for all operations on structures
- other way is to construct *concurrent data structures*

## Problems with Locks
- deadlocks can happen easily with locks
- can also happen when the same thread tries to acquire same lock again
- locks around long critical section cause harmful serialization

## Atomic Operations
- atomic instruction carry out instructions that cannot be interfered with by other threads
- a = a+27 can be implemented as atomic instruction through fetch-and-add
- instructions are *wait-free* because they always execute in $O(1)$
- instructions are *lock-free* if any thread will be able to execute instruction in a bounded amount of time (not always given)

# OpenMP

## Programming Model
- all threads execute same program (SPMD)
- each thread has uníque id
- more threads than processors are possible
- shared and private variables in threads are possible
- synchronization constructs to prevent race conditions

## OpenMP in C
- has to be compiled in gcc with -fopenmp
- has to be included with #include <omp.h>

## Parallel Regions
- forking starts at parallel region
- designated by #pragma omp parallel [...]
- number of threads in parallel region cannot be changed after start
- thread number set by the runtime environment or library call or num_threads() pragma

## Library Calls
- omp_get_thread_num(void) returns thread id
- omp_get_num_threads(void) returns number of threads
- omp_get_max_threads(void) returns maximum possible threads
- omp_set_num_threads(int num_threads) sets maximum

- omp_get_wtime(void) returns wall clock time in seconds
- omp_get_wtick(void) returns tick resolution of timer

## Sharing variables
- default is all vairables before region are shared, all variables declared in region are private
- sharing of variables set by clause in pragma
- private(vars...) makes uninitialized copies of variables
- firstprivate(varis...) makes copies and initializes value to value before parallel region
- shared(vars...) declares variables as globally shared
- default(shared|none) makes variables shared or not shared by default
- use like #pragma omp parallel private(a, b, c) shared(d) default(none)
- good practice to set no variables to shared per default (none)
- shared variables make *data races* possible

## Work Sharing: Master and Single
#### #pragma omp master
- declares statement to be executed by master thread only (thread id of 0)
- other threads will simply skip

#### #pragma omp single
- statement is executed by either one of the threads
- other threads will wait at end of statement until all threads have reached end
- can be eliminated by nowait clause after single
- might cause race conditions
- allows making variables private or firstprivate (unlike master construct)

## explicit Barrier
#### #pragma omp barrier

## Sections
- code is split into small pieces that can be executed in parallel by available threads

##### #pragma omp sections
- outer structure
- end of block is implicit barriers
- nowait possible
- variables can be designated (first)private

#### #pragma omp section
- marks an independent code section
- which thread executes which section is designated by runtime system
- best case: each thread executes a section, all threads run in parallel

## Loops of Independent Iterations
- *loop scheduling* is assignment of iteration blocks to threads
- each iteration has to be executed exactly once
- data races have to be avoided

#### #pragma omp for [clauses...] for (loop range...)
- all threads must have same start and end values for i and same step
- loop ranges have to be finite and determined (no while loops possible)
- end condition has to be in the form of: i<n, i<=n, i>n, i>=n, i!=n, with n as an expression or value
- steps have to be in form of i++, i--, i+=inc, i=i+inc, i-=inc, i=i-inc
- these kinds of loops are in *canonical form*

#### #pragma omp parallel for [clauses...]
- shorthand for parallel region with one loop
- next statement is for loop
- always barrier after

## Loop Scheduling
- how is each thread assigned to each iteration
- loop range is divided in consecutive chunks
- static schedule: chunks have same size and are assigned like round-robin (each thread gets chunk one after another)
- computation of chunk assignments is very fast (low overhead)
- dynamic: each thread dynamically grabs next chunk it can
- guided: dynamic assignment but chunk size changes, determined by number of iterations divided by threads
- set like schedule(static|dynamic|guided[, chunksize])
- for guided chunksize is minimum size
- if no size given then default size is used
- also possible schedule(auto|runtime)
- runtime: scheduling is set at runtime
- auto: OpenMP compiler determines

## Collapsing nested loops
- transform multiple loops into one

#### #pragma omp parallel for collapse(depth) [clauses...]
- depth is the amount of nested loops to be parallelized

## Reductions
#### #pragma omp for reduction(operator: variables)
- is clause of for
- operators are + - * & | ^ && || min max

#### #pragma omp task [clauses...]
- task is like a function call
- completion of task may not happen immediately
- at latest when completion is requested (e.g. at end of parallel region)
- if those variables are changed in the task data races may occur

#### #pragma omp taskwait [depend(...)]
- wait for completion of generated tasks
- only dependency clause allowed

## Mutual Exclusion Constructs
#### pragma omp critical [(name)]
- threads will wait and only one thread will execute code of critical section
- section can have name
- shared variables can be updated by task in critical region

#### #pragma omp atomic [read|write|update|capture]
- for simple critical sections
- allow fetch and add (FAA) atomic operations
- update is x++, x = x op ..., ...
- capture is y = x++, ...

## Locks
- locks don't have condition variables
- recursive (nested) locks are possible in OMP

## Special Loops
- OMP can try to utilize vector operations (SIMD)

#### #pragma omp simd [clauses...]
- followed by canonical for loop
- one thread executes loop but with SIMD instructions

#### #pragma omp for simd [clauses...]
- followed by canonical for loop
- loop executed by multiple threads
- each chunk executed with SIMD instructions

#### #pragma omp parallel for simd [clauses...]
- same as previous but with parallel region declared

#### #pragma omp taskloop [clauses...]
- recursively break iteration range into smaller ranges
- smaller ranges are then executed as tasks
- should be initiated by a single thread
- size of range can be adjusted by grainsize()

## Loops with Hopeless Dependencies
#### #pragma omp ordered
- loops with dependency patterns that cannot be handled normally can be marked as ordered
- only one possible ordered block in a parallel loop
- other parts of iteration can still be performed in parallel

## Cilk
- OMP inspired by Cilk
- since 2018 not supported anymore
- supports constructs cilk_spawn (like task), cilk_sync (like taskwait) and cilk_for (like taskloop)
- executes threads through *work-stealing algorithm*

# 3 Distributed Memory Parallel Systems and MPI

## Network Properties: Structure and Topology
- distributed memory introduces interconnection network (or *interconnect*)
- entities (cores, multi-core CPUs or larger systems) are connected through *links* (in any form)
- some entities are simply switches
- interconnect where processors are also communication elements (and without switches) is called *direct network*
- interconnect with switches is called *indirect network*
- *topology* of network can be modeled as an unweighted graph
- each vertice is an network communication element
- each arc denotes a direct link between two elements
- graphs are mostly undirected
- $\text{diam}(G)$ graph *diameter* is longest shortest path between two nodes
- is a lower bound on communication steps between two elements
- $\text{degree}(G)$ graph *degree* is maximum degree of all nodes
- $\text{bisec}(G)$ *bisection width* is minimum number of edges to remove so that graph is partitioned into two equally large subsets
- worst possible networks are *linear processor array* and *processor ring*
- *tree networks* are binary or k-ary trees
- have $\text{bisec}(T)=1$
- *d-dimensional mesh networks*
- processors are identified by their integer vectors
- special case is *torus network* where edges wrap around
- *hypercube network* is special case of torus network
- modern systems often built as torus networks with 3 to 6 dimensions, called *multi-stage networks*

## Communication algorithms in networks
- *unidirectional* communication if only one direction
- *bidirectional* if in both directions can be communicated
- most modern systems
- broadcast problem: how to transmit data from root to every other node in minimal communication steps
- lower bound is $\lceil\log_{k+1}(p)\rceil$ in $k$-degree, network with $p$ nodes
- algorithm to solve partitions network in $k+1$ smaller networks, root sends data to "virtual roots" of these networks, solve problem for smaller subnetworks

## Communication costs
- linear transmission cost assumes time of $\alpha+\beta m$, where $\alpha$ is start up latency and $\beta$ is time per unit of data

## Routing and Switching
- if network is not fully connected, routs make path between two nodes possible

## Hierarchical, Distributed Memory Systems
- communication networks with different levels
- processors may have different characteristics and thus live on different compute nodes

## Programming Models
- concrete network properties are abstracted
- model of fully connected network
- processes are not synchronized and communicate with others through explicit or implicit transmission
- message transmission is assumed to be deadlock free and correct
- distributed system programming models either:
- *data distribution* centric: data structures distributed according to rules, if one processes changes data structure of another process then through communication
- *communication* centric: focuses on explicit communication and synchronisation instead of data structures (MPI)

## Message-passing Interface (MPI)
- message-passing programming model is to structure parallel processes through sending and receiving messages
- proccesses are *Communicating Sequential Processes*
- cannot have data races
- MPI characteristics:
- finite processes
- each process identified by rank in domain
- more than one domain possible
- data is local
- communication is reliable

## MPI in C
- header #include <mpi.h>
- functions in MPI_ "name space" (illegal and punishable by law to use prefix for own functions)
- MPI_SUCCESS return value for success (obviously)
- mpicc compiler to compile programs

### Initializing MPI
- MPI_Init to initialize
- end with MPI_Finalize
- MPI_Abort forces termination

### Error Checking
- MPI only has limited error checking because it is expensive

### Communicators
- MPI_Comm_size to get amount of processes in communicator
- MPI_Comm_rank to get rank in communicator
- all processes are in communication domain MPI_COMM_WORLD
- data type MPI_COMM
- MPI_Comm_split splits processes of communicator into smaller groups with own communicators
- MPI_Comm_create also creates new communicators
- both functions are *collective* (have to be called by all processes)
- MPI_Comm_free to free communicators

### Organizing Processes
- communicator with grid structure is called *Cartesian communicator*
- created through MPI_Cart_create
- reorder flag to reorder ranks so neighboring processes are close on grid
- MPI_Cart_coords translate rank to coordinate vector
- MPI_Cart_rank vice versa

### Objects and Handles
- a *distributed object* is an object for which all processes that reference it can access it
- distributed objects
- MPI_Comm for communicator object
- MPI_Win for cummincation windows
- local objects
- MPI_Datatype for local layout and structure of data
- MPI_Group for ordered sets of processes
- MPI_Status for communication
- MPI_Request for open (not yet completed) communication
- MPI_Op for binary operators
- MPI_Info for additional info when creating objects

### Process Groups
- objects of type MP_Group
- groups used locally by processes to order processes
- MPI_Comm_group returns an ordered group from a communicator
- MPI_Group_(size|rank) like MPI_Comm_(size|rank)
- set-like operations on groups possible, MPI_Group_(union|intersection|difference|incl|excl...)

### Point-to-point Communication
- MPI_Send to send data from process to dest
- MPI_Recv to receive data from source
- combined operations MPI_Sendrecv[_replace] have argument for source and dest
- MPI_Get_(count|elements) to figure how much data was sent

### Semantic terms
- *blocking* operations when function call returns when operation has been locally completed (irregardless of success of if data has been sent out etc.)
- *non-blocking* when function call returns immediately (specified by capital I in function name, e.g. MPI_Irecv)

### Specifying Data
- buffer specifies starting adress
- count specifies number of elements
- datatype specifies MPI type
- e.g. MPI_(CHAR|INT|LONG|FLOAT|DOUBLE)

### One-sided Communication
- one process alone initiates communication
- also specifies actions on both ends
- involved processes are called *origin* and *target*

### Collective Communication
- processes communicating collectively
- MPI_Barrier blocks until all have reached routine
- MPI_Bcast to send data from root to all nodes
- MPI_Scatter distributes data from process root evenly to all other
- MPI_Scatterv if data is not distributed evenly
- MPI_Gather gathers data from all other processes in root
- MPI_Gatherv if not receiving the same number of elements from each process
- MPI_Allgather like Gather but all processes have all data
- MPI_Allgatherv similar to MPI_Gatherv
- MPI_Reduce reduction of data in root with operator MPI_Op op
- operations are assumed to be associative and commutative
- MPI_Allreduce like reduce but all processes receive reduction
- MPI_Reduce_Scatter first reduce vector in processes, then scatter vector in segments across processes
- MPI_Alltoall (could be named Allscatter) every process scatters their data to all other
- MPI_Scan performs a reduction over a group, where each subsequent process stores the result of the reduction of the current value and the previous values
- e.g. Scan with SUM reduction
processes with values in brackets: 1(1), 2(2), 3(3), 4(4)
after Scan: 1(1), 2(3), 3(6), 4(10)
- MPI_Exscan like Scan but value of process is not used in reduction
- same example with Exscan: 1(0), 2(1), 3(3), 4(6)


## File history

Click on a date/time to view the file as it appeared at that time.

Date/TimeDimensionsUserComment
current12:02, 13 July 2020 (273 KB)Richie (talk | contribs)
revert11:58, 13 July 2020 (273 KB)0.0.0.0 (talk)# Parallel Computing # 1 Introduction ## Performance of Processors - *nominal processor performance* is often measured in FLOPS (maximum number of Floating Point Operations per Second) - performance measured as clock frequency and number of instructions completable per clock cycle (e.g. number of FLOP/cycle) - number of instructions per cycle determined by architecture - processor with smaller number of cores is called *multi-core* - nominal performance is then nominal performance of one co…

There are no pages that use this file.