TU Wien:Parallel Computing VU (Träff)/Zusammenfassung

From VoWi
Jump to navigation Jump to search


FLoating point OPerations per Second
Parallel Computing
Efficiently utilizing dedicated parallel resources for solving given computation problems.
Distributed Computing
Making independent, non-dedicated resources available and cooperate toward solving specified problem complexes.
Concurrent Computing
Managing and reasoning about interacting processes that may or may not progress simultaneously.
Random Access Machine
Parallel Random Access Machine
  • Assumes as large as needed memory
  • Processors can read/write into memory in unit time
  • Number of processors is variable
  • Processors execute their own program
  • Processors are strictly synchronized (operating in lock-step)

The PRAM model defines what happens when multiple processors access the same memory word in the same step:

  • EREW (Exclusive Read Exclusive Write)
  • CREW (Concurrent Read Exclusive Write)
  • CRCW (Concurrent Read Concurrent Write)
Theorem 1
The maximum of n numbers stored in an array can be found in parallel time steps, using processors (and performing operations) on a Common CRCW PRAM.
Theorem 2
The maximum of n numbers stored in an array can be found in parallel time steps, using a maximum of n/2 processors (but performing only operations) on a CREW PRAM.
Theorem 3
Two n × n matrices can be multiplied into a matrix C in time steps and operations on an EREW PRAM.

Flynn's taxonomy looks at the instruction stream(s) and the data stream(s) of the system:

  • SISD (Single-Instruction, Single-Data)
  • SIMD (Single-Instruction, Multiple-Data)
  • MIMD (Multiple-Instruction, Multiple-Data)
  • MISD (Multiple-Instruction, Single-Data)

Performance and Objectives

absolute speed-up
The absolute speed-up of parallel algorithm Par over sequential algorithm Seq for input of size O(n) on p processor-cores is the ratio of sequential to parallel running time, i.e.,
relative speed-up
The relative speed-up of a parallel algorithm Par is the ratio of the running time with one processor-core to the parallel running time with p processor-cores, i.e.,
Amdahl's Law
Assume that the work performed by sequential algorithm Seq can be divided into a strictly sequential fraction s, 0 < s ≤ 1, independent of n, that cannot be parallelized at all, and a fraction r = ( 1 − s ) that can be perfectly parallelized. The parallelized algorithm is Par. The maximum speed-up that can be achieved by Par over Seq is 1/s.
strongly scalable
The parallel time decreases proportionally to p (linear speed-up), while keeping n constant.
weakly scalable
The parallel running time remains constant, while keeping the average work per processor constant and increasing n.

Patterns and Paradigms

Work Law
Depth Law

Merging and Prefix Sums

Shared Memory

A naive, parallel shared-memory system model:

  • A (fixed) number of processor-cores p connected to a large (but finite) shared memory.
  • Every core can read/write every location in memory, but memory access is significantly more expensive than performing operations in the processor-core.
  • memory accesses are not uniform, from each processors point of view some locations can be accessed (much) faster than other locations.
  • Processors are not synchronized.

All these assumptions are in stark contrast to those made for the idealized PRAM.


  • three different types:
    • directly mapped: each block mapped to one distinct position in cache
    • fully associative: each block can go anywhere
    • (k-way) set associative: like directly mapped cache but space for multiple blocks per position
  • chache misses
    • cold (compulsory) miss: no data in the cache line where you would look for it (i think)
    • capacity miss: all lines are full, one line has to be evicted (only fully associative)
    • conflict miss: already element in line (only directly and set associative)
  • replacement strategies
    • LRU: least recently used
    • LFU: least frequently used
  • on write either:
    • block already in cache: block is updated
    • block not in cache: block is allocated (called write-allocate) (may create conflict miss) or memory is written directly (write non-allocate)
  • if block is updated:
    • write-through cache: block updated and value is written directly to memory too
    • write back: written to memory when line is evicted
  • locality of access (for applications and algorithms)
    • temporal locality: memory address is reused frequently (will not be evicted)
    • spatial locality: addresses of the same block are also used

Matrix-Matrix Multiplication and Cache Performance

  • Matrix multiplied with Matrix
  • sequential algorithm in or (for squares)
  • three loops (for ) can be parallelized, order of parallelization matters
    • differences because matrices are accessable in row order
    • different index order causes more or fewer cache misses

Recursive Matrix-Matrix Multiplication Algorithm

  • recursively split and in smaller submatrices until very small then iterative solution

Blocked Matrix-Matrix Multiplication

  • split matrices into blocks at start
  • then multiply with 6 nested loops
  • block size can be chosen depending on cache size
    • called cache-aware algorithm (opposite is cache-oblivious)

Multi-core Caches

  • cache system has several dimensions
  • L1 (lowest level) to L3 (or more)
    • L1 is closest to processor, smallest (few KB) and fastest
    • L3 is Last Level Cache LLC, several MB big
    • L1 has data and instruction cache
  • Translation Lookaside Buffer for memory management (pages)
  • lowest level caches are private (only for one processor)
  • higher level caches are shared
  • cache coherence problem when one block is updated in one cache and also stored in another cache of another processor
    • coherent cache system if line of other cache will be updated at some point in time (can also just mean it is invalidated in the cache)
    • non coherent if it will never be updated
  • problem solved through cache coherence protocol
    • may cause a lot of cache coherence traffic
  • false sharing is when two caches have the same blocks of memory stored
    • update to one adress of block will cause the whole line of other to be replaced

The Memory System

  • cache system part of memory hierarchy
    • from fast low levels (caches) to slower bigger higher levels
  • write buffer for memory writes (FIFO)
  • for multi-core CPUs not every processor has direct connection to main memory
    • instead connected to memory controller
  • some processors closer to memory controller than others address access times are not uniform

Super-linear Speed-up through the Memory System

  • through memory hierarchy of large caches, working sets of each processor get smaller and eventually can fit into the fastest caches

Application Performance and the Memory Hierarchy

  • memory-bound: when execution (and associated reading and writing from memory) of instructions is faster than reading or writing the instruction in memory
  • compute-bound: other way around, intructions are read faster than they are executed

Memory Consistency

  • when process 1 sets a certain flag in memory and process 2 checks if that flag is set it may happen that the flag is still in the write buffer and process 2 may read a wrong value
  • frameworks like pthreads or OpenMP help


A thread is the smallest execution unit that can be scheduled.

The main characteristics of the pthreads programming model are:

  1. Fork-join parallelism. A thread can spawn any number of new threads (up to system limitations) and wait for completion of threads. Threads are referenced by thread identifiers.
  2. Threads are symmetric peers, any threads can wait for completion of any other thread via the thread identifier.
  3. One initial main thread
  4. SPMD but can be MIMD
  5. Threads are scheduled by the OS
  6. No implicit synchronization
  7. Threads in a process share global information
  8. Synchronization and updates to shared objects is done with coordination constructs

pthreads in C

  • use #include <pthread.h>

Race Conditions

  • when two threads update variable at same time, outcome may be write of either thread
    • result is non-deterministic
  • special race condition is called data race
    • two ore more threads access shared memory, one of accesses is a write

Critical Sections, MutEx, Locks

  • lock is programming model to guarantee mutual exclusion on a critical section
  • threads try to acquire lock, if granted they enter critical section, release lock when done
    • threads are blocked till lock is acquired
  • locks have to be deadlock free
  • lock is starvation free if thread will not be starved
  • threads at lock will serialize, one thread after another will pass critical section
  • locks where many threads are competing are called contended
  • try-locks allow execution of some code when lock could not be acquired
  • in spin-locks threads test by busy waiting
  • in blocking-locks waiting threads are blocked and freed by the OS
  • condition variables are associated with mutex
    • thread waits on variable till a signal occurs
    • a thread can signal one (maybe arbitrary) thread only
    • a broadcast signals all threads at once
  • concepts with conditions with signals and waits are called monitor
  • barriers are similar to mutex with conditions
  • concurrent initialization is where first thread executes initialization code before other threads begin

Locks in data structures

  • trivially make data structures work with parallel algorithms by using a single lock for all operations on structures
  • other way is to construct concurrent data structures

Problems with Locks

  • deadlocks can happen easily with locks
    • can also happen when the same thread tries to acquire same lock again
  • locks around long critical section cause harmful serialization
  • threads crashing during critical section also cause deadlocks
  • threads can starve
  • priority threads and locks can lead to lower priority threads preventing higher priority threads from continuing

Atomic Operations

  • atomic instruction carry out instructions that cannot be interfered with by other threads
  • a = a+27 can be implemented as atomic instruction through fetch-and-add
  • crashing threads during atomic instructions will not cause deadlocks
  • instructions are wait-free because they always execute in
  • instructions are lock-free if any thread will be able to execute instruction in a bounded amount of time (not always given)


The main characteristics of the OpenMP programming model are:

  1. Parallelism is (mostly) implicit through work sharing. All threads execute the same program (SPMD).
  2. Fork-join parallelism: Master thread implicitly spawns threads through parallel region construct, threads join at the end of parallel region.
  3. Each thread in a parallel region have a unique integer thread id, and threads are consecutively numbered from0to the number of threads minus one in the region.
  4. The number of threads can exceed number of the processors/cores. Threads intended to be executed in parallel by available cores/processors.
  5. Constructs for sharing work across threads. Work is expressed as loops of independent iterations and task graphs.
  6. Threads can share variables; shared variables are shared among all threads. Threads can have private variables.
  7. Unprotected, parallel updates of shared variables lead to data races, and are erroneous.
  8. Synchronization constructs for preventing race conditions.
  9. Memory is in a consistent state after synchronization operations.

OpenMP in C

  • has to be compiled in gcc with -fopenmp
  • has to be included with #include <omp.h>

Parallel Regions

  • forking starts at parallel region
  • designated by #pragma omp parallel [...]
  • number of threads in parallel region cannot be changed after start
  • thread number set by the runtime environment or library call or num_threads() pragma

Library Calls

  • omp_get_thread_num(void) returns thread id
  • omp_get_num_threads(void) returns number of threads
  • omp_get_max_threads(void) returns maximum possible threads
  • omp_set_num_threads(int num_threads) sets maximum
  • omp_get_wtime(void) returns wall clock time in seconds
  • omp_get_wtick(void) returns tick resolution of timer

Sharing variables

  • default is all vairables before region are shared, all variables declared in region are private
  • sharing of variables set by clause in pragma
    • private(vars...) makes uninitialized copies of variables
    • firstprivate(varis...) makes copies and initializes value to value before parallel region
    • shared(vars...) declares variables as globally shared
    • default(shared|none) makes variables shared or not shared by default
  • use like #pragma omp parallel private(a, b, c) shared(d) default(none)
  • good practice to set no variables to shared per default (none)
  • shared variables make data races possible

Work Sharing: Master and Single

#pragma omp master
  • declares statement to be executed by master thread only (thread id of 0)
    • other threads will simply skip
#pragma omp single
  • statement is executed by either one of the threads
    • other threads will wait at end of statement until all threads have reached end
    • can be eliminated by nowait clause after single
      • might cause race conditions
    • allows making variables private or firstprivate (unlike master construct)

explicit Barrier

#pragma omp barrier
  • no thread can continue until all threads have reached barrier


  • code is split into small pieces that can be executed in parallel by available threads
#pragma omp sections
  • outer structure
  • end of block is implicit barriers
    • nowait possible
  • variables can be designated (first)private
#pragma omp section
  • marks an independent code section
  • which thread executes which section is designated by runtime system
  • best case: each thread executes a section, all threads run in parallel

Loops of Independent Iterations

  • loop scheduling is assignment of iteration blocks to threads
  • each iteration has to be executed exactly once
  • data races have to be avoided
#pragma omp for [clauses...] for (loop range...)
  • all threads must have same start and end values for i and same step
  • loop ranges have to be finite and determined (no while loops possible)
  • end condition has to be in the form of: i<n, i<=n, i>n, i>=n, i!=n, with n as an expression or value
  • steps have to be in form of i++, i--, i+=inc, i=i+inc, i-=inc, i=i-inc
  • these kinds of loops are in canonical form
#pragma omp parallel for [clauses...]
  • shorthand for parallel region with one loop
  • next statement is for loop
  • always barrier after

Loop Scheduling

  • how is each thread assigned to each iteration
  • loop range is divided in consecutive chunks
  • static schedule: chunks have same size and are assigned like round-robin (each thread gets chunk one after another)
    • computation of chunk assignments is very fast (low overhead)
  • dynamic: each thread dynamically grabs next chunk it can
  • guided: dynamic assignment but chunk size changes, determined by number of iterations divided by threads
  • set like schedule(static|dynamic|guided[, chunksize])
    • for guided chunksize is minimum size
    • if no size given then default size is used
  • also possible schedule(auto|runtime)
    • runtime: scheduling is set at runtime
    • auto: OpenMP compiler determines

Collapsing nested loops

  • transform multiple loops into one
#pragma omp parallel for collapse(depth) [clauses...]
  • depth is the amount of nested loops to be parallelized


#pragma omp for reduction(operator: variables)
  • is clause of for
  • operators are `+ - * & | ^ && || min max`

Tasks and Task Graphs

#pragma omp task [clauses...]
  • task is like a function call
  • completion of task may not happen immediately
    • at latest when completion is requested (e.g. at end of parallel region)
  • tasks can access any variables of the thread
    • if those variables are changed in the task data races may occur
  • task can be designated final and thus not generate any additional tasks
#pragma omp taskwait [depend(...)]
  • wait for completion of generated tasks
  • only dependency clause allowed

Mutual Exclusion Constructs

pragma omp critical [(name)]
  • threads will wait and only one thread will execute code of critical section
  • section can have name
  • shared variables can be updated by task in critical region
#pragma omp atomic [read|write|update|capture]
  • for simple critical sections
  • allow fetch and add (FAA) atomic operations
  • update is x++, x = x op ..., ...
  • capture is y = x++, ...


  • locks don’t have condition variables
  • recursive (nested) locks are possible in OMP

Special Loops

  • OMP can try to utilize vector operations (SIMD)
#pragma omp simd [clauses...]
  • followed by canonical for loop
  • one thread executes loop but with SIMD instructions
#pragma omp for simd [clauses...]
  • followed by canonical for loop
  • loop executed by multiple threads
  • each chunk executed with SIMD instructions
#pragma omp parallel for simd [clauses...]
  • same as previous but with parallel region declared
#pragma omp taskloop [clauses...]
  • recursively break iteration range into smaller ranges
  • smaller ranges are then executed as tasks
  • should be initiated by a single thread
  • size of range can be adjusted by grainsize()

Loops with Hopeless Dependencies

#pragma omp ordered
  • loops with dependency patterns that cannot be handled normally can be marked as ordered
  • only one possible ordered block in a parallel loop
  • other parts of iteration can still be performed in parallel


  • OMP inspired by Cilk
  • since 2018 not supported anymore
  • supports constructs cilk_spawn (like task), cilk_sync (like taskwait) and cilk_for (like taskloop)
  • executes threads through work-stealing algorithm
    • each thread has local task queue
    • when threads run out of local tasks, they steal tasks from other threads until there are no more tasks to steal

Distributed Memory Systems

Network Properties: Structure and Topology

  • distributed memory introduces interconnection network (or interconnect)
  • entities (cores, multi-core CPUs or larger systems) are connected through links (in any form)
    • some entities are simply switches
  • interconnect where processors are also communication elements (and without switches) is called direct network
  • interconnect with switches is called indirect network
  • topology of network can be modeled as an unweighted graph
    • each vertice is an network communication element
    • each arc denotes a direct link between two elements
  • graphs are mostly undirected
  • graph diameter is longest shortest path between two nodes
    • is a lower bound on communication steps between two elements
  • graph degree is maximum degree of all nodes
  • bisection width is minimum number of edges to remove so that graph is partitioned into two equally large subsets
  • worst possible networks are linear processor array and processor ring
  • tree networks are binary or k-ary trees
    • have
  • d-dimensional mesh networks
    • processors are identified by their integer vectors
    • special case is torus network where edges wrap around
  • hypercube network is special case of torus network
  • modern systems often built as torus networks with 3 to 6 dimensions, called multi-stage networks

Communication algorithms in networks

  • unidirectional communication if only one direction
  • bidirectional if in both directions can be communicated
    • most modern systems
  • broadcast problem: how to transmit data from root to every other node in minimal communication steps
    • lower bound is in -degree, network with nodes
    • algorithm to solve partitions network in smaller networks, root sends data to “virtual roots” of these networks, solve problem for smaller subnetworks

Communication costs

  • linear transmission cost assumes time of , where is start up latency and is time per unit of data

Routing and Switching

  • if network is not fully connected, routs make path between two nodes possible

Hierarchical, Distributed Memory Systems

  • communication networks with different levels
  • processors may have different characteristics and thus live on different compute nodes

Programming Models

  • concrete network properties are abstracted
    • model of fully connected network
  • processes are not synchronized and communicate with others through explicit or implicit transmission
    • message transmission is assumed to be deadlock free and correct
  • distributed system programming models either:
    • data distribution centric: data structures distributed according to rules, if one processes changes data structure of another process then through communication
    • communication centric: focuses on explicit communication and synchronisation instead of data structures (MPI)

Message-Passing Interface

  • message-passing programming model is to structure parallel processes through sending and receiving messages
  • processes are Communicating Sequential Processes (CSP). CSP programs cannot have data races.

The main characteristics of the MPI message-passing programming model are:

  1. Finite sets of processes (in communication domains) that can communicate.
  2. Processes identified by rank in communication domain.
  3. Ranks successive ; p number of processes in domain (size).
  4. More than one communication domain possible; created relative to default domain of all started processes. A process can belong to several communication domains.
  5. Processes operate on local data, all communication explicit.
  6. Communication reliable and ordered.
  7. Network oblivious
  8. Three basic communication models:
    1. Point-to-point communication, different modes, non-local and local completion semantics
    2. One-sided communication, different synchronization mechanisms, local completion mechanisms
    3. Collective operations, non-local (and local) completion semantics.
  9. Structure of communicated data orthogonal to communication model and mode.
  10. Communication domains may reflect physical topology.

The rest of this section is about MPI in C:

  • header #include <mpi.h>
  • functions in MPI_ “name space” (illegal and punishable by law to use prefix for own functions)
  • MPI_SUCCESS return value for success (obviously)
  • mpicc compiler to compile programs

Initializing MPI

  • MPI_Init to initialize
  • end with MPI_Finalize
    • MPI_Abort forces termination

Error Checking

  • MPI only has limited error checking because it is expensive


  • MPI_Comm_size to get amount of processes in communicator
  • MPI_Comm_rank to get rank in communicator
  • all processes are in communication domain MPI_COMM_WORLD
    • data type MPI_COMM
  • MPI_Comm_split splits processes of communicator into smaller groups with own communicators
  • MPI_Comm_create also creates new communicators
  • both functions are collective (have to be called by all processes)
  • MPI_Comm_free to free communicators

Organizing Processes

  • communicator with grid structure is called Cartesian communicator
  • created through MPI_Cart_create
    • reorder flag to reorder ranks so neighboring processes are close on grid
  • MPI_Cart_coords translate rank to coordinate vector
    • MPI_Cart_rank vice versa

Objects and Handles

  • a distributed object is an object for which all processes that reference it can access it
  • distributed objects
    • MPI_Comm for communicator object
    • MPI_Win for cummincation windows
  • local objects
    • MPI_Datatype for local layout and structure of data
    • MPI_Group for ordered sets of processes
    • MPI_Status for communication
    • MPI_Request for open (not yet completed) communication
    • MPI_Op for binary operators
    • MPI_Info for additional info when creating objects

Process Groups

  • objects of type MP_Group
  • groups used locally by processes to order processes
  • MPI_Comm_group returns an ordered group from a communicator
  • MPI_Group_(size|rank) like MPI_Comm_(size|rank)
  • set-like operations on groups possible, MPI_Group_(union|intersection|difference|incl|excl...)

Point-to-point Communication

  • MPI_Send to send data from process to dest
  • MPI_Recv to receive data from source
  • combined operations MPI_Sendrecv[_replace] have argument for source and dest
  • MPI_Get_(count|elements) to figure how much data was sent

Semantic terms

  • blocking operations when function call returns when operation has been locally completed (irregardless of success of if data has been sent out etc.)
  • non-blocking when function call returns immediately (specified by capital I in function name, e.g. MPI_Irecv)

Specifying Data

  • buffer specifies starting adress
  • count specifies number of elements
  • datatype specifies MPI type

One-sided Communication

  • one process alone initiates communication
    • also specifies actions on both ends
  • involved processes are called origin and target

Collective Communication

  • processes communicating collectively
  • MPI_Barrier blocks until all have reached routine
  • MPI_Bcast to send data from root to all nodes
  • MPI_Scatter distributes data from process root evenly to all other
    • MPI_Scatterv if data is not distributed evenly
  • MPI_Gather gathers data from all other processes in root
    • MPI_Gatherv if not receiving the same number of elements from each process
  • MPI_Allgather like Gather but all processes have all data
    • MPI_Allgatherv similar to MPI_Gatherv
  • MPI_Reduce reduction of data in root with operator MPI_Op op
    • operations are assumed to be associative and commutative
  • MPI_Allreduce like reduce but all processes receive reduction
  • MPI_Reduce_Scatter first reduce vector in processes, then scatter vector in segments across processes
  • MPI_Alltoall (could be named Allscatter) every process scatters their data to all other
  • MPI_Scan performs a reduction over a group, where each subsequent process stores the result of the reduction of the current value and the previous values
    • e.g. Scan with SUM reduction processes with values in brackets: 1(1), 2(2), 3(3), 4(4)

after Scan: 1(1), 2(3), 3(6), 4(10)

  • MPI_Exscan like Scan but value of process is not used in reduction
    • same example with Exscan: 1(0), 2(1), 3(3), 4(6)