TU Wien:Parallel Computing VU (Träff)/Zusammenfassung

Introduction[Bearbeiten | Quelltext bearbeiten]

Processor performance is defined as the maximum number of operations that can be carried out per unit of time. Often FLoating point OPerations per Second (FLOPS).

Parallel Computing is intimately related to distributed and concurrent computing:

Parallel Computing: Efficiently utilizing dedicated parallel resources for solving given computation problems.
Distributed Computing: Making independent, non-dedicated resources available and cooperate toward solving specified problem complexes.
Concurrent Computing: Managing and reasoning about interacting processes that may or may not progress simultaneously.

For designing and analyzing parallel algorithms, a suitable model is needed. A good model is a bridging model (if an algorithm performs better in the model, a faithful implementation of it also performs better).

A bridging model for sequential computing is the RAM, Random Access Machine.

PRAM model[Bearbeiten | Quelltext bearbeiten]

While the Parallel Random Access Machine (PRAM) model is unrealistic it is extremely useful.

Assumes as large as needed memory
Processors can read/write into memory in unit time (uniform memory access, UMA)
Number of processors can be chosen as convenient
Processors execute their own program
Processors are strictly synchronized (operating in lock-step)

The PRAM model defines what happens when multiple processors access the same memory word in the same step:

EREW (Exclusive Read Exclusive Write)
CREW (Concurrent Read Exclusive Write)
CRCW (Concurrent Read Concurrent Write)
- Common CRCW: writing processors must write the same value
- Arbitrary CRCW: either of the written values will survive
- Priority CRCW: processor with the highest priority will successfully write

Theorem 1: The maximum of n numbers stored in an array can be found in $O(1)$ parallel time steps, using $n^{2}$ processors (and performing $O(n^{2})$ operations) on a Common CRCW PRAM.
Theorem 2: The maximum of n numbers stored in an array can be found in $O(\log n)$ parallel time steps, using a maximum of n/2 processors (but performing only $O(n)$ operations) on a CREW PRAM.
Theorem 3: Two n × n matrices can be multiplied into a matrix C in $O(n)$ time steps and $O(n^{3})$ operations on an EREW PRAM.

Flynn's taxonomy[Bearbeiten | Quelltext bearbeiten]

Flynn's taxonomy looks at the instruction stream(s) and the data stream(s) of the system:

SISD (single instruction, single data), e.g. sequential computer
SIMD (single instruction, multiple data), e.g. single instruction operates on a larger batch of data
MISD (multiple instruction, single data), e.g. pipelined system
MIMD (multiple instruction, multiple data), e.g. PRAM machine

SPMD (same program, multiple data) is a subcase of MIMD.

Performance and objectives[Bearbeiten | Quelltext bearbeiten]

absolute speed-up: The absolute speed-up of parallel algorithm Par over sequential algorithm Seq for input of size O(n) on p processor-cores is the ratio of sequential to parallel running time, i.e.,; $S_{p}(n)={\frac {T_{\text{seq}}(n)}{T_{\text{par}}(p,n)}}$

relative speed-up: The relative speed-up of a parallel algorithm Par is the ratio of the running time with one processor-core to the parallel running time with p processor-cores, i.e.,; ${\text{SRel}}_{p}(n)={\frac {T_{\text{par}}(1,n)}{T_{\text{par}}(p,n)}}$

Amdahl's Law: Assume that the work performed by sequential algorithm Seq can be divided into a strictly sequential fraction s, 0 < s ≤ 1, independent of n, that cannot be parallelized at all, and a fraction r = ( 1 − s ) that can be perfectly parallelized. The parallelized algorithm is Par. The maximum speed-up that can be achieved by Par over Seq is 1/s.

strongly scalable: The parallel time decreases proportionally to p (linear speed-up), while keeping n constant.

weakly scalable: The parallel running time remains constant, while keeping the average work per processor constant and increasing n.

Patterns and paradigms[Bearbeiten | Quelltext bearbeiten]

The work required to solve some problem on given input size n by some algorithm can be described as a set of smaller tasks to be executed in some order. The dependencies between these tasks can be modeled as directed edges in a directed acyclic graph (DAG).

$W_{i}$ : number of sequential operations required for task $t_{i}$

Ignoring communication costs total work is ${\textstyle W(n)=\sum _{i=1}^{k}W_{i}(n)}$ .

$T_{\infty }$ is the sequential part of any DAG (fastest possible parallel execution with unlimited number of processors).

Work Law: $T_{p}(n)\geq {\frac {T_{\text{1}}(n)}{p}}\geq {\frac {T_{\text{seq}}(n)}{p}}$
Depth Law: $T_{p}(n)\geq T_{\infty }(n)$

Merging problem[Bearbeiten | Quelltext bearbeiten]

Merge two ordered sequences.

merging by ranking
merging by co-ranking

Prefix sums[Bearbeiten | Quelltext bearbeiten]

Given an array and an associative operator $\oplus$ , compute

inclusive prefix-sum: $y_{i}=\bigoplus _{j=0}^{I}x_{j}$
or exclusive prefix-sum: $y_{i}=\bigoplus _{j=0}^{i-1}x_{j}$

load balancing with prefix-sums
recursive prefix-sums
solving recurrences with the Master Theorem
iterative prefix-sums
non work-optimal, faster prefix-sums
blocking

Shared memory[Bearbeiten | Quelltext bearbeiten]

A naive, parallel shared-memory system model:

A (fixed) number of processor-cores p connected to a large (but finite) shared memory.
Every core can read/write every location in memory, but memory access is significantly more expensive than performing operations in the processor-core.
memory accesses are not uniform, from each processors point of view some locations can be accessed (much) faster than other locations.
Processors are not synchronized.

All these assumptions are in stark contrast to those made for the idealized #PRAM model.

Caches[Bearbeiten | Quelltext bearbeiten]

The cache system of a standard processor does not work on the granularity of single words in memory, but on larger blocks of memory addresses. The memory can be thought of as being segmented into small blocks (typically 64 bytes) and each block can be mapped to some cache line.

There are different ways how memory blocks can be mapped to cache lines:

directly mapped: each block is mapped to one predetermined cache line
fully associative: each block can be mapped to any cache line
k-way set associative: each block can be mapped to some predetermined, small set of cache lines (set size = k)

When a processor reads a word and the memory block to which the block belongs is in the cache, we have a cache hit. Otherwise we have a cache miss, and the block has to be read from slow memory. There are three types of cache misses:

compulsory (cold) miss: the cache is empty
capacity miss: the cache is full
conflict miss: all cache lines where the block can fit are occupied (can only happen for directly and set associative)

In a k-way set associative cache, either of the k cache lines can be evicted upon a conflict miss. Typically used replacement policies are least recently used (LRU) and least frequently used (LFU).

On a write to memory:

if the block is cached, the cache must be updated
if the block is not cached, it can be cached (write allocate) or not (write no-allocate)

When the cache is updated, the memory can be updated immediately (write-through), or the updating of the memory can be postponed until the cache line is evicted (write back).

Such a cache system can benefit applications/algorithms in two ways:

temporal locality: memory address is reused frequently (will not be evicted)
spatial locality: addresses of the same block are also used

Matrix-matrix multiplication[Bearbeiten | Quelltext bearbeiten]

Access locality matters; a standard, and highly illustrative example application is the matrix-matrix multiplication.

${\textstyle n\times l}$ Matrix ${\textstyle A}$ multiplied with ${\textstyle l\times m}$ Matrix ${\textstyle B}$
sequential algorithm in ${\textstyle O(nml)}$ or ${\textstyle O(n^{3})}$ (for squares)
three loops (for ${\textstyle n,m,l}$ ${\textstyle n,m,l}$ ) can be parallelized, order of parallelization matters
- differences because matrices are accessable in row order
- different index order causes more or fewer cache misses

Recursive matrix-matrix multiplication algorithm

recursively split ${\textstyle A}$ and ${\textstyle B}$ in smaller submatrices until very small then iterative solution

Blocked matrix-matrix multiplication

split matrices into blocks at start
then multiply with 6 nested loops
block size can be chosen depending on cache size
- called cache-aware algorithm (opposite is cache-oblivious)

Multi-core caches[Bearbeiten | Quelltext bearbeiten]

cache system has several dimensions
L1 (lowest level) to L3 (or more)
- L1 is closest to processor, smallest (few KB) and fastest
- L3 is Last Level Cache LLC, several MB big
- L1 has data and instruction cache
Translation Lookaside Buffer for memory management (pages)
lowest level caches are private (only for one processor)
higher level caches are shared

Caches in parallel multi-core systems pose new problems:

cache coherence problem

when one block is updated in one cache and also stored in another cache of another processor

coherent cache system if line of other cache will be updated at some point in time (can also just mean it is invalidated in the cache)
non coherent if it will never be updated

problem solved through cache coherence protocol, may cause a lot of cache coherence traffic

false sharing

when two caches have the same blocks of memory stored

update to one address of block will cause the whole line of other to be replaced

The memory system[Bearbeiten | Quelltext bearbeiten]

cache system part of memory hierarchy
- from fast low levels (caches) to slower bigger higher levels
write buffer for memory writes (FIFO)
for multi-core CPUs not every processor has direct connection to main memory
- instead connected to memory controller
some processors closer to memory controller than others ${\textstyle \rightarrow }$ address access times are not uniform

Super-linear speed-up[Bearbeiten | Quelltext bearbeiten]

through memory hierarchy of large caches, working sets of each processor get smaller and eventually can fit into the fastest caches

Application performance[Bearbeiten | Quelltext bearbeiten]

memory-bound: the operations to be performed per memory access take less time than the memory access
compute-bound: the operations to be performed per memory access take more time than the memory access

Memory consistency[Bearbeiten | Quelltext bearbeiten]

when process 1 sets a certain flag in memory and process 2 checks if that flag is set it may happen that the flag is still in the write buffer and process 2 may read a wrong value
programming interfaces like pthreads and OpenMP provide constructs to ensure a well-defined memory state at certain points

pthreads[Bearbeiten | Quelltext bearbeiten]

A thread is the smallest execution unit that can be scheduled.

The main characteristics of the pthreads programming model are:

Fork-join parallelism. A thread can spawn any number of new threads (up to system limitations) and wait for completion of threads. Threads are referenced by thread identifiers.
Threads are symmetric peers, any threads can wait for completion of any other thread via the thread identifier.
One initial main thread
SPMD but can be MIMD
Threads are scheduled by the OS
No implicit synchronization
Threads in a process share global information
Coordination constructions for synchronization and updates to shared objects

pthreads in C[Bearbeiten | Quelltext bearbeiten]

include header #include <pthread.h>
compile with gcc flag -pthread

Race conditions[Bearbeiten | Quelltext bearbeiten]

race condition: When e.g. two threads update a variable at the same time, outcome may be write of either thread → non-deterministic result.
data race: Two or more threads access a shared object, at least one of those accesses is a write.

Critical sections, MutEx, locks[Bearbeiten | Quelltext bearbeiten]

A lock is programming model to guarantee mutual exclusion on a critical section.

threads try to acquire lock, if granted they enter critical section, release lock when done
threads are blocked till lock is acquired

deadlock free: If any number of threads are trying to acquire the lock, eventually one thread must succeed.
starvation free: Any specific thread trying to acquire the lock will eventually acquire it, no matter which other threads are also trying to acquire the lock.

In pthreads a lock is called a mutex. pthreads mutex'es guarantee mutual exclusion and are deadlock free, but not starvation free.

threads at lock will serialize, one thread after another will pass critical section
locks where many threads are competing are called contended
try-locks allow execution of some code when lock could not be acquired

Waiting for a lock can be implemented in two ways:

with a spin-lock the core executing the blocked thread actively keeps testing (spinning) for the lock to become free
with a blocking lock the blocked thread is suspended by the OS, the core is free to do something else
(may be advantageous when the shared-memory system is oversubscribed)

In pthreads, spinning behavior can be requested explicitly by using spin locks.

condition variables are associated with mutex
- thread waits on variable till a signal occurs
- a thread can signal one (maybe arbitrary) thread only
- a broadcast signals all threads at once
concepts with conditions with signals and waits are called monitor
barriers are similar to mutex with conditions
concurrent initialization is where first thread executes initialization code before other threads begin

Locks in data structures[Bearbeiten | Quelltext bearbeiten]

trivially make data structures work with parallel algorithms by using a single lock for all operations on structures
other way is to construct concurrent data structures

Problems with locks[Bearbeiten | Quelltext bearbeiten]

deadlocks can happen easily with locks
- can also happen when the same thread tries to acquire same lock again
locks around long critical section cause harmful serialization
threads crashing during critical section also cause deadlocks
threads can starve
priority threads and locks can lead to lower priority threads preventing higher priority threads from continuing

Atomic operations[Bearbeiten | Quelltext bearbeiten]

atomic instruction carry out instructions that cannot be interfered with by other threads
a = a+27 can be implemented as atomic instruction through fetch-and-add
crashing threads during atomic instructions will not cause deadlocks
instructions are wait-free because they always execute in ${\textstyle O(1)}$
instructions are lock-free if any thread will be able to execute instruction in a bounded amount of time (not always given)

OpenMP[Bearbeiten | Quelltext bearbeiten]

The main characteristics of the OpenMP programming model are:

Parallelism is (mostly) implicit through work sharing. All threads execute the same program (SPMD).
Fork-join parallelism: Master thread implicitly spawns threads through parallel region construct, threads join at the end of parallel region.
Each thread in a parallel region has a unique integer thread id, and threads are consecutively numbered from 0 to the number of threads minus one in the region.
The number of threads can exceed number of the processors/cores. Threads intended to be executed in parallel by available cores/processors.
Constructs for sharing work across threads. Work is expressed as loops of independent iterations and task graphs.
Threads can share variables; shared variables are shared among all threads. Threads can have private variables.
Unprotected, parallel updates of shared variables lead to data races, and are erroneous.
Synchronization constructs for preventing race conditions.
Memory is in a consistent state after synchronization operations.

The OpenMP specifications are available online (but you only need to know what is discussed in the lecture).

OpenMP in C[Bearbeiten | Quelltext bearbeiten]

include header #include <omp.h>
compile with gcc flag -fopenmp

Parallel regions[Bearbeiten | Quelltext bearbeiten]

The parallel construct starts parallel execution. There is an implicit barrier at the end of the parallel construct.

#pragma omp parallel [clauses]
{
   // threads
}

where clauses is a comma-separated list of:

num_threads(t)

If this clause is not given the default number of threads is used. The default can be set with the omp_set_num_threads(t) library function or the OMP_NUM_THREADS environment variable.

one or multiple #Sharing clauses

Sharing clauses[Bearbeiten | Quelltext bearbeiten]

Per default, all variables declared before a parallel region are shared by the threads in the region. Variables declared in the structured block of the parallel region are private.

Sharing of variables can be controlled with the following clauses:

default(shared|none)
Whether variables are shared by default or not.
private(<vars>)
Make uninitialized copies of the listed variables.
firstprivate(<vars>)
Makes copies of the listed variables and initializes each copy to the value the variable had before the parallel region.
shared(<vars>)
It is good practice to not share variables by default by setting default(none) and explicitly listing the variables to be shared with the shared() clause.

Shared variables make data races possible.

Worksharing constructs[Bearbeiten | Quelltext bearbeiten]

Loop construct[Bearbeiten | Quelltext bearbeiten]

#pragma omp for [clauses...] for (loop range...)

loop scheduling is assignment of iteration blocks to threads
each iteration has to be executed exactly once
data races have to be avoided

For loops must be in canonical form:

all threads must have same start and end values for i and same step
loop ranges have to be finite and determined (no while loops possible)
end condition has to be in the form of: i<n, i<=n, i>n, i>=n, i!=n, with n as an expression or value
steps have to be in form of i++, i--, i+=inc, i=i+inc, i-=inc, i=i-inc

#pragma omp parallel for [clauses...]

shorthand for parallel region with one loop
next statement is for loop
always barrier after

Loop scheduling[Bearbeiten | Quelltext bearbeiten]

how is each thread assigned to each iteration
loop range is divided in consecutive chunks
static schedule: chunks have same size and are assigned like round-robin (each thread gets chunk one after another)
- computation of chunk assignments is very fast (low overhead)
dynamic: each thread dynamically grabs next chunk it can
guided: dynamic assignment but chunk size changes, determined by number of iterations divided by threads
set like schedule(static|dynamic|guided[, chunksize])
- for guided chunksize is minimum size
- if no size given then default size is used
also possible schedule(auto|runtime)
- runtime: scheduling is set at runtime
- auto: OpenMP compiler determines

Collapsing nested loops[Bearbeiten | Quelltext bearbeiten]

transform multiple loops into one

#pragma omp parallel for collapse(depth) [clauses...]

depth is the amount of nested loops to be parallelized

sections construct[Bearbeiten | Quelltext bearbeiten]

The code is split into small pieces that can be executed in parallel by available threads.

#pragma omp sections

outer structure
end of block is implicit barriers
- nowait possible
variables can be designated (first)private

#pragma omp section

marks an independent code section
which thread executes which section is designated by runtime system
best case: each thread executes a section, all threads run in parallel

task construct[Bearbeiten | Quelltext bearbeiten]

Tasks are very much like OpenMP Sections, but Sections are static, that is, the number of sections is set when you write the code, whereas Tasks can be created anytime, and in any number, under control of your program's logic.[1]

#pragma omp task [clauses...]

task is like a function call
completion of task may not happen immediately
- at latest when completion is requested (e.g. at end of parallel region)
tasks can access any variables of the thread
- if those variables are changed in the task data races may occur
task can be designated final and thus not generate any additional tasks

single construct[Bearbeiten | Quelltext bearbeiten]

#pragma omp single

statement is executed by either one of the threads
- other threads will wait at end of statement until all threads have reached end
- can be eliminated by nowait clause after single
  - might cause race conditions
- allows making variables private or firstprivate (unlike master construct)

Master and synchronization constructs[Bearbeiten | Quelltext bearbeiten]

master construct[Bearbeiten | Quelltext bearbeiten]

#pragma omp master

declares statement to be executed by master thread only (thread id of 0)
- other threads will simply skip

barrier construct[Bearbeiten | Quelltext bearbeiten]

#pragma omp barrier

no thread can continue until all threads have reached barrier

critical construct[Bearbeiten | Quelltext bearbeiten]

pragma omp critical [(name)]

threads will wait and only one thread will execute code of critical section
section can have name
shared variables can be updated by task in critical region

atomic construct[Bearbeiten | Quelltext bearbeiten]

#pragma omp atomic [read|write|update|capture]

for simple critical sections
allow fetch and add (FAA) atomic operations
update is x++, x = x op ..., ...
capture is y = x++, ...

taskwait construct[Bearbeiten | Quelltext bearbeiten]

#pragma omp taskwait [depend(...)]

wait for completion of generated tasks
only dependency clause allowed

ordered construct[Bearbeiten | Quelltext bearbeiten]

#pragma omp ordered

loops with dependency patterns that cannot be handled normally can be marked as ordered
only one possible ordered block in a parallel loop
other parts of iteration can still be performed in parallel

Library calls[Bearbeiten | Quelltext bearbeiten]

omp_get_thread_num(void) returns thread id
omp_get_num_threads(void) returns number of threads
omp_get_max_threads(void) returns maximum possible threads
omp_set_num_threads(int num_threads) sets maximum
omp_get_wtime(void) returns wall clock time in seconds
omp_get_wtick(void) returns tick resolution of timer

Reductions[Bearbeiten | Quelltext bearbeiten]

#pragma omp for reduction(operator: variables)

is clause of for
operators are `+ - * & | ^ && || min max`

Locks[Bearbeiten | Quelltext bearbeiten]

locks don't have condition variables
recursive (nested) locks are possible in OMP

Special loops[Bearbeiten | Quelltext bearbeiten]

OMP can try to utilize vector operations (SIMD)

#pragma omp simd [clauses...]

followed by canonical for loop
one thread executes loop but with SIMD instructions

#pragma omp for simd [clauses...]

followed by canonical for loop
loop executed by multiple threads
each chunk executed with SIMD instructions

#pragma omp parallel for simd [clauses...]

same as previous but with parallel region declared

#pragma omp taskloop [clauses...]

recursively break iteration range into smaller ranges
smaller ranges are then executed as tasks
should be initiated by a single thread
size of range can be adjusted by grainsize()

Distributed memory systems[Bearbeiten | Quelltext bearbeiten]

Network properties: structure and topology[Bearbeiten | Quelltext bearbeiten]

distributed memory introduces interconnection network (or interconnect)
entities (cores, multi-core CPUs or larger systems) are connected through links (in any form)

A network without switches is called a direct network, a network with switches an indirect network.

the topology of network can be modeled as an unweighted graph
- each node is a network communication element
- each edge denotes a direct link between two elements
communication networks are mostly bidirected (represented as undirected graph)
${\textstyle {\text{diam}}(G)}$ ${\textstyle {\text{diam}}(G)}$ graph diameter is longest shortest path between two nodes
- is a lower bound on communication steps between two elements
${\textstyle {\text{degree}}(G)}$ graph degree is maximum degree of all nodes
${\textstyle {\text{bisec}}(G)}$ bisection width is minimum number of edges to remove so that graph is partitioned into two equally large subsets
worst possible networks are linear processor array and processor ring
tree networks are binary or k-ary trees
- have ${\textstyle {\text{bisec}}(T)=1}$
d-dimensional mesh networks
- processors are identified by their integer vectors
- special case is torus network where edges wrap around
hypercube network is special case of torus network
modern systems often built as torus networks with 3 to 6 dimensions, called multi-stage networks

Communication algorithms in networks[Bearbeiten | Quelltext bearbeiten]

unidirectional communication if only one direction
bidirectional if in both directions can be communicated
- most modern systems
broadcast problem: how to transmit data from root to every other node in minimal communication steps
- lower bound is ${\textstyle \lceil \log _{k+1}(p)\rceil }$ in ${\textstyle k}$ -degree, network with ${\textstyle p}$ nodes
- algorithm to solve partitions network in ${\textstyle k+1}$ smaller networks, root sends data to “virtual roots” of these networks, solve problem for smaller subnetworks

Communication costs[Bearbeiten | Quelltext bearbeiten]

linear transmission cost assumes time of ${\textstyle \alpha +\beta m}$ , where ${\textstyle \alpha }$ is start up latency and ${\textstyle \beta }$ is time per unit of data

Routing and switching[Bearbeiten | Quelltext bearbeiten]

if network is not fully connected, routs make path between two nodes possible

Hierarchical, distributed memory systems[Bearbeiten | Quelltext bearbeiten]

communication networks with different levels
processors may have different characteristics and thus live on different compute nodes

Programming models[Bearbeiten | Quelltext bearbeiten]

concrete network properties are abstracted
- model of fully connected network
processes are not synchronized and communicate with others through explicit or implicit transmission
- message transmission is assumed to be deadlock free and correct
distributed system programming models either:
- data distribution centric: data structures distributed according to rules, if one processes changes data structure of another process then through communication
- communication centric: focuses on explicit communication and synchronization instead of data structures (MPI)

Message-Passing Interface[Bearbeiten | Quelltext bearbeiten]

message-passing programming model is to structure parallel processes through sending and receiving messages
the message-passing model is called Communicating Sequential Processes (CSP). CSP programs cannot have data races.

The main characteristics of the MPI message-passing programming model are:

Finite sets of processes (in communication domains) that can communicate.
Processes identified by rank in communication domain.
Ranks successive $0,\dots p-1$ ; p number of processes in domain (size).
More than one communication domain possible; created relative to default domain of all started processes. A process can belong to several communication domains.
Processes operate on local data, all communication explicit.
Communication reliable and ordered.
Network oblivious
Three basic communication models:
1. Point-to-point communication, different modes, non-local and local completion semantics
2. One-sided communication, different synchronization mechanisms, local completion mechanisms
3. Collective operations, non-local (and local) completion semantics.
Structure of communicated data orthogonal to communication model and mode.
Communication domains may reflect physical topology.

The MPI specifications are available online (but you only need to know what is discussed in the lecture).

The rest of this section is about MPI in C:

header #include <mpi.h>
functions in MPI_ “name space” (illegal and punishable by law to use prefix for own functions)
MPI_SUCCESS return value for success (obviously)
mpicc compiler to compile programs

Initializing MPI[Bearbeiten | Quelltext bearbeiten]

MPI_Init to initialize
end with MPI_Finalize
MPI_Abort forces termination in case of emergency

Error checking[Bearbeiten | Quelltext bearbeiten]

MPI only has limited error checking because it is expensive

Communicators[Bearbeiten | Quelltext bearbeiten]

Communication domains are called communicators in MPI. A communicator is a distributed object that can be operated upon by all processes belonging to the communicator. A communicator is referenced by a handle of type MPI_Comm.

MPI_Comm_size to get amount of processes in communicator
MPI_Comm_rank to get rank in communicator
all processes are in communication domain MPI_COMM_WORLD
MPI_Comm_split splits processes of communicator into smaller groups with own communicators
MPI_Comm_create also creates new communicators
both functions are collective (have to be called by all processes)
MPI_Comm_free to free communicators

Organizing processes[Bearbeiten | Quelltext bearbeiten]

communicator with grid structure is called Cartesian communicator
created through MPI_Cart_create
- reorder flag to reorder ranks so neighboring processes are close on grid
MPI_Cart_coords translate rank to coordinate vector
- MPI_Cart_rank vice versa

Objects and handles[Bearbeiten | Quelltext bearbeiten]

a distributed object is an object for which all processes that reference it can access it
distributed objects
- MPI_Comm for communicator object
- MPI_Win for communication windows
local objects
- MPI_Datatype for local layout and structure of data
- MPI_Group for ordered sets of processes
- MPI_Status for communication
- MPI_Request for open (not yet completed) communication
- MPI_Op for binary operators
- MPI_Info for additional info when creating objects

Process groups[Bearbeiten | Quelltext bearbeiten]

objects of type MP_Group
groups used locally by processes to order processes
MPI_Comm_group returns an ordered group from a communicator
MPI_Group_(size|rank) like MPI_Comm_(size|rank)
set-like operations on groups possible, MPI_Group_(union|intersection|difference|incl|excl...)

Point-to-point communication[Bearbeiten | Quelltext bearbeiten]

MPI_Send to send data from process to dest
MPI_Recv to receive data from source
combined operations MPI_Sendrecv[_replace] have argument for source and dest
MPI_Get_(count|elements) to figure how much data was sent

Semantic terms[Bearbeiten | Quelltext bearbeiten]

blocking operations when function call returns when operation has been locally completed (irregardless of success of if data has been sent out etc.)
non-blocking when function call returns immediately (specified by capital I in function name, e.g. MPI_Irecv)

Specifying data[Bearbeiten | Quelltext bearbeiten]

buffer specifies starting adress
count specifies number of elements
datatype specifies MPI type
- e.g. MPI_(CHAR|INT|LONG|FLOAT|DOUBLE)

One-sided communication[Bearbeiten | Quelltext bearbeiten]

one process alone initiates communication
- also specifies actions on both ends
involved processes are called origin and target

Collective communication[Bearbeiten | Quelltext bearbeiten]

If some process calls a collective operation C on a communicator, then all other processes in the communicator must also eventually call C and no other collective before C.

	Regular	Irregular
symmetric, no-data	MPI_Barrier
Rooted (asymmetric)	MPI_Bcast MPI_Gather MPI_Scatter MPI_Reduce	MPI_Gatherv MPI_Scatterv
Non-rooted (symmetric)	MPI_Allgather MPI_Alltoall MPI_Allreduce MPI_Reduce_scatter_block MPI_Scan MPI_Exscan	MPI_Allgatherv MPI_Alltoallv MPI_Alltoallw MPI_Reduce_scatter

symmetric vs. non-symmetric: all processes have the same role in collective vs. one/some process (root) is special
regular vs. irregular: each process contributes or receives the same amount of data vs. different pairs of processes may exchange different amounts of data

MPI_Barrier blocks until all have reached routine
MPI_Bcast to send data from root to all nodes
MPI_Scatter distributes data from process root evenly to all other
MPI_Gather gathers data from all other processes in root
MPI_Allgather like Gather but all processes have all data
MPI_Reduce reduction of data in root with operator MPI_Op op
- operations are assumed to be associative and commutative
MPI_Allreduce like reduce but all processes receive reduction
MPI_Reduce_Scatter first reduce vector in processes, then scatter vector in segments across processes
MPI_Alltoall (could be named Allscatter) every process scatters their data to all other
MPI_Scan performs a reduction over a group, where each subsequent process stores the result of the reduction of the current value and the previous values
- e.g. Scan with SUM reduction processes with values in brackets: 1(1), 2(2), 3(3), 4(4)

after Scan: 1(1), 2(3), 3(6), 4(10)

MPI_Exscan like Scan but value of process is not used in reduction
- same example with Exscan: 1(0), 2(1), 3(3), 4(6)

Materialien

Neues Material hinzufügen

M

mpi functions.png (details)