TU Wien:Parallel Computing VU (Träff)/Zusammenfassung

Aus VoWi
Zur Navigation springen Zur Suche springen

Introduction[Bearbeiten | Quelltext bearbeiten]

Processor performance is defined as the maximum number of operations that can be carried out per unit of time. Often FLoating point OPerations per Second (FLOPS).

Parallel Computing is intimately related to distributed and concurrent computing:

Parallel Computing
Efficiently utilizing dedicated parallel resources for solving given computation problems.
Distributed Computing
Making independent, non-dedicated resources available and cooperate toward solving specified problem complexes.
Concurrent Computing
Managing and reasoning about interacting processes that may or may not progress simultaneously.

For designing and analyzing parallel algorithms, a suitable model is needed. A good model is a bridging model (if an algorithm performs better in the model, a faithful implementation of it also performs better).

A bridging model for sequential computing is the RAM, Random Access Machine.

PRAM model[Bearbeiten | Quelltext bearbeiten]

While the Parallel Random Access Machine (PRAM) model is unrealistic it is extremely useful.

  • Assumes as large as needed memory
  • Processors can read/write into memory in unit time (uniform memory access, UMA)
  • Number of processors can be chosen as convenient
  • Processors execute their own program
  • Processors are strictly synchronized (operating in lock-step)

The PRAM model defines what happens when multiple processors access the same memory word in the same step:

  • EREW (Exclusive Read Exclusive Write)
  • CREW (Concurrent Read Exclusive Write)
  • CRCW (Concurrent Read Concurrent Write)
    • Common CRCW: writing processors must write the same value
    • Arbitrary CRCW: either of the written values will survive
    • Priority CRCW: processor with the highest priority will successfully write
Theorem 1
The maximum of n numbers stored in an array can be found in parallel time steps, using processors (and performing operations) on a Common CRCW PRAM.
Theorem 2
The maximum of n numbers stored in an array can be found in parallel time steps, using a maximum of n/2 processors (but performing only operations) on a CREW PRAM.
Theorem 3
Two n × n matrices can be multiplied into a matrix C in time steps and operations on an EREW PRAM.

Flynn's taxonomy[Bearbeiten | Quelltext bearbeiten]

Flynn's taxonomy looks at the instruction stream(s) and the data stream(s) of the system:

  • SISD (single instruction, single data), e.g. sequential computer
  • SIMD (single instruction, multiple data), e.g. single instruction operates on a larger batch of data
  • MISD (multiple instruction, single data), e.g. pipelined system
  • MIMD (multiple instruction, multiple data), e.g. PRAM machine

SPMD (same program, multiple data) is a subcase of MIMD.

Performance and objectives[Bearbeiten | Quelltext bearbeiten]

absolute speed-up
The absolute speed-up of parallel algorithm Par over sequential algorithm Seq for input of size O(n) on p processor-cores is the ratio of sequential to parallel running time, i.e.,
relative speed-up
The relative speed-up of a parallel algorithm Par is the ratio of the running time with one processor-core to the parallel running time with p processor-cores, i.e.,
Amdahl's Law
Assume that the work performed by sequential algorithm Seq can be divided into a strictly sequential fraction s, 0 < s ≤ 1, independent of n, that cannot be parallelized at all, and a fraction r = ( 1 − s ) that can be perfectly parallelized. The parallelized algorithm is Par. The maximum speed-up that can be achieved by Par over Seq is 1/s.
strongly scalable
The parallel time decreases proportionally to p (linear speed-up), while keeping n constant.
weakly scalable
The parallel running time remains constant, while keeping the average work per processor constant and increasing n.

Patterns and paradigms[Bearbeiten | Quelltext bearbeiten]

The work required to solve some problem on given input size n by some algorithm can be described as a set of smaller tasks to be executed in some order. The dependencies between these tasks can be modeled as directed edges in a directed acyclic graph (DAG).

: number of sequential operations required for task

Ignoring communication costs total work is .

is the sequential part of any DAG (fastest possible parallel execution with unlimited number of processors).

Work Law
Depth Law

Merging problem[Bearbeiten | Quelltext bearbeiten]

Merge two ordered sequences.

  • merging by ranking
  • merging by co-ranking

Prefix sums[Bearbeiten | Quelltext bearbeiten]

Given an array and an associative operator , compute

inclusive prefix-sum
or exclusive prefix-sum
  • load balancing with prefix-sums
  • recursive prefix-sums
  • solving recurrences with the Master Theorem
  • iterative prefix-sums
  • non work-optimal, faster prefix-sums
  • blocking

Shared memory[Bearbeiten | Quelltext bearbeiten]

A naive, parallel shared-memory system model:

  • A (fixed) number of processor-cores p connected to a large (but finite) shared memory.
  • Every core can read/write every location in memory, but memory access is significantly more expensive than performing operations in the processor-core.
  • memory accesses are not uniform, from each processors point of view some locations can be accessed (much) faster than other locations.
  • Processors are not synchronized.

All these assumptions are in stark contrast to those made for the idealized #PRAM model.

Caches[Bearbeiten | Quelltext bearbeiten]

The cache system of a standard processor does not work on the granularity of single words in memory, but on larger blocks of memory addresses. The memory can be thought of as being segmented into small blocks (typically 64 bytes) and each block can be mapped to some cache line.

There are different ways how memory blocks can be mapped to cache lines:

  • directly mapped: each block is mapped to one predetermined cache line
  • fully associative: each block can be mapped to any cache line
  • k-way set associative: each block can be mapped to some predetermined, small set of cache lines (set size = k)

When a processor reads a word and the memory block to which the block belongs is in the cache, we have a cache hit. Otherwise we have a cache miss, and the block has to be read from slow memory. There are three types of cache misses:

  • compulsory (cold) miss: the cache is empty
  • capacity miss: the cache is full
  • conflict miss: all cache lines where the block can fit are occupied (can only happen for directly and set associative)

In a k-way set associative cache, either of the k cache lines can be evicted upon a conflict miss. Typically used replacement policies are least recently used (LRU) and least frequently used (LFU).

On a write to memory:

  • if the block is cached, the cache must be updated
  • if the block is not cached, it can be cached (write allocate) or not (write no-allocate)

When the cache is updated, the memory can be updated immediately (write-through), or the updating of the memory can be postponed until the cache line is evicted (write back).

Such a cache system can benefit applications/algorithms in two ways:

  • temporal locality: memory address is reused frequently (will not be evicted)
  • spatial locality: addresses of the same block are also used

Matrix-matrix multiplication[Bearbeiten | Quelltext bearbeiten]

Access locality matters; a standard, and highly illustrative example application is the matrix-matrix multiplication.

  • Matrix multiplied with Matrix
  • sequential algorithm in or (for squares)
  • three loops (for ) can be parallelized, order of parallelization matters
    • differences because matrices are accessable in row order
    • different index order causes more or fewer cache misses
Recursive matrix-matrix multiplication algorithm
  • recursively split and in smaller submatrices until very small then iterative solution
Blocked matrix-matrix multiplication
  • split matrices into blocks at start
  • then multiply with 6 nested loops
  • block size can be chosen depending on cache size
    • called cache-aware algorithm (opposite is cache-oblivious)

Multi-core caches[Bearbeiten | Quelltext bearbeiten]

  • cache system has several dimensions
  • L1 (lowest level) to L3 (or more)
    • L1 is closest to processor, smallest (few KB) and fastest
    • L3 is Last Level Cache LLC, several MB big
    • L1 has data and instruction cache
  • Translation Lookaside Buffer for memory management (pages)
  • lowest level caches are private (only for one processor)
  • higher level caches are shared

Caches in parallel multi-core systems pose new problems:

cache coherence problem
when one block is updated in one cache and also stored in another cache of another processor
  • coherent cache system if line of other cache will be updated at some point in time (can also just mean it is invalidated in the cache)
  • non coherent if it will never be updated
problem solved through cache coherence protocol, may cause a lot of cache coherence traffic
false sharing
when two caches have the same blocks of memory stored
update to one address of block will cause the whole line of other to be replaced

The memory system[Bearbeiten | Quelltext bearbeiten]

  • cache system part of memory hierarchy
    • from fast low levels (caches) to slower bigger higher levels
  • write buffer for memory writes (FIFO)
  • for multi-core CPUs not every processor has direct connection to main memory
    • instead connected to memory controller
  • some processors closer to memory controller than others address access times are not uniform

Super-linear speed-up[Bearbeiten | Quelltext bearbeiten]

  • through memory hierarchy of large caches, working sets of each processor get smaller and eventually can fit into the fastest caches

Application performance[Bearbeiten | Quelltext bearbeiten]

  • memory-bound: the operations to be performed per memory access take less time than the memory access
  • compute-bound: the operations to be performed per memory access take more time than the memory access

Memory consistency[Bearbeiten | Quelltext bearbeiten]

  • when process 1 sets a certain flag in memory and process 2 checks if that flag is set it may happen that the flag is still in the write buffer and process 2 may read a wrong value
  • programming interfaces like pthreads and OpenMP provide constructs to ensure a well-defined memory state at certain points

pthreads[Bearbeiten | Quelltext bearbeiten]

A thread is the smallest execution unit that can be scheduled.

The main characteristics of the pthreads programming model are:

  1. Fork-join parallelism. A thread can spawn any number of new threads (up to system limitations) and wait for completion of threads. Threads are referenced by thread identifiers.
  2. Threads are symmetric peers, any threads can wait for completion of any other thread via the thread identifier.
  3. One initial main thread
  4. SPMD but can be MIMD
  5. Threads are scheduled by the OS
  6. No implicit synchronization
  7. Threads in a process share global information
  8. Coordination constructions for synchronization and updates to shared objects

pthreads in C[Bearbeiten | Quelltext bearbeiten]

  • include header #include <pthread.h>
  • compile with gcc flag -pthread

Race conditions[Bearbeiten | Quelltext bearbeiten]

race condition
When e.g. two threads update a variable at the same time, outcome may be write of either thread → non-deterministic result.
data race
Two or more threads access a shared object, at least one of those accesses is a write.

Critical sections, MutEx, locks[Bearbeiten | Quelltext bearbeiten]

A lock is programming model to guarantee mutual exclusion on a critical section.

  • threads try to acquire lock, if granted they enter critical section, release lock when done
  • threads are blocked till lock is acquired
deadlock free
If any number of threads are trying to acquire the lock, eventually one thread must succeed.
starvation free
Any specific thread trying to acquire the lock will eventually acquire it, no matter which other threads are also trying to acquire the lock.

In pthreads a lock is called a mutex. pthreads mutex'es guarantee mutual exclusion and are deadlock free, but not starvation free.

  • threads at lock will serialize, one thread after another will pass critical section
  • locks where many threads are competing are called contended
  • try-locks allow execution of some code when lock could not be acquired

Waiting for a lock can be implemented in two ways:

  • with a spin-lock the core executing the blocked thread actively keeps testing (spinning) for the lock to become free
  • with a blocking lock the blocked thread is suspended by the OS, the core is free to do something else
    (may be advantageous when the shared-memory system is oversubscribed)

In pthreads, spinning behavior can be requested explicitly by using spin locks.

  • condition variables are associated with mutex
    • thread waits on variable till a signal occurs
    • a thread can signal one (maybe arbitrary) thread only
    • a broadcast signals all threads at once
  • concepts with conditions with signals and waits are called monitor
  • barriers are similar to mutex with conditions
  • concurrent initialization is where first thread executes initialization code before other threads begin

Locks in data structures[Bearbeiten | Quelltext bearbeiten]

  • trivially make data structures work with parallel algorithms by using a single lock for all operations on structures
  • other way is to construct concurrent data structures

Problems with locks[Bearbeiten | Quelltext bearbeiten]

  • deadlocks can happen easily with locks
    • can also happen when the same thread tries to acquire same lock again
  • locks around long critical section cause harmful serialization
  • threads crashing during critical section also cause deadlocks
  • threads can starve
  • priority threads and locks can lead to lower priority threads preventing higher priority threads from continuing

Atomic operations[Bearbeiten | Quelltext bearbeiten]

  • atomic instruction carry out instructions that cannot be interfered with by other threads
  • a = a+27 can be implemented as atomic instruction through fetch-and-add
  • crashing threads during atomic instructions will not cause deadlocks
  • instructions are wait-free because they always execute in
  • instructions are lock-free if any thread will be able to execute instruction in a bounded amount of time (not always given)

OpenMP[Bearbeiten | Quelltext bearbeiten]

The main characteristics of the OpenMP programming model are:

  1. Parallelism is (mostly) implicit through work sharing. All threads execute the same program (SPMD).
  2. Fork-join parallelism: Master thread implicitly spawns threads through parallel region construct, threads join at the end of parallel region.
  3. Each thread in a parallel region has a unique integer thread id, and threads are consecutively numbered from 0 to the number of threads minus one in the region.
  4. The number of threads can exceed number of the processors/cores. Threads intended to be executed in parallel by available cores/processors.
  5. Constructs for sharing work across threads. Work is expressed as loops of independent iterations and task graphs.
  6. Threads can share variables; shared variables are shared among all threads. Threads can have private variables.
  7. Unprotected, parallel updates of shared variables lead to data races, and are erroneous.
  8. Synchronization constructs for preventing race conditions.
  9. Memory is in a consistent state after synchronization operations.

The OpenMP specifications are available online (but you only need to know what is discussed in the lecture).

OpenMP in C[Bearbeiten | Quelltext bearbeiten]

  • include header #include <omp.h>
  • compile with gcc flag -fopenmp

Parallel regions[Bearbeiten | Quelltext bearbeiten]

The parallel construct starts parallel execution. There is an implicit barrier at the end of the parallel construct.

#pragma omp parallel [clauses]
{
   // threads
}

where clauses is a comma-separated list of:

  • num_threads(t)
If this clause is not given the default number of threads is used. The default can be set with the omp_set_num_threads(t) library function or the OMP_NUM_THREADS environment variable.

Sharing clauses[Bearbeiten | Quelltext bearbeiten]

Per default, all variables declared before a parallel region are shared by the threads in the region. Variables declared in the structured block of the parallel region are private.

Sharing of variables can be controlled with the following clauses:

  • default(shared|none)
    Whether variables are shared by default or not.
  • private(<vars>)
    Make uninitialized copies of the listed variables.
  • firstprivate(<vars>)
    Makes copies of the listed variables and initializes each copy to the value the variable had before the parallel region.
  • shared(<vars>)
    It is good practice to not share variables by default by setting default(none) and explicitly listing the variables to be shared with the shared() clause.

Shared variables make data races possible.

Worksharing constructs[Bearbeiten | Quelltext bearbeiten]

Loop construct[Bearbeiten | Quelltext bearbeiten]

#pragma omp for [clauses...] for (loop range...)

  • loop scheduling is assignment of iteration blocks to threads
  • each iteration has to be executed exactly once
  • data races have to be avoided

For loops must be in canonical form:

  • all threads must have same start and end values for i and same step
  • loop ranges have to be finite and determined (no while loops possible)
  • end condition has to be in the form of: i<n, i<=n, i>n, i>=n, i!=n, with n as an expression or value
  • steps have to be in form of i++, i--, i+=inc, i=i+inc, i-=inc, i=i-inc

#pragma omp parallel for [clauses...]

  • shorthand for parallel region with one loop
  • next statement is for loop
  • always barrier after
Loop scheduling[Bearbeiten | Quelltext bearbeiten]
  • how is each thread assigned to each iteration
  • loop range is divided in consecutive chunks
  • static schedule: chunks have same size and are assigned like round-robin (each thread gets chunk one after another)
    • computation of chunk assignments is very fast (low overhead)
  • dynamic: each thread dynamically grabs next chunk it can
  • guided: dynamic assignment but chunk size changes, determined by number of iterations divided by threads
  • set like schedule(static|dynamic|guided[, chunksize])
    • for guided chunksize is minimum size
    • if no size given then default size is used
  • also possible schedule(auto|runtime)
    • runtime: scheduling is set at runtime
    • auto: OpenMP compiler determines
Collapsing nested loops[Bearbeiten | Quelltext bearbeiten]
  • transform multiple loops into one

#pragma omp parallel for collapse(depth) [clauses...]

  • depth is the amount of nested loops to be parallelized

sections construct[Bearbeiten | Quelltext bearbeiten]

The code is split into small pieces that can be executed in parallel by available threads.

#pragma omp sections

  • outer structure
  • end of block is implicit barriers
    • nowait possible
  • variables can be designated (first)private

#pragma omp section

  • marks an independent code section
  • which thread executes which section is designated by runtime system
  • best case: each thread executes a section, all threads run in parallel

task construct[Bearbeiten | Quelltext bearbeiten]

Tasks are very much like OpenMP Sections, but Sections are static, that is, the number of sections is set when you write the code, whereas Tasks can be created anytime, and in any number, under control of your program's logic.[1]

#pragma omp task [clauses...]

  • task is like a function call
  • completion of task may not happen immediately
    • at latest when completion is requested (e.g. at end of parallel region)
  • tasks can access any variables of the thread
    • if those variables are changed in the task data races may occur
  • task can be designated final and thus not generate any additional tasks

single construct[Bearbeiten | Quelltext bearbeiten]

#pragma omp single

  • statement is executed by either one of the threads
    • other threads will wait at end of statement until all threads have reached end
    • can be eliminated by nowait clause after single
      • might cause race conditions
    • allows making variables private or firstprivate (unlike master construct)

Master and synchronization constructs[Bearbeiten | Quelltext bearbeiten]

master construct[Bearbeiten | Quelltext bearbeiten]

#pragma omp master

  • declares statement to be executed by master thread only (thread id of 0)
    • other threads will simply skip

barrier construct[Bearbeiten | Quelltext bearbeiten]

#pragma omp barrier

  • no thread can continue until all threads have reached barrier

critical construct[Bearbeiten | Quelltext bearbeiten]

pragma omp critical [(name)]

  • threads will wait and only one thread will execute code of critical section
  • section can have name
  • shared variables can be updated by task in critical region

atomic construct[Bearbeiten | Quelltext bearbeiten]

#pragma omp atomic [read|write|update|capture]

  • for simple critical sections
  • allow fetch and add (FAA) atomic operations
  • update is x++, x = x op ..., ...
  • capture is y = x++, ...

taskwait construct[Bearbeiten | Quelltext bearbeiten]

#pragma omp taskwait [depend(...)]

  • wait for completion of generated tasks
  • only dependency clause allowed

ordered construct[Bearbeiten | Quelltext bearbeiten]

#pragma omp ordered

  • loops with dependency patterns that cannot be handled normally can be marked as ordered
  • only one possible ordered block in a parallel loop
  • other parts of iteration can still be performed in parallel

Library calls[Bearbeiten | Quelltext bearbeiten]

  • omp_get_thread_num(void) returns thread id
  • omp_get_num_threads(void) returns number of threads
  • omp_get_max_threads(void) returns maximum possible threads
  • omp_set_num_threads(int num_threads) sets maximum
  • omp_get_wtime(void) returns wall clock time in seconds
  • omp_get_wtick(void) returns tick resolution of timer

Reductions[Bearbeiten | Quelltext bearbeiten]

#pragma omp for reduction(operator: variables)

  • is clause of for
  • operators are `+ - * & | ^ && || min max`

Locks[Bearbeiten | Quelltext bearbeiten]

  • locks don't have condition variables
  • recursive (nested) locks are possible in OMP

Special loops[Bearbeiten | Quelltext bearbeiten]

  • OMP can try to utilize vector operations (SIMD)

#pragma omp simd [clauses...]

  • followed by canonical for loop
  • one thread executes loop but with SIMD instructions

#pragma omp for simd [clauses...]

  • followed by canonical for loop
  • loop executed by multiple threads
  • each chunk executed with SIMD instructions

#pragma omp parallel for simd [clauses...]

  • same as previous but with parallel region declared

#pragma omp taskloop [clauses...]

  • recursively break iteration range into smaller ranges
  • smaller ranges are then executed as tasks
  • should be initiated by a single thread
  • size of range can be adjusted by grainsize()

Distributed memory systems[Bearbeiten | Quelltext bearbeiten]

Network properties: structure and topology[Bearbeiten | Quelltext bearbeiten]

  • distributed memory introduces interconnection network (or interconnect)
  • entities (cores, multi-core CPUs or larger systems) are connected through links (in any form)

A network without switches is called a direct network, a network with switches an indirect network.

  • the topology of network can be modeled as an unweighted graph
    • each node is a network communication element
    • each edge denotes a direct link between two elements
  • communication networks are mostly bidirected (represented as undirected graph)
  • graph diameter is longest shortest path between two nodes
    • is a lower bound on communication steps between two elements
  • graph degree is maximum degree of all nodes
  • bisection width is minimum number of edges to remove so that graph is partitioned into two equally large subsets
  • worst possible networks are linear processor array and processor ring
  • tree networks are binary or k-ary trees
    • have
  • d-dimensional mesh networks
    • processors are identified by their integer vectors
    • special case is torus network where edges wrap around
  • hypercube network is special case of torus network
  • modern systems often built as torus networks with 3 to 6 dimensions, called multi-stage networks

Communication algorithms in networks[Bearbeiten | Quelltext bearbeiten]

  • unidirectional communication if only one direction
  • bidirectional if in both directions can be communicated
    • most modern systems
  • broadcast problem: how to transmit data from root to every other node in minimal communication steps
    • lower bound is in -degree, network with nodes
    • algorithm to solve partitions network in smaller networks, root sends data to “virtual roots” of these networks, solve problem for smaller subnetworks

Communication costs[Bearbeiten | Quelltext bearbeiten]

  • linear transmission cost assumes time of , where is start up latency and is time per unit of data

Routing and switching[Bearbeiten | Quelltext bearbeiten]

  • if network is not fully connected, routs make path between two nodes possible

Hierarchical, distributed memory systems[Bearbeiten | Quelltext bearbeiten]

  • communication networks with different levels
  • processors may have different characteristics and thus live on different compute nodes

Programming models[Bearbeiten | Quelltext bearbeiten]

  • concrete network properties are abstracted
    • model of fully connected network
  • processes are not synchronized and communicate with others through explicit or implicit transmission
    • message transmission is assumed to be deadlock free and correct
  • distributed system programming models either:
    • data distribution centric: data structures distributed according to rules, if one processes changes data structure of another process then through communication
    • communication centric: focuses on explicit communication and synchronization instead of data structures (MPI)

Message-Passing Interface[Bearbeiten | Quelltext bearbeiten]

  • message-passing programming model is to structure parallel processes through sending and receiving messages
  • the message-passing model is called Communicating Sequential Processes (CSP). CSP programs cannot have data races.

The main characteristics of the MPI message-passing programming model are:

  1. Finite sets of processes (in communication domains) that can communicate.
  2. Processes identified by rank in communication domain.
  3. Ranks successive ; p number of processes in domain (size).
  4. More than one communication domain possible; created relative to default domain of all started processes. A process can belong to several communication domains.
  5. Processes operate on local data, all communication explicit.
  6. Communication reliable and ordered.
  7. Network oblivious
  8. Three basic communication models:
    1. Point-to-point communication, different modes, non-local and local completion semantics
    2. One-sided communication, different synchronization mechanisms, local completion mechanisms
    3. Collective operations, non-local (and local) completion semantics.
  9. Structure of communicated data orthogonal to communication model and mode.
  10. Communication domains may reflect physical topology.

The MPI specifications are available online (but you only need to know what is discussed in the lecture).

The rest of this section is about MPI in C:

  • header #include <mpi.h>
  • functions in MPI_ “name space” (illegal and punishable by law to use prefix for own functions)
  • MPI_SUCCESS return value for success (obviously)
  • mpicc compiler to compile programs

Initializing MPI[Bearbeiten | Quelltext bearbeiten]

  • MPI_Init to initialize
  • end with MPI_Finalize
  • MPI_Abort forces termination in case of emergency

Error checking[Bearbeiten | Quelltext bearbeiten]

  • MPI only has limited error checking because it is expensive

Communicators[Bearbeiten | Quelltext bearbeiten]

Communication domains are called communicators in MPI. A communicator is a distributed object that can be operated upon by all processes belonging to the communicator. A communicator is referenced by a handle of type MPI_Comm.

  • MPI_Comm_size to get amount of processes in communicator
  • MPI_Comm_rank to get rank in communicator
  • all processes are in communication domain MPI_COMM_WORLD
  • MPI_Comm_split splits processes of communicator into smaller groups with own communicators
  • MPI_Comm_create also creates new communicators
  • both functions are collective (have to be called by all processes)
  • MPI_Comm_free to free communicators

Organizing processes[Bearbeiten | Quelltext bearbeiten]

  • communicator with grid structure is called Cartesian communicator
  • created through MPI_Cart_create
    • reorder flag to reorder ranks so neighboring processes are close on grid
  • MPI_Cart_coords translate rank to coordinate vector
    • MPI_Cart_rank vice versa

Objects and handles[Bearbeiten | Quelltext bearbeiten]

  • a distributed object is an object for which all processes that reference it can access it
  • distributed objects
    • MPI_Comm for communicator object
    • MPI_Win for communication windows
  • local objects
    • MPI_Datatype for local layout and structure of data
    • MPI_Group for ordered sets of processes
    • MPI_Status for communication
    • MPI_Request for open (not yet completed) communication
    • MPI_Op for binary operators
    • MPI_Info for additional info when creating objects

Process groups[Bearbeiten | Quelltext bearbeiten]

  • objects of type MP_Group
  • groups used locally by processes to order processes
  • MPI_Comm_group returns an ordered group from a communicator
  • MPI_Group_(size|rank) like MPI_Comm_(size|rank)
  • set-like operations on groups possible, MPI_Group_(union|intersection|difference|incl|excl...)

Point-to-point communication[Bearbeiten | Quelltext bearbeiten]

  • MPI_Send to send data from process to dest
  • MPI_Recv to receive data from source
  • combined operations MPI_Sendrecv[_replace] have argument for source and dest
  • MPI_Get_(count|elements) to figure how much data was sent

Semantic terms[Bearbeiten | Quelltext bearbeiten]

  • blocking operations when function call returns when operation has been locally completed (irregardless of success of if data has been sent out etc.)
  • non-blocking when function call returns immediately (specified by capital I in function name, e.g. MPI_Irecv)

Specifying data[Bearbeiten | Quelltext bearbeiten]

  • buffer specifies starting adress
  • count specifies number of elements
  • datatype specifies MPI type
    • e.g. MPI_(CHAR|INT|LONG|FLOAT|DOUBLE)

One-sided communication[Bearbeiten | Quelltext bearbeiten]

  • one process alone initiates communication
    • also specifies actions on both ends
  • involved processes are called origin and target

Collective communication[Bearbeiten | Quelltext bearbeiten]

If some process calls a collective operation C on a communicator, then all other processes in the communicator must also eventually call C and no other collective before C.

Regular Irregular
symmetric,
no-data
MPI_Barrier
Rooted
(asymmetric)
MPI_Bcast
MPI_Gather
MPI_Scatter
MPI_Reduce
MPI_Gatherv
MPI_Scatterv
Non-rooted
(symmetric)
MPI_Allgather
MPI_Alltoall

MPI_Allreduce
MPI_Reduce_scatter_block
MPI_Scan
MPI_Exscan
MPI_Allgatherv
MPI_Alltoallv
MPI_Alltoallw

MPI_Reduce_scatter


symmetric vs. non-symmetric
all processes have the same role in collective vs. one/some process (root) is special
regular vs. irregular
each process contributes or receives the same amount of data vs. different pairs of processes may exchange different amounts of data


  • MPI_Barrier blocks until all have reached routine
  • MPI_Bcast to send data from root to all nodes
  • MPI_Scatter distributes data from process root evenly to all other
  • MPI_Gather gathers data from all other processes in root
  • MPI_Allgather like Gather but all processes have all data
  • MPI_Reduce reduction of data in root with operator MPI_Op op
    • operations are assumed to be associative and commutative
  • MPI_Allreduce like reduce but all processes receive reduction
  • MPI_Reduce_Scatter first reduce vector in processes, then scatter vector in segments across processes
  • MPI_Alltoall (could be named Allscatter) every process scatters their data to all other
  • MPI_Scan performs a reduction over a group, where each subsequent process stores the result of the reduction of the current value and the previous values
    • e.g. Scan with SUM reduction processes with values in brackets: 1(1), 2(2), 3(3), 4(4)
after Scan: 1(1), 2(3), 3(6), 4(10)
  • MPI_Exscan like Scan but value of process is not used in reduction
    • same example with Exscan: 1(0), 2(1), 3(3), 4(6)