TU Wien:High Performance Computing VU (Träff)/Zusammenfassung2023W Collectives
Binomial Tree:[Bearbeiten | Quelltext bearbeiten]
A binomial tree is recursively created by attaching B(i-1), B(i-2), … , B(0) as leaves to a node.
Some facts: Has total 2ⁱ nodes. A binomial tree B(i) has choose(i,k) nodes at level k Not useful for parallelisation
Fibonacci Tree[Bearbeiten | Quelltext bearbeiten]
A constant degree tree (two). Constructed by taking a root and making F(i-1) and F(i-2) as children. Fi has Fib(i+3)-1 nodes. Useful for parallelisation
Gather:[Bearbeiten | Quelltext bearbeiten]
- Binomial Tree: α log(p) + β m (p-1)/p Optimal!
- Binomial Tree like (divide-and-conquer): Harmful algorithmic latency
Reduce:[Bearbeiten | Quelltext bearbeiten]
- Binomial Tree: α log(p) + log(p) β m (Not optimal in beta)
- Binomial Tree like (divide-and-conquer): Harmful algorithmic latency
- Doubled, Mirrored Binary Tree: (2ceil(log((p+1)/4))α+ 2√(2(ceil(log((p+1)/4))αβm) + βm. Abuse bidirectional communication! 2-Coloring on tree is needed: Can be computed in O(log(p)) by recursive function Best known log-latency algorithm. Optimal in β
Broadcast:[Bearbeiten | Quelltext bearbeiten]
- Binomial Tree: α log(p) + log(p) β m (Not optimal in beta)
- Pipelined Binary tree: Long expression, not-optimal in β, log in α
- Pipelined Fibonacci Tree:
- Pipelined Hypercube: (log p-1)α + 2√((log p-1)αβm) + βm Optimal! NEXTBIT… Calculates distance to next 1 bit (with wraparaound) This is again possible by bit fiddling: i … processor rank q … log(p) j … round number s(i,j) = j-q+(1-bit(i,j))NEXTBIT(i)[j mod q] t(i,j) = j-q+ (bit(i,j))NEXTBIT(i)[j mod q] Pipelined Hypercube extend scheme by “doubling” some processors: One reperesentative processor receives and sends to other processor (last round). Other processor sends. Issue if other processor needs to send last rounds block: Switch roles On first switch, there is missing info for the switched in node. (What was received previous round.). This will be communicated in upcoming rounds and a single final round needs to be added.
Scatter:[Bearbeiten | Quelltext bearbeiten]
- Binomial Tree: α log(p) + β m (p-1)/p Optimal!
- Binomial Tree like (divide-and-conquer): α log(p) + β m (p-1)/p Optimal!
Gather V (variable length):[Bearbeiten | Quelltext bearbeiten]
3 log(p) α + β ∑ mᵢ + β ∑(only H’) mᵢ
Dynamic Binomial Tree on extended hypercube (2 d-1 hypercubes with connected edges) Complicated construction: * Fixed binomial tree for each hypercube * Both hypercubes have fixed root b, they know their size and the root r’ that is optimal to collect at for this subtree * communicate between fixed roots such that each optimal root r’ knows other optimal root r’’ * This forms the next level hypercube with a new root r’’
Allgather:[Bearbeiten | Quelltext bearbeiten]
- Butterfly/Hypercube: α log(p) + (p-1)/p β m
- Circulant graph/Fully connected: α log(p) + (p-1)/p β m
Allreduce:[Bearbeiten | Quelltext bearbeiten]
- Butterfly/Hypercube: Trivial version only with Idempotent Operators
- Smart version: Bitfiddling and maintaining two sums (with and without own result)
Scan/Exscan:[Bearbeiten | Quelltext bearbeiten]
- Hillis-Steele (Hypercube i believe): log(p)(α + β m)
Reducescatter (Each process knows only one result of the reduced vector):[Bearbeiten | Quelltext bearbeiten]
- Butterfly: log(p) α + (p-1)/p β m (Permutation needed such that each process gets correct part of reduced vector)
All to all[Bearbeiten | Quelltext bearbeiten]
- Fully connected: (p-1) α + (p-1)/p β m (Different Variants, Each process sends to all “in order”, 1-factoring the graph (deg-1 components))
- Hypercube (pipelined in some sense): log p * (α+β m/2): Send own data in chunks (half message sizes) to neighbors which further distribute it to correct processes.
- Circulant graph (works for n != 2^i): log p * (α+β m/2) + O(m): Very similar to hypercube, execution a bit more tricky: Either rotate array before and after sending, or be smart with calculating strides and start positions of blocks during sending.