TU Wien:High Performance Computing VU (Träff)/Zusammenfassung2023W Collectives

Aus VoWi
Zur Navigation springen Zur Suche springen

Binomial Tree:[Bearbeiten | Quelltext bearbeiten]

A binomial tree is recursively created by attaching B(i-1), B(i-2), … , B(0) as leaves to a node.

Some facts: Has total 2ⁱ nodes. A binomial tree B(i) has choose(i,k) nodes at level k Not useful for parallelisation

Fibonacci Tree[Bearbeiten | Quelltext bearbeiten]

A constant degree tree (two). Constructed by taking a root and making F(i-1) and F(i-2) as children. Fi has Fib(i+3)-1 nodes. Useful for parallelisation

Gather:[Bearbeiten | Quelltext bearbeiten]

  • Binomial Tree: α log(p) + β m (p-1)/p Optimal!
  • Binomial Tree like (divide-and-conquer): Harmful algorithmic latency

Reduce:[Bearbeiten | Quelltext bearbeiten]

  • Binomial Tree: α log(p) + log(p) β m (Not optimal in beta)
  • Binomial Tree like (divide-and-conquer): Harmful algorithmic latency
  • Doubled, Mirrored Binary Tree: (2ceil(log((p+1)/4))α+ 2√(2(ceil(log((p+1)/4))αβm) + βm. Abuse bidirectional communication! 2-Coloring on tree is needed: Can be computed in O(log(p)) by recursive function Best known log-latency algorithm. Optimal in β

Broadcast:[Bearbeiten | Quelltext bearbeiten]

  • Binomial Tree: α log(p) + log(p) β m (Not optimal in beta)
  • Pipelined Binary tree: Long expression, not-optimal in β, log in α
  • Pipelined Fibonacci Tree:
  • Pipelined Hypercube: (log p-1)α + 2√((log p-1)αβm) + βm Optimal! NEXTBIT… Calculates distance to next 1 bit (with wraparaound) This is again possible by bit fiddling: i … processor rank q … log(p) j … round number s(i,j) = j-q+(1-bit(i,j))NEXTBIT(i)[j mod q] t(i,j) = j-q+ (bit(i,j))NEXTBIT(i)[j mod q] Pipelined Hypercube extend scheme by “doubling” some processors: One reperesentative processor receives and sends to other processor (last round). Other processor sends. Issue if other processor needs to send last rounds block: Switch roles On first switch, there is missing info for the switched in node. (What was received previous round.). This will be communicated in upcoming rounds and a single final round needs to be added.

Scatter:[Bearbeiten | Quelltext bearbeiten]

  • Binomial Tree: α log(p) + β m (p-1)/p Optimal!
  • Binomial Tree like (divide-and-conquer): α log(p) + β m (p-1)/p Optimal!

Gather V (variable length):[Bearbeiten | Quelltext bearbeiten]

3 log(p) α + β ∑ mᵢ + β ∑(only H’) mᵢ

Dynamic Binomial Tree on extended hypercube (2 d-1 hypercubes with connected edges) Complicated construction: * Fixed binomial tree for each hypercube * Both hypercubes have fixed root b, they know their size and the root r’ that is optimal to collect at for this subtree * communicate between fixed roots such that each optimal root r’ knows other optimal root r’’ * This forms the next level hypercube with a new root r’’

Allgather:[Bearbeiten | Quelltext bearbeiten]

  • Butterfly/Hypercube: α log(p) + (p-1)/p β m
  • Circulant graph/Fully connected: α log(p) + (p-1)/p β m

Allreduce:[Bearbeiten | Quelltext bearbeiten]

  • Butterfly/Hypercube: Trivial version only with Idempotent Operators
  • Smart version: Bitfiddling and maintaining two sums (with and without own result)

Scan/Exscan:[Bearbeiten | Quelltext bearbeiten]

  • Hillis-Steele (Hypercube i believe): log(p)(α + β m)

Reducescatter (Each process knows only one result of the reduced vector):[Bearbeiten | Quelltext bearbeiten]

  • Butterfly: log(p) α + (p-1)/p β m (Permutation needed such that each process gets correct part of reduced vector)

All to all[Bearbeiten | Quelltext bearbeiten]

  • Fully connected: (p-1) α + (p-1)/p β m (Different Variants, Each process sends to all “in order”, 1-factoring the graph (deg-1 components))
  • Hypercube (pipelined in some sense): log p * (α+β m/2): Send own data in chunks (half message sizes) to neighbors which further distribute it to correct processes.
  • Circulant graph (works for n != 2^i): log p * (α+β m/2) + O(m): Very similar to hypercube, execution a bit more tricky: Either rotate array before and after sending, or be smart with calculating strides and start positions of blocks during sending.