TU Wien:Dynamic Compilation Dynamische Übersetzer VU (Krall)/Summary 2022

Aus VoWi
Zur Navigation springen Zur Suche springen

• Cacao

  • JavaVM based on stack-language
  • Cacao compiler - Three passes • Determine basic blocks • Generate static stack structure and register requirements • Each instruction contains stack structure before and after execution • Allows for copy elimination • Allocation of temporary registers and machine code generation
    • Easy because javac already colours local variables and stack variables have implicit live ranges
    • Variables that must survive across method invocations, these variables are allocated on callee-saved registers • Fixed register interface between each basic block (Stack depth often bounded)
  • Just-In-Time compilation during method invocation
  • Type checks at runtime using relative tree numbering

• Jalapeno JVM

  • Baseline Compiler only, slightly better than optimising one
    • Level 0: Constant, non-null and copy propagation, constant folding, arithmetic simplification, unreachable code elimination, etc.
    • Level 1: CSE, redundant load elimination, simple inlining, global flow analysis
    • Level 2: SSA-based flow-sensitive optimization
  • Java threads are run on dedicated AIX threads, yield points return control to scheduler
  • Compiler is invoked when • Unresolved reference leading to a class load • Executing code calls function not yet compiled • Adaptive optimisation system suggests recompiling a method with higher optimisation levels
  • Runtime Measurement Subsystems

• Generate raw performance data (hardware performance monitors, memory usage, etc.)

• e.g. the hot method listener tracks the current method at each thread switch. The hot method organiser determines a method to be hot if it reaches a certain relative threshold.

  • Processed by separate organisers
  • May instruct Recompilation Subsystem to change profiling strategy or apply more optimisations

• Recompilation Subsystems

• Consumes compilation plan: Optimisation plan, profiling data and instrumentation plan

  • AOS database
    • Tracks decisions, compilation and other measurements to adapt future decisions
    • Controller decides whether it is profitable to recompile a method using a cost- benefit analysis.
      • Controller considers estimated time spent, if method is not recompiled. The time spent on compilation as well as time spent in method if it is recompiled with a higher optimization level.
      • Controller expects program to run twice as long as it has until now. If 10  % of time was spent in a certain method, controller expects the same amount of time to be spent in the future as well.
      • Costs of recompilation are statically based on an offline tuned database.
  • Feedback-directed inlining • Uses an edge listener to record callee, caller and call-site. • Once certain threshold is reached, method is inlined during next recompilation phase.

• Threshold more conservative in the beginning but decreases over time

until a fixed value. • Java HotSpot Virtual Machine

• Comprises of interpreter and Just-In-Time Compilers

• VM can recompile method during invocation -> On-Stack Replacement

• Deoptimization in case assumptions leading to optimisation are invalidated

  • HIR in SSA-form
    • Control-flow graph with single-linked list instructions • Each node represents basic block
    • Used for constant-folding, value numbering, method inlining, etc.
  • HIR translated to LIR • Platform-dependent (but reuses lots of data structures from the HIR) • Operates on virtual and physical registers, memory and stack locations • Linear Scan for Register Allocation (instead of graph colouring as used by the server)
  • Methods are statically inlined as long as methods are below 35 bytes

• Only applied if compiler can prove that only one implementation for a certain function exists -> Optimistic inlining

• Otherwise, function is again deoptimised

• Stack-frames of compiled code and interpreter differ. Thus, a virtual

stack frame is constructed to populate the stack-frame according to the

interpreter’s needs • Further improvements:

  • Object inlining - Copy closely related children objects into parent • Saves indirect loads
  • Array Bounds Check Elimination

• Virtual Machines

  • Selective Optimizations
    • Requires a profiling mechanism, a decision-making component and a dynamic optimising compiler executing the decision
    • Profiling can either occur via counting or sampling. • Counting: Each method invocation/loop iteration increases the counter • Sampling: Execution is halted after a certain interval and the currently executed method is recorded
    • Partial Compilation • Only frequently executed basic blocks are compiled
  • Feedback-directed optimisation • Runtime Service Monitoring, Hardware Performance Counter, Sampling • Profiling unoptimised code • Removes profiling code in optimised code • Does not adapt to behavioural changes
  • Profile Stability

• A change in the program’s behaviour might have an effect on the optimisation

used. Phase shifting is an open problem according to the paper. • Inlining

• Cost-Benefit Model

  • Multiversioning • Generate multiple version of a code segment • Example: Optimistic Inlining and generate a fallback version in case the virtual call resolves to another type of class.
  • Memory

• Object splitting: Separates object into hot and cold areas. Cold area is indirectly accessible via pointer

• Adaptive Inlining and OSR

• Code Repository - Manages generated machine code

• Method database - Assumptions and properties about methods • Approach

• Countdown traps for method invocations and loops

• If countdown is triggered, code is recompiled with instrumentation support

• Includes execution counts for each basic black

• If instrumented code triggers recompilation, compiler produces optimised

version without instrumentation code • Inlining Mechanism

• Aggressive Depth-First

• If a valid call-site is found, inlines method until a certain depth or code size is reached.

  • Aggressive Breadth-First • Inlines all calls in the root method • Continuous until a certain depth has been reached
  • Knapsack heuristic • Has a certain inlining budget • Pick call sites in root node with the highest cost-benefit ratio • Call sites of inlined function are added to the call site list • Calculating the costs and benefits largest problem • OSR
  • Execution State - Stack frame as well as register values of the optimised code
  • Source State - State independent from any optimisation
  • Replacement Point - Location in code where execution state can be used to reconstruct source state
  • Escape Analysis • Allocate objects on the stack instead of heap • Requires that object does not outlive its creation site (e.g. is returned by the function or stored elsewhere) • Static escape analysis assumes all paths can be taken • Dynamic approach considers actual execution • Objects are split into global (heap) and local (stack-frame) regions
    • If an object in a small region is reachable from another object in a larger region, the object is promoted to the larger region.
    • If reachable from a global variable, thrown as exception or passed to a native function => Global • Native function: Use hardcoded table where object does not escape. • Cacao implementation • Uses control-flow and exception control flow graph • Call stack analysis is bounded • States describe object escape state. Define an ordering from None to Global
      • None: Only accessible in its creating method. Can be replaced by scalars
      • Method: Does not escape thread because the object is only passed to its callee. Can be allocated on the stack and does not require synchronisation specific code.
      • Method_return: Object is returned to its caller. This case is not handled by Cacao.
      • Global: See above. No optimisations implemented. • Variables grouped into equi-escape sets (EES) (Steensgaard analysis) • Sets are adjusted at copy or phi functions • Models unidirectional relationship between objects • Dependency lists allows to lookup both directions • Effect of algorithm improves with method inlining
  • Linear Scan
    • Fast Single-Pass Register Allocation Algorithm compared to Bin Packing and Graph Coloring
    • Tracks live ranges of variables
    • Algorithm iterates over start and end points of live ranges
      • At start point, make variable active (= put in a register) • If no register is available, spill variable with longest live range to memory
      • At end point, let variable expire
    • Original algorithm does not consider lifetime holes
    • Original algorithm cannot put same variable in different registers
    • Preallocation of registers required in case certain instructions expect specific registers
  • Second-Chance Binpacking

• Spill decisions only affect future allocations of a variable

• If a variable is spilled but is used again, then another variable is evicted. However, the variable remains in the register until a higher-priority location evicts it again. —> Lifetime Splitting

  • Since variables might stay in different registers, potential conflicts between basic blocks
    • Case 1 (variable in register at end of block but in memory at beginning of successor block): Insert store instruction
    • Case 2 (variable in memory at end of block but in register at beginning of successor block): Insert load instruction
    • Case 3 (variable in two different registers): Insert move instruction
  • For calling conventions, try to move register into available callee-saved register (Early second chance)
  • Move coalescing via peephole optimisation
  • Optimised Interval Splitting
    • Includes use information informing the allocator whether a variable must necessarily be stored in a register. This allows to model memory operands as existent on x86
    • Algorithm handles each lifetime separately. Each other interval can be separated into four sets: • Unhandled: Start after the start position of the interval • Active: Cover the start position and have a physical register assigned • Inactive: Interval starts before and ends after the beginning of the current interval. However, it is not covered because of a lifetime hole • Handled: End before the current interval or are spilled into memory
    • Spilling • Spill active or inactive register with the furthest use-position • Spill positions are optimised in case of loops and control flow
    • Register hints are used to allow coalescing of two lifetime intervals
    • Spill Store Elimination in case it can be proven that value has not changed
  • Dynamo • Binary-to-Binary JIT • Interprets code until hot trace is found • Hot trace is compiled into an optimised fragment • Trace begins and ends at a backward-edge (except if the edge points to an already existent fragment) • All branches are flattened, additional exit conditions might be required • Allows for potential optimisations such as full redundancies • Fragment Linking

• Fragments are directly linked to other existent fragments to circumvent the interpreter

• Fragment cache cleared if spike in hot trace detection • Dynamo RIO

• No interpreter used

• Uses basic blocks as base unit

• Multiple basic blocks can be combined into a trace

• Provides API to hook into Dynamo routines allowing for custom optimisations

• Examples for Redundant Load Removal, Strength Reduction, Indirect Branch Dispatch, etc.

  • Linear Scan on SSA
    • When resolving Phi function in LIR, move instructions in arbitrary order are required • —> Lifetime analysis does not know about this
    • Phi functions represent parallel moves
    • Linear Scan may decide which interval to work on
    • Lifetime holes can be eliminated
  • PyPy
    • Provides a tracing JIT-compiler for a subset of Python • Subset requires type inference
    • Focus on optimising interpreter code
    • Similar to Dynamo Traces, interprets most parts but hot loops are aggressively compiled
      • New feature: Loops might contain conditional branches. Not handled by classical traces
      • Programer can provide hints to help JIT understand code structure
      • Green variables represent position keys

• If position key have a certain fixed value, trace continuous

• Identification of green variables also helps JIT to aggressively optimise code • Escape Analysis and redundant guard elimination

  • JIT for dynamic languages • Dynamic typing difficult for JIT • Hot loops can be compiled by speculating on the type
    • Nested loops problematic • Solved by recording nested trace trees • Inner loop is traced first • If outer loop is reached and trace tree of inner loop type-match, inner loop trace is integrated into outer loop one
    • Side exits also traceable -> Useful for branches inside a loop
    • Type map states what types for the used variables are expected in the trace
    • Uses SSA LIR
    • Trace record also includes object representation • Simplifies access to property maps • Guards ensure the property map remains equal
    • JavaScript numbers • TraceMonkey prefers integers as much as possible • If an integer unexpectedly turns into a double • Oracle responsible for tracking whether a variable is type-stable • TraceMonkey uses trace trees • SideExits can be patched later if found to be hot • Blacklisting is used for traces that repeatedly fail recording • Uses simple greedy register allocation
  • HotPath-VM • Designed for memory-constraint devices • Differences to Dynamo approach • Hash Table for hotness counter is limited, collisions are deliberately accepted • Guards for control paths and virtual calls • Traces are aborted if code path finds exception or native function calls • Code generation
    • Stack deconstruction: Convert Java Stack into SSA form • Indices are tracked via a renaming table
    • Code Analysis: Loop Invariant Code Motion, CSE • Loop variables receive fixed registers • Live range analysis
    • Code generation: Both register allocation and code generation backwards
    • Loops with memory allocation are ignored
    • Side exit might lead to child trace generation • Registers are allocated according to the parent trace
  • Retrofitting a method-based JIT • Trace is aborted when encountering native call except for a couple of special functions • Identified that dead store elimination is unsound in Trace-JITs • Solved by introducing a helper function analysing the entire method • Only one optimisation level • Hash Lookup for hotness counter uses shadow array
  • Generalised Just-In-Time Trace Compilation

• Implements parallel JIT-compilation

• Interpreter continues while a farm of JIT-threads generate code • Traces are scheduled according to their ‘freshness’

• Lock-free implementation

• Instruction Set Simulation

• Mixed-Mode Interpreter and JIT

• Basic Blocks compiled, multiple blocks combined into a region