TU Wien:Dynamic Compilation Dynamische Übersetzer VU (Krall)/Summary 2022
• Cacao
- JavaVM based on stack-language
- Cacao compiler - Three passes • Determine basic blocks • Generate static stack structure and register requirements • Each instruction contains stack structure before and after execution • Allows for copy elimination • Allocation of temporary registers and machine code generation
- Easy because javac already colours local variables and stack variables have implicit live ranges
- Variables that must survive across method invocations, these variables are allocated on callee-saved registers • Fixed register interface between each basic block (Stack depth often bounded)
- Just-In-Time compilation during method invocation
- Type checks at runtime using relative tree numbering
• Jalapeno JVM
- Baseline Compiler only, slightly better than optimising one
- Level 0: Constant, non-null and copy propagation, constant folding, arithmetic simplification, unreachable code elimination, etc.
- Level 1: CSE, redundant load elimination, simple inlining, global flow analysis
- Level 2: SSA-based flow-sensitive optimization
- Java threads are run on dedicated AIX threads, yield points return control to scheduler
- Compiler is invoked when • Unresolved reference leading to a class load • Executing code calls function not yet compiled • Adaptive optimisation system suggests recompiling a method with higher optimisation levels
- Runtime Measurement Subsystems
• Generate raw performance data (hardware performance monitors, memory usage, etc.)
• e.g. the hot method listener tracks the current method at each thread switch. The hot method organiser determines a method to be hot if it reaches a certain relative threshold.
- Processed by separate organisers
- May instruct Recompilation Subsystem to change profiling strategy or apply more optimisations
• Recompilation Subsystems
• Consumes compilation plan: Optimisation plan, profiling data and instrumentation plan
- AOS database
- Tracks decisions, compilation and other measurements to adapt future decisions
- Controller decides whether it is profitable to recompile a method using a cost- benefit analysis.
- Controller considers estimated time spent, if method is not recompiled. The time spent on compilation as well as time spent in method if it is recompiled with a higher optimization level.
- Controller expects program to run twice as long as it has until now. If 10 % of time was spent in a certain method, controller expects the same amount of time to be spent in the future as well.
- Costs of recompilation are statically based on an offline tuned database.
- Feedback-directed inlining • Uses an edge listener to record callee, caller and call-site. • Once certain threshold is reached, method is inlined during next recompilation phase.
• Threshold more conservative in the beginning but decreases over time
until a fixed value. • Java HotSpot Virtual Machine
• Comprises of interpreter and Just-In-Time Compilers
• VM can recompile method during invocation -> On-Stack Replacement
• Deoptimization in case assumptions leading to optimisation are invalidated
- HIR in SSA-form
- Control-flow graph with single-linked list instructions • Each node represents basic block
- Used for constant-folding, value numbering, method inlining, etc.
- HIR translated to LIR • Platform-dependent (but reuses lots of data structures from the HIR) • Operates on virtual and physical registers, memory and stack locations • Linear Scan for Register Allocation (instead of graph colouring as used by the server)
- Methods are statically inlined as long as methods are below 35 bytes
• Only applied if compiler can prove that only one implementation for a certain function exists -> Optimistic inlining
• Otherwise, function is again deoptimised
• Stack-frames of compiled code and interpreter differ. Thus, a virtual
stack frame is constructed to populate the stack-frame according to the
interpreter’s needs • Further improvements:
- Object inlining - Copy closely related children objects into parent • Saves indirect loads
- Array Bounds Check Elimination
• Virtual Machines
- Selective Optimizations
- Requires a profiling mechanism, a decision-making component and a dynamic optimising compiler executing the decision
- Profiling can either occur via counting or sampling. • Counting: Each method invocation/loop iteration increases the counter • Sampling: Execution is halted after a certain interval and the currently executed method is recorded
- Partial Compilation • Only frequently executed basic blocks are compiled
- Feedback-directed optimisation • Runtime Service Monitoring, Hardware Performance Counter, Sampling • Profiling unoptimised code • Removes profiling code in optimised code • Does not adapt to behavioural changes
- Profile Stability
• A change in the program’s behaviour might have an effect on the optimisation
used. Phase shifting is an open problem according to the paper. • Inlining
• Cost-Benefit Model
- Multiversioning • Generate multiple version of a code segment • Example: Optimistic Inlining and generate a fallback version in case the virtual call resolves to another type of class.
- Memory
• Object splitting: Separates object into hot and cold areas. Cold area is indirectly accessible via pointer
• Adaptive Inlining and OSR
• Code Repository - Manages generated machine code
• Method database - Assumptions and properties about methods • Approach
• Countdown traps for method invocations and loops
• If countdown is triggered, code is recompiled with instrumentation support
• Includes execution counts for each basic black
• If instrumented code triggers recompilation, compiler produces optimised
version without instrumentation code • Inlining Mechanism
• Aggressive Depth-First
• If a valid call-site is found, inlines method until a certain depth or code size is reached.
- Aggressive Breadth-First • Inlines all calls in the root method • Continuous until a certain depth has been reached
- Knapsack heuristic • Has a certain inlining budget • Pick call sites in root node with the highest cost-benefit ratio • Call sites of inlined function are added to the call site list • Calculating the costs and benefits largest problem • OSR
- Execution State - Stack frame as well as register values of the optimised code
- Source State - State independent from any optimisation
- Replacement Point - Location in code where execution state can be used to reconstruct source state
- Escape Analysis • Allocate objects on the stack instead of heap • Requires that object does not outlive its creation site (e.g. is returned by the function or stored elsewhere) • Static escape analysis assumes all paths can be taken • Dynamic approach considers actual execution • Objects are split into global (heap) and local (stack-frame) regions
- If an object in a small region is reachable from another object in a larger region, the object is promoted to the larger region.
- If reachable from a global variable, thrown as exception or passed to a native function => Global • Native function: Use hardcoded table where object does not escape. • Cacao implementation • Uses control-flow and exception control flow graph • Call stack analysis is bounded • States describe object escape state. Define an ordering from None to Global
- None: Only accessible in its creating method. Can be replaced by scalars
- Method: Does not escape thread because the object is only passed to its callee. Can be allocated on the stack and does not require synchronisation specific code.
- Method_return: Object is returned to its caller. This case is not handled by Cacao.
- Global: See above. No optimisations implemented. • Variables grouped into equi-escape sets (EES) (Steensgaard analysis) • Sets are adjusted at copy or phi functions • Models unidirectional relationship between objects • Dependency lists allows to lookup both directions • Effect of algorithm improves with method inlining
- Linear Scan
- Fast Single-Pass Register Allocation Algorithm compared to Bin Packing and Graph Coloring
- Tracks live ranges of variables
- Algorithm iterates over start and end points of live ranges
- At start point, make variable active (= put in a register) • If no register is available, spill variable with longest live range to memory
- At end point, let variable expire
- Original algorithm does not consider lifetime holes
- Original algorithm cannot put same variable in different registers
- Preallocation of registers required in case certain instructions expect specific registers
- Second-Chance Binpacking
• Spill decisions only affect future allocations of a variable
• If a variable is spilled but is used again, then another variable is evicted. However, the variable remains in the register until a higher-priority location evicts it again. —> Lifetime Splitting
- Since variables might stay in different registers, potential conflicts between basic blocks
- Case 1 (variable in register at end of block but in memory at beginning of successor block): Insert store instruction
- Case 2 (variable in memory at end of block but in register at beginning of successor block): Insert load instruction
- Case 3 (variable in two different registers): Insert move instruction
- For calling conventions, try to move register into available callee-saved register (Early second chance)
- Move coalescing via peephole optimisation
- Optimised Interval Splitting
- Includes use information informing the allocator whether a variable must necessarily be stored in a register. This allows to model memory operands as existent on x86
- Algorithm handles each lifetime separately. Each other interval can be separated into four sets: • Unhandled: Start after the start position of the interval • Active: Cover the start position and have a physical register assigned • Inactive: Interval starts before and ends after the beginning of the current interval. However, it is not covered because of a lifetime hole • Handled: End before the current interval or are spilled into memory
- Spilling • Spill active or inactive register with the furthest use-position • Spill positions are optimised in case of loops and control flow
- Register hints are used to allow coalescing of two lifetime intervals
- Spill Store Elimination in case it can be proven that value has not changed
- Dynamo • Binary-to-Binary JIT • Interprets code until hot trace is found • Hot trace is compiled into an optimised fragment • Trace begins and ends at a backward-edge (except if the edge points to an already existent fragment) • All branches are flattened, additional exit conditions might be required • Allows for potential optimisations such as full redundancies • Fragment Linking
• Fragments are directly linked to other existent fragments to circumvent the interpreter
• Fragment cache cleared if spike in hot trace detection • Dynamo RIO
• No interpreter used
• Uses basic blocks as base unit
• Multiple basic blocks can be combined into a trace
• Provides API to hook into Dynamo routines allowing for custom optimisations
• Examples for Redundant Load Removal, Strength Reduction, Indirect Branch Dispatch, etc.
- Linear Scan on SSA
- When resolving Phi function in LIR, move instructions in arbitrary order are required • —> Lifetime analysis does not know about this
- Phi functions represent parallel moves
- Linear Scan may decide which interval to work on
- Lifetime holes can be eliminated
- PyPy
- Provides a tracing JIT-compiler for a subset of Python • Subset requires type inference
- Focus on optimising interpreter code
- Similar to Dynamo Traces, interprets most parts but hot loops are aggressively compiled
- New feature: Loops might contain conditional branches. Not handled by classical traces
- Programer can provide hints to help JIT understand code structure
- Green variables represent position keys
• If position key have a certain fixed value, trace continuous
• Identification of green variables also helps JIT to aggressively optimise code • Escape Analysis and redundant guard elimination
- JIT for dynamic languages • Dynamic typing difficult for JIT • Hot loops can be compiled by speculating on the type
- Nested loops problematic • Solved by recording nested trace trees • Inner loop is traced first • If outer loop is reached and trace tree of inner loop type-match, inner loop trace is integrated into outer loop one
- Side exits also traceable -> Useful for branches inside a loop
- Type map states what types for the used variables are expected in the trace
- Uses SSA LIR
- Trace record also includes object representation • Simplifies access to property maps • Guards ensure the property map remains equal
- JavaScript numbers • TraceMonkey prefers integers as much as possible • If an integer unexpectedly turns into a double • Oracle responsible for tracking whether a variable is type-stable • TraceMonkey uses trace trees • SideExits can be patched later if found to be hot • Blacklisting is used for traces that repeatedly fail recording • Uses simple greedy register allocation
- HotPath-VM • Designed for memory-constraint devices • Differences to Dynamo approach • Hash Table for hotness counter is limited, collisions are deliberately accepted • Guards for control paths and virtual calls • Traces are aborted if code path finds exception or native function calls • Code generation
- Stack deconstruction: Convert Java Stack into SSA form • Indices are tracked via a renaming table
- Code Analysis: Loop Invariant Code Motion, CSE • Loop variables receive fixed registers • Live range analysis
- Code generation: Both register allocation and code generation backwards
- Loops with memory allocation are ignored
- Side exit might lead to child trace generation • Registers are allocated according to the parent trace
- Retrofitting a method-based JIT • Trace is aborted when encountering native call except for a couple of special functions • Identified that dead store elimination is unsound in Trace-JITs • Solved by introducing a helper function analysing the entire method • Only one optimisation level • Hash Lookup for hotness counter uses shadow array
- Generalised Just-In-Time Trace Compilation
• Implements parallel JIT-compilation
• Interpreter continues while a farm of JIT-threads generate code • Traces are scheduled according to their ‘freshness’
• Lock-free implementation
• Instruction Set Simulation
• Mixed-Mode Interpreter and JIT
• Basic Blocks compiled, multiple blocks combined into a region