TU Wien:Dynamic Compilation Dynamische Übersetzer VU (Krall)/Summary 2022

• Cacao

JavaVM based on stack-language
Cacao compiler - Three passes • Determine basic blocks • Generate static stack structure and register requirements • Each instruction contains stack structure before and after execution • Allows for copy elimination • Allocation of temporary registers and machine code generation
- Easy because javac already colours local variables and stack variables have implicit live ranges
- Variables that must survive across method invocations, these variables are allocated on callee-saved registers • Fixed register interface between each basic block (Stack depth often bounded)
Just-In-Time compilation during method invocation
Type checks at runtime using relative tree numbering

• Jalapeno JVM

Baseline Compiler only, slightly better than optimising one
- Level 0: Constant, non-null and copy propagation, constant folding, arithmetic simplification, unreachable code elimination, etc.
- Level 1: CSE, redundant load elimination, simple inlining, global flow analysis
- Level 2: SSA-based flow-sensitive optimization
Java threads are run on dedicated AIX threads, yield points return control to scheduler
Compiler is invoked when • Unresolved reference leading to a class load • Executing code calls function not yet compiled • Adaptive optimisation system suggests recompiling a method with higher optimisation levels
Runtime Measurement Subsystems

• Generate raw performance data (hardware performance monitors, memory usage, etc.)

• e.g. the hot method listener tracks the current method at each thread switch. The hot method organiser determines a method to be hot if it reaches a certain relative threshold.

Processed by separate organisers
May instruct Recompilation Subsystem to change profiling strategy or apply more optimisations

• Recompilation Subsystems

• Consumes compilation plan: Optimisation plan, profiling data and instrumentation plan

AOS database
- Tracks decisions, compilation and other measurements to adapt future decisions
- Controller decides whether it is profitable to recompile a method using a cost- benefit analysis.
  - Controller considers estimated time spent, if method is not recompiled. The time spent on compilation as well as time spent in method if it is recompiled with a higher optimization level.
  - Controller expects program to run twice as long as it has until now. If 10 % of time was spent in a certain method, controller expects the same amount of time to be spent in the future as well.
  - Costs of recompilation are statically based on an offline tuned database.
Feedback-directed inlining • Uses an edge listener to record callee, caller and call-site. • Once certain threshold is reached, method is inlined during next recompilation phase.

• Threshold more conservative in the beginning but decreases over time

until a fixed value. • Java HotSpot Virtual Machine

• Comprises of interpreter and Just-In-Time Compilers

• VM can recompile method during invocation -> On-Stack Replacement

• Deoptimization in case assumptions leading to optimisation are invalidated

HIR in SSA-form
- Control-flow graph with single-linked list instructions • Each node represents basic block
- Used for constant-folding, value numbering, method inlining, etc.
HIR translated to LIR • Platform-dependent (but reuses lots of data structures from the HIR) • Operates on virtual and physical registers, memory and stack locations • Linear Scan for Register Allocation (instead of graph colouring as used by the server)
Methods are statically inlined as long as methods are below 35 bytes

• Only applied if compiler can prove that only one implementation for a certain function exists -> Optimistic inlining

• Otherwise, function is again deoptimised

• Stack-frames of compiled code and interpreter differ. Thus, a virtual

stack frame is constructed to populate the stack-frame according to the

interpreter’s needs • Further improvements:

Object inlining - Copy closely related children objects into parent • Saves indirect loads
Array Bounds Check Elimination

• Virtual Machines

Selective Optimizations
- Requires a profiling mechanism, a decision-making component and a dynamic optimising compiler executing the decision
- Profiling can either occur via counting or sampling. • Counting: Each method invocation/loop iteration increases the counter • Sampling: Execution is halted after a certain interval and the currently executed method is recorded
- Partial Compilation • Only frequently executed basic blocks are compiled
Feedback-directed optimisation • Runtime Service Monitoring, Hardware Performance Counter, Sampling • Profiling unoptimised code • Removes profiling code in optimised code • Does not adapt to behavioural changes
Profile Stability

• A change in the program’s behaviour might have an effect on the optimisation

used. Phase shifting is an open problem according to the paper. • Inlining

• Cost-Benefit Model

Multiversioning • Generate multiple version of a code segment • Example: Optimistic Inlining and generate a fallback version in case the virtual call resolves to another type of class.
Memory

• Object splitting: Separates object into hot and cold areas. Cold area is indirectly accessible via pointer

• Adaptive Inlining and OSR

• Code Repository - Manages generated machine code

• Method database - Assumptions and properties about methods • Approach

• Countdown traps for method invocations and loops

• If countdown is triggered, code is recompiled with instrumentation support

• Includes execution counts for each basic black

• If instrumented code triggers recompilation, compiler produces optimised

version without instrumentation code • Inlining Mechanism

• Aggressive Depth-First

• If a valid call-site is found, inlines method until a certain depth or code size is reached.

Aggressive Breadth-First • Inlines all calls in the root method • Continuous until a certain depth has been reached
Knapsack heuristic • Has a certain inlining budget • Pick call sites in root node with the highest cost-benefit ratio • Call sites of inlined function are added to the call site list • Calculating the costs and benefits largest problem • OSR
Execution State - Stack frame as well as register values of the optimised code
Source State - State independent from any optimisation
Replacement Point - Location in code where execution state can be used to reconstruct source state

Escape Analysis • Allocate objects on the stack instead of heap • Requires that object does not outlive its creation site (e.g. is returned by the function or stored elsewhere) • Static escape analysis assumes all paths can be taken • Dynamic approach considers actual execution • Objects are split into global (heap) and local (stack-frame) regions
- If an object in a small region is reachable from another object in a larger region, the object is promoted to the larger region.
- If reachable from a global variable, thrown as exception or passed to a native function => Global • Native function: Use hardcoded table where object does not escape. • Cacao implementation • Uses control-flow and exception control flow graph • Call stack analysis is bounded • States describe object escape state. Define an ordering from None to Global
  - None: Only accessible in its creating method. Can be replaced by scalars
  - Method: Does not escape thread because the object is only passed to its callee. Can be allocated on the stack and does not require synchronisation specific code.
  - Method_return: Object is returned to its caller. This case is not handled by Cacao.
  - Global: See above. No optimisations implemented. • Variables grouped into equi-escape sets (EES) (Steensgaard analysis) • Sets are adjusted at copy or phi functions • Models unidirectional relationship between objects • Dependency lists allows to lookup both directions • Effect of algorithm improves with method inlining
Linear Scan
- Fast Single-Pass Register Allocation Algorithm compared to Bin Packing and Graph Coloring
- Tracks live ranges of variables
- Algorithm iterates over start and end points of live ranges
  - At start point, make variable active (= put in a register) • If no register is available, spill variable with longest live range to memory
  - At end point, let variable expire
- Original algorithm does not consider lifetime holes
- Original algorithm cannot put same variable in different registers
- Preallocation of registers required in case certain instructions expect specific registers
Second-Chance Binpacking

• Spill decisions only affect future allocations of a variable

• If a variable is spilled but is used again, then another variable is evicted. However, the variable remains in the register until a higher-priority location evicts it again. —> Lifetime Splitting

Since variables might stay in different registers, potential conflicts between basic blocks
- Case 1 (variable in register at end of block but in memory at beginning of successor block): Insert store instruction
- Case 2 (variable in memory at end of block but in register at beginning of successor block): Insert load instruction
- Case 3 (variable in two different registers): Insert move instruction
For calling conventions, try to move register into available callee-saved register (Early second chance)
Move coalescing via peephole optimisation

Optimised Interval Splitting
- Includes use information informing the allocator whether a variable must necessarily be stored in a register. This allows to model memory operands as existent on x86
- Algorithm handles each lifetime separately. Each other interval can be separated into four sets: • Unhandled: Start after the start position of the interval • Active: Cover the start position and have a physical register assigned • Inactive: Interval starts before and ends after the beginning of the current interval. However, it is not covered because of a lifetime hole • Handled: End before the current interval or are spilled into memory
- Spilling • Spill active or inactive register with the furthest use-position • Spill positions are optimised in case of loops and control flow
- Register hints are used to allow coalescing of two lifetime intervals
- Spill Store Elimination in case it can be proven that value has not changed
Dynamo • Binary-to-Binary JIT • Interprets code until hot trace is found • Hot trace is compiled into an optimised fragment • Trace begins and ends at a backward-edge (except if the edge points to an already existent fragment) • All branches are flattened, additional exit conditions might be required • Allows for potential optimisations such as full redundancies • Fragment Linking

• Fragments are directly linked to other existent fragments to circumvent the interpreter

• Fragment cache cleared if spike in hot trace detection • Dynamo RIO

• No interpreter used

• Uses basic blocks as base unit

• Multiple basic blocks can be combined into a trace

• Provides API to hook into Dynamo routines allowing for custom optimisations

• Examples for Redundant Load Removal, Strength Reduction, Indirect Branch Dispatch, etc.

Linear Scan on SSA
- When resolving Phi function in LIR, move instructions in arbitrary order are required • —> Lifetime analysis does not know about this
- Phi functions represent parallel moves
- Linear Scan may decide which interval to work on
- Lifetime holes can be eliminated
PyPy
- Provides a tracing JIT-compiler for a subset of Python • Subset requires type inference
- Focus on optimising interpreter code
- Similar to Dynamo Traces, interprets most parts but hot loops are aggressively compiled
  - New feature: Loops might contain conditional branches. Not handled by classical traces
  - Programer can provide hints to help JIT understand code structure
  - Green variables represent position keys

• If position key have a certain fixed value, trace continuous

• Identification of green variables also helps JIT to aggressively optimise code • Escape Analysis and redundant guard elimination

JIT for dynamic languages • Dynamic typing difficult for JIT • Hot loops can be compiled by speculating on the type
- Nested loops problematic • Solved by recording nested trace trees • Inner loop is traced first • If outer loop is reached and trace tree of inner loop type-match, inner loop trace is integrated into outer loop one
- Side exits also traceable -> Useful for branches inside a loop
- Type map states what types for the used variables are expected in the trace
- Uses SSA LIR
- Trace record also includes object representation • Simplifies access to property maps • Guards ensure the property map remains equal
- JavaScript numbers • TraceMonkey prefers integers as much as possible • If an integer unexpectedly turns into a double • Oracle responsible for tracking whether a variable is type-stable • TraceMonkey uses trace trees • SideExits can be patched later if found to be hot • Blacklisting is used for traces that repeatedly fail recording • Uses simple greedy register allocation
HotPath-VM • Designed for memory-constraint devices • Differences to Dynamo approach • Hash Table for hotness counter is limited, collisions are deliberately accepted • Guards for control paths and virtual calls • Traces are aborted if code path finds exception or native function calls • Code generation
- Stack deconstruction: Convert Java Stack into SSA form • Indices are tracked via a renaming table
- Code Analysis: Loop Invariant Code Motion, CSE • Loop variables receive fixed registers • Live range analysis
- Code generation: Both register allocation and code generation backwards
- Loops with memory allocation are ignored
- Side exit might lead to child trace generation • Registers are allocated according to the parent trace
Retrofitting a method-based JIT • Trace is aborted when encountering native call except for a couple of special functions • Identified that dead store elimination is unsound in Trace-JITs • Solved by introducing a helper function analysing the entire method • Only one optimisation level • Hash Lookup for hotness counter uses shadow array
Generalised Just-In-Time Trace Compilation

• Implements parallel JIT-compilation

• Interpreter continues while a farm of JIT-threads generate code • Traces are scheduled according to their ‘freshness’

• Lock-free implementation

• Instruction Set Simulation

• Mixed-Mode Interpreter and JIT

• Basic Blocks compiled, multiple blocks combined into a region

TU Wien:Dynamic Compilation Dynamische Übersetzer VU (Krall)/Summary 2022

Navigationsmenü