Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
% How HotSpot cross-modifies code -- a summary % Erik Österlund & John Rose (Oracle) % May 2022 (draft version 0.3) ## Introduction _Cross-modified code_ (abbreviated _CMC_) happens when one thread edits, as normal data in memory, an instruction stream that may be in the process of concurrent execution by another thread. This can lead to a special kind of race condition, requiring that the relevant data memory writes from the first thread will be properly detected by the instruction fetch unit of the second thread. In general, it is problematic, not only because of races when synchronization is lacking, but also because the functional units that must communicate are not designed to do so at speed. Broadly speaking, some limited cross-modification of code is required in order to bootstrap dynamically-linked code into processes (in modern operating systems). Hardware vendors specify the narrow conditions required for successful CMC. The basic requirement is that all threads which might read a CMC-affected code stream must avoid reading the edited code until the editing processor performs a suitable specialized _CMC-release_ operation ("flush instructions"). At that point, in some globally serialized sense, all potentially executing processors must perform a specialized _CMC-acquire_ operation, which cleanly discards all state in any instruction cache that might be affected by any previous version of the edited data. It is convenient to think of that CMC-release and CMC-acquire operation as defining what might be called an _epoch boundary_, between an "epoch of data editing" and an "epoch of executing new instructions". A handshake must be used for each thread to move from the old epoch to the new epoch. This is clearly a slow and disruptive operation. It might seem correct that edits which are "small enough" do not need an epoch change. For example, changing one byte, from one opcode to another, without breaking the syntax of the instruction stream in which it appears, would seem to be "small". Or changing one word that is the immediate operand of a move or jump would seem to be "small". And yet, in both cases, there is no guarantee that an instruction fetch unit would "notice" the changes, ever, short of an explicit CMC-acquire. What is even worse, there is no guarantee that the write would not put an instruction fetch unit into a state which causes it to execute neither the old instructions nor the new instructions, but some combination which leads to unpredictable behavior. An instruction fetcher might build a complex model of the instructions it sees, which might be only partially invalidated if it takes an interrupt, pauses, and then happens to see new data in memory. As an extreme (though unlikely) example, unpredictable behavior could happen if one thread ever fetched instructions as single bytes, and without concern for their order of storing (that is, not using total store order). For more unpredictability, if that thread fetches instruction bytes in a random order (not just as it executes), it could "see" and attempt to execute any mix of old and new instruction bytes, regardless of any careful order in which they were stored. Normal data races are relatively understood, and have even been the subject of [formal models]. But races which affect an instruction fetch unit are not well specified or understood. [formal models]: > \[formal models]: Meanwhile, HotSpot has used cross modifying code techniques from its very beginning, a quarter century ago. For example, it has always performed dynamic linkage of call sites using CMC. There are several other instances as well, which this document describes. As a basic design principle, HotSpot "surfs the races" in a particular way. HotSpot avoids synchronizing (both CMC and simple data updates) along "hot paths", and accepts the inevitably resulting races that may occur. HotSpot ensures that these races are benign by allowing threads to execute (for CMC, or else, for data updates, read or write data) from memory in both old and new states. (There can be multiple new states as well.) The old and new states are made well-defined by ensuring that state-changing writes are performed using naturally atomic operations, often aligned single-word stores. (Using single stores is an important detail; if HotSpot mistakenly uses _N_ non-atomic byte-wise writes to move memory from old to new states, many intermediate states, theoretically up to 2^_N_^-2 such states, could appear to racing processors. There were bugs like this in the early days of HotSpot.) HotSpot uses actual mutex lock/unlock or CAS operations only when it is necessary to get additional separation between old and new states. Of course, to get guaranteed forward progress, HotSpot needs, eventually, to completely drive out the old states. This is obtained by occasionally performing an expensive global epoch transition, a big "handshake" which HotSpot calls a _safepoint_. More recently, other kinds of thread handshakes have been defined. These handshakes are similar to (and may use) the lower-level phenomenon of virtual memory "shootdown", where one thread must, by some sort of strong signal, affect another thread's virtual address mappings. Those generalities seem reasonable, and in fact they usually work for HotSpot. But they also work past the edges of the general assurances provided by hardware system programming manuals, by not immediately performing epoch transitions on every edit to instruction streams. Instead, HotSpot makes reasonable assumptions about natural atomicity of edits to instruction streams and the effects of races on such edits. These assumptions are validated for each specific kind of hardware that HotSpot executes on. Sometimes hardware is less welcoming of the race conditions that HotSpot "surfs"; for example, memory on some particular platform does not implement total store order. In such cases, HotSpot is configured to use more conservative techniques, which typically causes execution on the affected platform to be somewhat less competitive. It is in HotSpot's interest to run as fast as possible on each platform, so even when "de-racing" techniques are developed for some platforms, they are not applied to platforms that do not need them. All of can lead to unwelcome discoveries on new implementations or new platforms. When hardware designers design to a very restrictive system programming model, there can be occasional surprises where atomicity or coherency can be lost during cross-modification of code. Luckily for HotSpot, the natural use of cache-wise fetches (which is natural for good hardware performance) tends to present only old or new states to processors concurrently executing CMC in HotSpot. But this is a _useful observation_ about _most_ platforms, rather than a _guarantee_ about _all_ platforms. To shed more light on these assumptions, allowing them to be evaluated, so as to avoid some surprises in the future, this document lists all the known cross modifying code in HotSpot. ## Intel Assumptions The Intel Software Developer Manual (the _SDM_) makes it clear, in section 8.1.3, that both self-modifying and cross-modifying code should be completely synchronized with a serializing instruction like `cpuid`. In other words, every thread that can execute a dynamically modified instruction, should first execute the `cpuid` instruction; this moves it into a new "epoch" in which all current memory writes will be correctly posted to the threads's instruction fetch unit. The contract described by the SDM is hence more prohibitive than even the AArch64 spec, which at least allows jumps to be patched over `nop`s (and vice versa), without synchronization. An unfortunate aspect of the contract offered by the SDM, is that the `cpuid` instruction serializes everything the core is doing, making it possibly the slowest instruction there is. This can't be used in any code appearing in a fast path. And unsurprisingly, all the uses of cross-modifying code in the entire JVM, are there to make fast-path code as fast as possible. But in practice Intel hardware has proved to support other, less synchronized cases of cross-modifying code than the SDM describes. In particular, naturally atomic (naturally aligned) writes of 32-bit and 64-bit instruction operand words in code streams are generally safe in practice: The concurrently executing threads "see" only the old or new states, and there is no disruption to the decoding of the instructions which contain those words. Even 64-bit writes of whole instructions at jump targets (in particular, method entry points) seem to be safe, depending on the structure of the old and new instruction stream (before and after the write). Other techniques, such those which rely on coherently ordered reads of multiple editing events, have worked in the past, but require testing on new platforms. The Intel erratum "Unsynchronized Cross-Modifying Code Operations Can Cause Unexpected Instruction Execution Results" suggests that at least one of HotSpot's current techniques, of performing unsynchronized CMC, causes a crash. Some problems like that can be considered as CPU bugs, but some of the fault might be in HotSpot, if the bug only happens when doing unsynchronized CMC. In the end, HotSpot has two conflicting goals: Do only what the hardware supports, and gain the best possible performance on each of its platforms. Part of the burden goes back to those hardware platforms which wish to preserve and improve Java performance. Part of the burden is on HotSpot to use techniques which have been properly documented to, and reviewed by, the hardware developers. The Oracle/Intel collaborations have historically been good at carrying those burdens. These are the current general assumptions we on the HotSpot team have made about cross-modifying code: - Loading completely new instructions into freshly allocated memory is safe, as long as a CMC-release is done after the instruction memory is loaded. > That is, a thread will not "peek" into fresh memory until a fresh memory address address has been provided explicitly. We also require that the address of the instructions is explicitly published by safe means, such as a naturally atomic 32- or 64-bit store to data or to an instruction immediate operand. - Previously used instruction memory can be recycled, as if it were freshly allocated, as long as all processors that may have ever executed in that memory, or that may have prefetched from it, have stopped doing so, and have performed a CMC-acquire. > In addition, we should (and probably will) ensure that the cache lines of the recycled memory are disjoint from cache lines of instruction memory currently in use. This concern arises when fresh blocks of memory are pre-allocated in live HotSpot code regions, and are patched with instructions during the same epoch that processors may be executing or prefetching in neighboring cache lines. - Modifying immediate operands of instructions not crossing cache line boundaries, where those operands are naturally aligned, should result in concurrent executions seeing either the old instruction with the old immediate, or the old instruction with the new immediate. > For HotSpot, either result would be correct, but observing the new instruction operand would be faster. If an instruction fetcher on some platform ever reads part of an operand, pauses, and then reads the rest of the operand, this assumption would be violated. - Modifying two instructions in order, where the second instruction would only be executed if the modified version of the first instruction is executed, will result in only the modified version of the second instruction to be executed. We refer to this as _instruction cache coherency_. > For example, in a compiled method with managed pointer immediates, the GC may update those immediates in a batch while method entry is paused; it must be possible to finish the batch edit and then re-enable method entry by a second instruction patch, so that threads passing through a disarmed method entry barrier must witness coherently updated values in the managed pointer immediates. An incoherent instruction cache might hold onto old immediates (unedited by the GC) and feed them to threads passing through the disarmed barrier. On platforms where this is a problem, it can be fixed with a more expensive entry barrier, that can incorporate a CMC-acquire for each entering thread. - We assume that we can patch method entry sequences (and also method _re-entry_ sequences, after embedded call sites) under some conditions. Specifically, we expect to be able to patch jumps over `nop`s (and vice versa), to arm (and disarm) entry barriers to methods (or re-entry barriers, for re-entry points within methods). > There is more detail below about method entry barriers. Ideally (and moving forward) we use either naturally atomic word writes to perform such patching, or else transactional instructions (such as 128-bit CAS), and we ensure that the affected word is naturally aligned (not crossing a cache line boundary). Ideally, we also ensure (though we do not at present) that exactly one old instruction is replaced by exactly one new instruction, lest a thread stuck on an instruction boundary inside the patched area try to execute part of the interior of the new instruction. ## CMC release and acquire operations on Intel Older code in HotSpot assumes we need to explicitly flush caches with `clflush`, while some newer code assumes that for cross-modifying-code purposes, this does not help, but rather depend on the instruction cache coherency aspect (making the data vs instruction cache synchronization issues less interesting). Thus, we tend to use `clflush` for CMC-release, but since is effective only for the executing thread doing self-modifying code, we could omit `clflush` in the case of a JIT-only thread. In the case of self-modifying code where the thread itself expects to be able to observe the new instructions, we generally invoke a `cpuid` instruction as our CMC-acquire, but sometimes HotSpot resorts to explicit cache flushing instead. We are moving towards using `cpuid` instead of cache flushing, but we are not quite there yet. ## ARMv8 Assumptions On ARMv8, we assume we can not rely on instruction cache coherency. We also assume that we can not patch immdiates atomically. The architecture gives only limited guarantees about patching, that calls/jumps can replace `nop`s and vice versa. We aim to avoid other kinds of instruction patching. When we patch, we assume that we need to manually execute instruction cache flush on the modified instructions (this plus possibly a data fence is our CMC-release). We know we cannot rely on the changes becoming observable from other threads until a subsequent rendezvous invoking `isb` barriers (our CMC-acquire) on all threads. We do this occasionally (_question: when?_) for correctness reasons. There are some situations where we still do things out of spec, and just hope for the best, because doing anything else is a challenge. More details below. ## Uses of Cross-Modifying-Code In general, there are several cases of cross-modifying code in HotSpot, and each one is there to support a fast path that would be significantly slower if it used other techniques, such as querying data variables to detect barriers or state changes. Here the list: - JIT compilation (into either totally fresh or recycled memory) - one-shot barriers for deferred initialization (C1 class and field references) - dynamic method linking (multi-state link and re-link of call sites) - dynamic method call dispatch (adds more metadata and states to method linking) - small preallocated "stubs" near active JIT code (launch pads for some calls, populated just before use) - multi-shot method entry barriers (for methods that are temporarily paused or permanently not-entrant) - method re-entry barriers (after out-of-line call sites, for not-entrant methods) - managed pointer update (editing of immediates by the GC) ## JIT compilation The elephant in the room regarding cross-modifying code is the technique of JIT-compiling methods. The JIT-compiler writes instructions into executable pages, and then publishes the compiled code, such that it becomes available to execute to concurrent calls to Java methods. Threads that execute the JIT-compiled methods do not execute the `cpuid` instruction first, so this is a direct violation of recommended practices, and is neither allowed by the Intel SDM nor the ARMv8 spec. To be fair, dynamically linked but statically compiled ABI-based languages like C/C++ also seem to "surf" these race conditions, if a dynamic linker does not "shoot down" all process threads when it loads a new instruction segment. (_Question: Do they do that?_) The major difference between the dynamic linkage of HotSpot JIT code and code statically compiled for an ABI is that ABIs tend to specify data-based publication of dynamically linked function entry points, so that an ABI-compliant call will load a function pointer from a patchable variable, rather than (as in HotSpot) execute a call or jump with a patchable address immediate. Over most of HotSpot's career, and probably even at present, the latency of method entry via a patchable call instruction is less than the latency of method entry that first fetches the address from a variable. This is especially true when the complex linkage behavior of HotSpot methods must be supported. (HotSpot call sites have many more states than those supported by ABI dynamic linkers.) Full compliance is tricky, in part because there is a chicken-and-egg problem with installing CMC-acquires wherever they are required. Suppose we place an expensive CMC-acquire instruction (`cpuid` on x86) at the entry point of a new method, with the intention of removing it when it is not longer needed. But that instruction itself is inserted (and also hopefully removed) as cross-modifying code. So whatever we emit in the JIT's code area (including synchronization code to comply to the Intel SDM or some other platform), would have to already be sychronized through some other means. A fully synchronizing poll of a global variable at every method entry might help, but such things tend to cost a significant slice of performance. Java, like most languages, is somewhat sensitive to the overheads of procedure linkage. ## C1 deferred initialization HotSpot actually runs with two JITs, a warmup JIT called C1 and a performance JIT. The C1 JIT is mostly used before we have really felt the need to compile with the more optimizing compiler (usually C2, sometimes Graal). In this more initial phase, it may happen that we require information about some class which has not yet been loaded. (This is common in Java, where the unit of dynamic loading and linking the class, not the DLL.) The information needed could be the offset of a field (for a field read or write) or an object reference to the class object (for a type test or reflective query). C1 prefers to support this with one-time (fire and fix) execution barriers which precede the instruction that would need the missing information. When the barrier is hit, HotSpot finds the missing information, triggering class loading (and even reporting errors) as necessary. With the information (field offset or class metaobject) in hand, an instruction after (or perhaps under) the barrier is patched. Concurrent races that fire the barrier will synchronize in the HotSpot runtime, and in the end the barrier will be disarmed by replacing it with a `nop` or some other instruction. In this same scenario, the AArch64 port simply emits traps to deoptimize (discard and eventually recompile) the compiled method whenever this code path is reached. This seems to be necessary because (as described below) an AArch64 processor will sometimes fail to observe the patching of the _second_ instruction even after the _first_ barrier instruction is inactivated. However, deoptimizing like this has a noticeable effect on startup and warmup times, since it requires a full C1-level recompilation after any of these execution barriers is hit. The other platforms solve the problem with cross-modifying code as described above. The execution barrier is branch to the runtime, which will ensure the information needed is loaded, and then patch in the desired instructions into a compiled method, and subsequently replace the branch with a `nop`, such that the new instruction containing the loaded information is executed. This way we can patch in the information that was missing rather than recompiling the entire thing. This code patching must be done with correct release fencing at each point. It is especially delicate because it happens while other threads are freely executing in the same method as the code being patched. We must assume that, when the barrier is removed, the patched instruction will be correctly observed. In order for this to work, we rely on instruction cache coherency. We are ensuring the stores are ordered (on Intel it's easy because it's a TSO machine), and the assumption is that we prepare some instructions that are not yet reachable, and then when they are prepared, we publish them by patching an instruction. If the instruction fetcher observes the new instruction that published the patching, then it will also observe the instructions we patched guarded by that. Except for matters of scale and proximity, this is similar conceptually to JIT-compilation. In the case of patching already-published methods, the code being patched is conceptually in "fresh" memory, but it is very close to instructions that are being actively executed (perhaps by many threads). It appears that, on at least some platforms, "fresh" (unused) memory emitted by a JIT might be "contaminated" by nearby executions. There may be a scaling parameter, such as cache line or page size, which would help us measure the separation required on such platforms so that fresh memory stays fresh. The issue [JDK-8223613] tracks alternatives to our current techniques in C1 for patch-based initialization barriers. These alternatives are not being pursued actively at present. [JDK-8223613]: > \[JDK-8223613]: ## Dynamic method dispatch Dynamic method dispatch is optimized by HotSpot with a technique called _[inline caches]_. The basic idea is that you have a callsite with some metadata and a destination. The metadata is embodied in an instruction that sets a register, and the destination is the immediate destination operand of a call instruction. The inline cache is initially in the so-called _unlinked_ state, in which it points to "resolution stub" as the call destination; this trampoline will call into a linkage service routine provided by HotSpot. (This is similar to ABIs where a dynamically linked call is initially directed to a trampoline to the runtime linker.) The initial assumption is that this call site will only need to access a single callee method, and so the inline cache site is linked optimistically to the method that it reaches first. We call the new state of the call site the _monomorphic_ state, and in many cases that is the state it stays in forever. This transition is made by patching the immediates of the two instructions, which are ensured to not span cache lines. The assumption made is that the result of executing either of the two instructions is going to be either execution of the old instruction, or the new instruction, in either order. We don't rely on any effects being made observable to the instruction fetcher, or that they are made observable in any particular order, only that the result of executing the instructions is some combination of the old and new instructions. (Specifically, it is harmless to set the new metadata value and then run into the old linkage service routine, and it is also harmless to run into the new method entry, but have the stale old metadata value; both mismatches will take a slow path which will end up executing the correct method.) If we call through this callsite again with a different receiver object, we will find out that this callsite is actually megamorphic. This happens when the callee method checks the metadata, and sees that it does not (in fact) match the type of the receiver. In that case the callee method jumps to the linkage service routine, where the JVM again fixes things up. This is second state change for the inline cache call site, into the so-called _megamorphic_ state. In this case, we cannot patch the inline cache directly to the third state due to various races (between the third state and the first two). Instead we emulate an atomic update of both components of the inline cache, by JIT-compiling the new (third) inline cache state into a transient buffer (another kind of stub). At suitable epoch boundaries, these temporary buffers are copied back into the original inline cache instructions. (The boundary requires that all Java-executing threads that wake up from a safepoint must run a `cpuid` instruction, at least since [JDK-8220351].) This trick is, again, relying either on instruction cache coherency, or on the temporary buffer storage being "fresh" memory with respect to all instruction fetchers. The temporary stub is built first, and then published by modifying the immediate of the call instruction in the inline cache, to point at the stub. If the instruction fetcher observes the updated call instruction that points at the stub, we assume it will then observe the instructions we just stored in the fresh memory of the stub. [inline caches]: [JDK-8220351]: [JEP-8221828]: > \[inline caches]: > \[JDK-8220351]: > \[JEP-8221828]: Here is a summary of the inline cache states: - _unlinked:_ not executed yet, or in the process of first linkage - _monomorphic:_ optimistically pointing at the only method needed so far - _megamorphic:_ pointing at a dynamic dispatch stub (v-table or i-table lookup plus tail-call) These states are crossed with these other states which exist to avoid races: - _immediate_: metadata and call target are in the original instructions - _buffered:_ metadata and call target are in a temporary buffer (reached by the original call) There are a few more state variations due to the mixing of interpreted and compiled methods; these are described in the next section. There can be, in principle, yet more states which pertain to the safe disposal of compiled code that is no longer useful. Some people advocate a new method invocation scheme that moves away from inline caches, in favour of more optimized megamorphic call data structures (cf. [JEP-8221828]). Historically, the main job of the inline cache has been to avoid indirect calls, because they have been slow. Now indirect calls with dyamically monomorphic callsites are fast on some platforms, and both vtable and itable calls can be implemented efficiently without inline caches. The lack of support for (on some platforms) instruction cache coherency and the uncertainty of rules about "fresh" JIT memory, are part of the motivation for reconsidering the use of inline caches. ## Static method dispatch When we have statically dispatched calls, the callee method is already known at compile time, we don't need the full machinery of inline caches. But this linkage of methods is still dynamic, and requires trips through linkage service routines to resolve call sites. Since the target method might not be loaded at compile time, resolution of the callsite is still deferred until the first call through this callsite. At that point, the concrete method is known. At this point, there could either be a compiled method for said callee, or we might have to go into the interpreter. A call to an interpreted method from compiled code requires passing a reference to the target method into a compiler-to-interpreter adapter stub. This adapter will lay down the outgoing arguments on the stack in the array-like form required by the interpreter, and then jump into the interpreter with a request to enter the target method, as identified in a known linkage register (`rbx` on x86). The fast path is, of course, compiled code calling compiled code, which uses a register-based ABI-like calling sequence that does not need any metadata. (Since all this about interpreter transitions is true of inline caches as well as static all sites, the fast path for an inline cache might use metadata to perform the receiver check as described above.) Thus, the structure of a static call, in compiled code, does not contain a metadata pointer setting instruction; that would be just a useless interruption when the Java program has warmed up. So, when a static call site is linked to call the interpreter, a metadata pointer setting instruction must be created, so that the interpreter can be entered properly. Therefore, for every static call that might need to enter the interpreter, there is a corresponding launch pad (called a "stub", again), which is pre-allocated at the of the compiled method. This stub is a small amount of fresh memory which is filled in during method resolution, containing a metadata setting instruction that patches in the target method reference, and a jump to the argument shuffling adapter described above. So when we resolve the callsite to go to the interpreter, we first prepare the stub, and then publish it by patching in the destination of the call instruction to go to the stub (which in this case lives a few cache lines down from the end of the compiled method code). Once again, this relies on instruction cache coherency and/or rules about avoiding "fresh" memory, and also atomicity of patching the call destination in the main body of the compiled method. Since AArch64 doesn't have instruction cache coherency, an `isb` instruction was inserted in the stub path, which seems okay as the code is essentially rolling into the interpreter anyway (see [JDK-8219993]). [JDK-8219993]: ## Garbage collection Some of the HotSpot garbage collectors are "stop the world" collectors. Collectively, these modify object references appearing in immediates in compiled code, inside of a safepoint. During this safepoint (which may be viewed as a global epoch transition), every Java-executing thread is inactivated, and will not reactivate without running a `cpuid` instruction (see [JDK-8220351], as above). Other garbage collectors, like ZGC and Shenandoah, are fully concurrent. These prefer to modify object references in code during concurrent execution. Here we rely on a mechanism we refer to as method entry barriers. (In the source code, "nmethod" is the name for a block of JIT-compiled code, for no clear reason; it just stuck. So we also call these barriers "nmethod entry barriers", when we want to be clear it's about compiled code.) An entry barrier is guard instruction at the entry point of each compiled method that checks if we are allowed to call the method through a fast path; if it is disarmed, the fast path is permitted, but if it is armed, any attempted method entry control is directed to some runtime support routine. When a GC phase changes, method entries start to take the slow path, so HotSpot can "fix up" affected managed pointers and metadata (in inline caches) before continuing execution. The guard itself is a conditional branch. When the guard is triggered, we patch pointers in the code as needed, and then disarm the guard by patching the immediate in the conditional branch. This again relies on instruction cache coherency (or perhaps memory freshness), and assumes the effectiveness of patching immediates of an instruction. We uphold the usual requirements: We don't cross cache line boundaries, and we update the immediate is with a single store to a naturally aligned value. Note that we cannot appeal to the "freshness" of the updated instruction storage in the fully concurrent case, unless we contrive to ensure that any concurrently entering thread has already transitioned to a "freshness" epoch which comes after the updates of the managed pointers. This is possible but requires a more complex method entry barrier sequence. In an upcoming release of ZGC, we will want to patch GC barrier code, as well as object references. But no more assumptions are made about the safety of doing that, than already applies when patching the object references. On AArch64, where we do not have instruction cache coherency, method entry barriers initially did not support patching object references, and hence the object references were all moved to data, and loaded by indirection every times they were used. However, with the new upcoming ZGC release, a new guard has been designed that conditionally execues an `isb` instruction around the first time that a compiled method is invoked, per thread, since it was disarmed. The effect is that each thread that passes the barrier will witness freshly created code, even it is a messily edited version of code from a previous epoch. This allows fully synchronized cross-modifying code for AArch64. (We could do the same thing on x86_64, if that helps, but it will make method entry a little slower.) [Code for the new guard] looks like this: ``` // If we patch code we need both a code patching and a loadload // fence. It's not super cheap, so we use a global epoch mechanism // to hide them in a slow path. // The high level idea of the global epoch mechanism is to detect // when any thread has performed the required fencing, after the // last nmethod was disarmed. This implies that the required // fencing has been performed for all preceding nmethod disarms // as well. Therefore, we do not need any further fencing. __ lea(rscratch2, ExternalAddress((address)&_patching_epoch)); // Embed an artificial data dependency to order the guard load // before the epoch load. __ orr(rscratch2, rscratch2, rscratch1, Assembler::LSR, 32); // Read the global epoch value. __ ldrw(rscratch2, rscratch2); // Combine the guard value (low order) with the epoch value (high order). __ orr(rscratch1, rscratch1, rscratch2, Assembler::LSL, 32); // Compare the global values with the thread-local values. Address thread_disarmed_and_epoch_addr(rthread, in_bytes(bs_nm->thread_disarmed_offset())); __ ldr(rscratch2, thread_disarmed_and_epoch_addr); __ cmp(rscratch1, rscratch2); __ br(Assembler::EQ, skip_barrier); ``` [Code for the new guard]: > \[Code for the new guard]: The main idea is that there is a global epoch counter, and a thread-local epoch counter, combined with the normal guard value (part of the word is the epoch and part is the guard value). At the entry of the method, the thread-local guard and epoch value is compared against the current global epoch and the guard value of the method. We pass the fast path if the global epoch is consistent with the epoch of the thread, and the guard is in a valid state for the current GC phase. This way, we never let any further execution into modified code happen without running a full `isb` instruction, but we only execute it a small bounded number of times before switching to a fast path. ## Deoptimization The highly dynamic nature of Java makes it a near certainty that some fraction of all optimized code will become obsolete and need reoptimization. Supporting this requires a way to quickly get all threads to abandon obsolete code and fall back to a safer execution mode, which is typically the interpreter. This process of falling back is called _deoptimization_. A compiled method that is deoptimized is destined by the JVM for complicated process of safe removal and replacement. A first step in deoptimizing a method is patching its entry point to a jump that re-resolves whatever call site got the thread to the entry point. We refer to this action as making the compiled method _not-entrant_. We do not rely on the effect being instantly observed by other threads. Instead, we have a mechanism for gradually phasing out the obsolete code. Eventually, a rendezvous is performed with all threads in the system, forcing them to execute a `cpuid` instruction, after which we rely on all instructions having been made observable. (At this point they are all logically "in fresh memory".) Before that point it's a nice bonus if the jump is observed, but it isn't necessarily a problem if it is not observed. The most dubious thing about this code is that it is the only place where we patch right over existing instructions, and hope for the best, on x86_64. On AArch64, there is a `nop` instruction that exists simply to be overwritten with a jump; this transition is explicitly supported by the programming manual. But on x86_64, there is a random other instruction there. (Instructions at method entry perform tasks related to frame setup, such as stack-banging to detect overflow.) Moreover, the x86 code was written a long time ago, when we had 32 bit atomicity only. The jump we want is 5 bytes. So the code goes through different phases. First we patch in two self looping 2 byte jumps (total 4 bytes) with a single 4 byte store, followed by patching in the 5th byte of final jump, and then atomically write bytes 2-5, which is the entire destination offset with an atomic 4 byte store. I think the idea was to have valid instructions on each step of the road, but it does seem dubious at best. This particular use of cross-modifying code seems extra dangerous, and is known to have caused crashes in the wild, leading finding the Intel erratum "Unsynchronized Cross-Modifying Code Operations Can Cause Unexpected Instruction Execution Results". It would appear that the wording of said erratum suggests the use of unsynchronized cross-modifying code should not result in crashes. It seems very likely that this particular use of cross modifying code could patch a 5-byte jump over a 5-byte `nop`s instead, or even a 5-byte instruction that does useful frame setup work. At this point, it seems risky to patch over any instruction that is smaller than 5 bytes. The Graal JIT compiler uses an 8-byte natural store to update method entry, which seems to be the right call here. Another option would be to use a transactional CAS instruction of 8 bytes or even 16 bytes. That might be a good call on 32-bit machines which do not guarantee atomicity of natural 64-bit stores (if there are any such machines). Deoptimizing a method requires more than making a method not-entrant: It must also not be _re-entrant_ by return from a pending out-of-line call. (A third case is dealing with methods which are executing their random instructions at this very moment in some thread: That is handled by roll-forward.) To handle this, the deoptimization logic walks all frames in the system to ensure that frames of obsolete compiled methods get a return barrier installed in their callee. This return barrer prevents re-entry to frames of deoptimized methods, and instead jumps to a deoptimization handler that replaces the frame with something safer (an interpreter frame). Project Loom introduces massively-scaling virtual threads, where the stacks of millions of parked threads can be scattered across the Java heap. In that state, those threads are impossible to locate apart from an exhaustive heap walk, which is not practical. Thus, Loom requires yet another deoptimization-related patching technique. Since Loom, there are some `nop`s after each Java method call, which may be patched into re-entry barrier if that method must be deoptimized. This is a scalable alternative to patching return addresses. It ensures, after the global rendezvous with active threads has happened, that any returns into frames unparked from the heap will call a deoptimization handler, even though we could not visit them all during the initial deoptimization step.