Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
% How HotSpot cross-modifies code -- a summary
% Erik Ă–sterlund & John Rose (Oracle)
% May 2022 (draft version 0.3)
## Introduction
_Cross-modified code_ (abbreviated _CMC_) happens when one thread
edits, as normal data in memory, an instruction stream that may be
in the process of concurrent execution by another thread. This can lead to a
special kind of race condition, requiring that the relevant data
memory writes from the first thread will be properly detected by the
instruction fetch unit of the second thread. In general, it is
problematic, not only because of races when synchronization is
lacking, but also because the functional units that must communicate
are not designed to do so at speed.
Broadly speaking, some limited cross-modification of code is required
in order to bootstrap dynamically-linked code into processes (in
modern operating systems). Hardware vendors specify the narrow
conditions required for successful CMC. The basic requirement is that
all threads which might read a CMC-affected code stream must avoid
reading the edited code until the editing processor performs a
suitable specialized _CMC-release_ operation ("flush instructions").
At that point, in some globally serialized sense, all potentially
executing processors must perform a specialized _CMC-acquire_
operation, which cleanly discards all state in any instruction cache
that might be affected by any previous version of the edited data. It
is convenient to think of that CMC-release and CMC-acquire operation
as defining what might be called an _epoch boundary_, between an
"epoch of data editing" and an "epoch of executing new instructions".
A handshake must be used for each thread to move from the old epoch to
the new epoch. This is clearly a slow and disruptive operation.
It might seem correct that edits which are "small enough" do not need
an epoch change. For example, changing one byte, from one opcode to
another, without breaking the syntax of the instruction stream in
which it appears, would seem to be "small". Or changing one word that
is the immediate operand of a move or jump would seem to be "small".
And yet, in both cases, there is no guarantee that an instruction
fetch unit would "notice" the changes, ever, short of an explicit
CMC-acquire. What is even worse, there is no guarantee that the write
would not put an instruction fetch unit into a state which causes it to
execute neither the old instructions nor the new instructions, but
some combination which leads to unpredictable behavior.
An instruction fetcher might build a complex model of the instructions
it sees, which might be only partially invalidated if it takes an
interrupt, pauses, and then happens to see new data in memory.
As an extreme (though unlikely) example, unpredictable behavior could
happen if one thread ever fetched instructions as single bytes, and
without concern for their order of storing (that is, not using total
store order). For more unpredictability, if that thread fetches
instruction bytes in a random order (not just as it executes), it
could "see" and attempt to execute any mix of old and new instruction
bytes, regardless of any careful order in which they were stored.
Normal data races are relatively understood, and have even been the
subject of [formal models]. But races which affect an instruction
fetch unit are not well specified or understood.
[formal models]:
> \[formal models]:
Meanwhile, HotSpot has used cross modifying code techniques from its
very beginning, a quarter century ago. For example, it has always
performed dynamic linkage of call sites using CMC. There are several
other instances as well, which this document describes.
As a basic design principle, HotSpot "surfs the races" in a particular
way. HotSpot avoids synchronizing (both CMC and simple data updates)
along "hot paths", and accepts the inevitably resulting races that may
occur. HotSpot ensures that these races are benign by allowing
threads to execute (for CMC, or else, for data updates, read or write
data) from memory in both old and new states. (There can be multiple
new states as well.) The old and new states are made well-defined by
ensuring that state-changing writes are performed using naturally
atomic operations, often aligned single-word stores. (Using single
stores is an important detail; if
HotSpot mistakenly uses _N_ non-atomic byte-wise writes to move memory
from old to new states, many intermediate states, theoretically up to
2^_N_^-2 such states, could appear to racing processors. There were
bugs like this in the early days of HotSpot.) HotSpot uses actual
mutex lock/unlock or CAS operations only when it is necessary to get
additional separation between old and new states. Of course, to get
guaranteed forward progress, HotSpot needs, eventually, to completely
drive out the old states. This is obtained by occasionally performing
an expensive global epoch transition, a big "handshake" which HotSpot
calls a _safepoint_. More recently, other kinds of thread handshakes
have been defined. These handshakes are similar to (and may use) the
lower-level phenomenon of virtual memory "shootdown", where one thread
must, by some sort of strong signal, affect another thread's virtual
address mappings.
Those generalities seem reasonable, and in fact they usually work for
HotSpot. But they also work past the edges of the general assurances
provided by hardware system programming manuals, by not immediately
performing epoch transitions on every edit to instruction streams.
Instead, HotSpot makes reasonable assumptions about natural atomicity
of edits to instruction streams and the effects of races on such
edits. These assumptions are validated for each specific kind of
hardware that HotSpot executes on. Sometimes hardware is less
welcoming of the race conditions that HotSpot "surfs"; for example,
memory on some particular platform does not implement total store
order. In such cases, HotSpot is configured to use more conservative
techniques, which typically causes execution on the affected platform
to be somewhat less competitive. It is in HotSpot's interest to run
as fast as possible on each platform, so even when "de-racing"
techniques are developed for some platforms, they are not applied to
platforms that do not need them.
All of can lead to unwelcome discoveries on new implementations or new
platforms. When hardware designers design to a very restrictive
system programming model, there can be occasional surprises where
atomicity or coherency can be lost during cross-modification of code.
Luckily for HotSpot, the natural use of cache-wise fetches (which is
natural for good hardware performance) tends to present only old or
new states to processors concurrently executing CMC in HotSpot. But
this is a _useful observation_ about _most_ platforms, rather than a
_guarantee_ about _all_ platforms.
To shed more light on these assumptions, allowing them to be
evaluated, so as to avoid some surprises in the future, this document
lists all the known cross modifying code in HotSpot.
## Intel Assumptions
The Intel Software Developer Manual (the _SDM_) makes it clear, in
section 8.1.3, that both self-modifying and cross-modifying code
should be completely synchronized with a serializing instruction like
`cpuid`. In other words, every thread that can execute a dynamically
modified instruction, should first execute the `cpuid` instruction;
this moves it into a new "epoch" in which all current memory writes
will be correctly posted to the threads's instruction fetch unit.
The contract described by the SDM is hence more prohibitive than even
the AArch64 spec, which at least allows jumps to be patched over `nop`s
(and vice versa), without synchronization.
An unfortunate aspect of the contract offered by the SDM, is that the
`cpuid` instruction serializes everything the core is doing, making it
possibly the slowest instruction there is. This can't be used in any
code appearing in a fast path. And unsurprisingly, all the uses of
cross-modifying code in the entire JVM, are there to make fast-path
code as fast as possible.
But in practice Intel hardware has proved to support other, less
synchronized cases of cross-modifying code than the SDM describes. In
particular, naturally atomic (naturally aligned) writes of 32-bit and
64-bit instruction operand words in code streams are generally safe in
practice: The concurrently executing threads "see" only the old or new
states, and there is no disruption to the decoding of the instructions
which contain those words. Even 64-bit writes of whole instructions
at jump targets (in particular, method entry points) seem to be safe,
depending on the structure of the old and new instruction stream
(before and after the write). Other techniques, such those which rely
on coherently ordered reads of multiple editing events, have worked
in the past, but require testing on new platforms.
The Intel erratum "Unsynchronized Cross-Modifying Code Operations Can
Cause Unexpected Instruction Execution Results" suggests that at least
one of HotSpot's current techniques, of performing unsynchronized CMC,
causes a crash. Some problems like that can be considered as CPU
bugs, but some of the fault might be in HotSpot, if the bug only
happens when doing unsynchronized CMC. In the end, HotSpot has two
conflicting goals: Do only what the hardware supports, and gain the
best possible performance on each of its platforms. Part of the
burden goes back to those hardware platforms which wish to preserve
and improve Java performance. Part of the burden is on HotSpot to use
techniques which have been properly documented to, and reviewed by,
the hardware developers. The Oracle/Intel collaborations have
historically been good at carrying those burdens.
These are the current general assumptions we on the HotSpot team have
made about cross-modifying code:
- Loading completely new instructions into freshly allocated memory
is safe, as long as a CMC-release is done after the instruction memory
is loaded.
> That is, a thread will not "peek" into fresh memory until a fresh
memory address address has been provided explicitly. We also
require that the address of the instructions is explicitly
published by safe means, such as a naturally atomic 32- or 64-bit
store to data or to an instruction immediate operand.
- Previously used instruction memory can be recycled, as if it were
freshly allocated, as long as all processors that may have ever
executed in that memory, or that may have prefetched from it, have
stopped doing so, and have performed a CMC-acquire.
> In addition, we should (and probably will) ensure that the cache
lines of the recycled memory are disjoint from cache lines of
instruction memory currently in use. This concern arises when fresh
blocks of memory are pre-allocated in live HotSpot code regions, and
are patched with instructions during the same epoch that processors
may be executing or prefetching in neighboring cache lines.
- Modifying immediate operands of instructions not crossing cache
line boundaries, where those operands are naturally aligned, should
result in concurrent executions seeing either the old instruction with
the old immediate, or the old instruction with the new immediate.
> For HotSpot, either result would be correct, but observing the
new instruction operand would be faster. If an instruction fetcher on
some platform ever reads part of an operand, pauses, and then reads
the rest of the operand, this assumption would be violated.
- Modifying two instructions in order, where the second instruction
would only be executed if the modified version of the first
instruction is executed, will result in only the modified version of
the second instruction to be executed. We refer to this as
_instruction cache coherency_.
> For example, in a compiled method with managed pointer
immediates, the GC may update those immediates in a batch while method
entry is paused; it must be possible to finish the batch edit and then
re-enable method entry by a second instruction patch, so that threads
passing through a disarmed method entry barrier must witness
coherently updated values in the managed pointer immediates. An
incoherent instruction cache might hold onto old immediates (unedited
by the GC) and feed them to threads passing through the disarmed
barrier. On platforms where this is a problem, it can be fixed with a
more expensive entry barrier, that can incorporate a CMC-acquire for
each entering thread.
- We assume that we can patch method entry sequences (and also method
_re-entry_ sequences, after embedded call sites) under some
conditions. Specifically, we expect to be able to patch jumps over
`nop`s (and vice versa), to arm (and disarm) entry barriers to methods
(or re-entry barriers, for re-entry points within methods).
> There is more detail below about method entry barriers. Ideally
(and moving forward) we use either naturally atomic word writes to
perform such patching, or else transactional instructions (such as
128-bit CAS), and we ensure that the affected word is naturally
aligned (not crossing a cache line boundary). Ideally, we also ensure
(though we do not at present) that exactly one old instruction is
replaced by exactly one new instruction, lest a thread stuck on an
instruction boundary inside the patched area try to execute part of
the interior of the new instruction.
## CMC release and acquire operations on Intel
Older code in HotSpot assumes we need to explicitly flush caches with
`clflush`, while some newer code assumes that for cross-modifying-code
purposes, this does not help, but rather depend on the instruction
cache coherency aspect (making the data vs instruction cache
synchronization issues less interesting). Thus, we tend to use
`clflush` for CMC-release, but since is effective only for the
executing thread doing self-modifying code, we could omit `clflush` in
the case of a JIT-only thread.
In the case of self-modifying code where the thread itself expects to
be able to observe the new instructions, we generally invoke a `cpuid`
instruction as our CMC-acquire, but sometimes HotSpot resorts to
explicit cache flushing instead. We are moving towards using `cpuid`
instead of cache flushing, but we are not quite there yet.
## ARMv8 Assumptions
On ARMv8, we assume we can not rely on instruction cache coherency. We
also assume that we can not patch immdiates atomically. The
architecture gives only limited guarantees about patching, that
calls/jumps can replace `nop`s and vice versa. We aim to avoid other
kinds of instruction patching. When we patch, we assume that we need
to manually execute instruction cache flush on the modified
instructions (this plus possibly a data fence is our CMC-release). We
know we cannot rely on the changes becoming observable from other
threads until a subsequent rendezvous invoking `isb` barriers (our
CMC-acquire) on all threads. We do this occasionally (_question:
when?_) for correctness reasons.
There are some situations where we still do things out of spec, and
just hope for the best, because doing anything else is a challenge.
More details below.
## Uses of Cross-Modifying-Code
In general, there are several cases of cross-modifying code in
HotSpot, and each one is there to support a fast path that would be
significantly slower if it used other techniques, such as querying
data variables to detect barriers or state changes. Here the list:
- JIT compilation (into either totally fresh or recycled memory)
- one-shot barriers for deferred initialization (C1 class and field references)
- dynamic method linking (multi-state link and re-link of call sites)
- dynamic method call dispatch (adds more metadata and states to method linking)
- small preallocated "stubs" near active JIT code (launch pads for some calls, populated just before use)
- multi-shot method entry barriers (for methods that are temporarily paused or permanently not-entrant)
- method re-entry barriers (after out-of-line call sites, for not-entrant methods)
- managed pointer update (editing of immediates by the GC)
## JIT compilation
The elephant in the room regarding cross-modifying code is the
technique of JIT-compiling methods. The JIT-compiler writes
instructions into executable pages, and then publishes the compiled
code, such that it becomes available to execute to concurrent calls to
Java methods. Threads that execute the JIT-compiled methods do not
execute the `cpuid` instruction first, so this is a direct violation
of recommended practices, and is neither allowed by the Intel SDM nor
the ARMv8 spec.
To be fair, dynamically linked but statically compiled ABI-based
languages like C/C++ also seem to "surf" these race conditions, if a
dynamic linker does not "shoot down" all process threads when it loads
a new instruction segment. (_Question: Do they do that?_) The major
difference between the dynamic linkage of HotSpot JIT code and code
statically compiled for an ABI is that ABIs tend to specify data-based
publication of dynamically linked function entry points, so that an
ABI-compliant call will load a function pointer from a patchable
variable, rather than (as in HotSpot) execute a call or jump with a
patchable address immediate. Over most of HotSpot's career, and
probably even at present, the latency of method entry via a patchable
call instruction is less than the latency of method entry that first
fetches the address from a variable. This is especially true when the
complex linkage behavior of HotSpot methods must be supported.
(HotSpot call sites have many more states than those supported by ABI
dynamic linkers.)
Full compliance is tricky, in part because there is a chicken-and-egg
problem with installing CMC-acquires wherever they are required.
Suppose we place an expensive CMC-acquire instruction (`cpuid` on x86)
at the entry point of a new method, with the intention of removing it
when it is not longer needed. But that instruction itself is inserted
(and also hopefully removed) as cross-modifying code. So whatever we
emit in the JIT's code area (including synchronization code to comply
to the Intel SDM or some other platform), would have to already be
sychronized through some other means.
A fully synchronizing poll of a global variable at every method entry
might help, but such things tend to cost a significant slice of
performance. Java, like most languages, is somewhat sensitive to the
overheads of procedure linkage.
## C1 deferred initialization
HotSpot actually runs with two JITs, a warmup JIT called C1 and a
performance JIT.
The C1 JIT is mostly used before we have really felt the need to
compile with the more optimizing compiler (usually C2, sometimes
Graal). In this more initial phase, it may happen that we require
information about some class which has not yet been loaded. (This is
common in Java, where the unit of dynamic loading and linking the
class, not the DLL.) The information needed could be the offset of a
field (for a field read or write) or an object reference to the class
object (for a type test or reflective query).
C1 prefers to support this with one-time (fire and fix) execution
barriers which precede the instruction that would need the missing
information. When the barrier is hit, HotSpot finds the missing
information, triggering class loading (and even reporting errors) as
necessary. With the information (field offset or class metaobject) in
hand, an instruction after (or perhaps under) the barrier is patched.
Concurrent races that fire the barrier will synchronize in the HotSpot
runtime, and in the end the barrier will be disarmed by replacing it
with a `nop` or some other instruction.
In this same scenario, the AArch64 port simply emits traps to
deoptimize (discard and eventually recompile) the compiled method
whenever this code path is reached. This seems to be necessary
because (as described below) an AArch64 processor will sometimes fail
to observe the patching of the _second_ instruction even after the
_first_ barrier instruction is inactivated. However, deoptimizing
like this has a noticeable effect on startup and warmup times, since
it requires a full C1-level recompilation after any of these execution
barriers is hit.
The other platforms solve the problem with cross-modifying code as
described above. The execution barrier is branch to the runtime,
which will ensure the information needed is loaded, and then patch in
the desired instructions into a compiled method, and subsequently
replace the branch with a `nop`, such that the new instruction
containing the loaded information is executed. This way we can patch
in the information that was missing rather than recompiling the entire
thing.
This code patching must be done with correct release fencing at each
point. It is especially delicate because it happens while other
threads are freely executing in the same method as the code being
patched. We must assume that, when the barrier is removed, the
patched instruction will be correctly observed.
In order for this to work, we rely on instruction cache coherency. We
are ensuring the stores are ordered (on Intel it's easy because it's a
TSO machine), and the assumption is that we prepare some instructions
that are not yet reachable, and then when they are prepared, we
publish them by patching an instruction. If the instruction fetcher
observes the new instruction that published the patching, then it will
also observe the instructions we patched guarded by that. Except for
matters of scale and proximity, this is similar conceptually to
JIT-compilation. In the case of patching already-published methods,
the code being patched is conceptually in "fresh" memory, but it is
very close to instructions that are being actively executed (perhaps
by many threads). It appears that, on at least some platforms,
"fresh" (unused) memory emitted by a JIT might be "contaminated" by
nearby executions. There may be a scaling parameter, such as cache
line or page size, which would help us measure the separation required
on such platforms so that fresh memory stays fresh.
The issue [JDK-8223613] tracks alternatives to our current techniques
in C1 for patch-based initialization barriers. These alternatives are
not being pursued actively at present.
[JDK-8223613]:
> \[JDK-8223613]:
## Dynamic method dispatch
Dynamic method dispatch is optimized by HotSpot with a technique
called _[inline caches]_. The basic idea is that you have a callsite
with some metadata and a destination. The metadata is embodied in an
instruction that sets a register, and the destination is the immediate
destination operand of a call instruction. The inline cache is
initially in the so-called _unlinked_ state, in which it points to
"resolution stub" as the call destination; this trampoline will call
into a linkage service routine provided by HotSpot. (This is similar
to ABIs where a dynamically linked call is initially directed to a
trampoline to the runtime linker.) The initial assumption is that
this call site will only need to access a single callee method, and so
the inline cache site is linked optimistically to the method that it
reaches first. We call the new state of the call site the
_monomorphic_ state, and in many cases that is the state it stays in
forever.
This transition is made by patching the immediates of the two
instructions, which are ensured to not span cache lines. The
assumption made is that the result of executing either of the two
instructions is going to be either execution of the old instruction,
or the new instruction, in either order. We don't rely on any effects
being made observable to the instruction fetcher, or that they are
made observable in any particular order, only that the result of
executing the instructions is some combination of the old and new
instructions.
(Specifically, it is harmless to set the new metadata value and then
run into the old linkage service routine, and it is also harmless to
run into the new method entry, but have the stale old metadata value;
both mismatches will take a slow path which will end up executing the
correct method.)
If we call through this callsite again with a different receiver
object, we will find out that this callsite is actually megamorphic.
This happens when the callee method checks the metadata, and sees that
it does not (in fact) match the type of the receiver. In that case
the callee method jumps to the linkage service routine, where the JVM
again fixes things up. This is second state change for the inline
cache call site, into the so-called _megamorphic_ state.
In this case, we cannot patch the inline cache directly to the third
state due to various races (between the third state and the first
two). Instead we emulate an atomic update of both components of the
inline cache, by JIT-compiling the new (third) inline cache state into
a transient buffer (another kind of stub). At suitable epoch
boundaries, these temporary buffers are copied back into the original
inline cache instructions. (The boundary requires that all
Java-executing threads that wake up from a safepoint must run a
`cpuid` instruction, at least since [JDK-8220351].) This trick is,
again, relying either on instruction cache coherency, or on the
temporary buffer storage being "fresh" memory with respect to all
instruction fetchers. The temporary stub is built first, and then
published by modifying the immediate of the call instruction in the
inline cache, to point at the stub. If the instruction fetcher
observes the updated call instruction that points at the stub, we
assume it will then observe the instructions we just stored in the
fresh memory of the stub.
[inline caches]:
[JDK-8220351]:
[JEP-8221828]:
> \[inline caches]:
> \[JDK-8220351]:
> \[JEP-8221828]:
Here is a summary of the inline cache states:
- _unlinked:_ not executed yet, or in the process of first linkage
- _monomorphic:_ optimistically pointing at the only method needed so far
- _megamorphic:_ pointing at a dynamic dispatch stub (v-table or i-table lookup plus tail-call)
These states are crossed with these other states which exist to avoid
races:
- _immediate_: metadata and call target are in the original instructions
- _buffered:_ metadata and call target are in a temporary buffer (reached by the original call)
There are a few more state variations due to the mixing of interpreted
and compiled methods; these are described in the next section. There
can be, in principle, yet more states which pertain to the safe
disposal of compiled code that is no longer useful.
Some people advocate a new method invocation scheme that moves away
from inline caches, in favour of more optimized megamorphic call data
structures (cf. [JEP-8221828]). Historically, the main job of the
inline cache has been to avoid indirect calls, because they have been
slow. Now indirect calls with dyamically monomorphic callsites are
fast on some platforms, and both vtable and itable calls can be
implemented efficiently without inline caches. The lack of support
for (on some platforms) instruction cache coherency and the
uncertainty of rules about "fresh" JIT memory, are part of the
motivation for reconsidering the use of inline caches.
## Static method dispatch
When we have statically dispatched calls, the callee method is already
known at compile time, we don't need the full machinery of inline
caches. But this linkage of methods is still dynamic, and requires
trips through linkage service routines to resolve call sites.
Since the target method might not be loaded at compile time,
resolution of the callsite is still deferred until the first call
through this callsite. At that point, the concrete method is
known. At this point, there could either be a compiled method for said
callee, or we might have to go into the interpreter.
A call to an interpreted method from compiled code requires passing a
reference to the target method into a compiler-to-interpreter adapter
stub. This adapter will lay down the outgoing arguments on the stack
in the array-like form required by the interpreter, and then jump into
the interpreter with a request to enter the target method, as
identified in a known linkage register (`rbx` on x86).
The fast path is, of course, compiled code calling compiled code,
which uses a register-based ABI-like calling sequence that does not
need any metadata. (Since all this about interpreter transitions is
true of inline caches as well as static all sites, the fast path for
an inline cache might use metadata to perform the receiver check as
described above.) Thus, the structure of a static call, in compiled
code, does not contain a metadata pointer setting instruction; that
would be just a useless interruption when the Java program has warmed
up.
So, when a static call site is linked to call the interpreter, a
metadata pointer setting instruction must be created, so that the
interpreter can be entered properly. Therefore, for every static call
that might need to enter the interpreter, there is a corresponding
launch pad (called a "stub", again), which is pre-allocated at the of
the compiled method. This stub is a small amount of fresh memory
which is filled in during method resolution, containing a metadata
setting instruction that patches in the target method reference, and a
jump to the argument shuffling adapter described above. So when we
resolve the callsite to go to the interpreter, we first prepare the
stub, and then publish it by patching in the destination of the call
instruction to go to the stub (which in this case lives a few cache
lines down from the end of the compiled method code). Once again,
this relies on instruction cache coherency and/or rules about avoiding
"fresh" memory, and also atomicity of patching the call destination
in the main body of the compiled method.
Since AArch64 doesn't have instruction cache coherency, an `isb`
instruction was inserted in the stub path, which seems okay as the
code is essentially rolling into the interpreter anyway (see [JDK-8219993]).
[JDK-8219993]:
## Garbage collection
Some of the HotSpot garbage collectors are "stop the world"
collectors. Collectively, these modify object references appearing in
immediates in compiled code, inside of a safepoint. During this
safepoint (which may be viewed as a global epoch transition), every
Java-executing thread is inactivated, and will not reactivate without
running a `cpuid` instruction (see [JDK-8220351], as above).
Other garbage collectors, like ZGC and Shenandoah, are fully
concurrent. These prefer to modify object references in code during
concurrent execution. Here we rely on a mechanism we refer to as
method entry barriers. (In the source code, "nmethod" is the name for
a block of JIT-compiled code, for no clear reason; it just stuck. So
we also call these barriers "nmethod entry barriers", when we want to
be clear it's about compiled code.) An entry barrier is guard
instruction at the entry point of each compiled method that checks if
we are allowed to call the method through a fast path; if it is
disarmed, the fast path is permitted, but if it is armed, any
attempted method entry control is directed to some runtime support
routine.
When a GC phase changes, method entries start to take the slow path,
so HotSpot can "fix up" affected managed pointers and metadata (in
inline caches) before continuing execution. The guard itself is a
conditional branch. When the guard is triggered, we patch pointers in
the code as needed, and then disarm the guard by patching the
immediate in the conditional branch. This again relies on instruction
cache coherency (or perhaps memory freshness), and assumes the
effectiveness of patching immediates of an instruction. We uphold the
usual requirements: We don't cross cache line boundaries, and we
update the immediate is with a single store to a naturally aligned
value.
Note that we cannot appeal to the "freshness" of the updated
instruction storage in the fully concurrent case, unless we contrive
to ensure that any concurrently entering thread has already
transitioned to a "freshness" epoch which comes after the updates of
the managed pointers. This is possible but requires a more complex
method entry barrier sequence.
In an upcoming release of ZGC, we will want to patch GC barrier code,
as well as object references. But no more assumptions are made about
the safety of doing that, than already applies when patching the
object references.
On AArch64, where we do not have instruction cache coherency, method
entry barriers initially did not support patching object references,
and hence the object references were all moved to data, and loaded by
indirection every times they were used. However, with the new
upcoming ZGC release, a new guard has been designed that conditionally
execues an `isb` instruction around the first time that a compiled
method is invoked, per thread, since it was disarmed. The effect is
that each thread that passes the barrier will witness freshly created
code, even it is a messily edited version of code from a previous
epoch. This allows fully synchronized cross-modifying code for
AArch64. (We could do the same thing on x86_64, if that helps, but it
will make method entry a little slower.) [Code for the new guard]
looks like this:
```
// If we patch code we need both a code patching and a loadload
// fence. It's not super cheap, so we use a global epoch mechanism
// to hide them in a slow path.
// The high level idea of the global epoch mechanism is to detect
// when any thread has performed the required fencing, after the
// last nmethod was disarmed. This implies that the required
// fencing has been performed for all preceding nmethod disarms
// as well. Therefore, we do not need any further fencing.
__ lea(rscratch2, ExternalAddress((address)&_patching_epoch));
// Embed an artificial data dependency to order the guard load
// before the epoch load.
__ orr(rscratch2, rscratch2, rscratch1, Assembler::LSR, 32);
// Read the global epoch value.
__ ldrw(rscratch2, rscratch2);
// Combine the guard value (low order) with the epoch value (high order).
__ orr(rscratch1, rscratch1, rscratch2, Assembler::LSL, 32);
// Compare the global values with the thread-local values.
Address thread_disarmed_and_epoch_addr(rthread, in_bytes(bs_nm->thread_disarmed_offset()));
__ ldr(rscratch2, thread_disarmed_and_epoch_addr);
__ cmp(rscratch1, rscratch2);
__ br(Assembler::EQ, skip_barrier);
```
[Code for the new guard]:
> \[Code for the new guard]:
The main idea is that there is a global epoch counter, and a
thread-local epoch counter, combined with the normal guard value (part
of the word is the epoch and part is the guard value). At the entry of
the method, the thread-local guard and epoch value is compared against
the current global epoch and the guard value of the method. We pass
the fast path if the global epoch is consistent with the epoch of the
thread, and the guard is in a valid state for the current GC
phase. This way, we never let any further execution into modified code
happen without running a full `isb` instruction, but we only execute it
a small bounded number of times before switching to a fast path.
## Deoptimization
The highly dynamic nature of Java makes it a near certainty that some
fraction of all optimized code will become obsolete and need
reoptimization. Supporting this requires a way to quickly get all
threads to abandon obsolete code and fall back to a safer execution
mode, which is typically the interpreter. This process of falling
back is called _deoptimization_. A compiled method that is
deoptimized is destined by the JVM for complicated process of safe
removal and replacement.
A first step in deoptimizing a method is patching its entry point to a
jump that re-resolves whatever call site got the thread to the entry
point. We refer to this action as making the compiled method
_not-entrant_. We do not rely on the effect being instantly observed
by other threads. Instead, we have a mechanism for gradually phasing
out the obsolete code. Eventually, a rendezvous is performed with all
threads in the system, forcing them to execute a `cpuid` instruction,
after which we rely on all instructions having been made observable.
(At this point they are all logically "in fresh memory".) Before that
point it's a nice bonus if the jump is observed, but it isn't
necessarily a problem if it is not observed.
The most dubious thing about this code is that it is the only place
where we patch right over existing instructions, and hope for the
best, on x86_64. On AArch64, there is a `nop` instruction that exists
simply to be overwritten with a jump; this transition is explicitly
supported by the programming manual. But on x86_64, there is a random
other instruction there. (Instructions at method entry perform tasks
related to frame setup, such as stack-banging to detect overflow.)
Moreover, the x86 code was written a long time ago, when we had 32 bit
atomicity only. The jump we want is 5 bytes. So the code goes through
different phases. First we patch in two self looping 2 byte jumps
(total 4 bytes) with a single 4 byte store, followed by patching in
the 5th byte of final jump, and then atomically write bytes 2-5, which
is the entire destination offset with an atomic 4 byte store. I think
the idea was to have valid instructions on each step of the road, but
it does seem dubious at best. This particular use of cross-modifying
code seems extra dangerous, and is known to have caused crashes in the
wild, leading finding the Intel erratum "Unsynchronized
Cross-Modifying Code Operations Can Cause Unexpected Instruction
Execution Results". It would appear that the wording of said erratum
suggests the use of unsynchronized cross-modifying code should not
result in crashes.
It seems very likely that this particular use of cross modifying code
could patch a 5-byte jump over a 5-byte `nop`s instead, or even a
5-byte instruction that does useful frame setup work. At this point,
it seems risky to patch over any instruction that is smaller than 5
bytes.
The Graal JIT compiler uses an 8-byte natural store to update method
entry, which seems to be the right call here. Another option would be
to use a transactional CAS instruction of 8 bytes or even 16 bytes.
That might be a good call on 32-bit machines which do not guarantee
atomicity of natural 64-bit stores (if there are any such machines).
Deoptimizing a method requires more than making a method not-entrant:
It must also not be _re-entrant_ by return from a pending out-of-line
call. (A third case is dealing with methods which are executing their
random instructions at this very moment in some thread: That is
handled by roll-forward.) To handle this, the deoptimization logic
walks all frames in the system to ensure that frames of obsolete
compiled methods get a return barrier installed in their callee. This
return barrer prevents re-entry to frames of deoptimized methods, and
instead jumps to a deoptimization handler that replaces the frame with
something safer (an interpreter frame).
Project Loom introduces massively-scaling virtual threads, where the
stacks of millions of parked threads can be scattered across the Java
heap. In that state, those threads are impossible to locate apart
from an exhaustive heap walk, which is not practical. Thus, Loom
requires yet another deoptimization-related patching technique. Since
Loom, there are some `nop`s after each Java method call, which may be
patched into re-entry barrier if that method must be deoptimized.
This is a scalable alternative to patching return addresses. It
ensures, after the global rendezvous with active threads has happened,
that any returns into frames unparked from the heap will call a
deoptimization handler, even though we could not visit them all
during the initial deoptimization step.