Thus much higher performance than Fastswap, especially for
fine-grained objects.
Requires re-implementation of program
Compiler-based remote memory
Possible to achieve performance and programmer transparency using
modern compiler transformations and analysis
TrackFM compiler/runtime is fully transparent, with AIFM as a backend
TrackFM Design
Resuses AIFM far memory runtime
and automate integration into the application
Aims to transform C/C++ applications to use remote memory
automatically
Uses LLVM-based, middle-end analysis and transformations to remote
certain mallocs via AIFM
Produces modified binary that runs on a far memory cluster
Transformations take place at IR level
Primary obstacle is navigating semantic gap between application
developer’s knowledge of data structurs and what the compiler sees at
the granularity of memory accesses.
TrackFM must automatically determine the mapping of memory
allocations to AIFM objects (drawing
boundaries around chunks of contiguous memory allocations)
This means that any heap allocated datat structure may be swapped
out
Whether they are swapped out depends on temporal access patterns
(hot regions kept local, etc.)
TrackFM must transform pointers to work in AIFM, but pointers may easily escape to
external libraries, which cannot handle AIFM pointers
A libary may try to access remote memory that isn’t yet localized by
TrackFM runtime
Could do:
Programmers run all libraries through TrackFM runtime
Only allow pre-transformed versions provided by TrackFM
Thus all pointers are made remote aware (and can access either
remote or local memory)
Once tracked, translation/indirection layer between remote and local
pointers, so that local can be accessed correctly
Far Memory Pointer
Transformation
TrackFM must manage all heap allocations
Uses non-canonical x86 addresses (60th bit) to note if pointer is
local or remote
Uses indirection layer to guard accesses to pointers
Compiler injected via llvm
passes that introduce high-level and program-wide abstractions,
including:
Pointer guards
Finds load and store instructions that correspond with malloc, and
then marks as eligible for guard transformation
Candidate heap pointers then transformed by guard transformation
pass
Loop Chunking
Libc Transfomrmation
Transforms all mem alloc calls in libc into TrackFM managed memory
runtime calls
Leverages AIFM region-based
allocator to allocate remotable memory
Bridging AIFM with Compiler
Must somehow transform contiguous heap allocations into fixed-size
AIFM objects that can either be in local or remote state.
TrackFM extends AIFM with an abstract class that compiler uses to
capture all remotable allocations
Attaches remotable allocations and attaches to runtime-managed
object pool
This pool represents the total far memory that an application can
use
Interposes allocation sites and chunks allocations into objects in
the global pool at runtime
Object Size Selection
Constrained to choose a single object size at compile time for the
entire application
Auto tuning approach is feasible (could just try from 2^6 to 2^{12} and recompile for each)
Object State Table
AIFM requires a metadata reference and an object reference
TrackFM eliminates need for metadata reference by using a cache state table, using index
calculation instead of indirect memory reference
Guards
Custody check
Checks whether pointer is managed by TrackFM
If not jump to target load/store
If true perform table lookup to find state table entry corresponding
to AIFM object
Fast Path Guard
Checks if object is guaranteed to be local, prevents evacuation of
object until load/store
Slow Path Guard
If unsafe, call into TrackFM runtime \rightarrow call into AIFM runtime to
dereference object (may involve remote fetch)
Loop Overheads
Compiler can determine induction variable of a loop, so it can know
if guards within a bunch of array elements are redundant. So instead it
uses locality guards at object boundaries to check that the entire chunk
is safe.
Also uses prefetching since it can detect sequential access at
compile-time. This is a strength of compiler-based over
kernel-based.
Uses a simple cost model to understand if iteration space is large
enough to be faster than the simple guard approach.
Key Findings
Compiler based approach to software-based far memory
Path to simultaneous performance and programmer transparency
Modern compiler analysis and transformation automatically transforms
existign applications to support far memory
New compiler analysis and transformation passes that improve
performance for target applications
Design and implementation of TrackFM, compiler-based far memory
system
Near parity with AIFM performance
Hybrid (kernel + compiler) would be promising as kernel-based
performs well with temporal locality
Hardware that compiler could manage (similar to page mechanism)
might be beneficial
Critique/Gaps
Future work could allow for multiple object sizes (only a few fixed
object sizes make sense, likely to be powers of 2 from 64B to 4KB)
Capture inter-procedural data structure semantics
Languages whos memory semantics more closely match those of far
memory (Rust)
Remote fetching for trivial computation is wasteful, so offload to
remote node to employ near-data processing (using static analysis)
Profiling stage to prune set of heap allocations available for
remoting based on access frequency (similar to MaPHeA)
Questions
If transformations take place at IR level, does that mean that any
LLVM frontend can make use of TrackFM?
Could Derive actually be a
great test bed for trying to make all libraries TrackFM aware?
diff --git a/lit-reviews/tauro2024trackfm.md b/lit-reviews/tauro2024trackfm.md
index 6311217..bc8bdd8 100644
--- a/lit-reviews/tauro2024trackfm.md
+++ b/lit-reviews/tauro2024trackfm.md
@@ -6,10 +6,6 @@ bibliography: bibliography/references.bib
csl: bibliography/chicago-author-date.csl
---
-## Progress
-
-pg. 4/19
-
## Source Information
- **Title:** TrackFM: Far-out Compiler Support for a Far Memory World
@@ -139,6 +135,46 @@ Largest gap is remote pointer transformation, so:
- Interposes allocation sites and chunks allocations into objects in the
global pool at runtime
+#### Object Size Selection
+
+- Constrained to choose a single object size at compile time for the entire
+ application
+- Auto tuning approach is feasible (could just try from $2^6$ to $2^{12}$ and
+ recompile for each)
+
+#### Object State Table
+
+- AIFM requires a metadata reference and an object reference
+- TrackFM eliminates need for metadata reference by using a
+ [[../concepts/cache]] state table, using index calculation instead of indirect
+ memory reference
+
+### Guards
+
+- Custody check
+ - Checks whether pointer is managed by TrackFM
+ - If not jump to target load/store
+ - If true perform table lookup to find state table entry corresponding to
+ AIFM object
+- Fast Path Guard
+ - Checks if object is guaranteed to be local, prevents evacuation of object
+ until load/store
+- Slow Path Guard
+ - If unsafe, call into TrackFM runtime $\rightarrow$ call into AIFM runtime to
+ dereference object (may involve remote fetch)
+
+### Loop Overheads
+
+Compiler can determine induction variable of a loop, so it can know if guards
+within a bunch of array elements are redundant. So instead it uses locality
+guards at object boundaries to check that the entire chunk is safe.
+
+Also uses prefetching since it can detect sequential access at compile-time.
+This is a strength of compiler-based over kernel-based.
+
+Uses a simple cost model to understand if iteration space is large enough to be
+faster than the simple guard approach.
+
## Key Findings
- Compiler based approach to software-based far memory
@@ -148,10 +184,22 @@ Largest gap is remote pointer transformation, so:
- New compiler analysis and transformation passes that improve performance for
target applications
- Design and implementation of TrackFM, compiler-based far memory system
+- Near parity with AIFM performance
+- Hybrid (kernel + compiler) would be promising as kernel-based performs well
+ with temporal locality
+- Hardware that compiler could manage (similar to page mechanism) might be
+ beneficial
## Critique/Gaps
--
+- Future work could allow for multiple object sizes (only a few fixed object
+ sizes make sense, likely to be powers of 2 from 64B to 4KB)
+- Capture inter-procedural data structure semantics
+- Languages whos memory semantics more closely match those of far memory (Rust)
+- Remote fetching for trivial computation is wasteful, so offload to remote node
+ to employ near-data processing (using static analysis)
+- Profiling stage to prune set of heap allocations available for remoting based
+ on access frequency (similar to MaPHeA)
## Questions
@@ -159,9 +207,12 @@ Largest gap is remote pointer transformation, so:
frontend can make use of TrackFM?
- Could [Derive](http://derivelinux.org) actually be a great test bed for trying
to make all libraries TrackFM aware?
+- So AIFM can't use libc++ objects?
## Relations
- [[../concepts/locality]]
+- [[../concepts/DiLOS]]
+- [[../concepts/MaPHeA]]
## References (if any)