post on 25 Jun 2025 about 5825words require 20min
CC BY 4.0 (除特别声明或转载文章外)
如果这些文字帮助到你,可以请我喝一杯咖啡~
本文为我在 DAC 会议介绍自己的工作 GoPTX 的内容。
Good afternoon, everyone. Thank you for joining this session. I’m xxx, collaborating with a talented team arcSYSu from Sun Yat-sen University. Today, I’m excited to present GoPTX, our work on fine-grained GPU kernel fusion through PTX-level instruction flow weaving.
As we know, GPUs are pivotal in HPC and AI, achieving high throughput via massive parallel threads called warps. However, their ILP is often constrained by warp stalls. As illustrated, warps frequently stall, forcing schedulers to switch contexts and underutilize hardware. On an A100, each warp switch takes about 5ns, equivalent to a floating-point multiply operation.
Our analysis reveals that scoreboard stalls are the major bottleneck. The scoreboard, a hardware mechanism for out-of-order execution, manages data hazards but leaves pipeline bubbles. Theoretically, traditional compile-time or runtime reordering can’t fully overlap these delays. This led to our key insight: Can we fill these bubbles with useful work from another concurrent kernel?
By weaving independent flows, we increase dependency length and thus ILP. This example shows how weaving transforms two kernels into one with improved latency hiding. After weaving, the dependency length is increased from 1 to 2. Meanwhile, weaving two kernels into one also enforces intra-SM resource sharing, improves utilization. However, hardware solutions like hyper-threading are costly for GPUs. We turn to implement this purely in software.
However, software-based kernel instruction weaving faces key challenges. First, complex control flow (if/while/break) in CUDA C necessitates working at the IR level. Second, deadlock risks in multi-threading require careful handling. Third, instruction scheduling is NP-hard, demanding efficient heuristics. Last, NVIDIA provides no PTX processing tools, forcing custom implementation.
To address these challenges, we propose GoPTX, a novel design for kernel fusion that improves ILP through deliberate weaving instructions at the PTX level. GoPTX first preprocesses PTX code with unique register suffixes to avoid name conflicts. Then, PTX codes are converted to equivalent control flow graphs, and merged while preserving semantics. Thirdly, GoPTX weaves instructions latency-aware, guided by microbenchmarks. Finally postprocess for valid PTX output.
First, GoPTX ensures semantic correctness through its CFG merging algorithm. Since each CFG node has at most two branches (if-else), merging two nodes from different CFGs yields 2x2=4 possible combinations. As illustrated here, this requires inserting two dummy nodes for branch coordination. Our algorithm exhaustively explores all valid combinations with $O(N^2)$ complexity by traversing all the possibilities.
Since instruction scheduling is NP-hard, we adopt a heuristic approach to generate optimized inputs for ptxas. The weaving process works like a card shuffle: in each iteration, GoPTX greedily selects instructions from the flow with lower cumulative latency (measured via microbenchmarking) to maximize pipeline efficiency.
To enhance weaving opportunities, we introduce adaptive code slicing. Consider two basic blocks (BBs) with equal 5ms runtime: weaving can theoretically achieve 50% speedup. However, with imbalanced BBs (9ms vs 1ms), the maximum speedup drops to 10%. Motivated by this observation, GoPTX calculates the average BB execution time as a threshold and then slices larger BBs to balance durations.
Deadlock prevention is critical when merging CFGs containing synchronization instructions in multi-threaded environments. Our approach scans one CFG to mark nodes (and their parents) that have sync instructions, then strategically inserts dummy blocks during merging whenever these marked nodes interact with sync-containing nodes from the second CFG. This systematic insertion guarantees proper synchronization ordering while maintaining execution correctness, as visually demonstrated in this example.
We compare GoPTX with previous kernel fusion methods, VFuse and HFuse, on an A100 with 49 fused workloads from cuda-samples, onnx-runtime, etc. Vertical fusion concatenates two kernels sequentially, while horizontal fusion schedules into different warps.
GoPTX achieves a geomean speedup of 11.2% and 23% max, outperforming VFuse and HFuse. Notably, it boosts ILP, measured by eligible warps per cycle, by 5.5% geomean and 50% max. GoPTX also improves hardware utilization; you can refer to our paper.
We analyze the impact of instruction weaving on warp stalls using WMMA+GELU and WMMA+HARRIS. These workloads share a common kernel WMMA, but exhibit significantly different performance improvements. For WMMA+HARRIS, speedup hits 20%; but WMMA+GELU sees slight regression due to half precision resource contention. Still, GoPTX minimizes stalls better than alternatives.
Merging alone delivers 9.7% gains; weaving adds 1.5%. Slicing helps but introduces overheads, a trade-off for specific cases.
To conclude, motivated by the fact that GPUs suffer from scoreboard warp stalls, this paper presents GoPTX, which overlaps the bubbles with another kernel. GoPTX first merges CFGs at the PTX-level to preserve the semantics of input kernels. It then uses Latency-Aware Instruction Weaving to increase data dependency length. To enhance weaving opportunity, adaptive code slicing is introduced. Experimental results demonstrate that GoPTX effectively improves performance with higher ILP and utilization.
Thank you for your attention! We’re happy to discuss this further.
GoPTX can enhance existing horizontal loop fusion pipelines in ML compilers like XLA and TVM. It is also able to automate fusing operators (e.g., matrix multiplication with communication), reducing development effort. We are now exploring in our future work XFuse, which will be submit to TACO.
We provide both a command tool for merging PTX files, available via Docker, and the open-sourced code packaged as a C library for integration into existing applications.
It can be implement on LLVM-IR, thus work on such architectures. But the benefits depend on the hardware features.
We have discussed in our paper, so as time limited you can see the paper for details.
Related posts