DAC25 GoPTX Script

post on 25 Jun 2025 about 6006words require 21min
CC BY 4.0 （除特别声明或转载文章外）
如果这些文字帮助到你，可以请我喝一杯咖啡~

本文为我在 DAC 会议介绍自己的工作 GoPTX 的内容。

Slide 1: Title Slide

Good afternoon, everyone. Thank you for joining this session. I’m Kan Wu, collaborating with a talented team arcSYSu from Sun Yat-sen University. Today, I’m excited to present GoPTX, our work on fine-grained GPU kernel fusion through PTX-level instruction flow weaving.

Slide 2: Background

As we know, GPUs are pivotal in HPC and AI, achieving high throughput via massive parallel threads called warps. However, their ILP is often constrained by warp stalls. As illustrated, warps frequently stall, forcing schedulers to switch contexts and underutilize hardware. On an A100, each warp switch takes about 5 nano seconds, equivalent to a floating-point multiply operation.

Slide 3: Insight

Our analysis reveals that scoreboard stalls are the major bottleneck. The scoreboard, a hardware mechanism for out-of-order execution, manages data hazards but leaves pipeline bubbles. Theoretically, traditional compile-time or runtime reordering can’t fully overlap these delays. This led to our key insight: Can we fill these bubbles with useful work from another concurrent kernel, thus improve horizontal fusion in modern ML compilers, e.g. XLA and TVM?

Slide 4: Motivation

By weaving independent flows, we increase dependency length and thus ILP. This example shows how weaving transforms two kernels into one with improved latency hiding. After weaving, the dependency length is increased from 1 to 2. Meanwhile, weaving two kernels into one also enforces intra-SM resource sharing, improves utilization. However, hardware solutions like hyper-threading are costly for GPUs. We turn to implement this purely in software.

Slide 5: Challenge

However, software-based kernel instruction weaving faces key challenges. First, complex control flow (if/while/break) in CUDA C necessitates working at the IR level. Second, deadlock risks in multi-threading require careful handling. Third, instruction scheduling is NP-hard, demanding efficient heuristics. Last, NVIDIA provides no PTX processing tools, forcing custom implementation.

Slide 6: Design Overview

To address these challenges, we propose GoPTX, a novel design for kernel fusion that improves ILP through deliberate weaving instructions at the PTX level. GoPTX first preprocesses PTX code with unique register suffixes to avoid name conflicts. Then, PTX codes are converted to equivalent control flow graphs, and merged while preserving semantics. Thirdly, GoPTX weaves instructions latency-aware, guided by microbenchmarks. Finally postprocess for valid PTX output.

Slide 7: CFG Merging

First, GoPTX ensures semantic correctness through its CFG merging algorithm. Since each CFG node has at most two branches (if-else), merging two nodes from different CFGs yields 2x2=4 possible combinations. As illustrated here, this requires inserting two dummy nodes for branch coordination. Our algorithm exhaustively explores all valid combinations with $O(N^2)$ complexity by traversing all the possibilities.

Slide 8: Instruction Weaving

Since instruction scheduling is NP-hard, we adopt a heuristic approach to generate optimized inputs for ptxas. The weaving process works like a card shuffle: in each iteration, GoPTX greedily selects instructions from the flow with lower cumulative latency (measured via microbenchmarking) to maximize pipeline efficiency.

Slide 9: Adaptive Code Slicing

To enhance weaving opportunities, we introduce adaptive code slicing. Consider two basic blocks (BBs) with equal 5ms runtime: weaving can theoretically achieve 50% speedup. However, with imbalanced BBs (9ms vs 1ms), the maximum speedup drops to 10%. Motivated by this observation, GoPTX calculates the average BB execution time as a threshold and then slices larger BBs to balance durations.

Slide 10: Deadlock Avoidance

Deadlock prevention is critical when merging CFGs containing synchronization instructions in multi-threaded environments. Our approach scans one CFG to mark nodes (and their parents) that have sync instructions, then strategically inserts dummy blocks during merging whenever these marked nodes interact with sync-containing nodes from the second CFG. This systematic insertion guarantees proper synchronization ordering while maintaining execution correctness, as visually demonstrated in this example.

Slide 11: Evaluation

We compare GoPTX with previous kernel fusion methods, VFuse and HFuse, on an A100 with 49 fused workloads from cuda-samples, onnx-runtime, hipcc-samples, etc. These kernels include gemm, gelu, image filters (preprocessing), LU decomposition (traditional HPC). Vertical fusion concatenates two kernels sequentially, while horizontal fusion schedules into different warps.

Slide 12-13: Results Analysis

GoPTX achieves a geomean speedup of 11.2% and 23% max, outperforming VFuse and HFuse. Notably, it boosts ILP, measured by eligible warps per cycle, by 5.5% geomean and 50% max. GoPTX also improves hardware utilization; you can refer to our paper.

Slide 14: Case Study

We analyze the impact of instruction weaving on warp stalls using WMMA+GELU and WMMA+HARRIS. These workloads share a common kernel WMMA, but exhibit significantly different performance improvements. For WMMA+HARRIS, speedup hits 20%; but WMMA+GELU sees slight regression due to half precision resource contention. Still, GoPTX minimizes stalls better than alternatives.

Slide 15: Breakdown

Merging alone delivers 9.7% gains; weaving adds 1.5%. Slicing helps but introduces overheads, a trade-off for specific cases.

Slide 16: Summary

To conclude, motivated by the fact that GPUs suffer from scoreboard warp stalls, this paper presents GoPTX, which overlaps the bubbles with another kernel. GoPTX first merges CFGs at the PTX-level to preserve the semantics of input kernels. It then uses Latency-Aware Instruction Weaving to increase data dependency length. To enhance weaving opportunity, adaptive code slicing is introduced. Experimental results demonstrate that GoPTX effectively improves performance with higher ILP and utilization.

Slide 17: Thank You

Thank you for your attention! We’re happy to discuss this further.

Q & A 1: Where to use GoPTX

GoPTX can enhance existing horizontal loop fusion pipelines in ML compilers like XLA and TVM. It is also able to automate fusing operators (e.g., matrix multiplication with communication), reducing development effort. We are now exploring in our future work XFuse, which will be submit to TACO.

Q & A 2: How to use GoPTX

We provide both a command tool for merging PTX files, available via Docker, and the open-sourced code packaged as a C library for integration into existing applications.

Q & A 3: Can GoPTX work on other architectures like AMD GPUs, Huawei Ascends, CPUs

It can be implement on LLVM-IR, thus work on such architectures. But the benefits depend on the hardware features.

Q & A 4: Multiple Kernel Fusion, Resister Pressure

We have discussed in our paper, so as time limited you can see the paper for details.

Related posts

Tilelang 入门-L2 友好的矩阵乘 05 Jul 2025

今日此时所想之事（四） 21 Jun 2025

日本游记·宇治·京吹·利兹与青鸟 05 Apr 2025

Loading comments...