CUDA矩阵乘法的优化 wu-kan

因为超算队的新大三队员正在做 CUDA 矩阵乘法的作业,因此给他们安排了一期关于 CUDA 的内培。重看了一年前我在刚学 CUDA 时候做的实验,有些辣眼睛,那重新做一遍这个实验叭

本文基于 CUDA 实现了稠密矩阵乘法过程 $C\to\alpha AB+\beta C$,并在 $A,B,C\in\reals^{2^{13}\times 2^{13}}$ 大小的测试数据上与 cuBLAS 等常见实现进行性能比较。最终在没有内嵌汇编的情况下达到了 NVIDIA V100 GPU 理论性能的 92.1%(14.46 TFLOPS / 15.7 TFLOPS)。同情况下 cuBLAS 性能为 97.0%(15.23 TFLOPS / 15.7 TFLOPS),CUTLASS 库性能为 92.6%(14.54 TFLOPS / 15.7 TFLOPS)。

参考资料

实验环境

实验在广州超算中心 TH-2K 集群的一个 gpu_v100 节点上进行。相关硬件环境为:

  • NVLINK 接口的 NVIDIA V100 显卡 x4,单卡拥有:
    • 5120 个 CUDA 核心
    • 5120 个 FP32 单元
    • 2560 个 FP64 单元
    • 理论单精度算力 $1530\text{MHz}\times5120\text{Cores}\times 2\approx 15.7\text{TFLOPS}$

相关软件环境为:

  • Centos@7.6
    • gcc@7.5.0
      • cuda@10.1.243%gcc@7.5.0
      • cmake@3.18.4%gcc@7.5.0
      • python@3.8.6%gcc@7.5.0

实验过程

这里我只放出自己优化矩阵乘法过程的示例,相关核函数代码都放在 namespace wuk 中。所有矩阵按照 cuBLAS 的习惯,按照列优先方式存储,即 C[i + j * ldc] 表示 $C$ 第 $i$ 行 $j$ 列的元素;这与 C 语言在 CPU 的默认方式(行优先)恰好相反,相关原因会在下文中解释。当然行优先列优先的表示方式之间可以通过一次矩阵转置来调整,因此在应用于实际问题上的时候其实不是一个很让人困扰的问题。

gemm_32x32_v0

严格意义上来说,我是从 gemm_32x32_v1 版本开始优化的,但是 v0 版本仍然可以作为和 v1 版本的对比存在。在 gemm_32x32_v0 版本中,32x32 代表一个 block 计算答案矩阵 $C$ 中一个 $32\times 32$ 的子矩阵的结果(下同)。

在这个版本中,每个线程负责计算答案矩阵的一个元素,各自需要读取 $A$ 矩阵的一行和 $B$ 矩阵的一列,同一个 block 的线程之间没有发生通信。线程的 x 维度对应 $C$ 的列维度,线程的 y 维度对应 $C$ 的行维度。

这个版本的运行时间为 4687.326172 ms,达到的计算效率为 2.345712e+11 FLOPS,甚至没有发挥到理论算力的 1.5%。

gemm_32x32_v1

v1 版本在 v0 版本的基础上调整了线程到 $C$ 矩阵的映射关系,使得线程的 x 维度对应 $C$ 的行维度,线程的 y 维度对应 $C$ 的列维度。对于已经有过一些 CUDA 经验的同学来说,很容易知道这样做有助于触发 GPU 的合并访存,将连续的对 global memory 的读写操作合并成一次操作。

gemm-tile-structure

这个还没有分块的矩阵乘法运行时间为 638.216309 ms,几乎已经达到了 Alcanderian 学长以前给大家内培时分块矩阵的 614.487305 ms。原因在哪里?我翻了一下之前的代码,发现学长的代码为了兼容块大小不是 32 的整数倍的情况下的运算时,是通过读取 $A,B,C$ 矩阵时判断 if (a_ir < M && a_ic < K) 等条件实现的,但是这样做显然会带来大量的分支判断,因为 GPU 的分支预测做的远没有 CPU 好,所以带来严重的性能下降。而我在核函数的开始就判断了这个 block 的计算会不会超过矩阵的边界,如果会超过的话则向前移动 $A,B,C$ 的指针,这样就避免了在循环内部判断越界的问题。

当然这样做也存在问题。首先,核函数不能用于任意维度小于 $32$ 的矩阵乘法;但是此时我们完全可以使用矩阵向量乘甚至 CPU 上的 blas 来代替。此外,这个方法也不能解决 $k\mod 32\neq 0$ 的情况,不过也可以靠额外填充 0 实现(为什么不在 $m,n$ 两个维度上填 $0$ 呢?因为填充 $0$ 代码上的开销明显大于我直接移动指针)。最后,这样做也存在部分位置上的元素被读写两次的问题,因此实际上只能用来做 $\beta=0$ 的示例!

gemm_32x32_v2

这个版本是在我写这篇文章时候偶然得到的。相较于 v1 版本,我仅通过把代码中对 threadIdx.x 的访问提前用变量存起来:idx = threadIdx.x,就使得运行时间缩减到了 462.786011 ms,相较于前一个版本提高了 38.9% 的效率!

起先我以为这样的原因是,threadIdx.x 这个变量不是直接存在寄存器里的,每次在使用的时候都要重新加载进寄存器带来了运行时的开销。但是通过检查生成的 PTX ISA 来看并不是这样,二者均在循环前便加载进了寄存器的。那么让我们结合 PTX ISA 解释一下这个问题。

以下是 gemm_32x32_v2 中核心矩阵乘法过程的 PTX ISA

BB8_13:
	cvt.s64.s32	%rd52, %r58;
	add.s64 	%rd53, %rd52, %rd12;
	shl.b64 	%rd54, %rd53, 2;
	add.s64 	%rd55, %rd64, %rd54;
	ld.global.f32 	%f22, [%rd68];
	ld.global.f32 	%f23, [%rd55];
	fma.rn.f32 	%f24, %f23, %f22, %f40;
	add.s64 	%rd56, %rd55, %rd15;
	ld.global.f32 	%f25, [%rd68+4];
	ld.global.f32 	%f26, [%rd56];
	fma.rn.f32 	%f27, %f26, %f25, %f24;
	add.s64 	%rd57, %rd56, %rd15;
	ld.global.f32 	%f28, [%rd68+8];
	ld.global.f32 	%f29, [%rd57];
	fma.rn.f32 	%f30, %f29, %f28, %f27;
	add.s64 	%rd58, %rd57, %rd15;
	ld.global.f32 	%f31, [%rd68+12];
	ld.global.f32 	%f32, [%rd58];
	fma.rn.f32 	%f40, %f32, %f31, %f30;
	add.s64 	%rd68, %rd68, 16;
	add.s32 	%r58, %r58, %r13;
	add.s32 	%r57, %r57, 4;
	setp.lt.s32	%p8, %r57, %r19;
	@%p8 bra 	BB8_13;

而它在 gemm_32x32_v1 中被展开成了这样:

BB7_12:
	mad.lo.s32 	%r49, %r63, %r17, %r6;
	cvt.u64.u32	%rd44, %r49;
	add.s64 	%rd45, %rd44, %rd12;
	shl.b64 	%rd46, %rd45, 2;
	add.s64 	%rd47, %rd81, %rd46;
	add.s32 	%r50, %r13, %r63;
	cvt.u64.u32	%rd48, %r50;
	add.s64 	%rd49, %rd48, %rd13;
	shl.b64 	%rd50, %rd49, 2;
	add.s64 	%rd51, %rd83, %rd50;
	ld.global.f32 	%f22, [%rd51];
	ld.global.f32 	%f23, [%rd47];
	fma.rn.f32 	%f24, %f23, %f22, %f40;
	add.s32 	%r51, %r63, 1;
	mad.lo.s32 	%r52, %r51, %r17, %r6;
	cvt.u64.u32	%rd52, %r52;
	add.s64 	%rd53, %rd52, %rd12;
	shl.b64 	%rd54, %rd53, 2;
	add.s64 	%rd55, %rd81, %rd54;
	add.s32 	%r53, %r13, %r51;
	cvt.u64.u32	%rd56, %r53;
	add.s64 	%rd57, %rd56, %rd13;
	shl.b64 	%rd58, %rd57, 2;
	add.s64 	%rd59, %rd83, %rd58;
	ld.global.f32 	%f25, [%rd59];
	ld.global.f32 	%f26, [%rd55];
	fma.rn.f32 	%f27, %f26, %f25, %f24;
	add.s32 	%r54, %r63, 2;
	mad.lo.s32 	%r55, %r54, %r17, %r6;
	cvt.u64.u32	%rd60, %r55;
	add.s64 	%rd61, %rd60, %rd12;
	shl.b64 	%rd62, %rd61, 2;
	add.s64 	%rd63, %rd81, %rd62;
	add.s32 	%r56, %r13, %r54;
	cvt.u64.u32	%rd64, %r56;
	add.s64 	%rd65, %rd64, %rd13;
	shl.b64 	%rd66, %rd65, 2;
	add.s64 	%rd67, %rd83, %rd66;
	ld.global.f32 	%f28, [%rd67];
	ld.global.f32 	%f29, [%rd63];
	fma.rn.f32 	%f30, %f29, %f28, %f27;
	add.s32 	%r57, %r63, 3;
	mad.lo.s32 	%r58, %r57, %r17, %r6;
	cvt.u64.u32	%rd68, %r58;
	add.s64 	%rd69, %rd68, %rd12;
	shl.b64 	%rd70, %rd69, 2;
	add.s64 	%rd71, %rd81, %rd70;
	add.s32 	%r59, %r13, %r57;
	cvt.u64.u32	%rd72, %r59;
	add.s64 	%rd73, %rd72, %rd13;
	shl.b64 	%rd74, %rd73, 2;
	add.s64 	%rd75, %rd83, %rd74;
	ld.global.f32 	%f31, [%rd75];
	ld.global.f32 	%f32, [%rd71];
	fma.rn.f32 	%f40, %f32, %f31, %f30;
	add.s32 	%r63, %r63, 4;
	setp.lt.s32	%p8, %r63, %r16;
	@%p8 bra 	BB7_12;

可以看到,前者由编译器做了大小为 4 的循环展开,后者由编译器做了大小为 5 的循环展开。但即使排除这一点,前者平均 6 条指令中有 1 次 fma (4/24)计算,后者则只有 5/58。因此结论单纯就是,cuda@10.1.243 没有为前者生成很好的 ISA ,在 ISA 编译到后端代码的时候也没有很好的优化这一过程。很可惜不能随便升级超算中心节点上安装的 418.67 版本驱动,其最高只支持 cuda@10.1.243。因此不能验证最新的 cuda@11.1.0 是否解决了这一问题。

此外我也注意到,这个不分块版本的矩阵乘法已经比很多实现比较一般的分块矩阵乘法要快了。原因是 v100 上的显存非常奢侈的使用了 HBM2,相较于网上其他矩阵乘法教程中使用的显卡(大多是 GDDR5 显存)已经强大了不少,一定程度上减小了访存问题对性能的影响程度。

gemm_32x32_v3

这个版本是最经典的 CUDA 分块矩阵乘法,在互联网上能找到的大部分矩阵乘法实现都是这样做的。

nyE43F1oTbq7fugUOY9ZdwpGvH

如上图(当然图上为了方便显示只画了 $3\times3$ 分块,实际上是 $32\times 32$ 分块),可以把矩阵分块之后进行计算,这样分过块的子矩阵可以通过读进更快的 shared memory,从而使对 global memoey 的访问流量减少到原先的 $\frac{1}{32}$。随后 block 内的每个线程再分别读子矩阵的一行/一列元素,如下图。

WJrGFMzjYp41Vihty3ceHCqNul

最后运行时间反而退步到了 1565.291992 ms,说明不合理的使用 shared memory 反而会造成严重的性能问题。相较于学长的 cuda_kernel_sgemm_1,提升的地方来自于把矩阵边界的判断从循环内移动到了循环外面。

gemm_32x32_v4

v4 版本在 v3 版本的基础上考虑了 bank conflict 的影响。shared memory 有 32 个 bank,在声明 shared memory 的时候可以通过这样 __shared__ T sA[TILE_WIDTH][TILE_WIDTH | 1]; 的 trick 避免 bank conflict(也可以参照 Alcanderian 学长的 cuda_kernel_sgemm_2cuda_kernel_sgemm_3)。

最后的运行时间为 326.478821 ms,应该是这个经典分块算法能够达到的一个比较好的效果了,但是有效算力只达到了理论算力的 21.7%(3.37 TFLOPS/15.7 FLOPS)。这是由两个原因导致的。一方面,分块矩阵乘法占用了大量的 shared memory,这会限制 block 中活跃线程的数量;另一方面,在编译出来的 PTX ISA 代码中(见下)也可以看到,平均每个有效的 fma 指令要伴随两条对 shared memory 的取指令 ld.shared,那么无形中就浪费了 $\frac{2}{3}$ 的指令吞吐。

	ld.shared.f32 	%f10, [%r9];
	ld.shared.f32 	%f11, [%r8];
	fma.rn.f32 	%f12, %f11, %f10, %f109;

gemm_32x32_v5

v5 版本致力于改善 v4 版本的两条问题,原理基于 PTX ISA 文档中的以下内容

Limited-length vector types are supported. Vectors of length 2 and 4 of any non-predicate fundamental type can be declared by prefixing the type with .v2 or .v4. Vectors must be based on a fundamental type, and they may reside in the register space. Vectors cannot exceed 128-bits in length; for example, .v4.f64 is not allowed. Three-element vectors may be handled by using a .v4 vector, where the fourth element provides padding. This is a common case for three-dimensional grids, textures, etc.

如上可以发现,显卡支持单线程对不超过 128-bits 的向量类型进行加速操作,那么我们就可以使用内置的 float4 或者 double2 类型来加速访存过程。访存过程主要有四个:对 global memory 的读操作 ld.global、对 global memory 的写操作 st.global 、对 shared memory 的读操作 ld.shared、对 shared memory 的写操作 st.shared。通过向量化加速,我们最多可以同时完成四个 float 元素的访存操作,那我们让一个线程完成原先行维度上四个相邻线程的计算任务即可,如下图(图上是三个线程)。

tBDGZnCFwSlVJuypEkXv1bmNHO

这个版本的运行时间达到了 203.944931 ms,相对于前一个版本提高了 37%,效果还是非常明显的!检查编译出来的 PTX ISA,可以发现其中确实使用了向量化指令(如 ld.shared.v4.f32),同时核心计算区域里每条有效指令 fma 占的比重已经达到了 $\frac{2}{3}$。

	ld.shared.v4.f32 	{%f68, %f69, %f70, %f71}, [%r58+128];
	ld.shared.f32 	%f76, [%r60+4224];
	fma.rn.f32 	%f77, %f76, %f68, %f64;
	fma.rn.f32 	%f78, %f76, %f69, %f65;
	fma.rn.f32 	%f79, %f76, %f70, %f66;
	fma.rn.f32 	%f80, %f76, %f71, %f67;

此外还有两个值得注意的小细节:

  1. s_B 是行优先的方式存储的,这是为了访问 global memory 的过程可以使用 ld.global.v4.f32 加速,而访问 shared memory 的过程因为开销比较小被战略性放弃了,只能使用单个 ld.shared.f32 实现(因为物理存储上不连续)。
  2. 如下,在这段代码中我将下次要读取的内容先预加载进寄存器,然后再进行矩阵乘法运算。如果我们把这段代码移动到矩阵乘法之后再进行的话也完全没有问题,但是会导致性能上的明显下降。这是因为单次访存代码的时间很长,很容易导致指令的流水线阻塞。

    if (l + 32 < k)
    {
    	A += lda << 5;
    	B += 32;
    	resA = *(Tv *)&A[((idx * TvSize) & 31) + ((idx * TvSize) >> 5) * lda];
    	resB = *(Tv *)&B[((idx * TvSize) & 31) + ((idx * TvSize) >> 5) * ldb];
    }
    

gemm_64x64

v5 版本在 v4 版本上优化了有效指令吞吐量到 $\frac{2}{3}$,但是这仍然不是很让人满意的水平;同时对于 s_B 的访问也没有用到向量化加速,让强迫症的我略有不爽;于是考虑一个线程完成一个 $4\times 4$ 小矩阵的计算。当然此时如果一个 block 只计算一个 $32\times 32$ 的子矩阵,那么一个 block 只需要启动 $64$ 个线程,会产生很大的浪费。于是我们让一个 block 对应到原先四个 block 的工作量,即 $64\times 64$,这样仍然需要 256 个线程完成块内计算。

值得注意的是,shared memory 使用的大小是有限制的。我使用的 V100 显卡有着 7.0 的 Compute Capability,单个 block 至多可以使用 96KB 的 shared memory但是超过 48 KB 的大小只能在运行时动态分配。shared memory 实际上就是可编程的 L1 Cache,其大小可以通过 cudaDeviceSetCacheConfig 调整;shared memory 过大的时候可用的 L1 Cache 就变少了,同样会影响性能。因此,为了方便和其它版本代码进行比较,此处我暂不使用更大的 shared memory,仍然只用 2048 个 float,需要 8KB。这意味着我们需要再次调整对 $A,B$ 矩阵的分块策略,每次读入一个 $64\times 16$ 的 $A$ 矩阵、$16\times 64$ 的 $B$ 矩阵。

最终运行时间达到了 83.388191 ms,达到了理论算力的 84%,效果非常显著!再检查一下生成的 PTX ISA 文件,可以发现有效指令吞吐量已经达到了 $\frac{8}{9}$,也符合预期!

	ld.shared.v4.f32 	{%f147, %f148, %f149, %f150}, [%r79];
	ld.shared.v4.f32 	{%f155, %f156, %f157, %f158}, [%r80+4096];
	fma.rn.f32 	%f163, %f147, %f155, %f294;
	fma.rn.f32 	%f164, %f147, %f156, %f293;
	fma.rn.f32 	%f165, %f147, %f157, %f292;
	fma.rn.f32 	%f166, %f147, %f158, %f291;
	fma.rn.f32 	%f167, %f148, %f155, %f295;
	fma.rn.f32 	%f168, %f148, %f156, %f296;
	fma.rn.f32 	%f169, %f148, %f157, %f297;
	fma.rn.f32 	%f170, %f148, %f158, %f298;
	fma.rn.f32 	%f171, %f149, %f155, %f299;
	fma.rn.f32 	%f172, %f149, %f156, %f300;
	fma.rn.f32 	%f173, %f149, %f157, %f301;
	fma.rn.f32 	%f174, %f149, %f158, %f302;
	fma.rn.f32 	%f175, %f150, %f155, %f303;
	fma.rn.f32 	%f176, %f150, %f156, %f304;
	fma.rn.f32 	%f177, %f150, %f157, %f305;
	fma.rn.f32 	%f178, %f150, %f158, %f306;

gemm_128x128

本来我的优化已经到头了,但是又看到了来自 fynv/optimal_sgemm_cuda_c 的代码。原作者水平很高,但是可以看出他也是按照这个顺序去优化的。他是在 $64\times 64$ 的基础上,继续让每个 block 做四个 block 的任务,暴力实现了一个 128x128 的 kernel。此外还做了一些别的 trick,例如使用了两个 Buffer 来回切换,从而省掉一次 __syncthreads()

那么真的没有继续提升的空间了吗?为了在内培上保住面子,我决定在此基础上再挤出一点点水来。从生成的 PTX ISA 来看,这个循环 for (int j = 0; j < 8; ++j) 并没有被完全展开。于是考虑使用 #pragma unroll(4) 对其进行展开。值得注意的是,在 #pragma unroll(8) 时反而带来了性能下降,猜测是因为代码段太长导致了 Cache Miss。

最终运行时间达到了 76.045088 ms,达到了理论算力的 92.1%。再检查一下生成的 PTX ISA 文件,可以发现有效指令吞吐量已经达到了 $\frac{16}{17}$。

	ld.shared.v4.f32 	{%f531, %f532, %f533, %f534}, [%r63+512];
	ld.shared.v4.f32 	{%f539, %f540, %f541, %f542}, [%r63+768];
	ld.shared.v4.f32 	{%f547, %f548, %f549, %f550}, [%r65+4608];
	ld.shared.v4.f32 	{%f555, %f556, %f557, %f558}, [%r65+4864];
	fma.rn.f32 	%f563, %f531, %f547, %f467;
	fma.rn.f32 	%f564, %f532, %f547, %f468;
	fma.rn.f32 	%f565, %f533, %f547, %f469;
	fma.rn.f32 	%f566, %f534, %f547, %f470;
	fma.rn.f32 	%f567, %f531, %f548, %f471;
	fma.rn.f32 	%f568, %f532, %f548, %f472;
	fma.rn.f32 	%f569, %f533, %f548, %f473;
	fma.rn.f32 	%f570, %f534, %f548, %f474;
	fma.rn.f32 	%f571, %f531, %f549, %f475;
	fma.rn.f32 	%f572, %f532, %f549, %f476;
	fma.rn.f32 	%f573, %f533, %f549, %f477;
	fma.rn.f32 	%f574, %f534, %f549, %f478;
	fma.rn.f32 	%f575, %f531, %f550, %f479;
	fma.rn.f32 	%f576, %f532, %f550, %f480;
	fma.rn.f32 	%f577, %f533, %f550, %f481;
	fma.rn.f32 	%f578, %f534, %f550, %f482;
	fma.rn.f32 	%f579, %f531, %f555, %f483;
	fma.rn.f32 	%f580, %f532, %f555, %f484;
	fma.rn.f32 	%f581, %f533, %f555, %f485;
	fma.rn.f32 	%f582, %f534, %f555, %f486;
	fma.rn.f32 	%f583, %f531, %f556, %f487;
	fma.rn.f32 	%f584, %f532, %f556, %f488;
	fma.rn.f32 	%f585, %f533, %f556, %f489;
	fma.rn.f32 	%f586, %f534, %f556, %f490;
	fma.rn.f32 	%f587, %f531, %f557, %f491;
	fma.rn.f32 	%f588, %f532, %f557, %f492;
	fma.rn.f32 	%f589, %f533, %f557, %f493;
	fma.rn.f32 	%f590, %f534, %f557, %f494;
	fma.rn.f32 	%f591, %f531, %f558, %f495;
	fma.rn.f32 	%f592, %f532, %f558, %f496;
	fma.rn.f32 	%f593, %f533, %f558, %f497;
	fma.rn.f32 	%f594, %f534, %f558, %f498;
	fma.rn.f32 	%f595, %f539, %f547, %f499;
	fma.rn.f32 	%f596, %f540, %f547, %f500;
	fma.rn.f32 	%f597, %f541, %f547, %f501;
	fma.rn.f32 	%f598, %f542, %f547, %f502;
	fma.rn.f32 	%f599, %f539, %f548, %f503;
	fma.rn.f32 	%f600, %f540, %f548, %f504;
	fma.rn.f32 	%f601, %f541, %f548, %f505;
	fma.rn.f32 	%f602, %f542, %f548, %f506;
	fma.rn.f32 	%f603, %f539, %f549, %f507;
	fma.rn.f32 	%f604, %f540, %f549, %f508;
	fma.rn.f32 	%f605, %f541, %f549, %f509;
	fma.rn.f32 	%f606, %f542, %f549, %f510;
	fma.rn.f32 	%f607, %f539, %f550, %f511;
	fma.rn.f32 	%f608, %f540, %f550, %f512;
	fma.rn.f32 	%f609, %f541, %f550, %f513;
	fma.rn.f32 	%f610, %f542, %f550, %f514;
	fma.rn.f32 	%f611, %f539, %f555, %f515;
	fma.rn.f32 	%f612, %f540, %f555, %f516;
	fma.rn.f32 	%f613, %f541, %f555, %f517;
	fma.rn.f32 	%f614, %f542, %f555, %f518;
	fma.rn.f32 	%f615, %f539, %f556, %f519;
	fma.rn.f32 	%f616, %f540, %f556, %f520;
	fma.rn.f32 	%f617, %f541, %f556, %f521;
	fma.rn.f32 	%f618, %f542, %f556, %f522;
	fma.rn.f32 	%f619, %f539, %f557, %f523;
	fma.rn.f32 	%f620, %f540, %f557, %f524;
	fma.rn.f32 	%f621, %f541, %f557, %f525;
	fma.rn.f32 	%f622, %f542, %f557, %f526;
	fma.rn.f32 	%f623, %f539, %f558, %f527;
	fma.rn.f32 	%f624, %f540, %f558, %f528;
	fma.rn.f32 	%f625, %f541, %f558, %f529;
	fma.rn.f32 	%f626, %f542, %f558, %f530;

总结

可以发现,越是优化到后面,代码中 for 循环的层级就越多,这与 CUTLASS 库的实现理念 非常接近。究其原因,还是因为 CUDA 在设计上存在明显的层次性结构。如下图,当然我们这里并没有用到 warp-level GEMM,因为 wmma 指令只能作用于半精度矩阵乘法。

complete-hierarchy

那么最后,矩阵乘法做到这里是不是就到头了呢?我觉得未必。例如,单个 block 至多可以使用 96KB 的 shared memory,单个 thread 至多可以使用 255 个 32-bit 寄存器(256 线程下,因为每个 block 只能有 64K 寄存器),因此我觉得还可以通过增大分块大小继续压榨显卡的性能,而且还可以考虑非正方形分块、分块边长不是 2 的整数幂的情况;再比如某些时候的通信可以直接通过 warp shuffle 指令完成,不需要通过 shared memory;最后,如果认真读过上面所有优化过程的同学可以发现,如果对 B 矩阵提前做一个转置,同样可以使性能获得略微提升(因为此时 s_B 不需要转置存放了)。再进一步就不可避免的要进入汇编优化代码的领域了,我们需要:调整每条指令的顺序以获得最大的有效指令吞吐;调整寄存器的映射,最小化寄存器的 bank conflict(不错,寄存器也是有 bank 的!);调度好每个寄存器资源,避免不必要的浪费,从而获得更大的分块。

此外 v2 版本被发现的过程也说明 CUDA 的编译器确实没有那么智能,甚至可以说一份高性能的 CUDA 代码可能第一眼让人读上去感觉非常丑。CUDA 虽然目前已被市场证明是比较成功的 GPGPU 编程语言(甚至其老对手 AMD 也几乎抛弃了半死不活的 OpenCL,推出了镜像级复刻的 ROCm 和 HIP,甚至有一个一键转换的 CUDA to HIP 脚本),但是给我的感觉还是设计的上手门槛高了一些,想写出高性能的 CUDA 代码并不能像写 CPU 上的 C 语言一样随心所欲,而需要了解更多硬件上的设计细节。当然了,虽然号称代码里没有使用汇编,但是要想说明优化的效果还是应该结合 PTX ISA 甚至更进一步的 SASS 汇编。

想来半年前在做先导杯的时候只实现到 v5 版本的矩阵乘法就放弃了,半年后自己终于有空完整梳理一遍,写到这里让我觉得已经学到了很多东西。那就完结撒花叭~

源代码

slurm-696755.out

实验的结果。

cublasSgemm: 72.210464 ms, 1.522649e+13 FLOPS.
Alcanderian::cuda_kernel_sgemm_1: 2120.728271 ms, 5.184595e+11 FLOPS.
Alcanderian::cuda_kernel_sgemm_2: 614.487305 ms, 1.789315e+12 FLOPS.
Alcanderian::cuda_kernel_sgemm_3: 2119.019287 ms, 5.188776e+11 FLOPS.
fynv::g_fgemm: 77.550720 ms, 1.417797e+13 FLOPS.
wuk::gemm_32x32_v0: 4687.326172 ms, 2.345712e+11 FLOPS.
wuk::gemm_32x32_v1: 638.216309 ms, 1.722788e+12 FLOPS.
wuk::gemm_32x32_v2: 462.786011 ms, 2.375853e+12 FLOPS.
wuk::gemm_32x32_v3: 1565.291992 ms, 7.024323e+11 FLOPS.
wuk::gemm_32x32_v4: 326.478821 ms, 3.367789e+12 FLOPS.
wuk::gemm_32x32_v5: 203.944931 ms, 5.391218e+12 FLOPS.
wuk::gemm_64x64: 83.388191 ms, 1.318546e+13 FLOPS.
wuk::gemm_128x128: 76.045088 ms, 1.445868e+13 FLOPS.
Thu Dec 24 01:36:10 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:8A:00.0 Off |                    0 |
| N/A   55C    P0    67W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:8B:00.0 Off |                    0 |
| N/A   31C    P0    37W / 300W |      0MiB / 16130MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:B3:00.0 Off |                    0 |
| N/A   32C    P0    36W / 300W |      0MiB / 16130MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:B4:00.0 Off |                    0 |
| N/A   31C    P0    38W / 300W |      0MiB / 16130MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
clocks.max.sm [MHz]
1530 MHz
-- CMake Version: 3.18.4
-- The CXX compiler identification is GNU 7.5.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /GPUFS/sysu_hpcedu_302/asc20/WuK/spack-0.16.0/opt/spack/linux-centos7-haswell/gcc-4.8.5/gcc-7.5.0-apfqefcs5zty75lid2nxwyh5f4uagvtp/bin/g++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- The CUDA compiler identification is NVIDIA 10.1.243
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /GPUFS/sysu_hpcedu_302/asc20/WuK/spack-0.16.0/opt/spack/linux-centos7-skylake_avx512/gcc-7.5.0/cuda-10.1.243-w6eojdqsdkcs2bunnslnpyqqrifbjotd/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- CUDART: /GPUFS/sysu_hpcedu_302/asc20/WuK/spack-0.16.0/opt/spack/linux-centos7-skylake_avx512/gcc-7.5.0/cuda-10.1.243-w6eojdqsdkcs2bunnslnpyqqrifbjotd/lib64/libcudart.so
-- CUDA Driver: /GPUFS/sysu_hpcedu_302/asc20/WuK/spack-0.16.0/opt/spack/linux-centos7-skylake_avx512/gcc-7.5.0/cuda-10.1.243-w6eojdqsdkcs2bunnslnpyqqrifbjotd/lib64/stubs/libcuda.so
-- NVRTC: /GPUFS/sysu_hpcedu_302/asc20/WuK/spack-0.16.0/opt/spack/linux-centos7-skylake_avx512/gcc-7.5.0/cuda-10.1.243-w6eojdqsdkcs2bunnslnpyqqrifbjotd/lib64/libnvrtc.so
-- Default Install Location: install
-- The C compiler identification is GNU 7.5.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /GPUFS/sysu_hpcedu_302/asc20/WuK/spack-0.16.0/opt/spack/linux-centos7-haswell/gcc-4.8.5/gcc-7.5.0-apfqefcs5zty75lid2nxwyh5f4uagvtp/bin/gcc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Found PythonInterp: /GPUFS/sysu_hpcedu_302/asc20/WuK/spack-0.16.0/opt/spack/linux-centos7-skylake_avx512/gcc-7.5.0/python-3.8.6-r32vgbf3inytviyulslxiwt3iyy53bff/bin/python (found version "3.8.6")
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE
-- CUDA Compilation Architectures: 70
fatal: Not a git repository (or any parent up to mount point /GPUFS)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
-- CUTLASS Revision: Unable to detect, Git returned code 128.
-- Configuring cublas ...
-- cuBLAS: /GPUFS/sysu_hpcedu_302/asc20/WuK/spack-0.16.0/opt/spack/linux-centos7-skylake_avx512/gcc-7.5.0/cuda-10.1.243-w6eojdqsdkcs2bunnslnpyqqrifbjotd/lib64/libcublas.so
-- cuBLAS: /GPUFS/sysu_hpcedu_302/asc20/WuK/spack-0.16.0/opt/spack/linux-centos7-skylake_avx512/gcc-7.5.0/cuda-10.1.243-w6eojdqsdkcs2bunnslnpyqqrifbjotd/include
-- Configuring cuBLAS ... done.
-- Configuring cuDNN ...
-- cuDNN not found.
-- Configuring cuDNN ... done.
-- Found Python3: /GPUFS/sysu_hpcedu_302/asc20/WuK/spack-0.16.0/opt/spack/linux-centos7-skylake_avx512/gcc-7.5.0/python-3.8.6-r32vgbf3inytviyulslxiwt3iyy53bff/bin/python3.8 (found suitable version "3.8.6", minimum required is "3.5") found components: Interpreter
-- Enable device reference verification in conv unit tests
-- Configuring done
CMake Warning (dev) in tools/library/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "cutlass_library_static".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in tools/library/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "cutlass_lib".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in tools/library/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "cutlass_library_objs".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in tools/profiler/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "cutlass_profiler".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/00_basic_gemm/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "00_basic_gemm".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/00_basic_gemm/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "00_basic_gemm".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/01_cutlass_utilities/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "01_cutlass_utilities".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/01_cutlass_utilities/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "01_cutlass_utilities".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/02_dump_reg_shmem/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "02_dump_reg_shmem".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/02_dump_reg_shmem/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "02_dump_reg_shmem".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/03_visualize_layout/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "03_visualize_layout".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/04_tile_iterator/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "04_tile_iterator".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/04_tile_iterator/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "04_tile_iterator".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/05_batched_gemm/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "05_batched_gemm".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/05_batched_gemm/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "05_batched_gemm".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/06_splitK_gemm/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "06_splitK_gemm".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/06_splitK_gemm/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "06_splitK_gemm".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/07_volta_tensorop_gemm/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "07_volta_tensorop_gemm".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/07_volta_tensorop_gemm/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "07_volta_tensorop_gemm".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/08_turing_tensorop_gemm/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "08_turing_tensorop_gemm".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/08_turing_tensorop_gemm/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "08_turing_tensorop_gemm".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/09_turing_tensorop_conv2dfprop/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "09_turing_tensorop_conv2dfprop".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/09_turing_tensorop_conv2dfprop/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "09_turing_tensorop_conv2dfprop".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/10_planar_complex/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "10_planar_complex".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/10_planar_complex/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "10_planar_complex".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/11_planar_complex_array/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "11_planar_complex_array".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/11_planar_complex_array/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "11_planar_complex_array".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/12_gemm_bias_relu/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "12_gemm_bias_relu".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/12_gemm_bias_relu/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "12_gemm_bias_relu".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/13_fused_two_gemms/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "13_fused_two_gemms".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/13_fused_two_gemms/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "13_fused_two_gemms".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/14_ampere_tf32_tensorop_gemm/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "14_ampere_tf32_tensorop_gemm".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/14_ampere_tf32_tensorop_gemm/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "14_ampere_tf32_tensorop_gemm".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/15_ampere_sparse_tensorop_gemm/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "15_ampere_sparse_tensorop_gemm".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/15_ampere_sparse_tensorop_gemm/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "15_ampere_sparse_tensorop_gemm".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/22_ampere_tensorop_conv2dfprop/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "22_ampere_tensorop_conv2dfprop".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in examples/22_ampere_tensorop_conv2dfprop/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "22_ampere_tensorop_conv2dfprop".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/core/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "cutlass_test_unit_core".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/gemm/thread/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "cutlass_test_unit_gemm_thread".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/gemm/thread/host/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target
  "cutlass_test_unit_gemm_thread_host".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/gemm/warp/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "cutlass_test_unit_gemm_warp".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/gemm/threadblock/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target
  "cutlass_test_unit_gemm_threadblock".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/gemm/device/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target
  "cutlass_test_unit_gemm_device_sparse_tensorop_sm80".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/gemm/device/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target
  "cutlass_test_unit_gemm_device_wmma".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/gemm/device/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target
  "cutlass_test_unit_gemm_device_tensorop_s32_sm80".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/gemm/device/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target
  "cutlass_test_unit_gemm_device_tensorop_f64".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/gemm/device/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target
  "cutlass_test_unit_gemm_device_tensorop_f32_sm80".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/gemm/device/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target
  "cutlass_test_unit_gemm_device_simt".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/gemm/device/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target
  "cutlass_test_unit_gemm_device_tensorop_planar_complex".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/gemm/device/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target
  "cutlass_test_unit_gemm_device_tensorop_sm70".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/gemm/device/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target
  "cutlass_test_unit_gemm_device_tensorop_sm75".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/gemm/device/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target
  "cutlass_test_unit_gemm_device_tensorop_f16_sm80".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/gemm/device/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target
  "cutlass_test_unit_gemm_device_tensorop_f32_tf32_sm80".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/conv/device/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target
  "cutlass_test_unit_conv_device_tensorop_s32_interleaved".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/conv/device/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target
  "cutlass_test_unit_conv_device_simt".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/conv/device/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target
  "cutlass_test_unit_conv_device_tensorop_f32_sm70".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/conv/device/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target
  "cutlass_test_unit_conv_device_tensorop_f32_sm75".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/conv/device/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target
  "cutlass_test_unit_conv_device_tensorop_f16_sm80".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/conv/device/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target
  "cutlass_test_unit_conv_device_tensorop_f32_sm80".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/conv/device/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target
  "cutlass_test_unit_conv_device_tensorop_f32_tf32_sm80".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/conv/device/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target
  "cutlass_test_unit_conv_device_tensorop_s32".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/layout/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "cutlass_test_unit_layout".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/transform/threadblock/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target
  "cutlass_test_unit_transform_threadblock".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/epilogue/thread/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "cutlass_test_unit_epilogue_thread".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/epilogue/warp/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "cutlass_test_unit_epilogue_warp".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/epilogue/threadblock/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target
  "cutlass_test_unit_epilogue_threadblock".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/reduction/thread/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target
  "cutlass_test_unit_reduction_thread".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/reduction/kernel/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target
  "cutlass_test_unit_reduction_kernel".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/util/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "cutlass_test_unit_util".
This warning is for project developers.  Use -Wno-dev to suppress it.

CMake Warning (dev) in test/unit/nvrtc/thread/CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "cutlass_test_unit_nvrtc_thread".
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Generating done
-- Build files have been written to: /GPUFS/sysu_hpcedu_302/asc20/WuK/gemm/cutlass-2.4.0/build
Scanning dependencies of target cutlass_library_objs
[  0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/handle.cu.o
[  0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/operation_table.cu.o
[  0%] Building CXX object tools/library/CMakeFiles/cutlass_library_objs.dir/src/manifest.cpp.o
[  0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/singleton.cu.o
[  0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/util.cu.o
[  7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/initialize_reference_operations.cu.o
[  7%] Building CXX object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/initialize_all.cpp.o
[  7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/conv2d.cu.o
[  7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reduction/reduction_device.cu.o
[  7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reduction/init_reduction_operations.cu.o
[  7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/conv3d.cu.o
[ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_simt_sgemm_128x128_8x2_nt_align1.cu.o
[ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_simt_sgemm_128x128_8x2_tn_align1.cu.o
[ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm.cu.o
[ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/all_gemm_operations.cu.o
[ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_simt_sgemm_128x128_8x2_nn_align1.cu.o
[ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_simt_sgemm_128x128_8x2_tt_align1.cu.o
[ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_simt_cgemm_128x128_8x2_nn_align1.cu.o
[ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_simt_dgemm_128x128_8x2_tt_align1.cu.o
[ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_simt_dgemm_128x128_8x2_nn_align1.cu.o
[ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_simt_dgemm_128x128_8x2_nt_align1.cu.o
[ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_simt_cgemm_128x128_8x2_tt_align1.cu.o
[ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_simt_hgemm_256x128_8x2_nn_align1.cu.o
[ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_simt_cgemm_128x128_8x2_tn_align1.cu.o
[ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_simt_cgemm_128x128_8x2_nt_align1.cu.o
[ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_simt_hgemm_256x128_8x2_tn_align1.cu.o
[ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_simt_hgemm_256x128_8x2_nt_align1.cu.o
[ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_simt_dgemm_128x128_8x2_tn_align1.cu.o
[ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_simt_hgemm_256x128_8x2_tt_align1.cu.o
[ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_simt_igemm_s8_128x128_32x2_nn_align1.cu.o
[ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_simt_igemm_s8_128x128_32x2_nt_align1.cu.o
[ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_simt_igemm_s8_128x128_32x2_tn_align1.cu.o
[ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_simt_s8_igemm_s8_128x128_32x2_nn_align1.cu.o
[ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_simt_igemm_s8_128x128_32x2_tt_align1.cu.o
[ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_simt_s8_igemm_s8_128x128_32x2_tn_align1.cu.o
[ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_simt_s8_igemm_s8_128x128_32x2_nt_align1.cu.o
[ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_simt_s8_igemm_s8_128x128_32x2_tt_align1.cu.o
[ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_tensorop_s884gemm_f16_256x128_32x2_nn_align8.cu.o
[ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_tensorop_s884gemm_f16_256x128_32x2_tn_align8.cu.o
[ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_tensorop_s884gemm_f16_256x128_32x2_tt_align8.cu.o
[ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_tensorop_f16_s884gemm_f16_256x128_32x2_tn_align8.cu.o
[ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_tensorop_f16_s884gemm_f16_256x128_32x2_nn_align8.cu.o
[ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_tensorop_f16_s884gemm_f16_256x128_32x2_nt_align8.cu.o
[ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_tensorop_s884gemm_f16_256x128_32x2_nt_align8.cu.o
[ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_tensorop_f16_s884gemm_f16_256x128_32x2_tt_align8.cu.o
[ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_tensorop_h884gemm_256x128_32x2_nn_align8.cu.o
[ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_tensorop_h884gemm_256x128_32x2_nt_align8.cu.o
[ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/all_conv2d_operations.cu.o
[ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_tensorop_h884gemm_256x128_32x2_tn_align8.cu.o
[ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/cutlass_tensorop_h884gemm_256x128_32x2_tt_align8.cu.o
[ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_simt_sfprop_optimized_128x128_8x2_nhwc.cu.o
[ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_simt_swgrad_optimized_128x128_8x2_nhwc.cu.o
[ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_simt_sfprop_analytic_128x128_8x2_nhwc.cu.o
[ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_simt_sdgrad_analytic_128x128_8x2_nhwc.cu.o
[ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_simt_sdgrad_optimized_128x128_8x2_nhwc_unity_stride.cu.o
[ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_simt_swgrad_analytic_128x128_8x2_nhwc.cu.o
[ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_simt_cf32_cfprop_optimized_cf32_128x128_8x2_nhwc.cu.o
[ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_simt_cf32_cdgrad_analytic_cf32_128x128_8x2_nhwc.cu.o
[ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_simt_cf32_cdgrad_optimized_cf32_128x128_8x2_nhwc_unity_stride.cu.o
[ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_simt_cf32_cfprop_analytic_cf32_128x128_8x2_nhwc.cu.o
[ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_simt_cf32_cwgrad_analytic_cf32_128x128_8x2_nhwc.cu.o
[ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_simt_cf32_cwgrad_optimized_cf32_128x128_8x2_nhwc.cu.o
[ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_tensorop_s884fprop_analytic_f16_256x128_32x2_nhwc.cu.o
[ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_tensorop_s884fprop_optimized_f16_256x128_32x2_nhwc.cu.o
[ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_tensorop_s884wgrad_analytic_f16_256x128_32x2_nhwc.cu.o
[ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_tensorop_s884dgrad_optimized_f16_256x128_32x2_nhwc_unity_stride.cu.o
[ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_tensorop_s884dgrad_analytic_f16_256x128_32x2_nhwc.cu.o
[ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_tensorop_f16_s884fprop_analytic_f16_256x128_32x2_nhwc.cu.o
[ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_tensorop_f16_s884fprop_optimized_f16_256x128_32x2_nhwc.cu.o
[ 76%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_tensorop_f16_s884dgrad_analytic_f16_256x128_32x2_nhwc.cu.o
[ 76%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_tensorop_s884wgrad_optimized_f16_256x128_32x2_nhwc.cu.o
[ 76%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_tensorop_f16_s884dgrad_optimized_f16_256x128_32x2_nhwc_unity_stride.cu.o
[ 76%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_tensorop_f16_s884wgrad_analytic_f16_256x128_32x2_nhwc.cu.o
[ 76%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_tensorop_h884fprop_optimized_256x128_32x2_nhwc.cu.o
[ 76%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_tensorop_f16_s884wgrad_optimized_f16_256x128_32x2_nhwc.cu.o
[ 84%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_tensorop_h884fprop_analytic_256x128_32x2_nhwc.cu.o
[ 84%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_tensorop_h884dgrad_optimized_256x128_32x2_nhwc_unity_stride.cu.o
[ 84%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_tensorop_h884dgrad_analytic_256x128_32x2_nhwc.cu.o
[ 84%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_tensorop_h884wgrad_analytic_256x128_32x2_nhwc.cu.o
[ 84%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/cutlass_tensorop_h884wgrad_optimized_256x128_32x2_nhwc.cu.o
[ 84%] Built target cutlass_library_objs
Scanning dependencies of target cutlass_lib
[ 84%] Linking CXX shared library libcutlass.so
[ 84%] Built target cutlass_lib
Scanning dependencies of target cutlass_profiler
[ 84%] Building CXX object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/main.cpp.o
[ 84%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/cutlass_profiler.cu.o
[ 84%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/options.cu.o
[ 84%] Building CXX object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/performance_report.cpp.o
[ 84%] Building CXX object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/enumerated_types.cpp.o
[ 92%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/device_allocation.cu.o
[ 92%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/device_context.cu.o
[ 92%] Building CXX object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/cublas_helpers.cpp.o
[ 92%] Building CXX object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/cudnn_helpers.cpp.o
[ 92%] Building CXX object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/problem_space.cpp.o
[ 92%] Building CXX object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/gpu_timer.cpp.o
[ 92%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/operation_profiler.cu.o
[ 92%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/conv2d_operation_profiler.cu.o
[ 92%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/sparse_gemm_operation_profiler.cu.o
[100%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/conv3d_operation_profiler.cu.o
[100%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/gemm_operation_profiler.cu.o
[100%] Linking CXX executable cutlass_profiler
[100%] Built target cutlass_profiler



=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass_simt_sgemm_128x128_8x2_nn_align1

          Status: Success
    Verification: ON
     Disposition: Passed

reference_device: Passed
          cuBLAS: Passed

       Arguments: --gemm_kind=universal --m=8192 --n=8192 --k=8192 --A=f32:column --B=f32:column --C=f32:column --alpha=1  \
                  --beta=0 --split_k_slices=1 --batch_count=1 --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8  \
                  --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024  \


           Bytes: 805306368  bytes
           FLOPs: 1099645845504  flops

         Runtime: 75.649  ms
          Memory: 9.91421 GiB/s

            Math: 14536.2 GFLOP/s


=============================

CSV Results:

Problem,Provider,OperationKind,Operation,Disposition,Status,gemm_kind,m,n,k,A,B,C,alpha,beta,split_k_slices,batch_count,op_class,accum,cta_m,cta_n,cta_k,stages,warps_m,warps_n,warps_k,inst_m,inst_n,inst_k,min_cc,max_cc,Bytes,Flops,Runtime,GB/s,GFLOPs
1,CUTLASS,gemm,cutlass_simt_sgemm_128x128_8x2_nn_align1,passed,success,universal,8192,8192,8192,f32:column,f32:column,f32:column,1,0,1,1,simt,f32,128,128,8,2,4,2,1,1,1,1,50,1024,805306368,1099645845504,75.649,9.91421,14536.2

gemm.th2k.slurm

#!/bin/bash
#SBATCH -J WuK
#SBATCH -p gpu_v100
#SBATCH -N 1
#SBATCH --exclusive

DIR=/GPUFS/sysu_hpcedu_302/asc20/WuK
cd $DIR

source spack-0.16.0/share/spack/setup-env.sh
spack load python@3.8.6%gcc@7.5.0
spack load cuda@10.1.243%gcc@7.5.0
spack load cmake@3.18.4%gcc@7.5.0
spack load gcc@7.5.0

nvcc -arch=sm_70 -ptx -o gemm.ptx gemm.cu
nvcc -arch=sm_70 -lcublas -run -o gemm gemm.cu

nvidia-smi
nvidia-smi --query-gpu=clocks.max.sm --format=csv --id=0

if [[ ! -f cutlass-2.4.0.zip ]]; then
    wget https://github.com/NVIDIA/cutlass/archive/v2.4.0.zip -O cutlass-2.4.0.zip
    unzip cutlass-2.4.0.zip
fi

mkdir -p "$DIR/cutlass-2.4.0/build"
cd "$DIR/cutlass-2.4.0/build"
rm -fr *
cmake -DCUTLASS_NVCC_ARCHS=70 "$DIR/cutlass-2.4.0"
make cutlass_profiler -j
./tools/profiler/cutlass_profiler \
    --operation=Gemm \
    --m=8192 \
    --n=8192 \
    --k=8192 \
    --A=f32:column \
    --B=f32:column \
    --C=f32:column \
    --providers=cutlass,cublas

gemm.cu

#include <cstdio>
#include <functional>
#include <cublas_v2.h>
#include <cuda_runtime.h>
#include <thrust/device_vector.h>

namespace Alcanderian //https://github.com/Alcanderian/CUDA-tutorial/sgemm
{
    __global__ void cuda_kernel_sgemm_1(
        float *a,
        float *b,
        float *c,
        size_t N,
        size_t M,
        size_t K,
        float alpha,
        float beta)
    {
        int tr = threadIdx.x;                   // row idx in block
        int tc = threadIdx.y;                   // col idx in block
        int ir = blockIdx.x * 32 + threadIdx.x; // row idx in global
        int ic = blockIdx.y * 32 + threadIdx.y; // col idx in global

        __shared__ float a_sub[32][32];
        __shared__ float b_sub[32][32];

        int load_size = K / 32;
        if (K % 32 != 0)
        {
            load_size += 1;
        }
        float acc = 0.0f;
        int a_ir = ir;
        int b_ic = ic;
#define idx(ri, ci, nc) ((ri) * (nc) + (ci))
        for (int l = 0; l < load_size; ++l)
        {
            int a_ic = l * 32 + tc;
            int b_ir = l * 32 + tr;
            a_sub[tr][tc] = 0.0f;
            b_sub[tr][tc] = 0.0f;
            if (a_ir < M && a_ic < K)
                a_sub[tr][tc] = a[idx(a_ir, a_ic, K)];
            if (b_ir < K && b_ic < N)
                b_sub[tr][tc] = b[idx(b_ir, b_ic, N)];

            __syncthreads();

#pragma unroll
            for (int k = 0; k < 32; ++k)
            {
                acc += a_sub[tr][k] * b_sub[k][tc];
            }

            __syncthreads();
        }

        if (ir < M && ic < N)
            c[idx(ir, ic, N)] = alpha * acc + beta * c[idx(ir, ic, N)];
#undef idx
    }

    // use __ldg & avoid bank conflict
    __global__ void cuda_kernel_sgemm_2(
        float *__restrict__ a,
        float *__restrict__ b,
        float *__restrict__ c,
        size_t N,
        size_t M,
        size_t K,
        float alpha,
        float beta)
    {
        int tr = threadIdx.x;                   // row idx in block
        int tc = threadIdx.y;                   // col idx in block
        int ir = blockIdx.x * 32 + threadIdx.x; // row idx in global
        int ic = blockIdx.y * 32 + threadIdx.y; // col idx in global

        __shared__ float a_sub[32][32 + 1]; // avoid bank conflict
        __shared__ float b_sub[32][32 + 1];

        int load_size = K / 32;
        if (K % 32 != 0)
        {
            load_size += 1;
        }
        float acc = 0.0f;
        int a_ir = ir;
        int b_ic = ic;
#define idx(ri, ci, nc) ((ri) * (nc) + (ci))
        for (int l = 0; l < load_size; ++l)
        {
            int a_ic = l * 32 + tc;
            int b_ir = l * 32 + tr;
            a_sub[tr][tc] = 0.0f;
            b_sub[tr][tc] = 0.0f;
            if (a_ir < M && a_ic < K)
                a_sub[tr][tc] = a[idx(a_ir, a_ic, K)];
            if (b_ir < K && b_ic < N)
                b_sub[tr][tc] = b[idx(b_ir, b_ic, N)];

            __syncthreads();

#pragma unroll
            for (int k = 0; k < 32; ++k)
            {
                acc += a_sub[tr][k] * b_sub[k][tc];
            }

            __syncthreads();
        }

        if (ir < M && ic < N)
            c[idx(ir, ic, N)] = alpha * acc + beta * c[idx(ir, ic, N)];
#undef idx
    }

    // use __ldg without avoiding bank conflict
    __global__ void cuda_kernel_sgemm_3(
        float *__restrict__ a,
        float *__restrict__ b,
        float *__restrict__ c,
        size_t N,
        size_t M,
        size_t K,
        float alpha,
        float beta)
    {
        int tr = threadIdx.x;                   // row idx in block
        int tc = threadIdx.y;                   // col idx in block
        int ir = blockIdx.x * 32 + threadIdx.x; // row idx in global
        int ic = blockIdx.y * 32 + threadIdx.y; // col idx in global

        __shared__ float a_sub[32][32]; // avoid bank conflict
        __shared__ float b_sub[32][32];

        int load_size = K / 32;
        if (K % 32 != 0)
        {
            load_size += 1;
        }
        float acc = 0.0f;
        int a_ir = ir;
        int b_ic = ic;
#define idx(ri, ci, nc) ((ri) * (nc) + (ci))
        for (int l = 0; l < load_size; ++l)
        {
            int a_ic = l * 32 + tc;
            int b_ir = l * 32 + tr;
            a_sub[tr][tc] = 0.0f;
            b_sub[tr][tc] = 0.0f;
            if (a_ir < M && a_ic < K)
                a_sub[tr][tc] = a[idx(a_ir, a_ic, K)];
            if (b_ir < K && b_ic < N)
                b_sub[tr][tc] = b[idx(b_ir, b_ic, N)];

            __syncthreads();

#pragma unroll
            for (int k = 0; k < 32; ++k)
            {
                acc += a_sub[tr][k] * b_sub[k][tc];
            }

            __syncthreads();
        }

        if (ir < M && ic < N)
            c[idx(ir, ic, N)] = alpha * acc + beta * c[idx(ir, ic, N)];
#undef idx
    }
}; // namespace Alcanderian

namespace fynv //https://github.com/fynv/optimal_sgemm_cuda_c
{
    struct f8
    {
        float4 a, b;
        __device__ inline f8()
        {
            memset(this, 0, sizeof(f8));
        }
    };

    struct f88
    {
        f8 a, b, c, d, e, f, g, h;
    };

    __device__ inline void d_load8(const float *p, f8 &c)
    {
        c.a = ((float4 *)p)[0];
        c.b = ((float4 *)p)[16];
    }

    __device__ inline void d_store8(float *p, const f8 &c)
    {
        ((float4 *)p)[0] = c.a;
        ((float4 *)p)[16] = c.b;
    }

    __device__ inline void d_mult8v(f8 &c, const f8 &a, float b)
    {
        c.a.x += a.a.x * b;
        c.a.y += a.a.y * b;
        c.a.z += a.a.z * b;
        c.a.w += a.a.w * b;
        c.b.x += a.b.x * b;
        c.b.y += a.b.y * b;
        c.b.z += a.b.z * b;
        c.b.w += a.b.w * b;
    }

    template <typename T>
    __device__ inline void Swap(T &a, T &b)
    {
        T t = a;
        a = b;
        b = t;
    }

    __global__ __launch_bounds__(256) //https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#launch-bounds
        void g_fgemm(
            float *d_C,
            const float *d_A,
            const float *d_B,
            int n,
            int lda,
            int ldb,
            int ldc)
    {
        int x_a = threadIdx.x & 31;
        int y_a = threadIdx.x >> 5;

        int x_b = threadIdx.x & 1;
        int y_b = threadIdx.x >> 1;

        int x_c = threadIdx.x & 15;
        int y_c = threadIdx.x >> 4;

        __shared__ float smem[4096];
        float *s_A1 = smem;
        float *s_A2 = smem + 1024;
        float *s_B1 = smem + 2048;
        float *s_B2 = smem + 3072;

        f88 l_C;

        const float *p_A = d_A + (blockIdx.x << 7);
        const float *p_B = d_B + (blockIdx.y << 7) * ldb;

        float4 p, q;
        p = ((float4 *)p_A)[y_a * (lda >> 2) + x_a];
        q = ((float4 *)p_B)[y_b * (ldb >> 2) + x_b];

        for (int i = 0; i < n; i += 8)
        {
            ((float4 *)s_A1)[threadIdx.x] = p;
            s_B1[(((x_b << 2) + 0) << 7) + y_b] = q.x;
            s_B1[(((x_b << 2) + 1) << 7) + y_b] = q.y;
            s_B1[(((x_b << 2) + 2) << 7) + y_b] = q.z;
            s_B1[(((x_b << 2) + 3) << 7) + y_b] = q.w;
            __syncthreads();

            if (i + 8 < n)
            {
                p_A += (lda << 3);
                p_B += 8;
                p = ((float4 *)p_A)[y_a * (lda >> 2) + x_a];
                q = ((float4 *)p_B)[y_b * (ldb >> 2) + x_b];
            }

            for (int j = 0; j < 8; j++)
            {
                float *p_s_A = s_A1 + (j << 7) + (x_c << 2);
                float *p_s_B = s_B1 + (j << 7) + (y_c << 2);

                f8 a, b;
                d_load8(p_s_A, a);
                d_load8(p_s_B, b);

                d_mult8v(l_C.a, a, b.a.x);
                d_mult8v(l_C.b, a, b.a.y);
                d_mult8v(l_C.c, a, b.a.z);
                d_mult8v(l_C.d, a, b.a.w);
                d_mult8v(l_C.e, a, b.b.x);
                d_mult8v(l_C.f, a, b.b.y);
                d_mult8v(l_C.g, a, b.b.z);
                d_mult8v(l_C.h, a, b.b.w);
            }

            Swap(s_A1, s_A2);
            Swap(s_B1, s_B2);
        }

        float *p_C = d_C + ((blockIdx.x << 7) + (x_c << 2)) + ((blockIdx.y << 7) + (y_c << 2)) * ldc;
        d_store8(p_C, l_C.a);
        p_C += ldc;
        d_store8(p_C, l_C.b);
        p_C += ldc;
        d_store8(p_C, l_C.c);
        p_C += ldc;
        d_store8(p_C, l_C.d);
        p_C += (ldc * 61);
        d_store8(p_C, l_C.e);
        p_C += ldc;
        d_store8(p_C, l_C.f);
        p_C += ldc;
        d_store8(p_C, l_C.g);
        p_C += ldc;
        d_store8(p_C, l_C.h);
    }
}; // namespace fynv

namespace wuk
{
#define IDX2C(i, j, ld) (((j) * (ld)) + (i))

    template <typename T>
    __global__ __launch_bounds__(1024) void gemm_32x32_v0(
        int m,
        int n,
        int k,
        T alpha,
        const T *A,
        int lda,
        const T *B,
        int ldb,
        T beta,
        T *C,
        int ldc)
    {
        const int
            y_range = ((int)blockIdx.y + 1 << 5) - m,
            x_range = ((int)blockIdx.x + 1 << 5) - n;
        if (y_range > 0)
        {
            A -= y_range;
            C -= y_range;
        }
        if (x_range > 0)
        {
            B -= x_range * ldb;
            C -= x_range * ldc;
        }
        A += blockIdx.y << 5;
        B += (blockIdx.x << 5) * ldb;
        C += (blockIdx.y << 5) + (blockIdx.x << 5) * ldc;

        T resC = 0;
        for (int i = 0; i < k; ++i)
        {
            const T
                resA = A[IDX2C(threadIdx.y, i, lda)],
                resB = B[IDX2C(i, threadIdx.x, ldb)];
            resC += resA * resB;
        }
        resC = resC * alpha + C[IDX2C(threadIdx.y, threadIdx.x, ldc)] * beta;
        C[IDX2C(threadIdx.y, threadIdx.x, ldc)] = resC;
    }

    template <typename T>
    __global__ __launch_bounds__(1024) void gemm_32x32_v1(
        int m,
        int n,
        int k,
        T alpha,
        const T *A,
        int lda,
        const T *B,
        int ldb,
        T beta,
        T *C,
        int ldc)
    {
        const int
            x_range = ((int)blockIdx.x + 1 << 5) - m,
            y_range = ((int)blockIdx.y + 1 << 5) - n;
        if (x_range > 0)
        {
            A -= x_range;
            C -= x_range;
        }
        if (y_range > 0)
        {
            B -= y_range * ldb;
            C -= y_range * ldc;
        }
        A += blockIdx.x << 5;
        B += (blockIdx.y << 5) * ldb;
        C += (blockIdx.x << 5) + (blockIdx.y << 5) * ldc;
        T resC = 0;
        for (int i = 0; i < k; ++i)
        {
            const T
                resA = A[IDX2C(threadIdx.x, i, lda)],
                resB = B[IDX2C(i, threadIdx.y, ldb)];
            resC += resA * resB;
        }
        resC = resC * alpha + C[IDX2C(threadIdx.x, threadIdx.y, ldc)] * beta;
        C[IDX2C(threadIdx.x, threadIdx.y, ldc)] = resC;
    }

    template <typename T>
    __global__ __launch_bounds__(1024) void gemm_32x32_v2(
        int m,
        int n,
        int k,
        T alpha,
        const T *A,
        int lda,
        const T *B,
        int ldb,
        T beta,
        T *C,
        int ldc)
    {
        const int
            idx = threadIdx.x,
            idy = threadIdx.y,
            x_range = ((int)blockIdx.x + 1 << 5) - m,
            y_range = ((int)blockIdx.y + 1 << 5) - n;
        if (x_range > 0)
        {
            A -= x_range;
            C -= x_range;
        }
        if (y_range > 0)
        {
            B -= y_range * ldb;
            C -= y_range * ldc;
        }
        A += blockIdx.x << 5;
        B += (blockIdx.y << 5) * ldb;
        C += (blockIdx.x << 5) + (blockIdx.y << 5) * ldc;
        T resC = 0;
        for (int i = 0; i < k; ++i)
        {
            const T
                resA = A[IDX2C(idx, i, lda)],
                resB = B[IDX2C(i, idy, ldb)];
            resC += resA * resB;
        }
        resC = resC * alpha + C[IDX2C(idx, idy, ldc)] * beta;
        C[IDX2C(idx, idy, ldc)] = resC;
    }

    template <typename T, int TILE_WIDTH>
    __global__ __launch_bounds__(TILE_WIDTH *TILE_WIDTH) void gemm_32x32_v3(
        int m,
        int n,
        int k,
        T alpha,
        const T *A,
        int lda,
        const T *B,
        int ldb,
        T beta,
        T *C,
        int ldc)
    {
        const int
            idx = threadIdx.x,
            idy = threadIdx.y,
            x_range = ((int)blockIdx.x + 1 << 5) - m,
            y_range = ((int)blockIdx.y + 1 << 5) - n;
        if (x_range > 0)
        {
            A -= x_range;
            C -= x_range;
        }
        if (y_range > 0)
        {
            B -= y_range * ldb;
            C -= y_range * ldc;
        }
        A += blockIdx.x << 5;
        B += (blockIdx.y << 5) * ldb;
        C += (blockIdx.x << 5) + (blockIdx.y << 5) * ldc;
        T resC = 0;
        __shared__ T sA[TILE_WIDTH][TILE_WIDTH];
        __shared__ T sB[TILE_WIDTH][TILE_WIDTH];
        for (int i = 0; i < k; i += TILE_WIDTH)
        {
            sA[idx][idy] =
                A[IDX2C(idx, i + idy, lda)];
            sB[idx][idy] =
                B[IDX2C(i + idx, idy, ldb)];
            __syncthreads();
            for (int j = 0; j < TILE_WIDTH; ++j)
                resC += sA[idx][j] * sB[j][idy];
            __syncthreads();
        }
        resC = resC * alpha + C[IDX2C(idx, idy, ldc)] * beta;
        C[IDX2C(idx, idy, ldc)] = resC;
    }

    template <typename T, int TILE_WIDTH>
    __global__ __launch_bounds__(TILE_WIDTH *TILE_WIDTH) void gemm_32x32_v4(
        int m,
        int n,
        int k,
        T alpha,
        const T *A,
        int lda,
        const T *B,
        int ldb,
        T beta,
        T *C,
        int ldc)
    {
        const int
            idx = threadIdx.x,
            idy = threadIdx.y,
            x_range = ((int)blockIdx.x + 1 << 5) - m,
            y_range = ((int)blockIdx.y + 1 << 5) - n;
        if (x_range > 0)
        {
            A -= x_range;
            C -= x_range;
        }
        if (y_range > 0)
        {
            B -= y_range * ldb;
            C -= y_range * ldc;
        }
        A += blockIdx.x << 5;
        B += (blockIdx.y << 5) * ldb;
        C += (blockIdx.x << 5) + (blockIdx.y << 5) * ldc;
        T resC = 0;
        __shared__ T sA[TILE_WIDTH][TILE_WIDTH | 1];
        __shared__ T sB[TILE_WIDTH][TILE_WIDTH | 1];
        for (int i = 0; i < k; i += TILE_WIDTH)
        {
            sA[idx][idy] =
                A[IDX2C(idx, i + idy, lda)];
            sB[idx][idy] =
                B[IDX2C(i + idx, idy, ldb)];
            __syncthreads();
            for (int j = 0; j < TILE_WIDTH; ++j)
                resC += sA[idx][j] * sB[j][idy];
            __syncthreads();
        }
        resC = resC * alpha + C[IDX2C(idx, idy, ldc)] * beta;
        C[IDX2C(idx, idy, ldc)] = resC;
    }
#undef IDX2C

    template <
        typename T,
        typename Tv,
        int TWO_BUFFER>
    __global__ __launch_bounds__(256) void gemm_32x32_v5(
        int m,
        int n,
        int k,
        T alpha,
        const T *A,
        int lda,
        const T *B,
        int ldb,
        T beta,
        T *C,
        int ldc)
    {
        const int
            TvSize = sizeof(Tv) / sizeof(T),
            idx = threadIdx.x,
            x_range = ((int)blockIdx.x + 1 << 5) - m,
            y_range = ((int)blockIdx.y + 1 << 5) - n;
        if (x_range > 0)
        {
            A -= x_range;
            C -= x_range;
        }
        if (y_range > 0)
        {
            B -= y_range * ldb;
            C -= y_range * ldc;
        }
        A += blockIdx.x << 5;
        B += (blockIdx.y << 5) * ldb;
        C += (blockIdx.x << 5) + (blockIdx.y << 5) * ldc;
        Tv ansC;
        memset(&ansC, 0, sizeof(ansC));

        __shared__ T s_buffer[2048];
        T *s_A = s_buffer;
        T *s_B = s_buffer + 1024;

#if TWO_BUFFER
        __shared__ T s_tbuffer[2048];
        T *s_tA = s_tbuffer;
        T *s_tB = s_tbuffer + 1024;
#endif

        Tv resA = *(Tv *)&A[((idx * TvSize) & 31) + ((idx * TvSize) >> 5) * lda],
           resB = *(Tv *)&B[((idx * TvSize) & 31) + ((idx * TvSize) >> 5) * ldb];

        for (int l = 0; l < k; l += 32)
        {
            ((Tv *)s_A)[idx] = resA;
            for (int i = 0; i < TvSize; ++i)
                s_B[(((idx * TvSize) & 31) << 5) + ((idx * TvSize) >> 5)] = ((T *)&resB)[i];
            __syncthreads();

            if (l + 32 < k)
            {
                A += lda << 5;
                B += 32;
                resA = *(Tv *)&A[((idx * TvSize) & 31) + ((idx * TvSize) >> 5) * lda];
                resB = *(Tv *)&B[((idx * TvSize) & 31) + ((idx * TvSize) >> 5) * ldb];
            }

            for (int j = 0; j < 32; ++j)
            {
                Tv tmpA = *(Tv *)&s_A[((idx * TvSize) & 31) + (j << 5)];
                T tmpB = s_B[(j << 5) + ((idx * TvSize) >> 5)];
                for (int i = 0; i < TvSize; ++i)
                    ((T *)&ansC)[i] += ((T *)&tmpA)[i] * tmpB;
            }

#if TWO_BUFFER
            {
                T *tmp_A = s_A;
                s_A = s_tA;
                s_tA = tmp_A;
            }
            {
                T *tmp_B = s_B;
                s_B = s_tB;
                s_tB = tmp_B;
            }
#else
            __syncthreads();
#endif
        }

        {
            Tv *devC = (Tv *)&C[((idx * TvSize) & 31) + ((idx * TvSize) >> 5) * ldc],
               resC = *devC;
            for (int i = 0; i < TvSize; ++i)
                ((T *)&resC)[i] = alpha * ((T *)&ansC)[i] + beta * ((T *)&resC)[i];
            *devC = resC;
        }
    }

    template <
        typename T,
        typename Tv,
        int TWO_BUFFER>
    __global__ __launch_bounds__(256) void gemm_64x64(
        int m,
        int n,
        int k,
        T alpha,
        const T *A,
        int lda,
        const T *B,
        int ldb,
        T beta,
        T *C,
        int ldc)
    {
        const int
            TvSize = sizeof(Tv) / sizeof(T),
            idx = threadIdx.x,
            x_range = ((int)blockIdx.x + 1 << 6) - m,
            y_range = ((int)blockIdx.y + 1 << 6) - n;
        if (x_range > 0)
        {
            A -= x_range;
            C -= x_range;
        }
        if (y_range > 0)
        {
            B -= y_range * ldb;
            C -= y_range * ldc;
        }
        A += blockIdx.x << 6;
        B += (blockIdx.y << 6) * ldb;
        C += (blockIdx.x << 6) + (blockIdx.y << 6) * ldc;
        Tv ansC[TvSize];
        memset(ansC, 0, sizeof(ansC));

        __shared__ T s_buffer[2048];
        T *s_A = s_buffer;
        T *s_B = s_buffer + 1024;

#if TWO_BUFFER
        __shared__ T s_tbuffer[2048];
        T *s_tA = s_tbuffer;
        T *s_tB = s_tbuffer + 1024;
#endif

        Tv resA = *(Tv *)&(A[((idx * TvSize) & 63) + ((idx * TvSize) >> 6) * lda]),
           resB = *(Tv *)&(B[((idx * TvSize) & 15) + ((idx * TvSize) >> 4) * ldb]);
        for (int l = 0; l < k; l += 16)
        {
            ((Tv *)s_A)[idx] = resA;
            for (int i = 0; i < TvSize; ++i)
                s_B[((((idx * TvSize) & 15) + i) << 6) + ((idx * TvSize) >> 4)] = ((T *)&resB)[i];
            __syncthreads();

            if (l + 16 < k)
            {
                A += lda << 4;
                B += 16;
                resA = *(Tv *)&(A[((idx * TvSize) & 63) + ((idx * TvSize) >> 6) * lda]);
                resB = *(Tv *)&(B[((idx * TvSize) & 15) + ((idx * TvSize) >> 4) * ldb]);
            }

            for (int j = 0; j < 16; ++j)
            {
                const Tv
                    tmpA = *(Tv *)&(s_A[((idx * TvSize) & 63) + (j << 6)]),
                    tmpB = *(Tv *)&(s_B[(j << 6) + (idx >> 4) * TvSize]);
                for (int i = 0; i < TvSize; ++i)
                    for (int h = 0; h < TvSize; ++h)
                        ((T *)&ansC[i])[h] += ((T *)&tmpA)[i] * ((T *)&tmpB)[h];
            }

#if TWO_BUFFER
            {
                T *tmp_A = s_A;
                s_A = s_tA;
                s_tA = tmp_A;
            }
            {
                T *tmp_B = s_B;
                s_B = s_tB;
                s_tB = tmp_B;
            }
#else
            __syncthreads();
#endif
        }
        for (int i = 0; i < TvSize; ++i)
        {
            Tv *devC = (Tv *)&(C[(idx & 15) * 4 + ((idx >> 4) * 4 + i) * ldc]),
               resC = *devC;
            for (int h = 0; h < TvSize; ++h)
                ((T *)&resC)[h] = alpha * ((T *)&ansC[i])[h] + beta * ((T *)&resC)[h];
            *devC = resC;
        }
    }

    template <
        typename T,
        typename Tv,
        int TWO_BUFFER>
    __global__ __launch_bounds__(256) void gemm_128x128(
        int m,
        int n,
        int k,
        T alpha,
        const T *A,
        int lda,
        const T *B,
        int ldb,
        T beta,
        T *C,
        int ldc)
    {
        const int
            TvSize = sizeof(Tv) / sizeof(T),
            idx = threadIdx.x,
            x_range = ((int)blockIdx.x + 1 << 7) - m,
            y_range = ((int)blockIdx.y + 1 << 7) - n,
            x_a = idx & 31,
            y_a = idx >> 5,
            x_b = idx & 1,
            y_b = idx >> 1,
            x_c = idx & 15,
            y_c = idx >> 4;
        if (x_range > 0)
        {
            A -= x_range;
            C -= x_range;
        }
        if (y_range > 0)
        {
            B -= y_range * ldb;
            C -= y_range * ldc;
        }

        A += blockIdx.x << 7;
        B += (blockIdx.y << 7) * ldb;
        C += ((blockIdx.x << 7) + (x_c << 2)) + ((blockIdx.y << 7) + (y_c << 2)) * ldc;

        __shared__ T s_buffer[2048];
        T *s_A = s_buffer;
        T *s_B = s_buffer + 1024;

#if TWO_BUFFER
        __shared__ T s_tbuffer[2048];
        T *s_tA = s_tbuffer;
        T *s_tB = s_tbuffer + 1024;
#endif
        Tv ansC[2][2][TvSize];
        memset(ansC, 0, sizeof(ansC));

        Tv resA = ((Tv *)A)[x_a + y_a * (lda >> 2)];
        Tv resB = ((Tv *)B)[x_b + y_b * (ldb >> 2)];

        for (int l = 0; l < k; l += 8)
        {
            ((Tv *)s_A)[idx] = resA;
            for (int i = 0; i < TvSize; ++i)
                s_B[(((x_b << 2) + i) << 7) + y_b] = ((T *)&resB)[i];
            __syncthreads();

            if (l + 8 < k)
            {
                A += lda << 3;
                B += 8;
                resA = ((Tv *)A)[x_a + y_a * (lda >> 2)];
                resB = ((Tv *)B)[x_b + y_b * (ldb >> 2)];
            }

#pragma unroll(4)
            for (int j = 0; j < 8; ++j)
            {
                Tv a[2], b[2];
                for (int p = 0; p < 2; ++p)
                    a[p] = ((Tv *)(s_A + (j << 7) + (x_c << 2)))[p << 4];
                for (int p = 0; p < 2; ++p)
                    b[p] = ((Tv *)(s_B + (j << 7) + (y_c << 2)))[p << 4];
                for (int p = 0; p < 2; ++p)
                    for (int q = 0; q < 2; ++q)
                        for (int i = 0; i < TvSize; ++i)
                            for (int j = 0; j < TvSize; ++j)
                                ((T *)&ansC[p][q][i])[j] += ((T *)&a[p])[j] * ((T *)&b[q])[i];
            }

#if TWO_BUFFER
            {
                T *tmp_A = s_A;
                s_A = s_tA;
                s_tA = tmp_A;
            }
            {
                T *tmp_B = s_B;
                s_B = s_tB;
                s_tB = tmp_B;
            }
#else
            __syncthreads();
#endif
        }

        for (int p = 0; p < 2; ++p)
            for (int q = 0; q < 2; ++q)
                for (int i = 0; i < TvSize; ++i)
                {
                    Tv *devC = ((Tv *)(C + ldc * (q * 64 + i))) + (p << 4),
                       resC = *devC;
                    for (int j = 0; j < TvSize; ++j)
                        ((T *)&resC)[j] = alpha * ((T *)&ansC[p][q][i])[j] + beta * ((T *)&resC)[j];
                    *devC = resC;
                }
    }
}; // namespace wuk

void WuK_Timer(const char *tag, float flo, const std::function<void()> &kernel, int test_time = 9)
{
    float min_time = 9e99;
    while (test_time--)
    {
        cudaEvent_t beg, end;
        cudaEventCreate(&beg);
        cudaEventCreate(&end);
        cudaEventRecord(beg);
        kernel();
        cudaEventRecord(end);
        cudaEventSynchronize(beg);
        cudaEventSynchronize(end);
        float elapsed_time;
        cudaEventElapsedTime(&elapsed_time, beg, end);
        min_time = std::min(min_time, elapsed_time);
    }
    std::printf("%s: %f ms, %e FLOPS.\n", tag, min_time, flo * 1e3 / min_time);
}

struct WuK_cublas
{
    cublasHandle_t handle;
    WuK_cublas() { cublasCreate(&handle); }
    ~WuK_cublas() { cublasDestroy(handle); }
} wuk_cublas;

const float
    alpha = 1,
    beta = 0;
const cublasOperation_t
    opA = CUBLAS_OP_N,
    opB = CUBLAS_OP_N,
    opC = CUBLAS_OP_N;
const int
    m = 1 << 13,
    n = 1 << 13,
    k = 1 << 13,
    lda = opA == CUBLAS_OP_N ? k : m,
    ldb = opB == CUBLAS_OP_N ? n : k,
    ldc = opC == CUBLAS_OP_N ? n : m;
thrust::device_vector<float>
dA(m *k, 1),
    dB(k *n, 1),
    dC(m *n, 0);

int main()
{
    WuK_Timer(
        "cublasSgemm", 2.0 * m * k * n,
        [&] {
            cublasSgemm(
                wuk_cublas.handle,
                opA,
                opB,
                m,
                n,
                k,
                &alpha,
                thrust::raw_pointer_cast(dA.data()),
                lda,
                thrust::raw_pointer_cast(dB.data()),
                ldb,
                &beta,
                thrust::raw_pointer_cast(dC.data()),
                ldc);
        });
    WuK_Timer(
        "Alcanderian::cuda_kernel_sgemm_1", 2.0 * m * k * n,
        [&] {
            const int TILE_WIDTH = 32;
            const dim3
                blockDim(TILE_WIDTH, TILE_WIDTH),
                gridDim((m + TILE_WIDTH - 1) / TILE_WIDTH,
                        (n + TILE_WIDTH - 1) / TILE_WIDTH);
            assert(opA == CUBLAS_OP_N);
            assert(opB == CUBLAS_OP_N);
            assert(opC == CUBLAS_OP_N);
            Alcanderian::cuda_kernel_sgemm_1<<<
                gridDim,
                blockDim>>>(
                thrust::raw_pointer_cast(dA.data()),
                thrust::raw_pointer_cast(dB.data()),
                thrust::raw_pointer_cast(dC.data()),
                n,
                m,
                k,
                alpha,
                beta);
        });
    WuK_Timer(
        "Alcanderian::cuda_kernel_sgemm_2", 2.0 * m * k * n,
        [&] {
            const int TILE_WIDTH = 32;
            const dim3
                blockDim(TILE_WIDTH, TILE_WIDTH),
                gridDim((m + TILE_WIDTH - 1) / TILE_WIDTH,
                        (n + TILE_WIDTH - 1) / TILE_WIDTH);
            assert(opA == CUBLAS_OP_N);
            assert(opB == CUBLAS_OP_N);
            assert(opC == CUBLAS_OP_N);
            Alcanderian::cuda_kernel_sgemm_2<<<
                gridDim,
                blockDim>>>(
                thrust::raw_pointer_cast(dA.data()),
                thrust::raw_pointer_cast(dB.data()),
                thrust::raw_pointer_cast(dC.data()),
                n,
                m,
                k,
                alpha,
                beta);
        });
    WuK_Timer(
        "Alcanderian::cuda_kernel_sgemm_3", 2.0 * m * k * n,
        [&] {
            const int TILE_WIDTH = 32;
            const dim3
                blockDim(TILE_WIDTH, TILE_WIDTH),
                gridDim((m + TILE_WIDTH - 1) / TILE_WIDTH,
                        (n + TILE_WIDTH - 1) / TILE_WIDTH);
            assert(opA == CUBLAS_OP_N);
            assert(opB == CUBLAS_OP_N);
            assert(opC == CUBLAS_OP_N);
            Alcanderian::cuda_kernel_sgemm_3<<<
                gridDim,
                blockDim>>>(
                thrust::raw_pointer_cast(dA.data()),
                thrust::raw_pointer_cast(dB.data()),
                thrust::raw_pointer_cast(dC.data()),
                n,
                m,
                k,
                alpha,
                beta);
        });
    WuK_Timer(
        "fynv::g_fgemm", 2.0 * m * k * n,
        [&] {
            const int TILE_WIDTH = 128;
            const dim3
                blockDim(256),
                gridDim((m + TILE_WIDTH - 1) / TILE_WIDTH,
                        (n + TILE_WIDTH - 1) / TILE_WIDTH);
            assert(opA == CUBLAS_OP_N);
            assert(opB == CUBLAS_OP_N);
            assert(opC == CUBLAS_OP_N);
            assert(m % TILE_WIDTH == 0);
            assert(n % TILE_WIDTH == 0);
            assert(k % TILE_WIDTH == 0);
            fynv::g_fgemm<<<
                gridDim,
                blockDim>>>(
                thrust::raw_pointer_cast(dC.data()),
                thrust::raw_pointer_cast(dA.data()),
                thrust::raw_pointer_cast(dB.data()),
                k,
                lda,
                ldb,
                ldc);
        });
    WuK_Timer(
        "wuk::gemm_32x32_v0", 2.0 * m * k * n,
        [&] {
            const int TILE_WIDTH = 32;
            const dim3
                blockDim(TILE_WIDTH, TILE_WIDTH),
                gridDim((n + TILE_WIDTH - 1) / TILE_WIDTH,
                        (m + TILE_WIDTH - 1) / TILE_WIDTH);
            assert(opA == CUBLAS_OP_N);
            assert(opB == CUBLAS_OP_N);
            assert(opC == CUBLAS_OP_N);
            assert(m >= blockDim.y);
            assert(n >= blockDim.x);
            wuk::gemm_32x32_v0<<<
                gridDim,
                blockDim>>>(
                m,
                n,
                k,
                alpha,
                thrust::raw_pointer_cast(dA.data()),
                lda,
                thrust::raw_pointer_cast(dB.data()),
                ldb,
                beta,
                thrust::raw_pointer_cast(dC.data()),
                ldc);
        });
    WuK_Timer(
        "wuk::gemm_32x32_v1", 2.0 * m * k * n,
        [&] {
            const int TILE_WIDTH = 32;
            const dim3
                blockDim(TILE_WIDTH, TILE_WIDTH),
                gridDim((m + TILE_WIDTH - 1) / TILE_WIDTH,
                        (n + TILE_WIDTH - 1) / TILE_WIDTH);
            assert(opA == CUBLAS_OP_N);
            assert(opB == CUBLAS_OP_N);
            assert(opC == CUBLAS_OP_N);
            assert(m >= blockDim.x);
            assert(n >= blockDim.y);
            wuk::gemm_32x32_v1<<<
                gridDim,
                blockDim>>>(
                m,
                n,
                k,
                alpha,
                thrust::raw_pointer_cast(dA.data()),
                lda,
                thrust::raw_pointer_cast(dB.data()),
                ldb,
                beta,
                thrust::raw_pointer_cast(dC.data()),
                ldc);
        });
    WuK_Timer(
        "wuk::gemm_32x32_v2", 2.0 * m * k * n,
        [&] {
            const int TILE_WIDTH = 32;
            const dim3
                blockDim(TILE_WIDTH, TILE_WIDTH),
                gridDim((m + TILE_WIDTH - 1) / TILE_WIDTH,
                        (n + TILE_WIDTH - 1) / TILE_WIDTH);
            assert(opA == CUBLAS_OP_N);
            assert(opB == CUBLAS_OP_N);
            assert(opC == CUBLAS_OP_N);
            assert(m >= blockDim.x);
            assert(n >= blockDim.y);
            wuk::gemm_32x32_v2<<<
                gridDim,
                blockDim>>>(
                m,
                n,
                k,
                alpha,
                thrust::raw_pointer_cast(dA.data()),
                lda,
                thrust::raw_pointer_cast(dB.data()),
                ldb,
                beta,
                thrust::raw_pointer_cast(dC.data()),
                ldc);
        });
    WuK_Timer(
        "wuk::gemm_32x32_v3", 2.0 * m * k * n,
        [&] {
            const int TILE_WIDTH = 32;
            const dim3
                blockDim(TILE_WIDTH, TILE_WIDTH),
                gridDim((m + TILE_WIDTH - 1) / TILE_WIDTH,
                        (n + TILE_WIDTH - 1) / TILE_WIDTH);
            assert(opA == CUBLAS_OP_N);
            assert(opB == CUBLAS_OP_N);
            assert(opC == CUBLAS_OP_N);
            assert(m >= TILE_WIDTH);
            assert(n >= TILE_WIDTH);
            assert(k >= TILE_WIDTH);
            wuk::gemm_32x32_v3<
                float,
                TILE_WIDTH><<<
                gridDim,
                blockDim>>>(
                m,
                n,
                k,
                alpha,
                thrust::raw_pointer_cast(dA.data()),
                lda,
                thrust::raw_pointer_cast(dB.data()),
                ldb,
                beta,
                thrust::raw_pointer_cast(dC.data()),
                ldc);
        });
    WuK_Timer(
        "wuk::gemm_32x32_v4", 2.0 * m * k * n,
        [&] {
            const int TILE_WIDTH = 32;
            const dim3
                blockDim(TILE_WIDTH, TILE_WIDTH),
                gridDim((m + TILE_WIDTH - 1) / TILE_WIDTH,
                        (n + TILE_WIDTH - 1) / TILE_WIDTH);
            assert(opA == CUBLAS_OP_N);
            assert(opB == CUBLAS_OP_N);
            assert(opC == CUBLAS_OP_N);
            assert(m >= TILE_WIDTH);
            assert(n >= TILE_WIDTH);
            assert(k >= TILE_WIDTH);
            wuk::gemm_32x32_v4<
                float,
                TILE_WIDTH><<<
                gridDim,
                blockDim>>>(
                m,
                n,
                k,
                alpha,
                thrust::raw_pointer_cast(dA.data()),
                lda,
                thrust::raw_pointer_cast(dB.data()),
                ldb,
                beta,
                thrust::raw_pointer_cast(dC.data()),
                ldc);
        });
    WuK_Timer(
        "wuk::gemm_32x32_v5", 2.0 * m * k * n,
        [&] {
            const int TILE_WIDTH = 32;
            const dim3
                blockDim(256),
                gridDim((m + TILE_WIDTH - 1) / TILE_WIDTH,
                        (n + TILE_WIDTH - 1) / TILE_WIDTH);
            assert(opA == CUBLAS_OP_N);
            assert(opB == CUBLAS_OP_N);
            assert(opC == CUBLAS_OP_N);
            assert(m >= TILE_WIDTH);
            assert(n >= TILE_WIDTH);
            assert(k % 32 == 0);
            wuk::gemm_32x32_v5<
                float,
                float4,
                0><<<
                gridDim,
                blockDim>>>(
                m,
                n,
                k,
                alpha,
                thrust::raw_pointer_cast(dA.data()),
                lda,
                thrust::raw_pointer_cast(dB.data()),
                ldb,
                beta,
                thrust::raw_pointer_cast(dC.data()),
                ldc);
        });
    WuK_Timer(
        "wuk::gemm_64x64", 2.0 * m * k * n,
        [&] {
            const int TILE_WIDTH = 64;
            const dim3
                blockDim(256),
                gridDim((m + TILE_WIDTH - 1) / TILE_WIDTH,
                        (n + TILE_WIDTH - 1) / TILE_WIDTH);
            assert(opA == CUBLAS_OP_N);
            assert(opB == CUBLAS_OP_N);
            assert(opC == CUBLAS_OP_N);
            assert(m >= TILE_WIDTH);
            assert(n >= TILE_WIDTH);
            assert(k % 16 == 0);
            wuk::gemm_64x64<
                float,
                float4,
                0><<<
                gridDim,
                blockDim>>>(
                m,
                n,
                k,
                alpha,
                thrust::raw_pointer_cast(dA.data()),
                lda,
                thrust::raw_pointer_cast(dB.data()),
                ldb,
                beta,
                thrust::raw_pointer_cast(dC.data()),
                ldc);
        });
    WuK_Timer(
        "wuk::gemm_128x128", 2.0 * m * k * n,
        [&] {
            const int TILE_WIDTH = 128;
            const dim3
                blockDim(256),
                gridDim((m + TILE_WIDTH - 1) / TILE_WIDTH,
                        (n + TILE_WIDTH - 1) / TILE_WIDTH);
            assert(opA == CUBLAS_OP_N);
            assert(opB == CUBLAS_OP_N);
            assert(opC == CUBLAS_OP_N);
            assert(m >= TILE_WIDTH);
            assert(n >= TILE_WIDTH);
            assert(k % 8 == 0);
            wuk::gemm_128x128<
                float,
                float4,
                1><<<
                gridDim,
                blockDim>>>(
                m,
                n,
                k,
                alpha,
                thrust::raw_pointer_cast(dA.data()),
                lda,
                thrust::raw_pointer_cast(dB.data()),
                ldb,
                beta,
                thrust::raw_pointer_cast(dC.data()),
                ldc);
        });
}

gemm.ptx

//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-26907403
// Cuda compilation tools, release 10.1, V10.1.243
// Based on LLVM 3.4svn
//

.version 6.4
.target sm_70
.address_size 64

	// .globl	_ZN6thrust8cuda_cub3cub11EmptyKernelIvEEvv
// _ZZN11Alcanderian19cuda_kernel_sgemm_1EPfS0_S0_mmmffE5a_sub has been demoted
// _ZZN11Alcanderian19cuda_kernel_sgemm_1EPfS0_S0_mmmffE5b_sub has been demoted
// _ZZN11Alcanderian19cuda_kernel_sgemm_2EPfS0_S0_mmmffE5a_sub has been demoted
// _ZZN11Alcanderian19cuda_kernel_sgemm_2EPfS0_S0_mmmffE5b_sub has been demoted
// _ZZN11Alcanderian19cuda_kernel_sgemm_3EPfS0_S0_mmmffE5a_sub has been demoted
// _ZZN11Alcanderian19cuda_kernel_sgemm_3EPfS0_S0_mmmffE5b_sub has been demoted
// _ZZN4fynv7g_fgemmEPfPKfS2_iiiiE4smem has been demoted
// _ZZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_iE2sA has been demoted
// _ZZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_iE2sB has been demoted
// _ZZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_iE2sA has been demoted
// _ZZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_iE2sB has been demoted
// _ZZN3wuk13gemm_32x32_v5If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_iE8s_buffer has been demoted
// _ZZN3wuk10gemm_64x64If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_iE8s_buffer has been demoted
// _ZZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_iE8s_buffer has been demoted
.global .align 1 .b8 _ZN61_INTERNAL_39_tmpxft_0006b649_00000000_6_gemm_cpp1_ii_6c2907446thrust6system6detail10sequential3seqE[1];
.global .align 1 .b8 _ZN61_INTERNAL_39_tmpxft_0006b649_00000000_6_gemm_cpp1_ii_6c2907446thrust8cuda_cub3parE[1];
.global .align 1 .b8 _ZN61_INTERNAL_39_tmpxft_0006b649_00000000_6_gemm_cpp1_ii_6c2907446thrust12placeholders2_1E[1];
.global .align 1 .b8 _ZN61_INTERNAL_39_tmpxft_0006b649_00000000_6_gemm_cpp1_ii_6c2907446thrust12placeholders2_2E[1];
.global .align 1 .b8 _ZN61_INTERNAL_39_tmpxft_0006b649_00000000_6_gemm_cpp1_ii_6c2907446thrust12placeholders2_3E[1];
.global .align 1 .b8 _ZN61_INTERNAL_39_tmpxft_0006b649_00000000_6_gemm_cpp1_ii_6c2907446thrust12placeholders2_4E[1];
.global .align 1 .b8 _ZN61_INTERNAL_39_tmpxft_0006b649_00000000_6_gemm_cpp1_ii_6c2907446thrust12placeholders2_5E[1];
.global .align 1 .b8 _ZN61_INTERNAL_39_tmpxft_0006b649_00000000_6_gemm_cpp1_ii_6c2907446thrust12placeholders2_6E[1];
.global .align 1 .b8 _ZN61_INTERNAL_39_tmpxft_0006b649_00000000_6_gemm_cpp1_ii_6c2907446thrust12placeholders2_7E[1];
.global .align 1 .b8 _ZN61_INTERNAL_39_tmpxft_0006b649_00000000_6_gemm_cpp1_ii_6c2907446thrust12placeholders2_8E[1];
.global .align 1 .b8 _ZN61_INTERNAL_39_tmpxft_0006b649_00000000_6_gemm_cpp1_ii_6c2907446thrust12placeholders2_9E[1];
.global .align 1 .b8 _ZN61_INTERNAL_39_tmpxft_0006b649_00000000_6_gemm_cpp1_ii_6c2907446thrust12placeholders3_10E[1];
.global .align 1 .b8 _ZN61_INTERNAL_39_tmpxft_0006b649_00000000_6_gemm_cpp1_ii_6c2907446thrust3seqE[1];
.global .align 1 .b8 _ZN61_INTERNAL_39_tmpxft_0006b649_00000000_6_gemm_cpp1_ii_6c2907446thrust6deviceE[1];

.visible .entry _ZN6thrust8cuda_cub3cub11EmptyKernelIvEEvv(

)
{



	ret;
}

	// .globl	_ZN6thrust8cuda_cub4core13_kernel_agentINS0_14__parallel_for16ParallelForAgentINS0_20__uninitialized_fill7functorINS_10device_ptrIfEEfEEmEES9_mEEvT0_T1_
.visible .entry _ZN6thrust8cuda_cub4core13_kernel_agentINS0_14__parallel_for16ParallelForAgentINS0_20__uninitialized_fill7functorINS_10device_ptrIfEEfEEmEES9_mEEvT0_T1_(
	.param .align 8 .b8 _ZN6thrust8cuda_cub4core13_kernel_agentINS0_14__parallel_for16ParallelForAgentINS0_20__uninitialized_fill7functorINS_10device_ptrIfEEfEEmEES9_mEEvT0_T1__param_0[16],
	.param .u64 _ZN6thrust8cuda_cub4core13_kernel_agentINS0_14__parallel_for16ParallelForAgentINS0_20__uninitialized_fill7functorINS_10device_ptrIfEEfEEmEES9_mEEvT0_T1__param_1
)
.maxntid 256, 1, 1
.minnctapersm 1
{
	.reg .pred 	%p<10>;
	.reg .f32 	%f<2>;
	.reg .b32 	%r<27>;
	.reg .b64 	%rd<36>;


	ld.param.u64 	%rd1, [_ZN6thrust8cuda_cub4core13_kernel_agentINS0_14__parallel_for16ParallelForAgentINS0_20__uninitialized_fill7functorINS_10device_ptrIfEEfEEmEES9_mEEvT0_T1__param_0];
	ld.param.f32 	%f1, [_ZN6thrust8cuda_cub4core13_kernel_agentINS0_14__parallel_for16ParallelForAgentINS0_20__uninitialized_fill7functorINS_10device_ptrIfEEfEEmEES9_mEEvT0_T1__param_0+8];
	ld.param.u64 	%rd5, [_ZN6thrust8cuda_cub4core13_kernel_agentINS0_14__parallel_for16ParallelForAgentINS0_20__uninitialized_fill7functorINS_10device_ptrIfEEfEEmEES9_mEEvT0_T1__param_1];
	mov.u32 	%r3, %ctaid.x;
	shl.b32 	%r4, %r3, 9;
	cvt.u64.u32	%rd6, %r4;
	sub.s64 	%rd7, %rd5, %rd6;
	setp.lt.u64	%p1, %rd7, 512;
	cvt.u32.u64	%r5, %rd7;
	selp.b32	%r1, %r5, 512, %p1;
	setp.eq.s32	%p2, %r1, 512;
	@%p2 bra 	BB1_7;
	bra.uni 	BB1_1;

BB1_7:
	mov.u32 	%r20, %tid.x;
	cvt.s64.s32	%rd26, %r20;
	add.s64 	%rd28, %rd26, %rd6;
	shl.b64 	%rd29, %rd28, 2;
	add.s64 	%rd3, %rd1, %rd29;
	setp.eq.s64	%p8, %rd3, 0;
	@%p8 bra 	BB1_9;

	cvta.to.global.u64 	%rd30, %rd3;
	st.global.f32 	[%rd30], %f1;

BB1_9:
	add.s32 	%r24, %r20, 256;
	cvt.s64.s32	%rd31, %r24;
	add.s64 	%rd33, %rd31, %rd6;
	shl.b64 	%rd34, %rd33, 2;
	add.s64 	%rd4, %rd1, %rd34;
	setp.eq.s64	%p9, %rd4, 0;
	@%p9 bra 	BB1_11;

	cvta.to.global.u64 	%rd35, %rd4;
	st.global.f32 	[%rd35], %f1;
	bra.uni 	BB1_11;

BB1_1:
	mov.u32 	%r6, %tid.x;
	setp.ge.s32	%p3, %r6, %r1;
	@%p3 bra 	BB1_4;

	cvt.s64.s32	%rd8, %r6;
	add.s64 	%rd10, %rd8, %rd6;
	shl.b64 	%rd11, %rd10, 2;
	add.s64 	%rd12, %rd1, %rd11;
	setp.eq.s64	%p4, %rd12, 0;
	@%p4 bra 	BB1_4;

	cvta.to.global.u64 	%rd18, %rd12;
	st.global.f32 	[%rd18], %f1;

BB1_4:
	add.s32 	%r2, %r6, 256;
	setp.ge.s32	%p6, %r2, %r1;
	@%p6 bra 	BB1_11;

	cvt.s64.s32	%rd21, %r2;
	add.s64 	%rd23, %rd21, %rd6;
	shl.b64 	%rd24, %rd23, 2;
	add.s64 	%rd2, %rd1, %rd24;
	setp.eq.s64	%p7, %rd2, 0;
	@%p7 bra 	BB1_11;

	cvta.to.global.u64 	%rd25, %rd2;
	st.global.f32 	[%rd25], %f1;

BB1_11:
	ret;
}

	// .globl	_ZN11Alcanderian19cuda_kernel_sgemm_1EPfS0_S0_mmmff
.visible .entry _ZN11Alcanderian19cuda_kernel_sgemm_1EPfS0_S0_mmmff(
	.param .u64 _ZN11Alcanderian19cuda_kernel_sgemm_1EPfS0_S0_mmmff_param_0,
	.param .u64 _ZN11Alcanderian19cuda_kernel_sgemm_1EPfS0_S0_mmmff_param_1,
	.param .u64 _ZN11Alcanderian19cuda_kernel_sgemm_1EPfS0_S0_mmmff_param_2,
	.param .u64 _ZN11Alcanderian19cuda_kernel_sgemm_1EPfS0_S0_mmmff_param_3,
	.param .u64 _ZN11Alcanderian19cuda_kernel_sgemm_1EPfS0_S0_mmmff_param_4,
	.param .u64 _ZN11Alcanderian19cuda_kernel_sgemm_1EPfS0_S0_mmmff_param_5,
	.param .f32 _ZN11Alcanderian19cuda_kernel_sgemm_1EPfS0_S0_mmmff_param_6,
	.param .f32 _ZN11Alcanderian19cuda_kernel_sgemm_1EPfS0_S0_mmmff_param_7
)
{
	.reg .pred 	%p<13>;
	.reg .f32 	%f<110>;
	.reg .b32 	%r<44>;
	.reg .b64 	%rd<31>;
	// demoted variable
	.shared .align 4 .b8 _ZZN11Alcanderian19cuda_kernel_sgemm_1EPfS0_S0_mmmffE5a_sub[4096];
	// demoted variable
	.shared .align 4 .b8 _ZZN11Alcanderian19cuda_kernel_sgemm_1EPfS0_S0_mmmffE5b_sub[4096];

	ld.param.u64 	%rd5, [_ZN11Alcanderian19cuda_kernel_sgemm_1EPfS0_S0_mmmff_param_0];
	ld.param.u64 	%rd6, [_ZN11Alcanderian19cuda_kernel_sgemm_1EPfS0_S0_mmmff_param_1];
	ld.param.u64 	%rd7, [_ZN11Alcanderian19cuda_kernel_sgemm_1EPfS0_S0_mmmff_param_2];
	ld.param.u64 	%rd8, [_ZN11Alcanderian19cuda_kernel_sgemm_1EPfS0_S0_mmmff_param_3];
	ld.param.u64 	%rd9, [_ZN11Alcanderian19cuda_kernel_sgemm_1EPfS0_S0_mmmff_param_4];
	ld.param.u64 	%rd10, [_ZN11Alcanderian19cuda_kernel_sgemm_1EPfS0_S0_mmmff_param_5];
	ld.param.f32 	%f4, [_ZN11Alcanderian19cuda_kernel_sgemm_1EPfS0_S0_mmmff_param_6];
	ld.param.f32 	%f5, [_ZN11Alcanderian19cuda_kernel_sgemm_1EPfS0_S0_mmmff_param_7];
	mov.u32 	%r42, %tid.x;
	mov.u32 	%r13, %ctaid.x;
	shl.b32 	%r14, %r13, 5;
	add.s32 	%r15, %r14, %r42;
	mov.u32 	%r16, %ctaid.y;
	shl.b32 	%r17, %r16, 5;
	mov.u32 	%r41, %tid.y;
	add.s32 	%r18, %r17, %r41;
	shr.u64 	%rd11, %rd10, 5;
	cvt.u32.u64	%r19, %rd11;
	and.b64  	%rd12, %rd10, 31;
	setp.ne.s64	%p1, %rd12, 0;
	selp.u32	%r20, 1, 0, %p1;
	add.s32 	%r3, %r20, %r19;
	cvt.s64.s32	%rd1, %r15;
	cvt.s64.s32	%rd2, %r18;
	mov.f32 	%f109, 0f00000000;
	setp.lt.s32	%p2, %r3, 1;
	@%p2 bra 	BB2_7;

	shl.b32 	%r22, %r42, 7;
	mov.u32 	%r23, _ZZN11Alcanderian19cuda_kernel_sgemm_1EPfS0_S0_mmmffE5a_sub;
	add.s32 	%r24, %r23, %r22;
	shl.b32 	%r25, %r41, 2;
	add.s32 	%r4, %r24, %r25;
	mov.u32 	%r26, _ZZN11Alcanderian19cuda_kernel_sgemm_1EPfS0_S0_mmmffE5b_sub;
	add.s32 	%r27, %r26, %r22;
	add.s32 	%r5, %r27, %r25;
	mov.f32 	%f109, 0f00000000;
	mov.u32 	%r21, 0;
	cvta.to.global.u64 	%rd16, %rd5;
	cvta.to.global.u64 	%rd23, %rd6;
	mov.u32 	%r43, %r21;

BB2_2:
	st.shared.u32 	[%r4], %r21;
	st.shared.u32 	[%r5], %r21;
	cvt.s64.s32	%rd3, %r41;
	setp.ge.u64	%p3, %rd3, %rd10;
	setp.ge.u64	%p4, %rd1, %rd9;
	or.pred  	%p5, %p4, %p3;
	@%p5 bra 	BB2_4;

	mov.u32 	%r31, %tid.x;
	add.s32 	%r32, %r14, %r31;
	cvt.s64.s32	%rd13, %r32;
	mul.lo.s64 	%rd14, %rd13, %rd10;
	add.s64 	%rd15, %rd3, %rd14;
	shl.b64 	%rd17, %rd15, 2;
	add.s64 	%rd18, %rd16, %rd17;
	ld.global.f32 	%f8, [%rd18];
	st.shared.f32 	[%r4], %f8;

BB2_4:
	setp.ge.u64	%p6, %rd2, %rd8;
	cvt.s64.s32	%rd19, %r42;
	setp.ge.u64	%p7, %rd19, %rd10;
	or.pred  	%p8, %p7, %p6;
	@%p8 bra 	BB2_6;

	mul.lo.s64 	%rd21, %rd19, %rd8;
	add.s64 	%rd22, %rd21, %rd2;
	shl.b64 	%rd24, %rd22, 2;
	add.s64 	%rd25, %rd23, %rd24;
	ld.global.f32 	%f9, [%rd25];
	st.shared.f32 	[%r5], %f9;

BB2_6:
	cvt.u32.u64	%r9, %rd3;
	bar.sync 	0;
	mov.u32 	%r33, %tid.x;
	shl.b32 	%r34, %r33, 7;
	add.s32 	%r36, %r23, %r34;
	mov.u32 	%r37, %tid.y;
	shl.b32 	%r38, %r37, 2;
	add.s32 	%r40, %r26, %r38;
	ld.shared.f32 	%f10, [%r40];
	ld.shared.f32 	%f11, [%r36];
	fma.rn.f32 	%f12, %f11, %f10, %f109;
	ld.shared.f32 	%f13, [%r40+128];
	ld.shared.f32 	%f14, [%r36+4];
	fma.rn.f32 	%f15, %f14, %f13, %f12;
	ld.shared.f32 	%f16, [%r40+256];
	ld.shared.f32 	%f17, [%r36+8];
	fma.rn.f32 	%f18, %f17, %f16, %f15;
	ld.shared.f32 	%f19, [%r40+384];
	ld.shared.f32 	%f20, [%r36+12];
	fma.rn.f32 	%f21, %f20, %f19, %f18;
	ld.shared.f32 	%f22, [%r40+512];
	ld.shared.f32 	%f23, [%r36+16];
	fma.rn.f32 	%f24, %f23, %f22, %f21;
	ld.shared.f32 	%f25, [%r40+640];
	ld.shared.f32 	%f26, [%r36+20];
	fma.rn.f32 	%f27, %f26, %f25, %f24;
	ld.shared.f32 	%f28, [%r40+768];
	ld.shared.f32 	%f29, [%r36+24];
	fma.rn.f32 	%f30, %f29, %f28, %f27;
	ld.shared.f32 	%f31, [%r40+896];
	ld.shared.f32 	%f32, [%r36+28];
	fma.rn.f32 	%f33, %f32, %f31, %f30;
	ld.shared.f32 	%f34, [%r40+1024];
	ld.shared.f32 	%f35, [%r36+32];
	fma.rn.f32 	%f36, %f35, %f34, %f33;
	ld.shared.f32 	%f37, [%r40+1152];
	ld.shared.f32 	%f38, [%r36+36];
	fma.rn.f32 	%f39, %f38, %f37, %f36;
	ld.shared.f32 	%f40, [%r40+1280];
	ld.shared.f32 	%f41, [%r36+40];
	fma.rn.f32 	%f42, %f41, %f40, %f39;
	ld.shared.f32 	%f43, [%r40+1408];
	ld.shared.f32 	%f44, [%r36+44];
	fma.rn.f32 	%f45, %f44, %f43, %f42;
	ld.shared.f32 	%f46, [%r40+1536];
	ld.shared.f32 	%f47, [%r36+48];
	fma.rn.f32 	%f48, %f47, %f46, %f45;
	ld.shared.f32 	%f49, [%r40+1664];
	ld.shared.f32 	%f50, [%r36+52];
	fma.rn.f32 	%f51, %f50, %f49, %f48;
	ld.shared.f32 	%f52, [%r40+1792];
	ld.shared.f32 	%f53, [%r36+56];
	fma.rn.f32 	%f54, %f53, %f52, %f51;
	ld.shared.f32 	%f55, [%r40+1920];
	ld.shared.f32 	%f56, [%r36+60];
	fma.rn.f32 	%f57, %f56, %f55, %f54;
	ld.shared.f32 	%f58, [%r40+2048];
	ld.shared.f32 	%f59, [%r36+64];
	fma.rn.f32 	%f60, %f59, %f58, %f57;
	ld.shared.f32 	%f61, [%r40+2176];
	ld.shared.f32 	%f62, [%r36+68];
	fma.rn.f32 	%f63, %f62, %f61, %f60;
	ld.shared.f32 	%f64, [%r40+2304];
	ld.shared.f32 	%f65, [%r36+72];
	fma.rn.f32 	%f66, %f65, %f64, %f63;
	ld.shared.f32 	%f67, [%r40+2432];
	ld.shared.f32 	%f68, [%r36+76];
	fma.rn.f32 	%f69, %f68, %f67, %f66;
	ld.shared.f32 	%f70, [%r40+2560];
	ld.shared.f32 	%f71, [%r36+80];
	fma.rn.f32 	%f72, %f71, %f70, %f69;
	ld.shared.f32 	%f73, [%r40+2688];
	ld.shared.f32 	%f74, [%r36+84];
	fma.rn.f32 	%f75, %f74, %f73, %f72;
	ld.shared.f32 	%f76, [%r40+2816];
	ld.shared.f32 	%f77, [%r36+88];
	fma.rn.f32 	%f78, %f77, %f76, %f75;
	ld.shared.f32 	%f79, [%r40+2944];
	ld.shared.f32 	%f80, [%r36+92];
	fma.rn.f32 	%f81, %f80, %f79, %f78;
	ld.shared.f32 	%f82, [%r40+3072];
	ld.shared.f32 	%f83, [%r36+96];
	fma.rn.f32 	%f84, %f83, %f82, %f81;
	ld.shared.f32 	%f85, [%r40+3200];
	ld.shared.f32 	%f86, [%r36+100];
	fma.rn.f32 	%f87, %f86, %f85, %f84;
	ld.shared.f32 	%f88, [%r40+3328];
	ld.shared.f32 	%f89, [%r36+104];
	fma.rn.f32 	%f90, %f89, %f88, %f87;
	ld.shared.f32 	%f91, [%r40+3456];
	ld.shared.f32 	%f92, [%r36+108];
	fma.rn.f32 	%f93, %f92, %f91, %f90;
	ld.shared.f32 	%f94, [%r40+3584];
	ld.shared.f32 	%f95, [%r36+112];
	fma.rn.f32 	%f96, %f95, %f94, %f93;
	ld.shared.f32 	%f97, [%r40+3712];
	ld.shared.f32 	%f98, [%r36+116];
	fma.rn.f32 	%f99, %f98, %f97, %f96;
	ld.shared.f32 	%f100, [%r40+3840];
	ld.shared.f32 	%f101, [%r36+120];
	fma.rn.f32 	%f102, %f101, %f100, %f99;
	ld.shared.f32 	%f103, [%r40+3968];
	ld.shared.f32 	%f104, [%r36+124];
	fma.rn.f32 	%f109, %f104, %f103, %f102;
	bar.sync 	0;
	add.s32 	%r42, %r42, 32;
	add.s32 	%r41, %r9, 32;
	add.s32 	%r43, %r43, 1;
	setp.lt.s32	%p9, %r43, %r3;
	@%p9 bra 	BB2_2;

BB2_7:
	setp.ge.u64	%p10, %rd2, %rd8;
	setp.ge.u64	%p11, %rd1, %rd9;
	or.pred  	%p12, %p11, %p10;
	@%p12 bra 	BB2_9;

	mul.lo.s64 	%rd26, %rd1, %rd8;
	add.s64 	%rd27, %rd26, %rd2;
	cvta.to.global.u64 	%rd28, %rd7;
	shl.b64 	%rd29, %rd27, 2;
	add.s64 	%rd30, %rd28, %rd29;
	ld.global.f32 	%f105, [%rd30];
	mul.f32 	%f106, %f105, %f5;
	fma.rn.f32 	%f107, %f109, %f4, %f106;
	st.global.f32 	[%rd30], %f107;

BB2_9:
	ret;
}

	// .globl	_ZN11Alcanderian19cuda_kernel_sgemm_2EPfS0_S0_mmmff
.visible .entry _ZN11Alcanderian19cuda_kernel_sgemm_2EPfS0_S0_mmmff(
	.param .u64 _ZN11Alcanderian19cuda_kernel_sgemm_2EPfS0_S0_mmmff_param_0,
	.param .u64 _ZN11Alcanderian19cuda_kernel_sgemm_2EPfS0_S0_mmmff_param_1,
	.param .u64 _ZN11Alcanderian19cuda_kernel_sgemm_2EPfS0_S0_mmmff_param_2,
	.param .u64 _ZN11Alcanderian19cuda_kernel_sgemm_2EPfS0_S0_mmmff_param_3,
	.param .u64 _ZN11Alcanderian19cuda_kernel_sgemm_2EPfS0_S0_mmmff_param_4,
	.param .u64 _ZN11Alcanderian19cuda_kernel_sgemm_2EPfS0_S0_mmmff_param_5,
	.param .f32 _ZN11Alcanderian19cuda_kernel_sgemm_2EPfS0_S0_mmmff_param_6,
	.param .f32 _ZN11Alcanderian19cuda_kernel_sgemm_2EPfS0_S0_mmmff_param_7
)
{
	.reg .pred 	%p<13>;
	.reg .f32 	%f<110>;
	.reg .b32 	%r<43>;
	.reg .b64 	%rd<31>;
	// demoted variable
	.shared .align 4 .b8 _ZZN11Alcanderian19cuda_kernel_sgemm_2EPfS0_S0_mmmffE5a_sub[4224];
	// demoted variable
	.shared .align 4 .b8 _ZZN11Alcanderian19cuda_kernel_sgemm_2EPfS0_S0_mmmffE5b_sub[4224];

	ld.param.u64 	%rd5, [_ZN11Alcanderian19cuda_kernel_sgemm_2EPfS0_S0_mmmff_param_0];
	ld.param.u64 	%rd6, [_ZN11Alcanderian19cuda_kernel_sgemm_2EPfS0_S0_mmmff_param_1];
	ld.param.u64 	%rd7, [_ZN11Alcanderian19cuda_kernel_sgemm_2EPfS0_S0_mmmff_param_2];
	ld.param.u64 	%rd8, [_ZN11Alcanderian19cuda_kernel_sgemm_2EPfS0_S0_mmmff_param_3];
	ld.param.u64 	%rd9, [_ZN11Alcanderian19cuda_kernel_sgemm_2EPfS0_S0_mmmff_param_4];
	ld.param.u64 	%rd10, [_ZN11Alcanderian19cuda_kernel_sgemm_2EPfS0_S0_mmmff_param_5];
	ld.param.f32 	%f4, [_ZN11Alcanderian19cuda_kernel_sgemm_2EPfS0_S0_mmmff_param_6];
	ld.param.f32 	%f5, [_ZN11Alcanderian19cuda_kernel_sgemm_2EPfS0_S0_mmmff_param_7];
	mov.u32 	%r41, %tid.x;
	mov.u32 	%r13, %ctaid.x;
	shl.b32 	%r14, %r13, 5;
	add.s32 	%r15, %r14, %r41;
	mov.u32 	%r16, %ctaid.y;
	shl.b32 	%r17, %r16, 5;
	mov.u32 	%r40, %tid.y;
	add.s32 	%r18, %r17, %r40;
	shr.u64 	%rd11, %rd10, 5;
	cvt.u32.u64	%r19, %rd11;
	and.b64  	%rd12, %rd10, 31;
	setp.ne.s64	%p1, %rd12, 0;
	selp.u32	%r20, 1, 0, %p1;
	add.s32 	%r3, %r20, %r19;
	cvt.s64.s32	%rd1, %r15;
	cvt.s64.s32	%rd2, %r18;
	mov.f32 	%f109, 0f00000000;
	setp.lt.s32	%p2, %r3, 1;
	@%p2 bra 	BB3_7;

	mul.lo.s32 	%r22, %r41, 132;
	mov.u32 	%r23, _ZZN11Alcanderian19cuda_kernel_sgemm_2EPfS0_S0_mmmffE5a_sub;
	add.s32 	%r24, %r23, %r22;
	shl.b32 	%r25, %r40, 2;
	add.s32 	%r4, %r24, %r25;
	mov.u32 	%r26, _ZZN11Alcanderian19cuda_kernel_sgemm_2EPfS0_S0_mmmffE5b_sub;
	add.s32 	%r27, %r26, %r22;
	add.s32 	%r5, %r27, %r25;
	mov.f32 	%f109, 0f00000000;
	mov.u32 	%r21, 0;
	cvta.to.global.u64 	%rd16, %rd5;
	cvta.to.global.u64 	%rd23, %rd6;
	mov.u32 	%r42, %r21;

BB3_2:
	st.shared.u32 	[%r4], %r21;
	st.shared.u32 	[%r5], %r21;
	cvt.s64.s32	%rd3, %r40;
	setp.ge.u64	%p3, %rd3, %rd10;
	setp.ge.u64	%p4, %rd1, %rd9;
	or.pred  	%p5, %p4, %p3;
	@%p5 bra 	BB3_4;

	mov.u32 	%r31, %tid.x;
	add.s32 	%r32, %r14, %r31;
	cvt.s64.s32	%rd13, %r32;
	mul.lo.s64 	%rd14, %rd13, %rd10;
	add.s64 	%rd15, %rd3, %rd14;
	shl.b64 	%rd17, %rd15, 2;
	add.s64 	%rd18, %rd16, %rd17;
	ld.global.nc.f32 	%f8, [%rd18];
	st.shared.f32 	[%r4], %f8;

BB3_4:
	setp.ge.u64	%p6, %rd2, %rd8;
	cvt.s64.s32	%rd19, %r41;
	setp.ge.u64	%p7, %rd19, %rd10;
	or.pred  	%p8, %p7, %p6;
	@%p8 bra 	BB3_6;

	mul.lo.s64 	%rd21, %rd19, %rd8;
	add.s64 	%rd22, %rd21, %rd2;
	shl.b64 	%rd24, %rd22, 2;
	add.s64 	%rd25, %rd23, %rd24;
	ld.global.nc.f32 	%f9, [%rd25];
	st.shared.f32 	[%r5], %f9;

BB3_6:
	cvt.u32.u64	%r9, %rd3;
	bar.sync 	0;
	mov.u32 	%r33, %tid.x;
	mad.lo.s32 	%r35, %r33, 132, %r23;
	mov.u32 	%r36, %tid.y;
	shl.b32 	%r37, %r36, 2;
	add.s32 	%r39, %r26, %r37;
	ld.shared.f32 	%f10, [%r39];
	ld.shared.f32 	%f11, [%r35];
	fma.rn.f32 	%f12, %f11, %f10, %f109;
	ld.shared.f32 	%f13, [%r39+132];
	ld.shared.f32 	%f14, [%r35+4];
	fma.rn.f32 	%f15, %f14, %f13, %f12;
	ld.shared.f32 	%f16, [%r39+264];
	ld.shared.f32 	%f17, [%r35+8];
	fma.rn.f32 	%f18, %f17, %f16, %f15;
	ld.shared.f32 	%f19, [%r39+396];
	ld.shared.f32 	%f20, [%r35+12];
	fma.rn.f32 	%f21, %f20, %f19, %f18;
	ld.shared.f32 	%f22, [%r39+528];
	ld.shared.f32 	%f23, [%r35+16];
	fma.rn.f32 	%f24, %f23, %f22, %f21;
	ld.shared.f32 	%f25, [%r39+660];
	ld.shared.f32 	%f26, [%r35+20];
	fma.rn.f32 	%f27, %f26, %f25, %f24;
	ld.shared.f32 	%f28, [%r39+792];
	ld.shared.f32 	%f29, [%r35+24];
	fma.rn.f32 	%f30, %f29, %f28, %f27;
	ld.shared.f32 	%f31, [%r39+924];
	ld.shared.f32 	%f32, [%r35+28];
	fma.rn.f32 	%f33, %f32, %f31, %f30;
	ld.shared.f32 	%f34, [%r39+1056];
	ld.shared.f32 	%f35, [%r35+32];
	fma.rn.f32 	%f36, %f35, %f34, %f33;
	ld.shared.f32 	%f37, [%r39+1188];
	ld.shared.f32 	%f38, [%r35+36];
	fma.rn.f32 	%f39, %f38, %f37, %f36;
	ld.shared.f32 	%f40, [%r39+1320];
	ld.shared.f32 	%f41, [%r35+40];
	fma.rn.f32 	%f42, %f41, %f40, %f39;
	ld.shared.f32 	%f43, [%r39+1452];
	ld.shared.f32 	%f44, [%r35+44];
	fma.rn.f32 	%f45, %f44, %f43, %f42;
	ld.shared.f32 	%f46, [%r39+1584];
	ld.shared.f32 	%f47, [%r35+48];
	fma.rn.f32 	%f48, %f47, %f46, %f45;
	ld.shared.f32 	%f49, [%r39+1716];
	ld.shared.f32 	%f50, [%r35+52];
	fma.rn.f32 	%f51, %f50, %f49, %f48;
	ld.shared.f32 	%f52, [%r39+1848];
	ld.shared.f32 	%f53, [%r35+56];
	fma.rn.f32 	%f54, %f53, %f52, %f51;
	ld.shared.f32 	%f55, [%r39+1980];
	ld.shared.f32 	%f56, [%r35+60];
	fma.rn.f32 	%f57, %f56, %f55, %f54;
	ld.shared.f32 	%f58, [%r39+2112];
	ld.shared.f32 	%f59, [%r35+64];
	fma.rn.f32 	%f60, %f59, %f58, %f57;
	ld.shared.f32 	%f61, [%r39+2244];
	ld.shared.f32 	%f62, [%r35+68];
	fma.rn.f32 	%f63, %f62, %f61, %f60;
	ld.shared.f32 	%f64, [%r39+2376];
	ld.shared.f32 	%f65, [%r35+72];
	fma.rn.f32 	%f66, %f65, %f64, %f63;
	ld.shared.f32 	%f67, [%r39+2508];
	ld.shared.f32 	%f68, [%r35+76];
	fma.rn.f32 	%f69, %f68, %f67, %f66;
	ld.shared.f32 	%f70, [%r39+2640];
	ld.shared.f32 	%f71, [%r35+80];
	fma.rn.f32 	%f72, %f71, %f70, %f69;
	ld.shared.f32 	%f73, [%r39+2772];
	ld.shared.f32 	%f74, [%r35+84];
	fma.rn.f32 	%f75, %f74, %f73, %f72;
	ld.shared.f32 	%f76, [%r39+2904];
	ld.shared.f32 	%f77, [%r35+88];
	fma.rn.f32 	%f78, %f77, %f76, %f75;
	ld.shared.f32 	%f79, [%r39+3036];
	ld.shared.f32 	%f80, [%r35+92];
	fma.rn.f32 	%f81, %f80, %f79, %f78;
	ld.shared.f32 	%f82, [%r39+3168];
	ld.shared.f32 	%f83, [%r35+96];
	fma.rn.f32 	%f84, %f83, %f82, %f81;
	ld.shared.f32 	%f85, [%r39+3300];
	ld.shared.f32 	%f86, [%r35+100];
	fma.rn.f32 	%f87, %f86, %f85, %f84;
	ld.shared.f32 	%f88, [%r39+3432];
	ld.shared.f32 	%f89, [%r35+104];
	fma.rn.f32 	%f90, %f89, %f88, %f87;
	ld.shared.f32 	%f91, [%r39+3564];
	ld.shared.f32 	%f92, [%r35+108];
	fma.rn.f32 	%f93, %f92, %f91, %f90;
	ld.shared.f32 	%f94, [%r39+3696];
	ld.shared.f32 	%f95, [%r35+112];
	fma.rn.f32 	%f96, %f95, %f94, %f93;
	ld.shared.f32 	%f97, [%r39+3828];
	ld.shared.f32 	%f98, [%r35+116];
	fma.rn.f32 	%f99, %f98, %f97, %f96;
	ld.shared.f32 	%f100, [%r39+3960];
	ld.shared.f32 	%f101, [%r35+120];
	fma.rn.f32 	%f102, %f101, %f100, %f99;
	ld.shared.f32 	%f103, [%r39+4092];
	ld.shared.f32 	%f104, [%r35+124];
	fma.rn.f32 	%f109, %f104, %f103, %f102;
	bar.sync 	0;
	add.s32 	%r41, %r41, 32;
	add.s32 	%r40, %r9, 32;
	add.s32 	%r42, %r42, 1;
	setp.lt.s32	%p9, %r42, %r3;
	@%p9 bra 	BB3_2;

BB3_7:
	setp.ge.u64	%p10, %rd2, %rd8;
	setp.ge.u64	%p11, %rd1, %rd9;
	or.pred  	%p12, %p11, %p10;
	@%p12 bra 	BB3_9;

	mul.lo.s64 	%rd26, %rd1, %rd8;
	add.s64 	%rd27, %rd26, %rd2;
	cvta.to.global.u64 	%rd28, %rd7;
	shl.b64 	%rd29, %rd27, 2;
	add.s64 	%rd30, %rd28, %rd29;
	ld.global.f32 	%f105, [%rd30];
	mul.f32 	%f106, %f105, %f5;
	fma.rn.f32 	%f107, %f109, %f4, %f106;
	st.global.f32 	[%rd30], %f107;

BB3_9:
	ret;
}

	// .globl	_ZN11Alcanderian19cuda_kernel_sgemm_3EPfS0_S0_mmmff
.visible .entry _ZN11Alcanderian19cuda_kernel_sgemm_3EPfS0_S0_mmmff(
	.param .u64 _ZN11Alcanderian19cuda_kernel_sgemm_3EPfS0_S0_mmmff_param_0,
	.param .u64 _ZN11Alcanderian19cuda_kernel_sgemm_3EPfS0_S0_mmmff_param_1,
	.param .u64 _ZN11Alcanderian19cuda_kernel_sgemm_3EPfS0_S0_mmmff_param_2,
	.param .u64 _ZN11Alcanderian19cuda_kernel_sgemm_3EPfS0_S0_mmmff_param_3,
	.param .u64 _ZN11Alcanderian19cuda_kernel_sgemm_3EPfS0_S0_mmmff_param_4,
	.param .u64 _ZN11Alcanderian19cuda_kernel_sgemm_3EPfS0_S0_mmmff_param_5,
	.param .f32 _ZN11Alcanderian19cuda_kernel_sgemm_3EPfS0_S0_mmmff_param_6,
	.param .f32 _ZN11Alcanderian19cuda_kernel_sgemm_3EPfS0_S0_mmmff_param_7
)
{
	.reg .pred 	%p<13>;
	.reg .f32 	%f<110>;
	.reg .b32 	%r<44>;
	.reg .b64 	%rd<31>;
	// demoted variable
	.shared .align 4 .b8 _ZZN11Alcanderian19cuda_kernel_sgemm_3EPfS0_S0_mmmffE5a_sub[4096];
	// demoted variable
	.shared .align 4 .b8 _ZZN11Alcanderian19cuda_kernel_sgemm_3EPfS0_S0_mmmffE5b_sub[4096];

	ld.param.u64 	%rd5, [_ZN11Alcanderian19cuda_kernel_sgemm_3EPfS0_S0_mmmff_param_0];
	ld.param.u64 	%rd6, [_ZN11Alcanderian19cuda_kernel_sgemm_3EPfS0_S0_mmmff_param_1];
	ld.param.u64 	%rd7, [_ZN11Alcanderian19cuda_kernel_sgemm_3EPfS0_S0_mmmff_param_2];
	ld.param.u64 	%rd8, [_ZN11Alcanderian19cuda_kernel_sgemm_3EPfS0_S0_mmmff_param_3];
	ld.param.u64 	%rd9, [_ZN11Alcanderian19cuda_kernel_sgemm_3EPfS0_S0_mmmff_param_4];
	ld.param.u64 	%rd10, [_ZN11Alcanderian19cuda_kernel_sgemm_3EPfS0_S0_mmmff_param_5];
	ld.param.f32 	%f4, [_ZN11Alcanderian19cuda_kernel_sgemm_3EPfS0_S0_mmmff_param_6];
	ld.param.f32 	%f5, [_ZN11Alcanderian19cuda_kernel_sgemm_3EPfS0_S0_mmmff_param_7];
	mov.u32 	%r42, %tid.x;
	mov.u32 	%r13, %ctaid.x;
	shl.b32 	%r14, %r13, 5;
	add.s32 	%r15, %r14, %r42;
	mov.u32 	%r16, %ctaid.y;
	shl.b32 	%r17, %r16, 5;
	mov.u32 	%r41, %tid.y;
	add.s32 	%r18, %r17, %r41;
	shr.u64 	%rd11, %rd10, 5;
	cvt.u32.u64	%r19, %rd11;
	and.b64  	%rd12, %rd10, 31;
	setp.ne.s64	%p1, %rd12, 0;
	selp.u32	%r20, 1, 0, %p1;
	add.s32 	%r3, %r20, %r19;
	cvt.s64.s32	%rd1, %r15;
	cvt.s64.s32	%rd2, %r18;
	mov.f32 	%f109, 0f00000000;
	setp.lt.s32	%p2, %r3, 1;
	@%p2 bra 	BB4_7;

	shl.b32 	%r22, %r42, 7;
	mov.u32 	%r23, _ZZN11Alcanderian19cuda_kernel_sgemm_3EPfS0_S0_mmmffE5a_sub;
	add.s32 	%r24, %r23, %r22;
	shl.b32 	%r25, %r41, 2;
	add.s32 	%r4, %r24, %r25;
	mov.u32 	%r26, _ZZN11Alcanderian19cuda_kernel_sgemm_3EPfS0_S0_mmmffE5b_sub;
	add.s32 	%r27, %r26, %r22;
	add.s32 	%r5, %r27, %r25;
	mov.f32 	%f109, 0f00000000;
	mov.u32 	%r21, 0;
	cvta.to.global.u64 	%rd16, %rd5;
	cvta.to.global.u64 	%rd23, %rd6;
	mov.u32 	%r43, %r21;

BB4_2:
	st.shared.u32 	[%r4], %r21;
	st.shared.u32 	[%r5], %r21;
	cvt.s64.s32	%rd3, %r41;
	setp.ge.u64	%p3, %rd3, %rd10;
	setp.ge.u64	%p4, %rd1, %rd9;
	or.pred  	%p5, %p4, %p3;
	@%p5 bra 	BB4_4;

	mov.u32 	%r31, %tid.x;
	add.s32 	%r32, %r14, %r31;
	cvt.s64.s32	%rd13, %r32;
	mul.lo.s64 	%rd14, %rd13, %rd10;
	add.s64 	%rd15, %rd3, %rd14;
	shl.b64 	%rd17, %rd15, 2;
	add.s64 	%rd18, %rd16, %rd17;
	ld.global.nc.f32 	%f8, [%rd18];
	st.shared.f32 	[%r4], %f8;

BB4_4:
	setp.ge.u64	%p6, %rd2, %rd8;
	cvt.s64.s32	%rd19, %r42;
	setp.ge.u64	%p7, %rd19, %rd10;
	or.pred  	%p8, %p7, %p6;
	@%p8 bra 	BB4_6;

	mul.lo.s64 	%rd21, %rd19, %rd8;
	add.s64 	%rd22, %rd21, %rd2;
	shl.b64 	%rd24, %rd22, 2;
	add.s64 	%rd25, %rd23, %rd24;
	ld.global.nc.f32 	%f9, [%rd25];
	st.shared.f32 	[%r5], %f9;

BB4_6:
	cvt.u32.u64	%r9, %rd3;
	bar.sync 	0;
	mov.u32 	%r33, %tid.x;
	shl.b32 	%r34, %r33, 7;
	add.s32 	%r36, %r23, %r34;
	mov.u32 	%r37, %tid.y;
	shl.b32 	%r38, %r37, 2;
	add.s32 	%r40, %r26, %r38;
	ld.shared.f32 	%f10, [%r40];
	ld.shared.f32 	%f11, [%r36];
	fma.rn.f32 	%f12, %f11, %f10, %f109;
	ld.shared.f32 	%f13, [%r40+128];
	ld.shared.f32 	%f14, [%r36+4];
	fma.rn.f32 	%f15, %f14, %f13, %f12;
	ld.shared.f32 	%f16, [%r40+256];
	ld.shared.f32 	%f17, [%r36+8];
	fma.rn.f32 	%f18, %f17, %f16, %f15;
	ld.shared.f32 	%f19, [%r40+384];
	ld.shared.f32 	%f20, [%r36+12];
	fma.rn.f32 	%f21, %f20, %f19, %f18;
	ld.shared.f32 	%f22, [%r40+512];
	ld.shared.f32 	%f23, [%r36+16];
	fma.rn.f32 	%f24, %f23, %f22, %f21;
	ld.shared.f32 	%f25, [%r40+640];
	ld.shared.f32 	%f26, [%r36+20];
	fma.rn.f32 	%f27, %f26, %f25, %f24;
	ld.shared.f32 	%f28, [%r40+768];
	ld.shared.f32 	%f29, [%r36+24];
	fma.rn.f32 	%f30, %f29, %f28, %f27;
	ld.shared.f32 	%f31, [%r40+896];
	ld.shared.f32 	%f32, [%r36+28];
	fma.rn.f32 	%f33, %f32, %f31, %f30;
	ld.shared.f32 	%f34, [%r40+1024];
	ld.shared.f32 	%f35, [%r36+32];
	fma.rn.f32 	%f36, %f35, %f34, %f33;
	ld.shared.f32 	%f37, [%r40+1152];
	ld.shared.f32 	%f38, [%r36+36];
	fma.rn.f32 	%f39, %f38, %f37, %f36;
	ld.shared.f32 	%f40, [%r40+1280];
	ld.shared.f32 	%f41, [%r36+40];
	fma.rn.f32 	%f42, %f41, %f40, %f39;
	ld.shared.f32 	%f43, [%r40+1408];
	ld.shared.f32 	%f44, [%r36+44];
	fma.rn.f32 	%f45, %f44, %f43, %f42;
	ld.shared.f32 	%f46, [%r40+1536];
	ld.shared.f32 	%f47, [%r36+48];
	fma.rn.f32 	%f48, %f47, %f46, %f45;
	ld.shared.f32 	%f49, [%r40+1664];
	ld.shared.f32 	%f50, [%r36+52];
	fma.rn.f32 	%f51, %f50, %f49, %f48;
	ld.shared.f32 	%f52, [%r40+1792];
	ld.shared.f32 	%f53, [%r36+56];
	fma.rn.f32 	%f54, %f53, %f52, %f51;
	ld.shared.f32 	%f55, [%r40+1920];
	ld.shared.f32 	%f56, [%r36+60];
	fma.rn.f32 	%f57, %f56, %f55, %f54;
	ld.shared.f32 	%f58, [%r40+2048];
	ld.shared.f32 	%f59, [%r36+64];
	fma.rn.f32 	%f60, %f59, %f58, %f57;
	ld.shared.f32 	%f61, [%r40+2176];
	ld.shared.f32 	%f62, [%r36+68];
	fma.rn.f32 	%f63, %f62, %f61, %f60;
	ld.shared.f32 	%f64, [%r40+2304];
	ld.shared.f32 	%f65, [%r36+72];
	fma.rn.f32 	%f66, %f65, %f64, %f63;
	ld.shared.f32 	%f67, [%r40+2432];
	ld.shared.f32 	%f68, [%r36+76];
	fma.rn.f32 	%f69, %f68, %f67, %f66;
	ld.shared.f32 	%f70, [%r40+2560];
	ld.shared.f32 	%f71, [%r36+80];
	fma.rn.f32 	%f72, %f71, %f70, %f69;
	ld.shared.f32 	%f73, [%r40+2688];
	ld.shared.f32 	%f74, [%r36+84];
	fma.rn.f32 	%f75, %f74, %f73, %f72;
	ld.shared.f32 	%f76, [%r40+2816];
	ld.shared.f32 	%f77, [%r36+88];
	fma.rn.f32 	%f78, %f77, %f76, %f75;
	ld.shared.f32 	%f79, [%r40+2944];
	ld.shared.f32 	%f80, [%r36+92];
	fma.rn.f32 	%f81, %f80, %f79, %f78;
	ld.shared.f32 	%f82, [%r40+3072];
	ld.shared.f32 	%f83, [%r36+96];
	fma.rn.f32 	%f84, %f83, %f82, %f81;
	ld.shared.f32 	%f85, [%r40+3200];
	ld.shared.f32 	%f86, [%r36+100];
	fma.rn.f32 	%f87, %f86, %f85, %f84;
	ld.shared.f32 	%f88, [%r40+3328];
	ld.shared.f32 	%f89, [%r36+104];
	fma.rn.f32 	%f90, %f89, %f88, %f87;
	ld.shared.f32 	%f91, [%r40+3456];
	ld.shared.f32 	%f92, [%r36+108];
	fma.rn.f32 	%f93, %f92, %f91, %f90;
	ld.shared.f32 	%f94, [%r40+3584];
	ld.shared.f32 	%f95, [%r36+112];
	fma.rn.f32 	%f96, %f95, %f94, %f93;
	ld.shared.f32 	%f97, [%r40+3712];
	ld.shared.f32 	%f98, [%r36+116];
	fma.rn.f32 	%f99, %f98, %f97, %f96;
	ld.shared.f32 	%f100, [%r40+3840];
	ld.shared.f32 	%f101, [%r36+120];
	fma.rn.f32 	%f102, %f101, %f100, %f99;
	ld.shared.f32 	%f103, [%r40+3968];
	ld.shared.f32 	%f104, [%r36+124];
	fma.rn.f32 	%f109, %f104, %f103, %f102;
	bar.sync 	0;
	add.s32 	%r42, %r42, 32;
	add.s32 	%r41, %r9, 32;
	add.s32 	%r43, %r43, 1;
	setp.lt.s32	%p9, %r43, %r3;
	@%p9 bra 	BB4_2;

BB4_7:
	setp.ge.u64	%p10, %rd2, %rd8;
	setp.ge.u64	%p11, %rd1, %rd9;
	or.pred  	%p12, %p11, %p10;
	@%p12 bra 	BB4_9;

	mul.lo.s64 	%rd26, %rd1, %rd8;
	add.s64 	%rd27, %rd26, %rd2;
	cvta.to.global.u64 	%rd28, %rd7;
	shl.b64 	%rd29, %rd27, 2;
	add.s64 	%rd30, %rd28, %rd29;
	ld.global.f32 	%f105, [%rd30];
	mul.f32 	%f106, %f105, %f5;
	fma.rn.f32 	%f107, %f109, %f4, %f106;
	st.global.f32 	[%rd30], %f107;

BB4_9:
	ret;
}

	// .globl	_ZN4fynv7g_fgemmEPfPKfS2_iiii
.visible .entry _ZN4fynv7g_fgemmEPfPKfS2_iiii(
	.param .u64 _ZN4fynv7g_fgemmEPfPKfS2_iiii_param_0,
	.param .u64 _ZN4fynv7g_fgemmEPfPKfS2_iiii_param_1,
	.param .u64 _ZN4fynv7g_fgemmEPfPKfS2_iiii_param_2,
	.param .u32 _ZN4fynv7g_fgemmEPfPKfS2_iiii_param_3,
	.param .u32 _ZN4fynv7g_fgemmEPfPKfS2_iiii_param_4,
	.param .u32 _ZN4fynv7g_fgemmEPfPKfS2_iiii_param_5,
	.param .u32 _ZN4fynv7g_fgemmEPfPKfS2_iiii_param_6
)
.maxntid 256, 1, 1
{
	.reg .pred 	%p<5>;
	.reg .b16 	%rs<3>;
	.reg .f32 	%f<677>;
	.reg .b32 	%r<90>;
	.reg .b64 	%rd<51>;
	// demoted variable
	.shared .align 4 .b8 _ZZN4fynv7g_fgemmEPfPKfS2_iiiiE4smem[16384];

	ld.param.u64 	%rd11, [_ZN4fynv7g_fgemmEPfPKfS2_iiii_param_1];
	ld.param.u64 	%rd12, [_ZN4fynv7g_fgemmEPfPKfS2_iiii_param_2];
	ld.param.u32 	%r15, [_ZN4fynv7g_fgemmEPfPKfS2_iiii_param_3];
	ld.param.u32 	%r16, [_ZN4fynv7g_fgemmEPfPKfS2_iiii_param_4];
	ld.param.u32 	%r17, [_ZN4fynv7g_fgemmEPfPKfS2_iiii_param_5];
	cvta.to.global.u64 	%rd13, %rd12;
	mov.u32 	%r19, %tid.x;
	and.b32  	%r20, %r19, 31;
	shr.u32 	%r21, %r19, 5;
	and.b32  	%r22, %r19, 1;
	shr.u32 	%r23, %r19, 1;
	mov.u32 	%r24, %ctaid.x;
	shl.b32 	%r25, %r24, 7;
	cvta.to.global.u64 	%rd14, %rd11;
	mul.wide.u32 	%rd15, %r25, 4;
	add.s64 	%rd48, %rd14, %rd15;
	mov.u32 	%r26, %ctaid.y;
	mul.lo.s32 	%r27, %r26, %r17;
	shl.b32 	%r28, %r27, 7;
	mul.wide.u32 	%rd16, %r28, 4;
	add.s64 	%rd47, %rd13, %rd16;
	shr.s32 	%r29, %r16, 2;
	mad.lo.s32 	%r30, %r21, %r29, %r20;
	mul.wide.s32 	%rd17, %r30, 16;
	add.s64 	%rd18, %rd48, %rd17;
	ld.global.v4.f32 	{%f485, %f486, %f487, %f488}, [%rd18];
	shr.s32 	%r31, %r17, 2;
	mad.lo.s32 	%r32, %r23, %r31, %r22;
	mul.wide.s32 	%rd19, %r32, 16;
	add.s64 	%rd3, %rd47, %rd19;
	mov.f32 	%f549, 0f00000000;
	setp.lt.s32	%p1, %r15, 1;
	mov.f32 	%f550, %f549;
	mov.f32 	%f551, %f549;
	mov.f32 	%f552, %f549;
	mov.f32 	%f553, %f549;
	mov.f32 	%f554, %f549;
	mov.f32 	%f555, %f549;
	mov.f32 	%f556, %f549;
	mov.f32 	%f557, %f549;
	mov.f32 	%f558, %f549;
	mov.f32 	%f559, %f549;
	mov.f32 	%f560, %f549;
	mov.f32 	%f561, %f549;
	mov.f32 	%f562, %f549;
	mov.f32 	%f563, %f549;
	mov.f32 	%f564, %f549;
	mov.f32 	%f565, %f549;
	mov.f32 	%f566, %f549;
	mov.f32 	%f567, %f549;
	mov.f32 	%f568, %f549;
	mov.f32 	%f569, %f549;
	mov.f32 	%f570, %f549;
	mov.f32 	%f571, %f549;
	mov.f32 	%f572, %f549;
	mov.f32 	%f573, %f549;
	mov.f32 	%f574, %f549;
	mov.f32 	%f575, %f549;
	mov.f32 	%f576, %f549;
	mov.f32 	%f577, %f549;
	mov.f32 	%f578, %f549;
	mov.f32 	%f579, %f549;
	mov.f32 	%f580, %f549;
	mov.f32 	%f581, %f549;
	mov.f32 	%f582, %f549;
	mov.f32 	%f583, %f549;
	mov.f32 	%f584, %f549;
	mov.f32 	%f585, %f549;
	mov.f32 	%f586, %f549;
	mov.f32 	%f587, %f549;
	mov.f32 	%f588, %f549;
	mov.f32 	%f589, %f549;
	mov.f32 	%f590, %f549;
	mov.f32 	%f591, %f549;
	mov.f32 	%f592, %f549;
	mov.f32 	%f593, %f549;
	mov.f32 	%f594, %f549;
	mov.f32 	%f595, %f549;
	mov.f32 	%f596, %f549;
	mov.f32 	%f597, %f549;
	mov.f32 	%f598, %f549;
	mov.f32 	%f599, %f549;
	mov.f32 	%f600, %f549;
	mov.f32 	%f601, %f549;
	mov.f32 	%f602, %f549;
	mov.f32 	%f603, %f549;
	mov.f32 	%f604, %f549;
	mov.f32 	%f605, %f549;
	mov.f32 	%f606, %f549;
	mov.f32 	%f607, %f549;
	mov.f32 	%f608, %f549;
	mov.f32 	%f609, %f549;
	mov.f32 	%f610, %f549;
	mov.f32 	%f611, %f549;
	mov.f32 	%f612, %f549;
	@%p1 bra 	BB5_7;

	ld.global.v4.f32 	{%f481, %f482, %f483, %f484}, [%rd3];
	mov.u32 	%r82, 0;
	mov.u32 	%r83, _ZZN4fynv7g_fgemmEPfPKfS2_iiiiE4smem;
	add.s32 	%r86, %r83, 12288;
	add.s32 	%r85, %r83, 8192;
	add.s32 	%r84, %r83, 4096;
	mov.f32 	%f549, 0f00000000;
	mov.f32 	%f550, %f549;
	mov.f32 	%f551, %f549;
	mov.f32 	%f552, %f549;
	mov.f32 	%f553, %f549;
	mov.f32 	%f554, %f549;
	mov.f32 	%f555, %f549;
	mov.f32 	%f556, %f549;
	mov.f32 	%f557, %f549;
	mov.f32 	%f558, %f549;
	mov.f32 	%f559, %f549;
	mov.f32 	%f560, %f549;
	mov.f32 	%f561, %f549;
	mov.f32 	%f562, %f549;
	mov.f32 	%f563, %f549;
	mov.f32 	%f564, %f549;
	mov.f32 	%f565, %f549;
	mov.f32 	%f566, %f549;
	mov.f32 	%f567, %f549;
	mov.f32 	%f568, %f549;
	mov.f32 	%f569, %f549;
	mov.f32 	%f570, %f549;
	mov.f32 	%f571, %f549;
	mov.f32 	%f572, %f549;
	mov.f32 	%f573, %f549;
	mov.f32 	%f574, %f549;
	mov.f32 	%f575, %f549;
	mov.f32 	%f576, %f549;
	mov.f32 	%f577, %f549;
	mov.f32 	%f578, %f549;
	mov.f32 	%f579, %f549;
	mov.f32 	%f580, %f549;
	mov.f32 	%f581, %f549;
	mov.f32 	%f582, %f549;
	mov.f32 	%f583, %f549;
	mov.f32 	%f584, %f549;
	mov.f32 	%f585, %f549;
	mov.f32 	%f586, %f549;
	mov.f32 	%f587, %f549;
	mov.f32 	%f588, %f549;
	mov.f32 	%f589, %f549;
	mov.f32 	%f590, %f549;
	mov.f32 	%f591, %f549;
	mov.f32 	%f592, %f549;
	mov.f32 	%f593, %f549;
	mov.f32 	%f594, %f549;
	mov.f32 	%f595, %f549;
	mov.f32 	%f596, %f549;
	mov.f32 	%f597, %f549;
	mov.f32 	%f598, %f549;
	mov.f32 	%f599, %f549;
	mov.f32 	%f600, %f549;
	mov.f32 	%f601, %f549;
	mov.f32 	%f602, %f549;
	mov.f32 	%f603, %f549;
	mov.f32 	%f604, %f549;
	mov.f32 	%f605, %f549;
	mov.f32 	%f606, %f549;
	mov.f32 	%f607, %f549;
	mov.f32 	%f608, %f549;
	mov.f32 	%f609, %f549;
	mov.f32 	%f610, %f549;
	mov.f32 	%f611, %f549;
	mov.f32 	%f612, %f549;

BB5_2:
	mov.u32 	%r4, %r85;
	mov.u32 	%r2, %r83;
	shr.u32 	%r76, %r19, 1;
	and.b32  	%r75, %r19, 1;
	shl.b32 	%r39, %r19, 4;
	add.s32 	%r40, %r2, %r39;
	st.shared.v4.f32 	[%r40], {%f485, %f486, %f487, %f488};
	shl.b32 	%r42, %r75, 9;
	add.s32 	%r44, %r42, %r76;
	shl.b32 	%r45, %r44, 2;
	add.s32 	%r46, %r4, %r45;
	st.shared.f32 	[%r46], %f481;
	st.shared.f32 	[%r46+512], %f482;
	st.shared.f32 	[%r46+1024], %f483;
	st.shared.f32 	[%r46+1536], %f484;
	bar.sync 	0;
	add.s32 	%r6, %r82, 8;
	setp.ge.s32	%p2, %r6, %r15;
	@%p2 bra 	BB5_4;

	ld.param.u32 	%r77, [_ZN4fynv7g_fgemmEPfPKfS2_iiii_param_4];
	shl.b32 	%r47, %r77, 3;
	mul.wide.s32 	%rd20, %r47, 4;
	add.s64 	%rd48, %rd48, %rd20;
	add.s64 	%rd22, %rd48, %rd17;
	ld.global.v4.f32 	{%f485, %f486, %f487, %f488}, [%rd22];
	add.s64 	%rd47, %rd47, 32;
	add.s64 	%rd24, %rd47, %rd19;
	ld.global.v4.f32 	{%f481, %f482, %f483, %f484}, [%rd24];

BB5_4:
	cvt.u16.u32	%rs1, %r19;
	and.b16  	%rs2, %rs1, 15;
	mul.wide.u16 	%r88, %rs2, 16;
	and.b32  	%r89, %r19, -16;
	mov.u32 	%r87, -8;

BB5_5:
	add.s32 	%r59, %r2, %r88;
	ld.shared.v4.f32 	{%f437, %f438, %f439, %f440}, [%r59];
	ld.shared.v4.f32 	{%f445, %f446, %f447, %f448}, [%r59+256];
	add.s32 	%r60, %r4, %r89;
	ld.shared.v4.f32 	{%f453, %f454, %f455, %f456}, [%r60];
	ld.shared.v4.f32 	{%f461, %f462, %f463, %f464}, [%r60+256];
	fma.rn.f32 	%f561, %f437, %f453, %f561;
	fma.rn.f32 	%f562, %f438, %f453, %f562;
	fma.rn.f32 	%f563, %f439, %f453, %f563;
	fma.rn.f32 	%f564, %f440, %f453, %f564;
	fma.rn.f32 	%f565, %f445, %f453, %f565;
	fma.rn.f32 	%f566, %f446, %f453, %f566;
	fma.rn.f32 	%f567, %f447, %f453, %f567;
	fma.rn.f32 	%f568, %f448, %f453, %f568;
	fma.rn.f32 	%f569, %f437, %f454, %f569;
	fma.rn.f32 	%f570, %f438, %f454, %f570;
	fma.rn.f32 	%f571, %f439, %f454, %f571;
	fma.rn.f32 	%f572, %f440, %f454, %f572;
	fma.rn.f32 	%f573, %f445, %f454, %f573;
	fma.rn.f32 	%f574, %f446, %f454, %f574;
	fma.rn.f32 	%f575, %f447, %f454, %f575;
	fma.rn.f32 	%f576, %f448, %f454, %f576;
	fma.rn.f32 	%f577, %f437, %f455, %f577;
	fma.rn.f32 	%f578, %f438, %f455, %f578;
	fma.rn.f32 	%f579, %f439, %f455, %f579;
	fma.rn.f32 	%f580, %f440, %f455, %f580;
	fma.rn.f32 	%f581, %f445, %f455, %f581;
	fma.rn.f32 	%f582, %f446, %f455, %f582;
	fma.rn.f32 	%f583, %f447, %f455, %f583;
	fma.rn.f32 	%f584, %f448, %f455, %f584;
	fma.rn.f32 	%f585, %f437, %f456, %f585;
	fma.rn.f32 	%f586, %f438, %f456, %f586;
	fma.rn.f32 	%f587, %f439, %f456, %f587;
	fma.rn.f32 	%f588, %f440, %f456, %f588;
	fma.rn.f32 	%f589, %f445, %f456, %f589;
	fma.rn.f32 	%f590, %f446, %f456, %f590;
	fma.rn.f32 	%f591, %f447, %f456, %f591;
	fma.rn.f32 	%f592, %f448, %f456, %f592;
	fma.rn.f32 	%f593, %f437, %f461, %f593;
	fma.rn.f32 	%f594, %f438, %f461, %f594;
	fma.rn.f32 	%f595, %f439, %f461, %f595;
	fma.rn.f32 	%f596, %f440, %f461, %f596;
	fma.rn.f32 	%f597, %f445, %f461, %f597;
	fma.rn.f32 	%f598, %f446, %f461, %f598;
	fma.rn.f32 	%f599, %f447, %f461, %f599;
	fma.rn.f32 	%f600, %f448, %f461, %f600;
	fma.rn.f32 	%f601, %f437, %f462, %f601;
	fma.rn.f32 	%f602, %f438, %f462, %f602;
	fma.rn.f32 	%f603, %f439, %f462, %f603;
	fma.rn.f32 	%f604, %f440, %f462, %f604;
	fma.rn.f32 	%f605, %f445, %f462, %f605;
	fma.rn.f32 	%f606, %f446, %f462, %f606;
	fma.rn.f32 	%f607, %f447, %f462, %f607;
	fma.rn.f32 	%f608, %f448, %f462, %f608;
	fma.rn.f32 	%f609, %f437, %f463, %f609;
	fma.rn.f32 	%f610, %f438, %f463, %f610;
	fma.rn.f32 	%f611, %f439, %f463, %f611;
	fma.rn.f32 	%f612, %f440, %f463, %f612;
	fma.rn.f32 	%f560, %f445, %f463, %f560;
	fma.rn.f32 	%f559, %f446, %f463, %f559;
	fma.rn.f32 	%f558, %f447, %f463, %f558;
	fma.rn.f32 	%f557, %f448, %f463, %f557;
	fma.rn.f32 	%f556, %f437, %f464, %f556;
	fma.rn.f32 	%f555, %f438, %f464, %f555;
	fma.rn.f32 	%f554, %f439, %f464, %f554;
	fma.rn.f32 	%f553, %f440, %f464, %f553;
	fma.rn.f32 	%f552, %f445, %f464, %f552;
	fma.rn.f32 	%f551, %f446, %f464, %f551;
	fma.rn.f32 	%f550, %f447, %f464, %f550;
	fma.rn.f32 	%f549, %f448, %f464, %f549;
	add.s32 	%r89, %r89, 512;
	add.s32 	%r88, %r88, 512;
	add.s32 	%r87, %r87, 1;
	setp.ne.s32	%p3, %r87, 0;
	@%p3 bra 	BB5_5;

	add.s32 	%r82, %r82, 8;
	setp.lt.s32	%p4, %r82, %r15;
	mov.u32 	%r83, %r84;
	mov.u32 	%r84, %r2;
	mov.u32 	%r85, %r86;
	mov.u32 	%r86, %r4;
	@%p4 bra 	BB5_2;

BB5_7:
	mov.u32 	%r81, %ctaid.y;
	mov.u32 	%r80, %ctaid.x;
	shl.b32 	%r79, %r80, 7;
	ld.param.u64 	%rd46, [_ZN4fynv7g_fgemmEPfPKfS2_iiii_param_0];
	ld.param.u32 	%r74, [_ZN4fynv7g_fgemmEPfPKfS2_iiii_param_6];
	and.b32  	%r62, %r19, 15;
	shl.b32 	%r63, %r62, 2;
	shr.u32 	%r64, %r19, 2;
	and.b32  	%r65, %r64, 1073741820;
	add.s32 	%r68, %r79, %r63;
	cvt.u64.u32	%rd25, %r68;
	shl.b32 	%r70, %r81, 7;
	add.s32 	%r71, %r70, %r65;
	mul.lo.s32 	%r72, %r71, %r74;
	cvt.u64.u32	%rd26, %r72;
	add.s64 	%rd27, %rd26, %rd25;
	cvta.to.global.u64 	%rd28, %rd46;
	shl.b64 	%rd29, %rd27, 2;
	add.s64 	%rd30, %rd28, %rd29;
	st.global.v4.f32 	[%rd30], {%f561, %f562, %f563, %f564};
	st.global.v4.f32 	[%rd30+256], {%f565, %f566, %f567, %f568};
	cvt.s64.s32	%rd31, %r74;
	add.s64 	%rd32, %rd27, %rd31;
	mul.wide.s32 	%rd33, %r74, 4;
	add.s64 	%rd34, %rd30, %rd33;
	st.global.v4.f32 	[%rd34], {%f569, %f570, %f571, %f572};
	st.global.v4.f32 	[%rd34+256], {%f573, %f574, %f575, %f576};
	add.s64 	%rd35, %rd32, %rd31;
	add.s64 	%rd36, %rd34, %rd33;
	st.global.v2.f32 	[%rd36], {%f577, %f578};
	st.global.v2.f32 	[%rd36+8], {%f579, %f580};
	st.global.v4.f32 	[%rd36+256], {%f581, %f582, %f583, %f584};
	add.s64 	%rd37, %rd35, %rd31;
	add.s64 	%rd38, %rd36, %rd33;
	st.global.v2.f32 	[%rd38], {%f585, %f586};
	st.global.v2.f32 	[%rd38+8], {%f587, %f588};
	st.global.v4.f32 	[%rd38+256], {%f589, %f590, %f591, %f592};
	mul.lo.s32 	%r73, %r74, 61;
	cvt.s64.s32	%rd39, %r73;
	add.s64 	%rd40, %rd37, %rd39;
	shl.b64 	%rd41, %rd40, 2;
	add.s64 	%rd42, %rd28, %rd41;
	st.global.v2.f32 	[%rd42], {%f593, %f594};
	st.global.v2.f32 	[%rd42+8], {%f595, %f596};
	st.global.v4.f32 	[%rd42+256], {%f597, %f598, %f599, %f600};
	add.s64 	%rd43, %rd42, %rd33;
	st.global.v2.f32 	[%rd43], {%f601, %f602};
	st.global.v2.f32 	[%rd43+8], {%f603, %f604};
	st.global.v4.f32 	[%rd43+256], {%f605, %f606, %f607, %f608};
	add.s64 	%rd44, %rd43, %rd33;
	st.global.v2.f32 	[%rd44], {%f609, %f610};
	st.global.v2.f32 	[%rd44+8], {%f611, %f612};
	st.global.v4.f32 	[%rd44+256], {%f560, %f559, %f558, %f557};
	add.s64 	%rd45, %rd44, %rd33;
	st.global.v2.f32 	[%rd45], {%f556, %f555};
	st.global.v2.f32 	[%rd45+8], {%f554, %f553};
	st.global.v4.f32 	[%rd45+256], {%f552, %f551, %f550, %f549};
	ret;
}

	// .globl	_ZN3wuk13gemm_32x32_v0IfEEviiiT_PKS1_iS3_iS1_PS1_i
.visible .entry _ZN3wuk13gemm_32x32_v0IfEEviiiT_PKS1_iS3_iS1_PS1_i(
	.param .u32 _ZN3wuk13gemm_32x32_v0IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_0,
	.param .u32 _ZN3wuk13gemm_32x32_v0IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_1,
	.param .u32 _ZN3wuk13gemm_32x32_v0IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_2,
	.param .f32 _ZN3wuk13gemm_32x32_v0IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_3,
	.param .u64 _ZN3wuk13gemm_32x32_v0IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_4,
	.param .u32 _ZN3wuk13gemm_32x32_v0IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_5,
	.param .u64 _ZN3wuk13gemm_32x32_v0IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_6,
	.param .u32 _ZN3wuk13gemm_32x32_v0IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_7,
	.param .f32 _ZN3wuk13gemm_32x32_v0IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_8,
	.param .u64 _ZN3wuk13gemm_32x32_v0IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_9,
	.param .u32 _ZN3wuk13gemm_32x32_v0IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_10
)
.maxntid 1024, 1, 1
{
	.reg .pred 	%p<9>;
	.reg .f32 	%f<41>;
	.reg .b32 	%r<65>;
	.reg .b64 	%rd<85>;


	ld.param.u32 	%r20, [_ZN3wuk13gemm_32x32_v0IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_0];
	ld.param.u32 	%r21, [_ZN3wuk13gemm_32x32_v0IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_1];
	ld.param.u32 	%r16, [_ZN3wuk13gemm_32x32_v0IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_2];
	ld.param.f32 	%f10, [_ZN3wuk13gemm_32x32_v0IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_3];
	ld.param.u64 	%rd14, [_ZN3wuk13gemm_32x32_v0IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_4];
	ld.param.u32 	%r17, [_ZN3wuk13gemm_32x32_v0IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_5];
	ld.param.u64 	%rd15, [_ZN3wuk13gemm_32x32_v0IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_6];
	ld.param.u32 	%r18, [_ZN3wuk13gemm_32x32_v0IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_7];
	ld.param.f32 	%f11, [_ZN3wuk13gemm_32x32_v0IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_8];
	ld.param.u64 	%rd16, [_ZN3wuk13gemm_32x32_v0IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_9];
	ld.param.u32 	%r19, [_ZN3wuk13gemm_32x32_v0IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_10];
	cvta.to.global.u64 	%rd83, %rd15;
	cvta.to.global.u64 	%rd81, %rd14;
	cvta.to.global.u64 	%rd82, %rd16;
	mov.u32 	%r1, %ctaid.y;
	add.s32 	%r22, %r1, 1;
	shl.b32 	%r23, %r22, 5;
	sub.s32 	%r2, %r23, %r20;
	mov.u32 	%r3, %ctaid.x;
	add.s32 	%r24, %r3, 1;
	shl.b32 	%r25, %r24, 5;
	sub.s32 	%r4, %r25, %r21;
	setp.lt.s32	%p1, %r2, 1;
	@%p1 bra 	BB6_2;

	neg.s32 	%r26, %r2;
	mul.wide.s32 	%rd17, %r26, 4;
	add.s64 	%rd81, %rd81, %rd17;
	add.s64 	%rd82, %rd82, %rd17;

BB6_2:
	setp.lt.s32	%p2, %r4, 1;
	@%p2 bra 	BB6_4;

	neg.s32 	%r27, %r18;
	mul.lo.s32 	%r28, %r4, %r27;
	mul.wide.s32 	%rd18, %r28, 4;
	add.s64 	%rd83, %rd83, %rd18;
	neg.s32 	%r29, %r19;
	mul.lo.s32 	%r30, %r4, %r29;
	mul.wide.s32 	%rd19, %r30, 4;
	add.s64 	%rd82, %rd82, %rd19;

BB6_4:
	shl.b32 	%r31, %r1, 5;
	cvt.u64.u32	%rd12, %r31;
	shl.b32 	%r32, %r3, 5;
	mul.lo.s32 	%r33, %r32, %r18;
	cvt.u64.u32	%rd13, %r33;
	mad.lo.s32 	%r5, %r32, %r19, %r31;
	mov.u32 	%r6, %tid.y;
	mov.u32 	%r7, %tid.x;
	mov.f32 	%f40, 0f00000000;
	setp.lt.s32	%p3, %r16, 1;
	@%p3 bra 	BB6_13;

	and.b32  	%r37, %r16, 3;
	mov.f32 	%f40, 0f00000000;
	mov.u32 	%r63, 0;
	setp.eq.s32	%p4, %r37, 0;
	@%p4 bra 	BB6_11;

	setp.eq.s32	%p5, %r37, 1;
	@%p5 bra 	BB6_10;

	setp.eq.s32	%p6, %r37, 2;
	@%p6 bra 	BB6_9;

	cvt.u64.u32	%rd20, %r6;
	add.s64 	%rd21, %rd20, %rd12;
	shl.b64 	%rd22, %rd21, 2;
	add.s64 	%rd23, %rd81, %rd22;
	mul.lo.s32 	%r40, %r7, %r18;
	cvt.u64.u32	%rd24, %r40;
	add.s64 	%rd25, %rd24, %rd13;
	shl.b64 	%rd26, %rd25, 2;
	add.s64 	%rd27, %rd83, %rd26;
	ld.global.f32 	%f16, [%rd27];
	ld.global.f32 	%f17, [%rd23];
	fma.rn.f32 	%f40, %f17, %f16, 0f00000000;
	mov.u32 	%r63, 1;

BB6_9:
	neg.s32 	%r41, %r63;
	and.b32  	%r42, %r41, %r17;
	add.s32 	%r43, %r42, %r6;
	cvt.u64.u32	%rd28, %r43;
	add.s64 	%rd29, %rd28, %rd12;
	shl.b64 	%rd30, %rd29, 2;
	add.s64 	%rd31, %rd81, %rd30;
	mad.lo.s32 	%r45, %r7, %r18, %r63;
	cvt.u64.u32	%rd32, %r45;
	add.s64 	%rd33, %rd32, %rd13;
	shl.b64 	%rd34, %rd33, 2;
	add.s64 	%rd35, %rd83, %rd34;
	ld.global.f32 	%f18, [%rd35];
	ld.global.f32 	%f19, [%rd31];
	fma.rn.f32 	%f40, %f19, %f18, %f40;
	add.s32 	%r63, %r63, 1;

BB6_10:
	mad.lo.s32 	%r46, %r63, %r17, %r6;
	cvt.u64.u32	%rd36, %r46;
	add.s64 	%rd37, %rd36, %rd12;
	shl.b64 	%rd38, %rd37, 2;
	add.s64 	%rd39, %rd81, %rd38;
	mad.lo.s32 	%r48, %r7, %r18, %r63;
	cvt.u64.u32	%rd40, %r48;
	add.s64 	%rd41, %rd40, %rd13;
	shl.b64 	%rd42, %rd41, 2;
	add.s64 	%rd43, %rd83, %rd42;
	ld.global.f32 	%f20, [%rd43];
	ld.global.f32 	%f21, [%rd39];
	fma.rn.f32 	%f40, %f21, %f20, %f40;
	add.s32 	%r63, %r63, 1;

BB6_11:
	mul.lo.s32 	%r13, %r7, %r18;
	setp.lt.u32	%p7, %r16, 4;
	@%p7 bra 	BB6_13;

BB6_12:
	mad.lo.s32 	%r49, %r63, %r17, %r6;
	cvt.u64.u32	%rd44, %r49;
	add.s64 	%rd45, %rd44, %rd12;
	shl.b64 	%rd46, %rd45, 2;
	add.s64 	%rd47, %rd81, %rd46;
	add.s32 	%r50, %r13, %r63;
	cvt.u64.u32	%rd48, %r50;
	add.s64 	%rd49, %rd48, %rd13;
	shl.b64 	%rd50, %rd49, 2;
	add.s64 	%rd51, %rd83, %rd50;
	ld.global.f32 	%f22, [%rd51];
	ld.global.f32 	%f23, [%rd47];
	fma.rn.f32 	%f24, %f23, %f22, %f40;
	add.s32 	%r51, %r63, 1;
	mad.lo.s32 	%r52, %r51, %r17, %r6;
	cvt.u64.u32	%rd52, %r52;
	add.s64 	%rd53, %rd52, %rd12;
	shl.b64 	%rd54, %rd53, 2;
	add.s64 	%rd55, %rd81, %rd54;
	add.s32 	%r53, %r13, %r51;
	cvt.u64.u32	%rd56, %r53;
	add.s64 	%rd57, %rd56, %rd13;
	shl.b64 	%rd58, %rd57, 2;
	add.s64 	%rd59, %rd83, %rd58;
	ld.global.f32 	%f25, [%rd59];
	ld.global.f32 	%f26, [%rd55];
	fma.rn.f32 	%f27, %f26, %f25, %f24;
	add.s32 	%r54, %r63, 2;
	mad.lo.s32 	%r55, %r54, %r17, %r6;
	cvt.u64.u32	%rd60, %r55;
	add.s64 	%rd61, %rd60, %rd12;
	shl.b64 	%rd62, %rd61, 2;
	add.s64 	%rd63, %rd81, %rd62;
	add.s32 	%r56, %r13, %r54;
	cvt.u64.u32	%rd64, %r56;
	add.s64 	%rd65, %rd64, %rd13;
	shl.b64 	%rd66, %rd65, 2;
	add.s64 	%rd67, %rd83, %rd66;
	ld.global.f32 	%f28, [%rd67];
	ld.global.f32 	%f29, [%rd63];
	fma.rn.f32 	%f30, %f29, %f28, %f27;
	add.s32 	%r57, %r63, 3;
	mad.lo.s32 	%r58, %r57, %r17, %r6;
	cvt.u64.u32	%rd68, %r58;
	add.s64 	%rd69, %rd68, %rd12;
	shl.b64 	%rd70, %rd69, 2;
	add.s64 	%rd71, %rd81, %rd70;
	add.s32 	%r59, %r13, %r57;
	cvt.u64.u32	%rd72, %r59;
	add.s64 	%rd73, %rd72, %rd13;
	shl.b64 	%rd74, %rd73, 2;
	add.s64 	%rd75, %rd83, %rd74;
	ld.global.f32 	%f31, [%rd75];
	ld.global.f32 	%f32, [%rd71];
	fma.rn.f32 	%f40, %f32, %f31, %f30;
	add.s32 	%r63, %r63, 4;
	setp.lt.s32	%p8, %r63, %r16;
	@%p8 bra 	BB6_12;

BB6_13:
	mad.lo.s32 	%r60, %r7, %r19, %r6;
	cvt.u64.u32	%rd76, %r60;
	cvt.u64.u32	%rd77, %r5;
	add.s64 	%rd78, %rd76, %rd77;
	shl.b64 	%rd79, %rd78, 2;
	add.s64 	%rd80, %rd82, %rd79;
	ld.global.f32 	%f33, [%rd80];
	mul.f32 	%f34, %f33, %f11;
	fma.rn.f32 	%f35, %f40, %f10, %f34;
	st.global.f32 	[%rd80], %f35;
	ret;
}

	// .globl	_ZN3wuk13gemm_32x32_v1IfEEviiiT_PKS1_iS3_iS1_PS1_i
.visible .entry _ZN3wuk13gemm_32x32_v1IfEEviiiT_PKS1_iS3_iS1_PS1_i(
	.param .u32 _ZN3wuk13gemm_32x32_v1IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_0,
	.param .u32 _ZN3wuk13gemm_32x32_v1IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_1,
	.param .u32 _ZN3wuk13gemm_32x32_v1IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_2,
	.param .f32 _ZN3wuk13gemm_32x32_v1IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_3,
	.param .u64 _ZN3wuk13gemm_32x32_v1IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_4,
	.param .u32 _ZN3wuk13gemm_32x32_v1IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_5,
	.param .u64 _ZN3wuk13gemm_32x32_v1IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_6,
	.param .u32 _ZN3wuk13gemm_32x32_v1IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_7,
	.param .f32 _ZN3wuk13gemm_32x32_v1IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_8,
	.param .u64 _ZN3wuk13gemm_32x32_v1IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_9,
	.param .u32 _ZN3wuk13gemm_32x32_v1IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_10
)
.maxntid 1024, 1, 1
{
	.reg .pred 	%p<9>;
	.reg .f32 	%f<41>;
	.reg .b32 	%r<65>;
	.reg .b64 	%rd<85>;


	ld.param.u32 	%r20, [_ZN3wuk13gemm_32x32_v1IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_0];
	ld.param.u32 	%r21, [_ZN3wuk13gemm_32x32_v1IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_1];
	ld.param.u32 	%r16, [_ZN3wuk13gemm_32x32_v1IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_2];
	ld.param.f32 	%f10, [_ZN3wuk13gemm_32x32_v1IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_3];
	ld.param.u64 	%rd14, [_ZN3wuk13gemm_32x32_v1IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_4];
	ld.param.u32 	%r17, [_ZN3wuk13gemm_32x32_v1IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_5];
	ld.param.u64 	%rd15, [_ZN3wuk13gemm_32x32_v1IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_6];
	ld.param.u32 	%r18, [_ZN3wuk13gemm_32x32_v1IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_7];
	ld.param.f32 	%f11, [_ZN3wuk13gemm_32x32_v1IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_8];
	ld.param.u64 	%rd16, [_ZN3wuk13gemm_32x32_v1IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_9];
	ld.param.u32 	%r19, [_ZN3wuk13gemm_32x32_v1IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_10];
	cvta.to.global.u64 	%rd83, %rd15;
	cvta.to.global.u64 	%rd81, %rd14;
	cvta.to.global.u64 	%rd82, %rd16;
	mov.u32 	%r1, %ctaid.x;
	add.s32 	%r22, %r1, 1;
	shl.b32 	%r23, %r22, 5;
	sub.s32 	%r2, %r23, %r20;
	mov.u32 	%r3, %ctaid.y;
	add.s32 	%r24, %r3, 1;
	shl.b32 	%r25, %r24, 5;
	sub.s32 	%r4, %r25, %r21;
	setp.lt.s32	%p1, %r2, 1;
	@%p1 bra 	BB7_2;

	neg.s32 	%r26, %r2;
	mul.wide.s32 	%rd17, %r26, 4;
	add.s64 	%rd81, %rd81, %rd17;
	add.s64 	%rd82, %rd82, %rd17;

BB7_2:
	setp.lt.s32	%p2, %r4, 1;
	@%p2 bra 	BB7_4;

	neg.s32 	%r27, %r18;
	mul.lo.s32 	%r28, %r4, %r27;
	mul.wide.s32 	%rd18, %r28, 4;
	add.s64 	%rd83, %rd83, %rd18;
	neg.s32 	%r29, %r19;
	mul.lo.s32 	%r30, %r4, %r29;
	mul.wide.s32 	%rd19, %r30, 4;
	add.s64 	%rd82, %rd82, %rd19;

BB7_4:
	shl.b32 	%r31, %r1, 5;
	cvt.u64.u32	%rd12, %r31;
	shl.b32 	%r32, %r3, 5;
	mul.lo.s32 	%r33, %r32, %r18;
	cvt.u64.u32	%rd13, %r33;
	mad.lo.s32 	%r5, %r32, %r19, %r31;
	mov.u32 	%r6, %tid.x;
	mov.u32 	%r7, %tid.y;
	mov.f32 	%f40, 0f00000000;
	setp.lt.s32	%p3, %r16, 1;
	@%p3 bra 	BB7_13;

	and.b32  	%r37, %r16, 3;
	mov.f32 	%f40, 0f00000000;
	mov.u32 	%r63, 0;
	setp.eq.s32	%p4, %r37, 0;
	@%p4 bra 	BB7_11;

	setp.eq.s32	%p5, %r37, 1;
	@%p5 bra 	BB7_10;

	setp.eq.s32	%p6, %r37, 2;
	@%p6 bra 	BB7_9;

	cvt.u64.u32	%rd20, %r6;
	add.s64 	%rd21, %rd20, %rd12;
	shl.b64 	%rd22, %rd21, 2;
	add.s64 	%rd23, %rd81, %rd22;
	mul.lo.s32 	%r40, %r7, %r18;
	cvt.u64.u32	%rd24, %r40;
	add.s64 	%rd25, %rd24, %rd13;
	shl.b64 	%rd26, %rd25, 2;
	add.s64 	%rd27, %rd83, %rd26;
	ld.global.f32 	%f16, [%rd27];
	ld.global.f32 	%f17, [%rd23];
	fma.rn.f32 	%f40, %f17, %f16, 0f00000000;
	mov.u32 	%r63, 1;

BB7_9:
	neg.s32 	%r41, %r63;
	and.b32  	%r42, %r41, %r17;
	add.s32 	%r43, %r42, %r6;
	cvt.u64.u32	%rd28, %r43;
	add.s64 	%rd29, %rd28, %rd12;
	shl.b64 	%rd30, %rd29, 2;
	add.s64 	%rd31, %rd81, %rd30;
	mad.lo.s32 	%r45, %r7, %r18, %r63;
	cvt.u64.u32	%rd32, %r45;
	add.s64 	%rd33, %rd32, %rd13;
	shl.b64 	%rd34, %rd33, 2;
	add.s64 	%rd35, %rd83, %rd34;
	ld.global.f32 	%f18, [%rd35];
	ld.global.f32 	%f19, [%rd31];
	fma.rn.f32 	%f40, %f19, %f18, %f40;
	add.s32 	%r63, %r63, 1;

BB7_10:
	mad.lo.s32 	%r46, %r63, %r17, %r6;
	cvt.u64.u32	%rd36, %r46;
	add.s64 	%rd37, %rd36, %rd12;
	shl.b64 	%rd38, %rd37, 2;
	add.s64 	%rd39, %rd81, %rd38;
	mad.lo.s32 	%r48, %r7, %r18, %r63;
	cvt.u64.u32	%rd40, %r48;
	add.s64 	%rd41, %rd40, %rd13;
	shl.b64 	%rd42, %rd41, 2;
	add.s64 	%rd43, %rd83, %rd42;
	ld.global.f32 	%f20, [%rd43];
	ld.global.f32 	%f21, [%rd39];
	fma.rn.f32 	%f40, %f21, %f20, %f40;
	add.s32 	%r63, %r63, 1;

BB7_11:
	mul.lo.s32 	%r13, %r7, %r18;
	setp.lt.u32	%p7, %r16, 4;
	@%p7 bra 	BB7_13;

BB7_12:
	mad.lo.s32 	%r49, %r63, %r17, %r6;
	cvt.u64.u32	%rd44, %r49;
	add.s64 	%rd45, %rd44, %rd12;
	shl.b64 	%rd46, %rd45, 2;
	add.s64 	%rd47, %rd81, %rd46;
	add.s32 	%r50, %r13, %r63;
	cvt.u64.u32	%rd48, %r50;
	add.s64 	%rd49, %rd48, %rd13;
	shl.b64 	%rd50, %rd49, 2;
	add.s64 	%rd51, %rd83, %rd50;
	ld.global.f32 	%f22, [%rd51];
	ld.global.f32 	%f23, [%rd47];
	fma.rn.f32 	%f24, %f23, %f22, %f40;
	add.s32 	%r51, %r63, 1;
	mad.lo.s32 	%r52, %r51, %r17, %r6;
	cvt.u64.u32	%rd52, %r52;
	add.s64 	%rd53, %rd52, %rd12;
	shl.b64 	%rd54, %rd53, 2;
	add.s64 	%rd55, %rd81, %rd54;
	add.s32 	%r53, %r13, %r51;
	cvt.u64.u32	%rd56, %r53;
	add.s64 	%rd57, %rd56, %rd13;
	shl.b64 	%rd58, %rd57, 2;
	add.s64 	%rd59, %rd83, %rd58;
	ld.global.f32 	%f25, [%rd59];
	ld.global.f32 	%f26, [%rd55];
	fma.rn.f32 	%f27, %f26, %f25, %f24;
	add.s32 	%r54, %r63, 2;
	mad.lo.s32 	%r55, %r54, %r17, %r6;
	cvt.u64.u32	%rd60, %r55;
	add.s64 	%rd61, %rd60, %rd12;
	shl.b64 	%rd62, %rd61, 2;
	add.s64 	%rd63, %rd81, %rd62;
	add.s32 	%r56, %r13, %r54;
	cvt.u64.u32	%rd64, %r56;
	add.s64 	%rd65, %rd64, %rd13;
	shl.b64 	%rd66, %rd65, 2;
	add.s64 	%rd67, %rd83, %rd66;
	ld.global.f32 	%f28, [%rd67];
	ld.global.f32 	%f29, [%rd63];
	fma.rn.f32 	%f30, %f29, %f28, %f27;
	add.s32 	%r57, %r63, 3;
	mad.lo.s32 	%r58, %r57, %r17, %r6;
	cvt.u64.u32	%rd68, %r58;
	add.s64 	%rd69, %rd68, %rd12;
	shl.b64 	%rd70, %rd69, 2;
	add.s64 	%rd71, %rd81, %rd70;
	add.s32 	%r59, %r13, %r57;
	cvt.u64.u32	%rd72, %r59;
	add.s64 	%rd73, %rd72, %rd13;
	shl.b64 	%rd74, %rd73, 2;
	add.s64 	%rd75, %rd83, %rd74;
	ld.global.f32 	%f31, [%rd75];
	ld.global.f32 	%f32, [%rd71];
	fma.rn.f32 	%f40, %f32, %f31, %f30;
	add.s32 	%r63, %r63, 4;
	setp.lt.s32	%p8, %r63, %r16;
	@%p8 bra 	BB7_12;

BB7_13:
	mad.lo.s32 	%r60, %r7, %r19, %r6;
	cvt.u64.u32	%rd76, %r60;
	cvt.u64.u32	%rd77, %r5;
	add.s64 	%rd78, %rd76, %rd77;
	shl.b64 	%rd79, %rd78, 2;
	add.s64 	%rd80, %rd82, %rd79;
	ld.global.f32 	%f33, [%rd80];
	mul.f32 	%f34, %f33, %f11;
	fma.rn.f32 	%f35, %f40, %f10, %f34;
	st.global.f32 	[%rd80], %f35;
	ret;
}

	// .globl	_ZN3wuk13gemm_32x32_v2IfEEviiiT_PKS1_iS3_iS1_PS1_i
.visible .entry _ZN3wuk13gemm_32x32_v2IfEEviiiT_PKS1_iS3_iS1_PS1_i(
	.param .u32 _ZN3wuk13gemm_32x32_v2IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_0,
	.param .u32 _ZN3wuk13gemm_32x32_v2IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_1,
	.param .u32 _ZN3wuk13gemm_32x32_v2IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_2,
	.param .f32 _ZN3wuk13gemm_32x32_v2IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_3,
	.param .u64 _ZN3wuk13gemm_32x32_v2IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_4,
	.param .u32 _ZN3wuk13gemm_32x32_v2IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_5,
	.param .u64 _ZN3wuk13gemm_32x32_v2IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_6,
	.param .u32 _ZN3wuk13gemm_32x32_v2IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_7,
	.param .f32 _ZN3wuk13gemm_32x32_v2IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_8,
	.param .u64 _ZN3wuk13gemm_32x32_v2IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_9,
	.param .u32 _ZN3wuk13gemm_32x32_v2IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_10
)
.maxntid 1024, 1, 1
{
	.reg .pred 	%p<9>;
	.reg .f32 	%f<41>;
	.reg .b32 	%r<60>;
	.reg .b64 	%rd<69>;


	ld.param.u32 	%r23, [_ZN3wuk13gemm_32x32_v2IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_0];
	ld.param.u32 	%r24, [_ZN3wuk13gemm_32x32_v2IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_1];
	ld.param.u32 	%r19, [_ZN3wuk13gemm_32x32_v2IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_2];
	ld.param.f32 	%f10, [_ZN3wuk13gemm_32x32_v2IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_3];
	ld.param.u64 	%rd18, [_ZN3wuk13gemm_32x32_v2IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_4];
	ld.param.u32 	%r20, [_ZN3wuk13gemm_32x32_v2IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_5];
	ld.param.u64 	%rd19, [_ZN3wuk13gemm_32x32_v2IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_6];
	ld.param.u32 	%r21, [_ZN3wuk13gemm_32x32_v2IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_7];
	ld.param.f32 	%f11, [_ZN3wuk13gemm_32x32_v2IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_8];
	ld.param.u64 	%rd20, [_ZN3wuk13gemm_32x32_v2IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_9];
	ld.param.u32 	%r22, [_ZN3wuk13gemm_32x32_v2IfEEviiiT_PKS1_iS3_iS1_PS1_i_param_10];
	cvta.to.global.u64 	%rd66, %rd19;
	cvta.to.global.u64 	%rd64, %rd18;
	cvta.to.global.u64 	%rd65, %rd20;
	mov.u32 	%r1, %tid.x;
	mov.u32 	%r2, %tid.y;
	mov.u32 	%r3, %ctaid.x;
	add.s32 	%r25, %r3, 1;
	shl.b32 	%r26, %r25, 5;
	sub.s32 	%r4, %r26, %r23;
	mov.u32 	%r27, %ctaid.y;
	add.s32 	%r28, %r27, 1;
	shl.b32 	%r29, %r28, 5;
	sub.s32 	%r5, %r29, %r24;
	setp.lt.s32	%p1, %r4, 1;
	@%p1 bra 	BB8_2;

	neg.s32 	%r30, %r4;
	mul.wide.s32 	%rd21, %r30, 4;
	add.s64 	%rd64, %rd64, %rd21;
	add.s64 	%rd65, %rd65, %rd21;

BB8_2:
	setp.lt.s32	%p2, %r5, 1;
	@%p2 bra 	BB8_4;

	neg.s32 	%r31, %r21;
	mul.lo.s32 	%r32, %r5, %r31;
	mul.wide.s32 	%rd22, %r32, 4;
	add.s64 	%rd66, %rd66, %rd22;
	neg.s32 	%r33, %r22;
	mul.lo.s32 	%r34, %r5, %r33;
	mul.wide.s32 	%rd23, %r34, 4;
	add.s64 	%rd65, %rd65, %rd23;

BB8_4:
	shl.b32 	%r35, %r3, 5;
	cvt.u64.u32	%rd12, %r35;
	shl.b32 	%r37, %r27, 5;
	mul.lo.s32 	%r38, %r37, %r21;
	cvt.u64.u32	%rd13, %r38;
	mad.lo.s32 	%r6, %r37, %r22, %r35;
	mov.f32 	%f40, 0f00000000;
	setp.lt.s32	%p3, %r19, 1;
	@%p3 bra 	BB8_14;

	mul.lo.s32 	%r7, %r2, %r21;
	and.b32  	%r42, %r19, 3;
	mov.f32 	%f40, 0f00000000;
	mov.u32 	%r57, 0;
	setp.eq.s32	%p4, %r42, 0;
	@%p4 bra 	BB8_11;

	setp.eq.s32	%p5, %r42, 1;
	@%p5 bra 	BB8_10;

	setp.eq.s32	%p6, %r42, 2;
	@%p6 bra 	BB8_9;

	cvt.s64.s32	%rd24, %r1;
	add.s64 	%rd25, %rd24, %rd12;
	shl.b64 	%rd26, %rd25, 2;
	add.s64 	%rd27, %rd64, %rd26;
	cvt.s64.s32	%rd28, %r7;
	add.s64 	%rd29, %rd28, %rd13;
	shl.b64 	%rd30, %rd29, 2;
	add.s64 	%rd31, %rd66, %rd30;
	ld.global.f32 	%f16, [%rd31];
	ld.global.f32 	%f17, [%rd27];
	fma.rn.f32 	%f40, %f17, %f16, 0f00000000;
	mov.u32 	%r57, 1;

BB8_9:
	neg.s32 	%r44, %r57;
	and.b32  	%r45, %r44, %r20;
	add.s32 	%r46, %r45, %r1;
	cvt.s64.s32	%rd32, %r46;
	add.s64 	%rd33, %rd32, %rd12;
	shl.b64 	%rd34, %rd33, 2;
	add.s64 	%rd35, %rd64, %rd34;
	add.s32 	%r47, %r57, %r7;
	cvt.s64.s32	%rd36, %r47;
	add.s64 	%rd37, %rd36, %rd13;
	shl.b64 	%rd38, %rd37, 2;
	add.s64 	%rd39, %rd66, %rd38;
	ld.global.f32 	%f18, [%rd39];
	ld.global.f32 	%f19, [%rd35];
	fma.rn.f32 	%f40, %f19, %f18, %f40;
	add.s32 	%r57, %r57, 1;

BB8_10:
	mad.lo.s32 	%r48, %r57, %r20, %r1;
	cvt.s64.s32	%rd40, %r48;
	add.s64 	%rd41, %rd40, %rd12;
	shl.b64 	%rd42, %rd41, 2;
	add.s64 	%rd43, %rd64, %rd42;
	add.s32 	%r49, %r57, %r7;
	cvt.s64.s32	%rd44, %r49;
	add.s64 	%rd45, %rd44, %rd13;
	shl.b64 	%rd46, %rd45, 2;
	add.s64 	%rd47, %rd66, %rd46;
	ld.global.f32 	%f20, [%rd47];
	ld.global.f32 	%f21, [%rd43];
	fma.rn.f32 	%f40, %f21, %f20, %f40;
	add.s32 	%r57, %r57, 1;

BB8_11:
	setp.lt.u32	%p7, %r19, 4;
	@%p7 bra 	BB8_14;

	mad.lo.s32 	%r50, %r2, %r21, %r57;
	cvt.s64.s32	%rd48, %r50;
	mul.lo.s32 	%r52, %r27, %r21;
	shl.b32 	%r53, %r52, 5;
	cvt.u64.u32	%rd49, %r53;
	add.s64 	%rd50, %rd48, %rd49;
	shl.b64 	%rd51, %rd50, 2;
	add.s64 	%rd68, %rd66, %rd51;
	shl.b32 	%r13, %r20, 2;
	mad.lo.s32 	%r58, %r57, %r20, %r1;
	mul.wide.s32 	%rd15, %r20, 4;

BB8_13:
	cvt.s64.s32	%rd52, %r58;
	add.s64 	%rd53, %rd52, %rd12;
	shl.b64 	%rd54, %rd53, 2;
	add.s64 	%rd55, %rd64, %rd54;
	ld.global.f32 	%f22, [%rd68];
	ld.global.f32 	%f23, [%rd55];
	fma.rn.f32 	%f24, %f23, %f22, %f40;
	add.s64 	%rd56, %rd55, %rd15;
	ld.global.f32 	%f25, [%rd68+4];
	ld.global.f32 	%f26, [%rd56];
	fma.rn.f32 	%f27, %f26, %f25, %f24;
	add.s64 	%rd57, %rd56, %rd15;
	ld.global.f32 	%f28, [%rd68+8];
	ld.global.f32 	%f29, [%rd57];
	fma.rn.f32 	%f30, %f29, %f28, %f27;
	add.s64 	%rd58, %rd57, %rd15;
	ld.global.f32 	%f31, [%rd68+12];
	ld.global.f32 	%f32, [%rd58];
	fma.rn.f32 	%f40, %f32, %f31, %f30;
	add.s64 	%rd68, %rd68, 16;
	add.s32 	%r58, %r58, %r13;
	add.s32 	%r57, %r57, 4;
	setp.lt.s32	%p8, %r57, %r19;
	@%p8 bra 	BB8_13;

BB8_14:
	mad.lo.s32 	%r54, %r2, %r22, %r1;
	cvt.s64.s32	%rd59, %r54;
	cvt.u64.u32	%rd60, %r6;
	add.s64 	%rd61, %rd60, %rd59;
	shl.b64 	%rd62, %rd61, 2;
	add.s64 	%rd63, %rd65, %rd62;
	ld.global.f32 	%f33, [%rd63];
	mul.f32 	%f34, %f33, %f11;
	fma.rn.f32 	%f35, %f40, %f10, %f34;
	st.global.f32 	[%rd63], %f35;
	ret;
}

	// .globl	_ZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i
.visible .entry _ZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i(
	.param .u32 _ZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_0,
	.param .u32 _ZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_1,
	.param .u32 _ZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_2,
	.param .f32 _ZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_3,
	.param .u64 _ZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_4,
	.param .u32 _ZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_5,
	.param .u64 _ZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_6,
	.param .u32 _ZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_7,
	.param .f32 _ZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_8,
	.param .u64 _ZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_9,
	.param .u32 _ZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_10
)
.maxntid 1024, 1, 1
{
	.reg .pred 	%p<5>;
	.reg .f32 	%f<110>;
	.reg .b32 	%r<52>;
	.reg .b64 	%rd<37>;
	// demoted variable
	.shared .align 4 .b8 _ZZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_iE2sA[4096];
	// demoted variable
	.shared .align 4 .b8 _ZZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_iE2sB[4096];

	ld.param.u32 	%r16, [_ZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_0];
	ld.param.u32 	%r17, [_ZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_1];
	ld.param.u32 	%r12, [_ZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_2];
	ld.param.f32 	%f4, [_ZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_3];
	ld.param.u64 	%rd14, [_ZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_4];
	ld.param.u32 	%r13, [_ZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_5];
	ld.param.u64 	%rd15, [_ZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_6];
	ld.param.u32 	%r14, [_ZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_7];
	ld.param.f32 	%f5, [_ZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_8];
	ld.param.u64 	%rd16, [_ZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_9];
	ld.param.u32 	%r15, [_ZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_10];
	cvta.to.global.u64 	%rd35, %rd15;
	cvta.to.global.u64 	%rd33, %rd14;
	cvta.to.global.u64 	%rd34, %rd16;
	mov.u32 	%r1, %ctaid.x;
	add.s32 	%r18, %r1, 1;
	shl.b32 	%r19, %r18, 5;
	sub.s32 	%r2, %r19, %r16;
	mov.u32 	%r3, %ctaid.y;
	add.s32 	%r20, %r3, 1;
	shl.b32 	%r21, %r20, 5;
	sub.s32 	%r4, %r21, %r17;
	setp.lt.s32	%p1, %r2, 1;
	@%p1 bra 	BB9_2;

	neg.s32 	%r22, %r2;
	mul.wide.s32 	%rd17, %r22, 4;
	add.s64 	%rd33, %rd33, %rd17;
	add.s64 	%rd34, %rd34, %rd17;

BB9_2:
	setp.lt.s32	%p2, %r4, 1;
	@%p2 bra 	BB9_4;

	neg.s32 	%r23, %r14;
	mul.lo.s32 	%r24, %r4, %r23;
	mul.wide.s32 	%rd18, %r24, 4;
	add.s64 	%rd35, %rd35, %rd18;
	neg.s32 	%r25, %r15;
	mul.lo.s32 	%r26, %r4, %r25;
	mul.wide.s32 	%rd19, %r26, 4;
	add.s64 	%rd34, %rd34, %rd19;

BB9_4:
	mov.f32 	%f109, 0f00000000;
	setp.lt.s32	%p3, %r12, 1;
	@%p3 bra 	BB9_7;

	mov.u32 	%r28, %tid.x;
	mov.u32 	%r29, %tid.y;
	shl.b32 	%r30, %r1, 5;
	shl.b32 	%r31, %r28, 7;
	mov.u32 	%r32, _ZZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_iE2sA;
	add.s32 	%r8, %r32, %r31;
	shl.b32 	%r33, %r29, 2;
	add.s32 	%r5, %r8, %r33;
	mad.lo.s32 	%r6, %r29, %r14, %r28;
	mov.u32 	%r34, _ZZN3wuk13gemm_32x32_v3IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_iE2sB;
	add.s32 	%r35, %r34, %r31;
	add.s32 	%r7, %r35, %r33;
	add.s32 	%r9, %r34, %r33;
	mul.lo.s32 	%r36, %r3, %r14;
	shl.b32 	%r37, %r36, 5;
	cvt.u64.u32	%rd12, %r37;
	cvt.u64.u32	%rd13, %r30;
	mov.f32 	%f109, 0f00000000;
	mov.u32 	%r51, 0;

BB9_6:
	add.s32 	%r39, %r51, %r29;
	mad.lo.s32 	%r41, %r39, %r13, %r28;
	cvt.s64.s32	%rd20, %r41;
	add.s64 	%rd21, %rd20, %rd13;
	shl.b64 	%rd22, %rd21, 2;
	add.s64 	%rd23, %rd33, %rd22;
	ld.global.f32 	%f8, [%rd23];
	st.shared.f32 	[%r5], %f8;
	add.s32 	%r42, %r6, %r51;
	cvt.s64.s32	%rd24, %r42;
	add.s64 	%rd25, %rd24, %rd12;
	shl.b64 	%rd26, %rd25, 2;
	add.s64 	%rd27, %rd35, %rd26;
	ld.global.f32 	%f9, [%rd27];
	st.shared.f32 	[%r7], %f9;
	bar.sync 	0;
	ld.shared.f32 	%f10, [%r9];
	ld.shared.f32 	%f11, [%r8];
	fma.rn.f32 	%f12, %f11, %f10, %f109;
	ld.shared.f32 	%f13, [%r9+128];
	ld.shared.f32 	%f14, [%r8+4];
	fma.rn.f32 	%f15, %f14, %f13, %f12;
	ld.shared.f32 	%f16, [%r9+256];
	ld.shared.f32 	%f17, [%r8+8];
	fma.rn.f32 	%f18, %f17, %f16, %f15;
	ld.shared.f32 	%f19, [%r9+384];
	ld.shared.f32 	%f20, [%r8+12];
	fma.rn.f32 	%f21, %f20, %f19, %f18;
	ld.shared.f32 	%f22, [%r9+512];
	ld.shared.f32 	%f23, [%r8+16];
	fma.rn.f32 	%f24, %f23, %f22, %f21;
	ld.shared.f32 	%f25, [%r9+640];
	ld.shared.f32 	%f26, [%r8+20];
	fma.rn.f32 	%f27, %f26, %f25, %f24;
	ld.shared.f32 	%f28, [%r9+768];
	ld.shared.f32 	%f29, [%r8+24];
	fma.rn.f32 	%f30, %f29, %f28, %f27;
	ld.shared.f32 	%f31, [%r9+896];
	ld.shared.f32 	%f32, [%r8+28];
	fma.rn.f32 	%f33, %f32, %f31, %f30;
	ld.shared.f32 	%f34, [%r9+1024];
	ld.shared.f32 	%f35, [%r8+32];
	fma.rn.f32 	%f36, %f35, %f34, %f33;
	ld.shared.f32 	%f37, [%r9+1152];
	ld.shared.f32 	%f38, [%r8+36];
	fma.rn.f32 	%f39, %f38, %f37, %f36;
	ld.shared.f32 	%f40, [%r9+1280];
	ld.shared.f32 	%f41, [%r8+40];
	fma.rn.f32 	%f42, %f41, %f40, %f39;
	ld.shared.f32 	%f43, [%r9+1408];
	ld.shared.f32 	%f44, [%r8+44];
	fma.rn.f32 	%f45, %f44, %f43, %f42;
	ld.shared.f32 	%f46, [%r9+1536];
	ld.shared.f32 	%f47, [%r8+48];
	fma.rn.f32 	%f48, %f47, %f46, %f45;
	ld.shared.f32 	%f49, [%r9+1664];
	ld.shared.f32 	%f50, [%r8+52];
	fma.rn.f32 	%f51, %f50, %f49, %f48;
	ld.shared.f32 	%f52, [%r9+1792];
	ld.shared.f32 	%f53, [%r8+56];
	fma.rn.f32 	%f54, %f53, %f52, %f51;
	ld.shared.f32 	%f55, [%r9+1920];
	ld.shared.f32 	%f56, [%r8+60];
	fma.rn.f32 	%f57, %f56, %f55, %f54;
	ld.shared.f32 	%f58, [%r9+2048];
	ld.shared.f32 	%f59, [%r8+64];
	fma.rn.f32 	%f60, %f59, %f58, %f57;
	ld.shared.f32 	%f61, [%r9+2176];
	ld.shared.f32 	%f62, [%r8+68];
	fma.rn.f32 	%f63, %f62, %f61, %f60;
	ld.shared.f32 	%f64, [%r9+2304];
	ld.shared.f32 	%f65, [%r8+72];
	fma.rn.f32 	%f66, %f65, %f64, %f63;
	ld.shared.f32 	%f67, [%r9+2432];
	ld.shared.f32 	%f68, [%r8+76];
	fma.rn.f32 	%f69, %f68, %f67, %f66;
	ld.shared.f32 	%f70, [%r9+2560];
	ld.shared.f32 	%f71, [%r8+80];
	fma.rn.f32 	%f72, %f71, %f70, %f69;
	ld.shared.f32 	%f73, [%r9+2688];
	ld.shared.f32 	%f74, [%r8+84];
	fma.rn.f32 	%f75, %f74, %f73, %f72;
	ld.shared.f32 	%f76, [%r9+2816];
	ld.shared.f32 	%f77, [%r8+88];
	fma.rn.f32 	%f78, %f77, %f76, %f75;
	ld.shared.f32 	%f79, [%r9+2944];
	ld.shared.f32 	%f80, [%r8+92];
	fma.rn.f32 	%f81, %f80, %f79, %f78;
	ld.shared.f32 	%f82, [%r9+3072];
	ld.shared.f32 	%f83, [%r8+96];
	fma.rn.f32 	%f84, %f83, %f82, %f81;
	ld.shared.f32 	%f85, [%r9+3200];
	ld.shared.f32 	%f86, [%r8+100];
	fma.rn.f32 	%f87, %f86, %f85, %f84;
	ld.shared.f32 	%f88, [%r9+3328];
	ld.shared.f32 	%f89, [%r8+104];
	fma.rn.f32 	%f90, %f89, %f88, %f87;
	ld.shared.f32 	%f91, [%r9+3456];
	ld.shared.f32 	%f92, [%r8+108];
	fma.rn.f32 	%f93, %f92, %f91, %f90;
	ld.shared.f32 	%f94, [%r9+3584];
	ld.shared.f32 	%f95, [%r8+112];
	fma.rn.f32 	%f96, %f95, %f94, %f93;
	ld.shared.f32 	%f97, [%r9+3712];
	ld.shared.f32 	%f98, [%r8+116];
	fma.rn.f32 	%f99, %f98, %f97, %f96;
	ld.shared.f32 	%f100, [%r9+3840];
	ld.shared.f32 	%f101, [%r8+120];
	fma.rn.f32 	%f102, %f101, %f100, %f99;
	ld.shared.f32 	%f103, [%r9+3968];
	ld.shared.f32 	%f104, [%r8+124];
	fma.rn.f32 	%f109, %f104, %f103, %f102;
	bar.sync 	0;
	add.s32 	%r51, %r51, 32;
	setp.lt.s32	%p4, %r51, %r12;
	@%p4 bra 	BB9_6;

BB9_7:
	mov.u32 	%r43, %tid.y;
	mov.u32 	%r44, %tid.x;
	mad.lo.s32 	%r45, %r43, %r15, %r44;
	cvt.s64.s32	%rd28, %r45;
	shl.b32 	%r47, %r3, 5;
	shl.b32 	%r49, %r1, 5;
	mad.lo.s32 	%r50, %r47, %r15, %r49;
	cvt.u64.u32	%rd29, %r50;
	add.s64 	%rd30, %rd29, %rd28;
	shl.b64 	%rd31, %rd30, 2;
	add.s64 	%rd32, %rd34, %rd31;
	ld.global.f32 	%f105, [%rd32];
	mul.f32 	%f106, %f105, %f5;
	fma.rn.f32 	%f107, %f109, %f4, %f106;
	st.global.f32 	[%rd32], %f107;
	ret;
}

	// .globl	_ZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i
.visible .entry _ZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i(
	.param .u32 _ZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_0,
	.param .u32 _ZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_1,
	.param .u32 _ZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_2,
	.param .f32 _ZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_3,
	.param .u64 _ZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_4,
	.param .u32 _ZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_5,
	.param .u64 _ZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_6,
	.param .u32 _ZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_7,
	.param .f32 _ZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_8,
	.param .u64 _ZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_9,
	.param .u32 _ZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_10
)
.maxntid 1024, 1, 1
{
	.reg .pred 	%p<5>;
	.reg .f32 	%f<110>;
	.reg .b32 	%r<52>;
	.reg .b64 	%rd<37>;
	// demoted variable
	.shared .align 4 .b8 _ZZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_iE2sA[4224];
	// demoted variable
	.shared .align 4 .b8 _ZZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_iE2sB[4224];

	ld.param.u32 	%r16, [_ZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_0];
	ld.param.u32 	%r17, [_ZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_1];
	ld.param.u32 	%r12, [_ZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_2];
	ld.param.f32 	%f4, [_ZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_3];
	ld.param.u64 	%rd14, [_ZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_4];
	ld.param.u32 	%r13, [_ZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_5];
	ld.param.u64 	%rd15, [_ZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_6];
	ld.param.u32 	%r14, [_ZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_7];
	ld.param.f32 	%f5, [_ZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_8];
	ld.param.u64 	%rd16, [_ZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_9];
	ld.param.u32 	%r15, [_ZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_i_param_10];
	cvta.to.global.u64 	%rd35, %rd15;
	cvta.to.global.u64 	%rd33, %rd14;
	cvta.to.global.u64 	%rd34, %rd16;
	mov.u32 	%r1, %ctaid.x;
	add.s32 	%r18, %r1, 1;
	shl.b32 	%r19, %r18, 5;
	sub.s32 	%r2, %r19, %r16;
	mov.u32 	%r3, %ctaid.y;
	add.s32 	%r20, %r3, 1;
	shl.b32 	%r21, %r20, 5;
	sub.s32 	%r4, %r21, %r17;
	setp.lt.s32	%p1, %r2, 1;
	@%p1 bra 	BB10_2;

	neg.s32 	%r22, %r2;
	mul.wide.s32 	%rd17, %r22, 4;
	add.s64 	%rd33, %rd33, %rd17;
	add.s64 	%rd34, %rd34, %rd17;

BB10_2:
	setp.lt.s32	%p2, %r4, 1;
	@%p2 bra 	BB10_4;

	neg.s32 	%r23, %r14;
	mul.lo.s32 	%r24, %r4, %r23;
	mul.wide.s32 	%rd18, %r24, 4;
	add.s64 	%rd35, %rd35, %rd18;
	neg.s32 	%r25, %r15;
	mul.lo.s32 	%r26, %r4, %r25;
	mul.wide.s32 	%rd19, %r26, 4;
	add.s64 	%rd34, %rd34, %rd19;

BB10_4:
	mov.f32 	%f109, 0f00000000;
	setp.lt.s32	%p3, %r12, 1;
	@%p3 bra 	BB10_7;

	mov.u32 	%r28, %tid.x;
	mov.u32 	%r29, %tid.y;
	shl.b32 	%r30, %r1, 5;
	mul.lo.s32 	%r31, %r28, 132;
	mov.u32 	%r32, _ZZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_iE2sA;
	add.s32 	%r8, %r32, %r31;
	shl.b32 	%r33, %r29, 2;
	add.s32 	%r5, %r8, %r33;
	mad.lo.s32 	%r6, %r29, %r14, %r28;
	mov.u32 	%r34, _ZZN3wuk13gemm_32x32_v4IfLi32EEEviiiT_PKS1_iS3_iS1_PS1_iE2sB;
	add.s32 	%r35, %r34, %r31;
	add.s32 	%r7, %r35, %r33;
	add.s32 	%r9, %r34, %r33;
	mul.lo.s32 	%r36, %r3, %r14;
	shl.b32 	%r37, %r36, 5;
	cvt.u64.u32	%rd12, %r37;
	cvt.u64.u32	%rd13, %r30;
	mov.f32 	%f109, 0f00000000;
	mov.u32 	%r51, 0;

BB10_6:
	add.s32 	%r39, %r51, %r29;
	mad.lo.s32 	%r41, %r39, %r13, %r28;
	cvt.s64.s32	%rd20, %r41;
	add.s64 	%rd21, %rd20, %rd13;
	shl.b64 	%rd22, %rd21, 2;
	add.s64 	%rd23, %rd33, %rd22;
	ld.global.f32 	%f8, [%rd23];
	st.shared.f32 	[%r5], %f8;
	add.s32 	%r42, %r6, %r51;
	cvt.s64.s32	%rd24, %r42;
	add.s64 	%rd25, %rd24, %rd12;
	shl.b64 	%rd26, %rd25, 2;
	add.s64 	%rd27, %rd35, %rd26;
	ld.global.f32 	%f9, [%rd27];
	st.shared.f32 	[%r7], %f9;
	bar.sync 	0;
	ld.shared.f32 	%f10, [%r9];
	ld.shared.f32 	%f11, [%r8];
	fma.rn.f32 	%f12, %f11, %f10, %f109;
	ld.shared.f32 	%f13, [%r9+132];
	ld.shared.f32 	%f14, [%r8+4];
	fma.rn.f32 	%f15, %f14, %f13, %f12;
	ld.shared.f32 	%f16, [%r9+264];
	ld.shared.f32 	%f17, [%r8+8];
	fma.rn.f32 	%f18, %f17, %f16, %f15;
	ld.shared.f32 	%f19, [%r9+396];
	ld.shared.f32 	%f20, [%r8+12];
	fma.rn.f32 	%f21, %f20, %f19, %f18;
	ld.shared.f32 	%f22, [%r9+528];
	ld.shared.f32 	%f23, [%r8+16];
	fma.rn.f32 	%f24, %f23, %f22, %f21;
	ld.shared.f32 	%f25, [%r9+660];
	ld.shared.f32 	%f26, [%r8+20];
	fma.rn.f32 	%f27, %f26, %f25, %f24;
	ld.shared.f32 	%f28, [%r9+792];
	ld.shared.f32 	%f29, [%r8+24];
	fma.rn.f32 	%f30, %f29, %f28, %f27;
	ld.shared.f32 	%f31, [%r9+924];
	ld.shared.f32 	%f32, [%r8+28];
	fma.rn.f32 	%f33, %f32, %f31, %f30;
	ld.shared.f32 	%f34, [%r9+1056];
	ld.shared.f32 	%f35, [%r8+32];
	fma.rn.f32 	%f36, %f35, %f34, %f33;
	ld.shared.f32 	%f37, [%r9+1188];
	ld.shared.f32 	%f38, [%r8+36];
	fma.rn.f32 	%f39, %f38, %f37, %f36;
	ld.shared.f32 	%f40, [%r9+1320];
	ld.shared.f32 	%f41, [%r8+40];
	fma.rn.f32 	%f42, %f41, %f40, %f39;
	ld.shared.f32 	%f43, [%r9+1452];
	ld.shared.f32 	%f44, [%r8+44];
	fma.rn.f32 	%f45, %f44, %f43, %f42;
	ld.shared.f32 	%f46, [%r9+1584];
	ld.shared.f32 	%f47, [%r8+48];
	fma.rn.f32 	%f48, %f47, %f46, %f45;
	ld.shared.f32 	%f49, [%r9+1716];
	ld.shared.f32 	%f50, [%r8+52];
	fma.rn.f32 	%f51, %f50, %f49, %f48;
	ld.shared.f32 	%f52, [%r9+1848];
	ld.shared.f32 	%f53, [%r8+56];
	fma.rn.f32 	%f54, %f53, %f52, %f51;
	ld.shared.f32 	%f55, [%r9+1980];
	ld.shared.f32 	%f56, [%r8+60];
	fma.rn.f32 	%f57, %f56, %f55, %f54;
	ld.shared.f32 	%f58, [%r9+2112];
	ld.shared.f32 	%f59, [%r8+64];
	fma.rn.f32 	%f60, %f59, %f58, %f57;
	ld.shared.f32 	%f61, [%r9+2244];
	ld.shared.f32 	%f62, [%r8+68];
	fma.rn.f32 	%f63, %f62, %f61, %f60;
	ld.shared.f32 	%f64, [%r9+2376];
	ld.shared.f32 	%f65, [%r8+72];
	fma.rn.f32 	%f66, %f65, %f64, %f63;
	ld.shared.f32 	%f67, [%r9+2508];
	ld.shared.f32 	%f68, [%r8+76];
	fma.rn.f32 	%f69, %f68, %f67, %f66;
	ld.shared.f32 	%f70, [%r9+2640];
	ld.shared.f32 	%f71, [%r8+80];
	fma.rn.f32 	%f72, %f71, %f70, %f69;
	ld.shared.f32 	%f73, [%r9+2772];
	ld.shared.f32 	%f74, [%r8+84];
	fma.rn.f32 	%f75, %f74, %f73, %f72;
	ld.shared.f32 	%f76, [%r9+2904];
	ld.shared.f32 	%f77, [%r8+88];
	fma.rn.f32 	%f78, %f77, %f76, %f75;
	ld.shared.f32 	%f79, [%r9+3036];
	ld.shared.f32 	%f80, [%r8+92];
	fma.rn.f32 	%f81, %f80, %f79, %f78;
	ld.shared.f32 	%f82, [%r9+3168];
	ld.shared.f32 	%f83, [%r8+96];
	fma.rn.f32 	%f84, %f83, %f82, %f81;
	ld.shared.f32 	%f85, [%r9+3300];
	ld.shared.f32 	%f86, [%r8+100];
	fma.rn.f32 	%f87, %f86, %f85, %f84;
	ld.shared.f32 	%f88, [%r9+3432];
	ld.shared.f32 	%f89, [%r8+104];
	fma.rn.f32 	%f90, %f89, %f88, %f87;
	ld.shared.f32 	%f91, [%r9+3564];
	ld.shared.f32 	%f92, [%r8+108];
	fma.rn.f32 	%f93, %f92, %f91, %f90;
	ld.shared.f32 	%f94, [%r9+3696];
	ld.shared.f32 	%f95, [%r8+112];
	fma.rn.f32 	%f96, %f95, %f94, %f93;
	ld.shared.f32 	%f97, [%r9+3828];
	ld.shared.f32 	%f98, [%r8+116];
	fma.rn.f32 	%f99, %f98, %f97, %f96;
	ld.shared.f32 	%f100, [%r9+3960];
	ld.shared.f32 	%f101, [%r8+120];
	fma.rn.f32 	%f102, %f101, %f100, %f99;
	ld.shared.f32 	%f103, [%r9+4092];
	ld.shared.f32 	%f104, [%r8+124];
	fma.rn.f32 	%f109, %f104, %f103, %f102;
	bar.sync 	0;
	add.s32 	%r51, %r51, 32;
	setp.lt.s32	%p4, %r51, %r12;
	@%p4 bra 	BB10_6;

BB10_7:
	mov.u32 	%r43, %tid.y;
	mov.u32 	%r44, %tid.x;
	mad.lo.s32 	%r45, %r43, %r15, %r44;
	cvt.s64.s32	%rd28, %r45;
	shl.b32 	%r47, %r3, 5;
	shl.b32 	%r49, %r1, 5;
	mad.lo.s32 	%r50, %r47, %r15, %r49;
	cvt.u64.u32	%rd29, %r50;
	add.s64 	%rd30, %rd29, %rd28;
	shl.b64 	%rd31, %rd30, 2;
	add.s64 	%rd32, %rd34, %rd31;
	ld.global.f32 	%f105, [%rd32];
	mul.f32 	%f106, %f105, %f5;
	fma.rn.f32 	%f107, %f109, %f4, %f106;
	st.global.f32 	[%rd32], %f107;
	ret;
}

	// .globl	_ZN3wuk13gemm_32x32_v5If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i
.visible .entry _ZN3wuk13gemm_32x32_v5If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i(
	.param .u32 _ZN3wuk13gemm_32x32_v5If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_0,
	.param .u32 _ZN3wuk13gemm_32x32_v5If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_1,
	.param .u32 _ZN3wuk13gemm_32x32_v5If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_2,
	.param .f32 _ZN3wuk13gemm_32x32_v5If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_3,
	.param .u64 _ZN3wuk13gemm_32x32_v5If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_4,
	.param .u32 _ZN3wuk13gemm_32x32_v5If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_5,
	.param .u64 _ZN3wuk13gemm_32x32_v5If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_6,
	.param .u32 _ZN3wuk13gemm_32x32_v5If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_7,
	.param .f32 _ZN3wuk13gemm_32x32_v5If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_8,
	.param .u64 _ZN3wuk13gemm_32x32_v5If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_9,
	.param .u32 _ZN3wuk13gemm_32x32_v5If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_10
)
.maxntid 256, 1, 1
{
	.reg .pred 	%p<7>;
	.reg .f32 	%f<141>;
	.reg .b32 	%r<74>;
	.reg .b64 	%rd<59>;
	// demoted variable
	.shared .align 4 .b8 _ZZN3wuk13gemm_32x32_v5If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_iE8s_buffer[8192];

	ld.param.u32 	%r15, [_ZN3wuk13gemm_32x32_v5If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_0];
	ld.param.u32 	%r16, [_ZN3wuk13gemm_32x32_v5If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_1];
	ld.param.u32 	%r11, [_ZN3wuk13gemm_32x32_v5If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_2];
	ld.param.f32 	%f37, [_ZN3wuk13gemm_32x32_v5If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_3];
	ld.param.u64 	%rd20, [_ZN3wuk13gemm_32x32_v5If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_4];
	ld.param.u32 	%r12, [_ZN3wuk13gemm_32x32_v5If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_5];
	ld.param.u64 	%rd21, [_ZN3wuk13gemm_32x32_v5If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_6];
	ld.param.u32 	%r13, [_ZN3wuk13gemm_32x32_v5If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_7];
	ld.param.f32 	%f38, [_ZN3wuk13gemm_32x32_v5If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_8];
	ld.param.u64 	%rd22, [_ZN3wuk13gemm_32x32_v5If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_9];
	ld.param.u32 	%r14, [_ZN3wuk13gemm_32x32_v5If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_10];
	cvta.to.global.u64 	%rd51, %rd22;
	cvta.to.global.u64 	%rd54, %rd21;
	cvta.to.global.u64 	%rd52, %rd20;
	mov.u32 	%r1, %ctaid.x;
	add.s32 	%r17, %r1, 1;
	shl.b32 	%r18, %r17, 5;
	sub.s32 	%r2, %r18, %r15;
	mov.u32 	%r3, %ctaid.y;
	add.s32 	%r19, %r3, 1;
	shl.b32 	%r20, %r19, 5;
	sub.s32 	%r4, %r20, %r16;
	setp.lt.s32	%p2, %r2, 1;
	@%p2 bra 	BB11_2;

	neg.s32 	%r21, %r2;
	mul.wide.s32 	%rd23, %r21, 4;
	add.s64 	%rd52, %rd52, %rd23;
	add.s64 	%rd51, %rd51, %rd23;

BB11_2:
	setp.lt.s32	%p3, %r4, 1;
	@%p3 bra 	BB11_4;

	neg.s32 	%r22, %r13;
	mul.lo.s32 	%r23, %r4, %r22;
	mul.wide.s32 	%rd24, %r23, 4;
	add.s64 	%rd54, %rd54, %rd24;
	neg.s32 	%r24, %r14;
	mul.lo.s32 	%r25, %r4, %r24;
	mul.wide.s32 	%rd25, %r25, 4;
	add.s64 	%rd51, %rd51, %rd25;

BB11_4:
	mov.f32 	%f133, 0f00000000;
	setp.lt.s32	%p4, %r11, 1;
	mov.f32 	%f134, %f133;
	mov.f32 	%f135, %f133;
	mov.f32 	%f136, %f133;
	@%p4 bra 	BB11_11;

	mov.u32 	%r27, %tid.x;
	shl.b32 	%r28, %r1, 5;
	cvt.u64.u32	%rd26, %r28;
	mul.lo.s32 	%r29, %r3, %r13;
	shl.b32 	%r30, %r29, 5;
	cvt.u64.u32	%rd27, %r30;
	shl.b32 	%r31, %r27, 2;
	and.b32  	%r32, %r31, 28;
	shr.s32 	%r33, %r27, 3;
	mad.lo.s32 	%r34, %r33, %r12, %r32;
	cvt.s64.s32	%rd28, %r34;
	add.s64 	%rd29, %rd28, %rd26;
	shl.b64 	%rd30, %rd29, 2;
	add.s64 	%rd31, %rd52, %rd30;
	ld.global.v4.f32 	{%f124, %f125, %f126, %f127}, [%rd31];
	mov.u32 	%r71, 0;
	mad.lo.s32 	%r35, %r33, %r13, %r32;
	cvt.s64.s32	%rd32, %r35;
	add.s64 	%rd33, %rd27, %rd32;
	shl.b64 	%rd34, %rd33, 2;
	add.s64 	%rd35, %rd34, %rd54;
	ld.global.f32 	%f119, [%rd35+12];
	mul.wide.u32 	%rd36, %r30, 4;
	add.s64 	%rd56, %rd54, %rd36;
	mul.wide.u32 	%rd37, %r28, 4;
	add.s64 	%rd55, %rd52, %rd37;
	mov.f32 	%f133, 0f00000000;
	mov.f32 	%f134, %f133;
	mov.f32 	%f135, %f133;
	mov.f32 	%f136, %f133;

BB11_6:
	shl.b32 	%r37, %r27, 4;
	mov.u32 	%r72, _ZZN3wuk13gemm_32x32_v5If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_iE8s_buffer;
	add.s32 	%r39, %r72, %r37;
	st.shared.v4.f32 	[%r39], {%f124, %f125, %f126, %f127};
	shl.b32 	%r41, %r27, 7;
	and.b32  	%r42, %r41, 896;
	add.s32 	%r43, %r42, %r33;
	shl.b32 	%r44, %r43, 2;
	add.s32 	%r45, %r44, %r72;
	st.shared.f32 	[%r45+4096], %f119;
	bar.sync 	0;
	add.s32 	%r71, %r71, 32;
	setp.ge.s32	%p5, %r71, %r11;
	@%p5 bra 	BB11_8;

	shl.b32 	%r46, %r12, 5;
	cvt.s64.s32	%rd38, %r46;
	mul.wide.s32 	%rd39, %r46, 4;
	add.s64 	%rd16, %rd55, %rd39;
	add.s64 	%rd17, %rd56, 128;
	add.s64 	%rd41, %rd28, %rd38;
	shl.b64 	%rd42, %rd41, 2;
	add.s64 	%rd43, %rd55, %rd42;
	ld.global.v4.f32 	{%f124, %f125, %f126, %f127}, [%rd43];
	mul.wide.s32 	%rd44, %r35, 4;
	add.s64 	%rd45, %rd44, %rd56;
	ld.global.f32 	%f119, [%rd45+140];
	mov.u64 	%rd56, %rd17;
	mov.u64 	%rd55, %rd16;

BB11_8:
	mov.u32 	%r73, 0;

BB11_9:
	mad.lo.s32 	%r58, %r32, 4, %r72;
	ld.shared.v4.f32 	{%f55, %f56, %f57, %f58}, [%r58];
	mad.lo.s32 	%r60, %r33, 4, %r72;
	ld.shared.f32 	%f63, [%r60+4096];
	fma.rn.f32 	%f64, %f63, %f55, %f133;
	fma.rn.f32 	%f65, %f63, %f56, %f134;
	fma.rn.f32 	%f66, %f63, %f57, %f135;
	fma.rn.f32 	%f67, %f63, %f58, %f136;
	ld.shared.v4.f32 	{%f68, %f69, %f70, %f71}, [%r58+128];
	ld.shared.f32 	%f76, [%r60+4224];
	fma.rn.f32 	%f77, %f76, %f68, %f64;
	fma.rn.f32 	%f78, %f76, %f69, %f65;
	fma.rn.f32 	%f79, %f76, %f70, %f66;
	fma.rn.f32 	%f80, %f76, %f71, %f67;
	ld.shared.v4.f32 	{%f81, %f82, %f83, %f84}, [%r58+256];
	ld.shared.f32 	%f89, [%r60+4352];
	fma.rn.f32 	%f90, %f89, %f81, %f77;
	fma.rn.f32 	%f91, %f89, %f82, %f78;
	fma.rn.f32 	%f92, %f89, %f83, %f79;
	fma.rn.f32 	%f93, %f89, %f84, %f80;
	ld.shared.v4.f32 	{%f94, %f95, %f96, %f97}, [%r58+384];
	ld.shared.f32 	%f102, [%r60+4480];
	fma.rn.f32 	%f133, %f102, %f94, %f90;
	fma.rn.f32 	%f134, %f102, %f95, %f91;
	fma.rn.f32 	%f135, %f102, %f96, %f92;
	fma.rn.f32 	%f136, %f102, %f97, %f93;
	add.s32 	%r72, %r72, 512;
	add.s32 	%r73, %r73, 4;
	setp.lt.s32	%p6, %r73, 32;
	@%p6 bra 	BB11_9;

	setp.lt.s32	%p1, %r71, %r11;
	bar.sync 	0;
	@%p1 bra 	BB11_6;

BB11_11:
	mov.u32 	%r61, %tid.x;
	shr.s32 	%r62, %r61, 3;
	shl.b32 	%r63, %r61, 2;
	and.b32  	%r64, %r63, 28;
	mad.lo.s32 	%r65, %r62, %r14, %r64;
	cvt.s64.s32	%rd46, %r65;
	shl.b32 	%r67, %r3, 5;
	shl.b32 	%r69, %r1, 5;
	mad.lo.s32 	%r70, %r67, %r14, %r69;
	cvt.u64.u32	%rd47, %r70;
	add.s64 	%rd48, %rd47, %rd46;
	shl.b64 	%rd49, %rd48, 2;
	add.s64 	%rd50, %rd51, %rd49;
	ld.global.v4.f32 	{%f103, %f104, %f105, %f106}, [%rd50];
	mul.f32 	%f111, %f103, %f38;
	mul.f32 	%f112, %f104, %f38;
	mul.f32 	%f113, %f105, %f38;
	mul.f32 	%f114, %f106, %f38;
	fma.rn.f32 	%f115, %f136, %f37, %f114;
	fma.rn.f32 	%f116, %f135, %f37, %f113;
	fma.rn.f32 	%f117, %f134, %f37, %f112;
	fma.rn.f32 	%f118, %f133, %f37, %f111;
	st.global.v4.f32 	[%rd50], {%f118, %f117, %f116, %f115};
	ret;
}

	// .globl	_ZN3wuk10gemm_64x64If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i
.visible .entry _ZN3wuk10gemm_64x64If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i(
	.param .u32 _ZN3wuk10gemm_64x64If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_0,
	.param .u32 _ZN3wuk10gemm_64x64If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_1,
	.param .u32 _ZN3wuk10gemm_64x64If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_2,
	.param .f32 _ZN3wuk10gemm_64x64If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_3,
	.param .u64 _ZN3wuk10gemm_64x64If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_4,
	.param .u32 _ZN3wuk10gemm_64x64If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_5,
	.param .u64 _ZN3wuk10gemm_64x64If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_6,
	.param .u32 _ZN3wuk10gemm_64x64If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_7,
	.param .f32 _ZN3wuk10gemm_64x64If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_8,
	.param .u64 _ZN3wuk10gemm_64x64If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_9,
	.param .u32 _ZN3wuk10gemm_64x64If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_10
)
.maxntid 256, 1, 1
{
	.reg .pred 	%p<7>;
	.reg .b16 	%rs<3>;
	.reg .f32 	%f<323>;
	.reg .b32 	%r<82>;
	.reg .b64 	%rd<63>;
	// demoted variable
	.shared .align 4 .b8 _ZZN3wuk10gemm_64x64If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_iE8s_buffer[8192];

	ld.param.u32 	%r19, [_ZN3wuk10gemm_64x64If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_0];
	ld.param.u32 	%r20, [_ZN3wuk10gemm_64x64If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_1];
	ld.param.u32 	%r15, [_ZN3wuk10gemm_64x64If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_2];
	ld.param.f32 	%f97, [_ZN3wuk10gemm_64x64If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_3];
	ld.param.u64 	%rd20, [_ZN3wuk10gemm_64x64If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_4];
	ld.param.u32 	%r16, [_ZN3wuk10gemm_64x64If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_5];
	ld.param.u64 	%rd21, [_ZN3wuk10gemm_64x64If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_6];
	ld.param.u32 	%r17, [_ZN3wuk10gemm_64x64If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_7];
	ld.param.f32 	%f98, [_ZN3wuk10gemm_64x64If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_8];
	ld.param.u64 	%rd22, [_ZN3wuk10gemm_64x64If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_9];
	ld.param.u32 	%r18, [_ZN3wuk10gemm_64x64If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_i_param_10];
	cvta.to.global.u64 	%rd55, %rd22;
	cvta.to.global.u64 	%rd58, %rd21;
	cvta.to.global.u64 	%rd56, %rd20;
	mov.u32 	%r1, %ctaid.x;
	add.s32 	%r21, %r1, 1;
	shl.b32 	%r22, %r21, 6;
	sub.s32 	%r2, %r22, %r19;
	mov.u32 	%r3, %ctaid.y;
	add.s32 	%r23, %r3, 1;
	shl.b32 	%r24, %r23, 6;
	sub.s32 	%r4, %r24, %r20;
	setp.lt.s32	%p2, %r2, 1;
	@%p2 bra 	BB12_2;

	neg.s32 	%r25, %r2;
	mul.wide.s32 	%rd23, %r25, 4;
	add.s64 	%rd56, %rd56, %rd23;
	add.s64 	%rd55, %rd55, %rd23;

BB12_2:
	setp.lt.s32	%p3, %r4, 1;
	@%p3 bra 	BB12_4;

	neg.s32 	%r26, %r17;
	mul.lo.s32 	%r27, %r4, %r26;
	mul.wide.s32 	%rd24, %r27, 4;
	add.s64 	%rd58, %rd58, %rd24;
	neg.s32 	%r28, %r18;
	mul.lo.s32 	%r29, %r4, %r28;
	mul.wide.s32 	%rd25, %r29, 4;
	add.s64 	%rd55, %rd55, %rd25;

BB12_4:
	mov.f32 	%f291, 0f00000000;
	setp.lt.s32	%p4, %r15, 1;
	mov.f32 	%f292, %f291;
	mov.f32 	%f293, %f291;
	mov.f32 	%f294, %f291;
	mov.f32 	%f295, %f291;
	mov.f32 	%f296, %f291;
	mov.f32 	%f297, %f291;
	mov.f32 	%f298, %f291;
	mov.f32 	%f299, %f291;
	mov.f32 	%f300, %f291;
	mov.f32 	%f301, %f291;
	mov.f32 	%f302, %f291;
	mov.f32 	%f303, %f291;
	mov.f32 	%f304, %f291;
	mov.f32 	%f305, %f291;
	mov.f32 	%f306, %f291;
	@%p4 bra 	BB12_11;

	mov.u32 	%r31, %tid.x;
	shl.b32 	%r32, %r1, 6;
	cvt.u64.u32	%rd26, %r32;
	mul.lo.s32 	%r33, %r3, %r17;
	shl.b32 	%r34, %r33, 6;
	cvt.u64.u32	%rd27, %r34;
	shl.b32 	%r35, %r31, 2;
	shr.s32 	%r36, %r31, 4;
	and.b32  	%r37, %r35, 60;
	mad.lo.s32 	%r38, %r36, %r16, %r37;
	cvt.s64.s32	%rd28, %r38;
	add.s64 	%rd29, %rd28, %rd26;
	shl.b64 	%rd30, %rd29, 2;
	add.s64 	%rd31, %rd56, %rd30;
	ld.global.v4.f32 	{%f279, %f280, %f281, %f282}, [%rd31];
	and.b32  	%r39, %r35, 12;
	shr.s32 	%r40, %r31, 2;
	mad.lo.s32 	%r41, %r40, %r17, %r39;
	cvt.s64.s32	%rd32, %r41;
	add.s64 	%rd33, %rd27, %rd32;
	shl.b64 	%rd34, %rd33, 2;
	add.s64 	%rd35, %rd58, %rd34;
	ld.global.v4.f32 	{%f266, %f265, %f264, %f263}, [%rd35];
	mov.u32 	%r78, 0;
	mul.wide.u32 	%rd36, %r34, 4;
	add.s64 	%rd60, %rd58, %rd36;
	mul.wide.u32 	%rd37, %r32, 4;
	add.s64 	%rd59, %rd56, %rd37;
	mov.f32 	%f291, 0f00000000;
	mov.f32 	%f292, %f291;
	mov.f32 	%f293, %f291;
	mov.f32 	%f294, %f291;
	mov.f32 	%f295, %f291;
	mov.f32 	%f296, %f291;
	mov.f32 	%f297, %f291;
	mov.f32 	%f298, %f291;
	mov.f32 	%f299, %f291;
	mov.f32 	%f300, %f291;
	mov.f32 	%f301, %f291;
	mov.f32 	%f302, %f291;
	mov.f32 	%f303, %f291;
	mov.f32 	%f304, %f291;
	mov.f32 	%f305, %f291;
	mov.f32 	%f306, %f291;

BB12_6:
	shl.b32 	%r43, %r31, 4;
	mov.u32 	%r44, _ZZN3wuk10gemm_64x64If6float4Li0EEEviiiT_PKS2_iS4_iS2_PS2_iE8s_buffer;
	add.s32 	%r45, %r44, %r43;
	st.shared.v4.f32 	[%r45], {%f279, %f280, %f281, %f282};
	shr.u32 	%r46, %r31, 2;
	shl.b32 	%r47, %r31, 8;
	and.b32  	%r48, %r47, 768;
	add.s32 	%r49, %r48, %r46;
	shl.b32 	%r50, %r49, 2;
	add.s32 	%r51, %r50, %r44;
	st.shared.f32 	[%r51+4096], %f266;
	st.shared.f32 	[%r51+4352], %f265;
	st.shared.f32 	[%r51+4608], %f264;
	st.shared.f32 	[%r51+4864], %f263;
	bar.sync 	0;
	add.s32 	%r78, %r78, 16;
	setp.ge.s32	%p5, %r78, %r15;
	@%p5 bra 	BB12_8;

	shl.b32 	%r52, %r16, 4;
	cvt.s64.s32	%rd38, %r52;
	mul.wide.s32 	%rd39, %r52, 4;
	add.s64 	%rd16, %rd59, %rd39;
	add.s64 	%rd17, %rd60, 64;
	add.s64 	%rd41, %rd28, %rd38;
	shl.b64 	%rd42, %rd41, 2;
	add.s64 	%rd43, %rd59, %rd42;
	ld.global.v4.f32 	{%f279, %f280, %f281, %f282}, [%rd43];
	mul.wide.s32 	%rd44, %r41, 4;
	add.s64 	%rd45, %rd44, %rd60;
	ld.global.v4.f32 	{%f266, %f265, %f264, %f263}, [%rd45+64];
	mov.u64 	%rd60, %rd17;
	mov.u64 	%rd59, %rd16;

BB12_8:
	cvt.u16.u32	%rs1, %r35;
	and.b16  	%rs2, %rs1, 60;
	mul.wide.u16 	%r64, %rs2, 4;
	add.s32 	%r79, %r44, %r64;
	and.b32  	%r66, %r31, -16;
	add.s32 	%r80, %r44, %r66;
	mov.u32 	%r81, 0;

BB12_9:
	ld.shared.v4.f32 	{%f147, %f148, %f149, %f150}, [%r79];
	ld.shared.v4.f32 	{%f155, %f156, %f157, %f158}, [%r80+4096];
	fma.rn.f32 	%f163, %f147, %f155, %f294;
	fma.rn.f32 	%f164, %f147, %f156, %f293;
	fma.rn.f32 	%f165, %f147, %f157, %f292;
	fma.rn.f32 	%f166, %f147, %f158, %f291;
	fma.rn.f32 	%f167, %f148, %f155, %f295;
	fma.rn.f32 	%f168, %f148, %f156, %f296;
	fma.rn.f32 	%f169, %f148, %f157, %f297;
	fma.rn.f32 	%f170, %f148, %f158, %f298;
	fma.rn.f32 	%f171, %f149, %f155, %f299;
	fma.rn.f32 	%f172, %f149, %f156, %f300;
	fma.rn.f32 	%f173, %f149, %f157, %f301;
	fma.rn.f32 	%f174, %f149, %f158, %f302;
	fma.rn.f32 	%f175, %f150, %f155, %f303;
	fma.rn.f32 	%f176, %f150, %f156, %f304;
	fma.rn.f32 	%f177, %f150, %f157, %f305;
	fma.rn.f32 	%f178, %f150, %f158, %f306;
	ld.shared.v4.f32 	{%f179, %f180, %f181, %f182}, [%r79+256];
	ld.shared.v4.f32 	{%f187, %f188, %f189, %f190}, [%r80+4352];
	fma.rn.f32 	%f294, %f179, %f187, %f163;
	fma.rn.f32 	%f293, %f179, %f188, %f164;
	fma.rn.f32 	%f292, %f179, %f189, %f165;
	fma.rn.f32 	%f291, %f179, %f190, %f166;
	fma.rn.f32 	%f295, %f180, %f187, %f167;
	fma.rn.f32 	%f296, %f180, %f188, %f168;
	fma.rn.f32 	%f297, %f180, %f189, %f169;
	fma.rn.f32 	%f298, %f180, %f190, %f170;
	fma.rn.f32 	%f299, %f181, %f187, %f171;
	fma.rn.f32 	%f300, %f181, %f188, %f172;
	fma.rn.f32 	%f301, %f181, %f189, %f173;
	fma.rn.f32 	%f302, %f181, %f190, %f174;
	fma.rn.f32 	%f303, %f182, %f187, %f175;
	fma.rn.f32 	%f304, %f182, %f188, %f176;
	fma.rn.f32 	%f305, %f182, %f189, %f177;
	fma.rn.f32 	%f306, %f182, %f190, %f178;
	add.s32 	%r80, %r80, 512;
	add.s32 	%r79, %r79, 512;
	add.s32 	%r81, %r81, 2;
	setp.lt.s32	%p6, %r81, 16;
	@%p6 bra 	BB12_9;

	setp.lt.s32	%p1, %r78, %r15;
	bar.sync 	0;
	@%p1 bra 	BB12_6;

BB12_11:
	mov.u32 	%r67, %tid.x;
	shr.s32 	%r68, %r67, 4;
	shl.b32 	%r69, %r68, 2;
	shl.b32 	%r70, %r67, 2;
	and.b32  	%r71, %r70, 60;
	mad.lo.s32 	%r72, %r69, %r18, %r71;
	cvt.s64.s32	%rd46, %r72;
	shl.b32 	%r74, %r3, 6;
	shl.b32 	%r76, %r1, 6;
	mad.lo.s32 	%r77, %r74, %r18, %r76;
	cvt.u64.u32	%rd47, %r77;
	add.s64 	%rd48, %rd46, %rd47;
	shl.b64 	%rd49, %rd48, 2;
	add.s64 	%rd50, %rd55, %rd49;
	ld.global.v4.f32 	{%f195, %f196, %f197, %f198}, [%rd50];
	mul.f32 	%f203, %f195, %f98;
	mul.f32 	%f204, %f196, %f98;
	mul.f32 	%f205, %f197, %f98;
	mul.f32 	%f206, %f198, %f98;
	fma.rn.f32 	%f207, %f291, %f97, %f206;
	fma.rn.f32 	%f208, %f292, %f97, %f205;
	fma.rn.f32 	%f209, %f293, %f97, %f204;
	fma.rn.f32 	%f210, %f294, %f97, %f203;
	st.global.v4.f32 	[%rd50], {%f210, %f209, %f208, %f207};
	mul.wide.s32 	%rd51, %r18, 4;
	add.s64 	%rd52, %rd50, %rd51;
	ld.global.v4.f32 	{%f211, %f212, %f213, %f214}, [%rd52];
	mul.f32 	%f219, %f211, %f98;
	mul.f32 	%f220, %f212, %f98;
	mul.f32 	%f221, %f213, %f98;
	mul.f32 	%f222, %f214, %f98;
	fma.rn.f32 	%f223, %f298, %f97, %f222;
	fma.rn.f32 	%f224, %f297, %f97, %f221;
	fma.rn.f32 	%f225, %f296, %f97, %f220;
	fma.rn.f32 	%f226, %f295, %f97, %f219;
	st.global.v4.f32 	[%rd52], {%f226, %f225, %f224, %f223};
	add.s64 	%rd53, %rd52, %rd51;
	ld.global.v4.f32 	{%f227, %f228, %f229, %f230}, [%rd53];
	mul.f32 	%f235, %f227, %f98;
	mul.f32 	%f236, %f228, %f98;
	mul.f32 	%f237, %f229, %f98;
	mul.f32 	%f238, %f230, %f98;
	fma.rn.f32 	%f239, %f302, %f97, %f238;
	fma.rn.f32 	%f240, %f301, %f97, %f237;
	fma.rn.f32 	%f241, %f300, %f97, %f236;
	fma.rn.f32 	%f242, %f299, %f97, %f235;
	st.global.v4.f32 	[%rd53], {%f242, %f241, %f240, %f239};
	add.s64 	%rd54, %rd53, %rd51;
	ld.global.v4.f32 	{%f243, %f244, %f245, %f246}, [%rd54];
	mul.f32 	%f251, %f243, %f98;
	mul.f32 	%f252, %f244, %f98;
	mul.f32 	%f253, %f245, %f98;
	mul.f32 	%f254, %f246, %f98;
	fma.rn.f32 	%f255, %f306, %f97, %f254;
	fma.rn.f32 	%f256, %f305, %f97, %f253;
	fma.rn.f32 	%f257, %f304, %f97, %f252;
	fma.rn.f32 	%f258, %f303, %f97, %f251;
	st.global.v4.f32 	[%rd54], {%f258, %f257, %f256, %f255};
	ret;
}

	// .globl	_ZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_i
.visible .entry _ZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_i(
	.param .u32 _ZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_i_param_0,
	.param .u32 _ZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_i_param_1,
	.param .u32 _ZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_i_param_2,
	.param .f32 _ZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_i_param_3,
	.param .u64 _ZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_i_param_4,
	.param .u32 _ZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_i_param_5,
	.param .u64 _ZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_i_param_6,
	.param .u32 _ZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_i_param_7,
	.param .f32 _ZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_i_param_8,
	.param .u64 _ZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_i_param_9,
	.param .u32 _ZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_i_param_10
)
.maxntid 256, 1, 1
{
	.reg .pred 	%p<8>;
	.reg .f32 	%f<1221>;
	.reg .b32 	%r<95>;
	.reg .b64 	%rd<76>;
	// demoted variable
	.shared .align 4 .b8 _ZZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_iE8s_buffer[8192];

	ld.param.u32 	%r15, [_ZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_i_param_0];
	ld.param.u32 	%r16, [_ZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_i_param_1];
	ld.param.u32 	%r11, [_ZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_i_param_2];
	ld.param.u64 	%rd20, [_ZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_i_param_4];
	ld.param.u32 	%r12, [_ZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_i_param_5];
	ld.param.u64 	%rd21, [_ZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_i_param_6];
	ld.param.u32 	%r13, [_ZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_i_param_7];
	ld.param.u64 	%rd22, [_ZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_i_param_9];
	ld.param.u32 	%r14, [_ZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_i_param_10];
	cvta.to.global.u64 	%rd68, %rd22;
	cvta.to.global.u64 	%rd71, %rd21;
	cvta.to.global.u64 	%rd69, %rd20;
	mov.u32 	%r1, %ctaid.x;
	add.s32 	%r17, %r1, 1;
	shl.b32 	%r18, %r17, 7;
	sub.s32 	%r2, %r18, %r15;
	mov.u32 	%r3, %ctaid.y;
	add.s32 	%r19, %r3, 1;
	shl.b32 	%r20, %r19, 7;
	sub.s32 	%r4, %r20, %r16;
	setp.lt.s32	%p2, %r2, 1;
	@%p2 bra 	BB13_2;

	neg.s32 	%r21, %r2;
	mul.wide.s32 	%rd23, %r21, 4;
	add.s64 	%rd69, %rd69, %rd23;
	add.s64 	%rd68, %rd68, %rd23;

BB13_2:
	setp.lt.s32	%p3, %r4, 1;
	@%p3 bra 	BB13_4;

	neg.s32 	%r22, %r13;
	mul.lo.s32 	%r23, %r4, %r22;
	mul.wide.s32 	%rd24, %r23, 4;
	add.s64 	%rd71, %rd71, %rd24;
	neg.s32 	%r24, %r14;
	mul.lo.s32 	%r25, %r4, %r24;
	mul.wide.s32 	%rd25, %r25, 4;
	add.s64 	%rd68, %rd68, %rd25;

BB13_4:
	mov.f32 	%f1093, 0f00000000;
	setp.lt.s32	%p4, %r11, 1;
	mov.f32 	%f1094, %f1093;
	mov.f32 	%f1095, %f1093;
	mov.f32 	%f1096, %f1093;
	mov.f32 	%f1097, %f1093;
	mov.f32 	%f1098, %f1093;
	mov.f32 	%f1099, %f1093;
	mov.f32 	%f1100, %f1093;
	mov.f32 	%f1101, %f1093;
	mov.f32 	%f1102, %f1093;
	mov.f32 	%f1103, %f1093;
	mov.f32 	%f1104, %f1093;
	mov.f32 	%f1105, %f1093;
	mov.f32 	%f1106, %f1093;
	mov.f32 	%f1107, %f1093;
	mov.f32 	%f1108, %f1093;
	mov.f32 	%f1109, %f1093;
	mov.f32 	%f1110, %f1093;
	mov.f32 	%f1111, %f1093;
	mov.f32 	%f1112, %f1093;
	mov.f32 	%f1113, %f1093;
	mov.f32 	%f1114, %f1093;
	mov.f32 	%f1115, %f1093;
	mov.f32 	%f1116, %f1093;
	mov.f32 	%f1117, %f1093;
	mov.f32 	%f1118, %f1093;
	mov.f32 	%f1119, %f1093;
	mov.f32 	%f1120, %f1093;
	mov.f32 	%f1121, %f1093;
	mov.f32 	%f1122, %f1093;
	mov.f32 	%f1123, %f1093;
	mov.f32 	%f1124, %f1093;
	mov.f32 	%f1125, %f1093;
	mov.f32 	%f1126, %f1093;
	mov.f32 	%f1127, %f1093;
	mov.f32 	%f1128, %f1093;
	mov.f32 	%f1129, %f1093;
	mov.f32 	%f1130, %f1093;
	mov.f32 	%f1131, %f1093;
	mov.f32 	%f1132, %f1093;
	mov.f32 	%f1133, %f1093;
	mov.f32 	%f1134, %f1093;
	mov.f32 	%f1135, %f1093;
	mov.f32 	%f1136, %f1093;
	mov.f32 	%f1137, %f1093;
	mov.f32 	%f1138, %f1093;
	mov.f32 	%f1139, %f1093;
	mov.f32 	%f1140, %f1093;
	mov.f32 	%f1141, %f1093;
	mov.f32 	%f1142, %f1093;
	mov.f32 	%f1143, %f1093;
	mov.f32 	%f1144, %f1093;
	mov.f32 	%f1145, %f1093;
	mov.f32 	%f1146, %f1093;
	mov.f32 	%f1147, %f1093;
	mov.f32 	%f1148, %f1093;
	mov.f32 	%f1149, %f1093;
	mov.f32 	%f1150, %f1093;
	mov.f32 	%f1151, %f1093;
	mov.f32 	%f1152, %f1093;
	mov.f32 	%f1153, %f1093;
	mov.f32 	%f1154, %f1093;
	mov.f32 	%f1155, %f1093;
	mov.f32 	%f1156, %f1093;
	@%p4 bra 	BB13_11;

	mov.u32 	%r27, %tid.x;
	and.b32  	%r28, %r27, 1;
	shr.s32 	%r29, %r27, 1;
	shl.b32 	%r30, %r1, 7;
	mul.wide.u32 	%rd26, %r30, 4;
	add.s64 	%rd72, %rd69, %rd26;
	mul.lo.s32 	%r31, %r3, %r13;
	shl.b32 	%r32, %r31, 7;
	mul.wide.u32 	%rd27, %r32, 4;
	add.s64 	%rd73, %rd71, %rd27;
	and.b32  	%r33, %r27, 31;
	shr.s32 	%r34, %r27, 5;
	shr.s32 	%r35, %r12, 2;
	mad.lo.s32 	%r36, %r34, %r35, %r33;
	mul.wide.s32 	%rd28, %r36, 16;
	add.s64 	%rd29, %rd72, %rd28;
	ld.global.v4.f32 	{%f1081, %f1082, %f1083, %f1084}, [%rd29];
	shr.s32 	%r37, %r13, 2;
	mad.lo.s32 	%r38, %r29, %r37, %r28;
	mul.wide.s32 	%rd30, %r38, 16;
	add.s64 	%rd31, %rd73, %rd30;
	ld.global.v4.f32 	{%f1060, %f1059, %f1058, %f1057}, [%rd31];
	mov.u32 	%r92, 0;
	mov.f32 	%f1093, 0f00000000;
	mov.f32 	%f1094, %f1093;
	mov.f32 	%f1095, %f1093;
	mov.f32 	%f1096, %f1093;
	mov.f32 	%f1097, %f1093;
	mov.f32 	%f1098, %f1093;
	mov.f32 	%f1099, %f1093;
	mov.f32 	%f1100, %f1093;
	mov.f32 	%f1101, %f1093;
	mov.f32 	%f1102, %f1093;
	mov.f32 	%f1103, %f1093;
	mov.f32 	%f1104, %f1093;
	mov.f32 	%f1105, %f1093;
	mov.f32 	%f1106, %f1093;
	mov.f32 	%f1107, %f1093;
	mov.f32 	%f1108, %f1093;
	mov.f32 	%f1109, %f1093;
	mov.f32 	%f1110, %f1093;
	mov.f32 	%f1111, %f1093;
	mov.f32 	%f1112, %f1093;
	mov.f32 	%f1113, %f1093;
	mov.f32 	%f1114, %f1093;
	mov.f32 	%f1115, %f1093;
	mov.f32 	%f1116, %f1093;
	mov.f32 	%f1117, %f1093;
	mov.f32 	%f1118, %f1093;
	mov.f32 	%f1119, %f1093;
	mov.f32 	%f1120, %f1093;
	mov.f32 	%f1121, %f1093;
	mov.f32 	%f1122, %f1093;
	mov.f32 	%f1123, %f1093;
	mov.f32 	%f1124, %f1093;
	mov.f32 	%f1125, %f1093;
	mov.f32 	%f1126, %f1093;
	mov.f32 	%f1127, %f1093;
	mov.f32 	%f1128, %f1093;
	mov.f32 	%f1129, %f1093;
	mov.f32 	%f1130, %f1093;
	mov.f32 	%f1131, %f1093;
	mov.f32 	%f1132, %f1093;
	mov.f32 	%f1133, %f1093;
	mov.f32 	%f1134, %f1093;
	mov.f32 	%f1135, %f1093;
	mov.f32 	%f1136, %f1093;
	mov.f32 	%f1137, %f1093;
	mov.f32 	%f1138, %f1093;
	mov.f32 	%f1139, %f1093;
	mov.f32 	%f1140, %f1093;
	mov.f32 	%f1141, %f1093;
	mov.f32 	%f1142, %f1093;
	mov.f32 	%f1143, %f1093;
	mov.f32 	%f1144, %f1093;
	mov.f32 	%f1145, %f1093;
	mov.f32 	%f1146, %f1093;
	mov.f32 	%f1147, %f1093;
	mov.f32 	%f1148, %f1093;
	mov.f32 	%f1149, %f1093;
	mov.f32 	%f1150, %f1093;
	mov.f32 	%f1151, %f1093;
	mov.f32 	%f1152, %f1093;
	mov.f32 	%f1153, %f1093;
	mov.f32 	%f1154, %f1093;
	mov.f32 	%f1155, %f1093;
	mov.f32 	%f1156, %f1093;

BB13_6:
	shl.b32 	%r40, %r27, 4;
	mov.u32 	%r41, _ZZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_iE8s_buffer;
	add.s32 	%r42, %r41, %r40;
	st.shared.v4.f32 	[%r42], {%f1081, %f1082, %f1083, %f1084};
	shr.u32 	%r43, %r27, 1;
	shl.b32 	%r45, %r28, 9;
	add.s32 	%r46, %r45, %r43;
	shl.b32 	%r47, %r46, 2;
	add.s32 	%r48, %r47, %r41;
	st.shared.f32 	[%r48+4096], %f1060;
	st.shared.f32 	[%r48+4608], %f1059;
	st.shared.f32 	[%r48+5120], %f1058;
	st.shared.f32 	[%r48+5632], %f1057;
	bar.sync 	0;
	add.s32 	%r6, %r92, 8;
	setp.ge.s32	%p5, %r6, %r11;
	@%p5 bra 	BB13_8;

	ld.param.u32 	%r89, [_ZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_i_param_5];
	shl.b32 	%r49, %r89, 3;
	mul.wide.s32 	%rd32, %r49, 4;
	add.s64 	%rd72, %rd72, %rd32;
	add.s64 	%rd34, %rd72, %rd28;
	ld.global.v4.f32 	{%f1081, %f1082, %f1083, %f1084}, [%rd34];
	add.s64 	%rd73, %rd73, 32;
	add.s64 	%rd36, %rd73, %rd30;
	ld.global.v4.f32 	{%f1060, %f1059, %f1058, %f1057}, [%rd36];

BB13_8:
	mov.u32 	%r93, _ZZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_iE8s_buffer;
	mov.u32 	%r94, 0;

BB13_9:
	.pragma "nounroll";
	and.b32  	%r62, %r27, 15;
	mad.lo.s32 	%r63, %r62, 16, %r93;
	ld.shared.v4.f32 	{%f435, %f436, %f437, %f438}, [%r63];
	ld.shared.v4.f32 	{%f443, %f444, %f445, %f446}, [%r63+256];
	shr.s32 	%r64, %r27, 4;
	mad.lo.s32 	%r65, %r64, 16, %r93;
	ld.shared.v4.f32 	{%f451, %f452, %f453, %f454}, [%r65+4096];
	ld.shared.v4.f32 	{%f459, %f460, %f461, %f462}, [%r65+4352];
	fma.rn.f32 	%f467, %f435, %f451, %f1136;
	fma.rn.f32 	%f468, %f436, %f451, %f1135;
	fma.rn.f32 	%f469, %f437, %f451, %f1134;
	fma.rn.f32 	%f470, %f438, %f451, %f1133;
	fma.rn.f32 	%f471, %f435, %f452, %f1132;
	fma.rn.f32 	%f472, %f436, %f452, %f1131;
	fma.rn.f32 	%f473, %f437, %f452, %f1130;
	fma.rn.f32 	%f474, %f438, %f452, %f1129;
	fma.rn.f32 	%f475, %f435, %f453, %f1128;
	fma.rn.f32 	%f476, %f436, %f453, %f1127;
	fma.rn.f32 	%f477, %f437, %f453, %f1126;
	fma.rn.f32 	%f478, %f438, %f453, %f1125;
	fma.rn.f32 	%f479, %f435, %f454, %f1124;
	fma.rn.f32 	%f480, %f436, %f454, %f1123;
	fma.rn.f32 	%f481, %f437, %f454, %f1122;
	fma.rn.f32 	%f482, %f438, %f454, %f1121;
	fma.rn.f32 	%f483, %f435, %f459, %f1120;
	fma.rn.f32 	%f484, %f436, %f459, %f1119;
	fma.rn.f32 	%f485, %f437, %f459, %f1118;
	fma.rn.f32 	%f486, %f438, %f459, %f1117;
	fma.rn.f32 	%f487, %f435, %f460, %f1116;
	fma.rn.f32 	%f488, %f436, %f460, %f1115;
	fma.rn.f32 	%f489, %f437, %f460, %f1114;
	fma.rn.f32 	%f490, %f438, %f460, %f1113;
	fma.rn.f32 	%f491, %f435, %f461, %f1112;
	fma.rn.f32 	%f492, %f436, %f461, %f1111;
	fma.rn.f32 	%f493, %f437, %f461, %f1110;
	fma.rn.f32 	%f494, %f438, %f461, %f1109;
	fma.rn.f32 	%f495, %f435, %f462, %f1108;
	fma.rn.f32 	%f496, %f436, %f462, %f1107;
	fma.rn.f32 	%f497, %f437, %f462, %f1106;
	fma.rn.f32 	%f498, %f438, %f462, %f1105;
	fma.rn.f32 	%f499, %f443, %f451, %f1104;
	fma.rn.f32 	%f500, %f444, %f451, %f1103;
	fma.rn.f32 	%f501, %f445, %f451, %f1102;
	fma.rn.f32 	%f502, %f446, %f451, %f1101;
	fma.rn.f32 	%f503, %f443, %f452, %f1100;
	fma.rn.f32 	%f504, %f444, %f452, %f1099;
	fma.rn.f32 	%f505, %f445, %f452, %f1098;
	fma.rn.f32 	%f506, %f446, %f452, %f1097;
	fma.rn.f32 	%f507, %f443, %f453, %f1096;
	fma.rn.f32 	%f508, %f444, %f453, %f1095;
	fma.rn.f32 	%f509, %f445, %f453, %f1094;
	fma.rn.f32 	%f510, %f446, %f453, %f1093;
	fma.rn.f32 	%f511, %f443, %f454, %f1137;
	fma.rn.f32 	%f512, %f444, %f454, %f1138;
	fma.rn.f32 	%f513, %f445, %f454, %f1139;
	fma.rn.f32 	%f514, %f446, %f454, %f1140;
	fma.rn.f32 	%f515, %f443, %f459, %f1141;
	fma.rn.f32 	%f516, %f444, %f459, %f1142;
	fma.rn.f32 	%f517, %f445, %f459, %f1143;
	fma.rn.f32 	%f518, %f446, %f459, %f1144;
	fma.rn.f32 	%f519, %f443, %f460, %f1145;
	fma.rn.f32 	%f520, %f444, %f460, %f1146;
	fma.rn.f32 	%f521, %f445, %f460, %f1147;
	fma.rn.f32 	%f522, %f446, %f460, %f1148;
	fma.rn.f32 	%f523, %f443, %f461, %f1149;
	fma.rn.f32 	%f524, %f444, %f461, %f1150;
	fma.rn.f32 	%f525, %f445, %f461, %f1151;
	fma.rn.f32 	%f526, %f446, %f461, %f1152;
	fma.rn.f32 	%f527, %f443, %f462, %f1153;
	fma.rn.f32 	%f528, %f444, %f462, %f1154;
	fma.rn.f32 	%f529, %f445, %f462, %f1155;
	fma.rn.f32 	%f530, %f446, %f462, %f1156;
	ld.shared.v4.f32 	{%f531, %f532, %f533, %f534}, [%r63+512];
	ld.shared.v4.f32 	{%f539, %f540, %f541, %f542}, [%r63+768];
	ld.shared.v4.f32 	{%f547, %f548, %f549, %f550}, [%r65+4608];
	ld.shared.v4.f32 	{%f555, %f556, %f557, %f558}, [%r65+4864];
	fma.rn.f32 	%f563, %f531, %f547, %f467;
	fma.rn.f32 	%f564, %f532, %f547, %f468;
	fma.rn.f32 	%f565, %f533, %f547, %f469;
	fma.rn.f32 	%f566, %f534, %f547, %f470;
	fma.rn.f32 	%f567, %f531, %f548, %f471;
	fma.rn.f32 	%f568, %f532, %f548, %f472;
	fma.rn.f32 	%f569, %f533, %f548, %f473;
	fma.rn.f32 	%f570, %f534, %f548, %f474;
	fma.rn.f32 	%f571, %f531, %f549, %f475;
	fma.rn.f32 	%f572, %f532, %f549, %f476;
	fma.rn.f32 	%f573, %f533, %f549, %f477;
	fma.rn.f32 	%f574, %f534, %f549, %f478;
	fma.rn.f32 	%f575, %f531, %f550, %f479;
	fma.rn.f32 	%f576, %f532, %f550, %f480;
	fma.rn.f32 	%f577, %f533, %f550, %f481;
	fma.rn.f32 	%f578, %f534, %f550, %f482;
	fma.rn.f32 	%f579, %f531, %f555, %f483;
	fma.rn.f32 	%f580, %f532, %f555, %f484;
	fma.rn.f32 	%f581, %f533, %f555, %f485;
	fma.rn.f32 	%f582, %f534, %f555, %f486;
	fma.rn.f32 	%f583, %f531, %f556, %f487;
	fma.rn.f32 	%f584, %f532, %f556, %f488;
	fma.rn.f32 	%f585, %f533, %f556, %f489;
	fma.rn.f32 	%f586, %f534, %f556, %f490;
	fma.rn.f32 	%f587, %f531, %f557, %f491;
	fma.rn.f32 	%f588, %f532, %f557, %f492;
	fma.rn.f32 	%f589, %f533, %f557, %f493;
	fma.rn.f32 	%f590, %f534, %f557, %f494;
	fma.rn.f32 	%f591, %f531, %f558, %f495;
	fma.rn.f32 	%f592, %f532, %f558, %f496;
	fma.rn.f32 	%f593, %f533, %f558, %f497;
	fma.rn.f32 	%f594, %f534, %f558, %f498;
	fma.rn.f32 	%f595, %f539, %f547, %f499;
	fma.rn.f32 	%f596, %f540, %f547, %f500;
	fma.rn.f32 	%f597, %f541, %f547, %f501;
	fma.rn.f32 	%f598, %f542, %f547, %f502;
	fma.rn.f32 	%f599, %f539, %f548, %f503;
	fma.rn.f32 	%f600, %f540, %f548, %f504;
	fma.rn.f32 	%f601, %f541, %f548, %f505;
	fma.rn.f32 	%f602, %f542, %f548, %f506;
	fma.rn.f32 	%f603, %f539, %f549, %f507;
	fma.rn.f32 	%f604, %f540, %f549, %f508;
	fma.rn.f32 	%f605, %f541, %f549, %f509;
	fma.rn.f32 	%f606, %f542, %f549, %f510;
	fma.rn.f32 	%f607, %f539, %f550, %f511;
	fma.rn.f32 	%f608, %f540, %f550, %f512;
	fma.rn.f32 	%f609, %f541, %f550, %f513;
	fma.rn.f32 	%f610, %f542, %f550, %f514;
	fma.rn.f32 	%f611, %f539, %f555, %f515;
	fma.rn.f32 	%f612, %f540, %f555, %f516;
	fma.rn.f32 	%f613, %f541, %f555, %f517;
	fma.rn.f32 	%f614, %f542, %f555, %f518;
	fma.rn.f32 	%f615, %f539, %f556, %f519;
	fma.rn.f32 	%f616, %f540, %f556, %f520;
	fma.rn.f32 	%f617, %f541, %f556, %f521;
	fma.rn.f32 	%f618, %f542, %f556, %f522;
	fma.rn.f32 	%f619, %f539, %f557, %f523;
	fma.rn.f32 	%f620, %f540, %f557, %f524;
	fma.rn.f32 	%f621, %f541, %f557, %f525;
	fma.rn.f32 	%f622, %f542, %f557, %f526;
	fma.rn.f32 	%f623, %f539, %f558, %f527;
	fma.rn.f32 	%f624, %f540, %f558, %f528;
	fma.rn.f32 	%f625, %f541, %f558, %f529;
	fma.rn.f32 	%f626, %f542, %f558, %f530;
	ld.shared.v4.f32 	{%f627, %f628, %f629, %f630}, [%r63+1024];
	ld.shared.v4.f32 	{%f635, %f636, %f637, %f638}, [%r63+1280];
	ld.shared.v4.f32 	{%f643, %f644, %f645, %f646}, [%r65+5120];
	ld.shared.v4.f32 	{%f651, %f652, %f653, %f654}, [%r65+5376];
	fma.rn.f32 	%f659, %f627, %f643, %f563;
	fma.rn.f32 	%f660, %f628, %f643, %f564;
	fma.rn.f32 	%f661, %f629, %f643, %f565;
	fma.rn.f32 	%f662, %f630, %f643, %f566;
	fma.rn.f32 	%f663, %f627, %f644, %f567;
	fma.rn.f32 	%f664, %f628, %f644, %f568;
	fma.rn.f32 	%f665, %f629, %f644, %f569;
	fma.rn.f32 	%f666, %f630, %f644, %f570;
	fma.rn.f32 	%f667, %f627, %f645, %f571;
	fma.rn.f32 	%f668, %f628, %f645, %f572;
	fma.rn.f32 	%f669, %f629, %f645, %f573;
	fma.rn.f32 	%f670, %f630, %f645, %f574;
	fma.rn.f32 	%f671, %f627, %f646, %f575;
	fma.rn.f32 	%f672, %f628, %f646, %f576;
	fma.rn.f32 	%f673, %f629, %f646, %f577;
	fma.rn.f32 	%f674, %f630, %f646, %f578;
	fma.rn.f32 	%f675, %f627, %f651, %f579;
	fma.rn.f32 	%f676, %f628, %f651, %f580;
	fma.rn.f32 	%f677, %f629, %f651, %f581;
	fma.rn.f32 	%f678, %f630, %f651, %f582;
	fma.rn.f32 	%f679, %f627, %f652, %f583;
	fma.rn.f32 	%f680, %f628, %f652, %f584;
	fma.rn.f32 	%f681, %f629, %f652, %f585;
	fma.rn.f32 	%f682, %f630, %f652, %f586;
	fma.rn.f32 	%f683, %f627, %f653, %f587;
	fma.rn.f32 	%f684, %f628, %f653, %f588;
	fma.rn.f32 	%f685, %f629, %f653, %f589;
	fma.rn.f32 	%f686, %f630, %f653, %f590;
	fma.rn.f32 	%f687, %f627, %f654, %f591;
	fma.rn.f32 	%f688, %f628, %f654, %f592;
	fma.rn.f32 	%f689, %f629, %f654, %f593;
	fma.rn.f32 	%f690, %f630, %f654, %f594;
	fma.rn.f32 	%f691, %f635, %f643, %f595;
	fma.rn.f32 	%f692, %f636, %f643, %f596;
	fma.rn.f32 	%f693, %f637, %f643, %f597;
	fma.rn.f32 	%f694, %f638, %f643, %f598;
	fma.rn.f32 	%f695, %f635, %f644, %f599;
	fma.rn.f32 	%f696, %f636, %f644, %f600;
	fma.rn.f32 	%f697, %f637, %f644, %f601;
	fma.rn.f32 	%f698, %f638, %f644, %f602;
	fma.rn.f32 	%f699, %f635, %f645, %f603;
	fma.rn.f32 	%f700, %f636, %f645, %f604;
	fma.rn.f32 	%f701, %f637, %f645, %f605;
	fma.rn.f32 	%f702, %f638, %f645, %f606;
	fma.rn.f32 	%f703, %f635, %f646, %f607;
	fma.rn.f32 	%f704, %f636, %f646, %f608;
	fma.rn.f32 	%f705, %f637, %f646, %f609;
	fma.rn.f32 	%f706, %f638, %f646, %f610;
	fma.rn.f32 	%f707, %f635, %f651, %f611;
	fma.rn.f32 	%f708, %f636, %f651, %f612;
	fma.rn.f32 	%f709, %f637, %f651, %f613;
	fma.rn.f32 	%f710, %f638, %f651, %f614;
	fma.rn.f32 	%f711, %f635, %f652, %f615;
	fma.rn.f32 	%f712, %f636, %f652, %f616;
	fma.rn.f32 	%f713, %f637, %f652, %f617;
	fma.rn.f32 	%f714, %f638, %f652, %f618;
	fma.rn.f32 	%f715, %f635, %f653, %f619;
	fma.rn.f32 	%f716, %f636, %f653, %f620;
	fma.rn.f32 	%f717, %f637, %f653, %f621;
	fma.rn.f32 	%f718, %f638, %f653, %f622;
	fma.rn.f32 	%f719, %f635, %f654, %f623;
	fma.rn.f32 	%f720, %f636, %f654, %f624;
	fma.rn.f32 	%f721, %f637, %f654, %f625;
	fma.rn.f32 	%f722, %f638, %f654, %f626;
	ld.shared.v4.f32 	{%f723, %f724, %f725, %f726}, [%r63+1536];
	ld.shared.v4.f32 	{%f731, %f732, %f733, %f734}, [%r63+1792];
	ld.shared.v4.f32 	{%f739, %f740, %f741, %f742}, [%r65+5632];
	ld.shared.v4.f32 	{%f747, %f748, %f749, %f750}, [%r65+5888];
	fma.rn.f32 	%f1136, %f723, %f739, %f659;
	fma.rn.f32 	%f1135, %f724, %f739, %f660;
	fma.rn.f32 	%f1134, %f725, %f739, %f661;
	fma.rn.f32 	%f1133, %f726, %f739, %f662;
	fma.rn.f32 	%f1132, %f723, %f740, %f663;
	fma.rn.f32 	%f1131, %f724, %f740, %f664;
	fma.rn.f32 	%f1130, %f725, %f740, %f665;
	fma.rn.f32 	%f1129, %f726, %f740, %f666;
	fma.rn.f32 	%f1128, %f723, %f741, %f667;
	fma.rn.f32 	%f1127, %f724, %f741, %f668;
	fma.rn.f32 	%f1126, %f725, %f741, %f669;
	fma.rn.f32 	%f1125, %f726, %f741, %f670;
	fma.rn.f32 	%f1124, %f723, %f742, %f671;
	fma.rn.f32 	%f1123, %f724, %f742, %f672;
	fma.rn.f32 	%f1122, %f725, %f742, %f673;
	fma.rn.f32 	%f1121, %f726, %f742, %f674;
	fma.rn.f32 	%f1120, %f723, %f747, %f675;
	fma.rn.f32 	%f1119, %f724, %f747, %f676;
	fma.rn.f32 	%f1118, %f725, %f747, %f677;
	fma.rn.f32 	%f1117, %f726, %f747, %f678;
	fma.rn.f32 	%f1116, %f723, %f748, %f679;
	fma.rn.f32 	%f1115, %f724, %f748, %f680;
	fma.rn.f32 	%f1114, %f725, %f748, %f681;
	fma.rn.f32 	%f1113, %f726, %f748, %f682;
	fma.rn.f32 	%f1112, %f723, %f749, %f683;
	fma.rn.f32 	%f1111, %f724, %f749, %f684;
	fma.rn.f32 	%f1110, %f725, %f749, %f685;
	fma.rn.f32 	%f1109, %f726, %f749, %f686;
	fma.rn.f32 	%f1108, %f723, %f750, %f687;
	fma.rn.f32 	%f1107, %f724, %f750, %f688;
	fma.rn.f32 	%f1106, %f725, %f750, %f689;
	fma.rn.f32 	%f1105, %f726, %f750, %f690;
	fma.rn.f32 	%f1104, %f731, %f739, %f691;
	fma.rn.f32 	%f1103, %f732, %f739, %f692;
	fma.rn.f32 	%f1102, %f733, %f739, %f693;
	fma.rn.f32 	%f1101, %f734, %f739, %f694;
	fma.rn.f32 	%f1100, %f731, %f740, %f695;
	fma.rn.f32 	%f1099, %f732, %f740, %f696;
	fma.rn.f32 	%f1098, %f733, %f740, %f697;
	fma.rn.f32 	%f1097, %f734, %f740, %f698;
	fma.rn.f32 	%f1096, %f731, %f741, %f699;
	fma.rn.f32 	%f1095, %f732, %f741, %f700;
	fma.rn.f32 	%f1094, %f733, %f741, %f701;
	fma.rn.f32 	%f1093, %f734, %f741, %f702;
	fma.rn.f32 	%f1137, %f731, %f742, %f703;
	fma.rn.f32 	%f1138, %f732, %f742, %f704;
	fma.rn.f32 	%f1139, %f733, %f742, %f705;
	fma.rn.f32 	%f1140, %f734, %f742, %f706;
	fma.rn.f32 	%f1141, %f731, %f747, %f707;
	fma.rn.f32 	%f1142, %f732, %f747, %f708;
	fma.rn.f32 	%f1143, %f733, %f747, %f709;
	fma.rn.f32 	%f1144, %f734, %f747, %f710;
	fma.rn.f32 	%f1145, %f731, %f748, %f711;
	fma.rn.f32 	%f1146, %f732, %f748, %f712;
	fma.rn.f32 	%f1147, %f733, %f748, %f713;
	fma.rn.f32 	%f1148, %f734, %f748, %f714;
	fma.rn.f32 	%f1149, %f731, %f749, %f715;
	fma.rn.f32 	%f1150, %f732, %f749, %f716;
	fma.rn.f32 	%f1151, %f733, %f749, %f717;
	fma.rn.f32 	%f1152, %f734, %f749, %f718;
	fma.rn.f32 	%f1153, %f731, %f750, %f719;
	fma.rn.f32 	%f1154, %f732, %f750, %f720;
	fma.rn.f32 	%f1155, %f733, %f750, %f721;
	fma.rn.f32 	%f1156, %f734, %f750, %f722;
	add.s32 	%r93, %r93, 2048;
	add.s32 	%r94, %r94, 4;
	setp.lt.s32	%p6, %r94, 8;
	@%p6 bra 	BB13_9;

	add.s32 	%r91, %r92, 8;
	setp.lt.s32	%p7, %r91, %r11;
	add.s32 	%r92, %r92, 8;
	bar.sync 	0;
	@%p7 bra 	BB13_6;

BB13_11:
	ld.param.f32 	%f1012, [_ZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_i_param_3];
	ld.param.f32 	%f1011, [_ZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_i_param_8];
	mov.u32 	%r88, %ctaid.y;
	mov.u32 	%r87, %ctaid.x;
	ld.param.u32 	%r86, [_ZN3wuk12gemm_128x128If6float4Li1EEEviiiT_PKS2_iS4_iS2_PS2_i_param_10];
	shl.b32 	%r67, %r87, 7;
	mov.u32 	%r68, %tid.x;
	shl.b32 	%r69, %r68, 2;
	and.b32  	%r70, %r69, 60;
	add.s32 	%r71, %r67, %r70;
	shl.b32 	%r73, %r88, 7;
	shr.s32 	%r74, %r68, 4;
	shl.b32 	%r75, %r74, 2;
	add.s32 	%r76, %r73, %r75;
	mad.lo.s32 	%r77, %r76, %r86, %r71;
	cvt.u64.u32	%rd37, %r77;
	mul.wide.u32 	%rd38, %r77, 4;
	add.s64 	%rd39, %rd68, %rd38;
	ld.global.v4.f32 	{%f755, %f756, %f757, %f758}, [%rd39];
	mul.f32 	%f763, %f755, %f1011;
	mul.f32 	%f764, %f756, %f1011;
	mul.f32 	%f765, %f757, %f1011;
	mul.f32 	%f766, %f758, %f1011;
	fma.rn.f32 	%f767, %f1133, %f1012, %f766;
	fma.rn.f32 	%f768, %f1134, %f1012, %f765;
	fma.rn.f32 	%f769, %f1135, %f1012, %f764;
	fma.rn.f32 	%f770, %f1136, %f1012, %f763;
	st.global.v4.f32 	[%rd39], {%f770, %f769, %f768, %f767};
	cvt.s64.s32	%rd40, %r86;
	add.s64 	%rd41, %rd40, %rd37;
	shl.b64 	%rd42, %rd41, 2;
	add.s64 	%rd43, %rd68, %rd42;
	ld.global.v4.f32 	{%f771, %f772, %f773, %f774}, [%rd43];
	mul.f32 	%f779, %f771, %f1011;
	mul.f32 	%f780, %f772, %f1011;
	mul.f32 	%f781, %f773, %f1011;
	mul.f32 	%f782, %f774, %f1011;
	fma.rn.f32 	%f783, %f1129, %f1012, %f782;
	fma.rn.f32 	%f784, %f1130, %f1012, %f781;
	fma.rn.f32 	%f785, %f1131, %f1012, %f780;
	fma.rn.f32 	%f786, %f1132, %f1012, %f779;
	st.global.v4.f32 	[%rd43], {%f786, %f785, %f784, %f783};
	shl.b32 	%r78, %r86, 1;
	cvt.s64.s32	%rd44, %r78;
	add.s64 	%rd45, %rd44, %rd37;
	shl.b64 	%rd46, %rd45, 2;
	add.s64 	%rd47, %rd68, %rd46;
	ld.global.v4.f32 	{%f787, %f788, %f789, %f790}, [%rd47];
	mul.f32 	%f795, %f787, %f1011;
	mul.f32 	%f796, %f788, %f1011;
	mul.f32 	%f797, %f789, %f1011;
	mul.f32 	%f798, %f790, %f1011;
	fma.rn.f32 	%f799, %f1125, %f1012, %f798;
	fma.rn.f32 	%f800, %f1126, %f1012, %f797;
	fma.rn.f32 	%f801, %f1127, %f1012, %f796;
	fma.rn.f32 	%f802, %f1128, %f1012, %f795;
	st.global.v4.f32 	[%rd47], {%f802, %f801, %f800, %f799};
	mul.lo.s32 	%r79, %r86, 3;
	cvt.s64.s32	%rd48, %r79;
	add.s64 	%rd49, %rd48, %rd37;
	shl.b64 	%rd50, %rd49, 2;
	add.s64 	%rd51, %rd68, %rd50;
	ld.global.v4.f32 	{%f803, %f804, %f805, %f806}, [%rd51];
	mul.f32 	%f811, %f803, %f1011;
	mul.f32 	%f812, %f804, %f1011;
	mul.f32 	%f813, %f805, %f1011;
	mul.f32 	%f814, %f806, %f1011;
	fma.rn.f32 	%f815, %f1121, %f1012, %f814;
	fma.rn.f32 	%f816, %f1122, %f1012, %f813;
	fma.rn.f32 	%f817, %f1123, %f1012, %f812;
	fma.rn.f32 	%f818, %f1124, %f1012, %f811;
	st.global.v4.f32 	[%rd51], {%f818, %f817, %f816, %f815};
	shl.b32 	%r80, %r86, 6;
	cvt.s64.s32	%rd52, %r80;
	add.s64 	%rd53, %rd52, %rd37;
	shl.b64 	%rd54, %rd53, 2;
	add.s64 	%rd55, %rd68, %rd54;
	ld.global.v4.f32 	{%f819, %f820, %f821, %f822}, [%rd55];
	mul.f32 	%f827, %f819, %f1011;
	mul.f32 	%f828, %f820, %f1011;
	mul.f32 	%f829, %f821, %f1011;
	mul.f32 	%f830, %f822, %f1011;
	fma.rn.f32 	%f831, %f1117, %f1012, %f830;
	fma.rn.f32 	%f832, %f1118, %f1012, %f829;
	fma.rn.f32 	%f833, %f1119, %f1012, %f828;
	fma.rn.f32 	%f834, %f1120, %f1012, %f827;
	st.global.v4.f32 	[%rd55], {%f834, %f833, %f832, %f831};
	mul.lo.s32 	%r81, %r86, 65;
	cvt.s64.s32	%rd56, %r81;
	add.s64 	%rd57, %rd56, %rd37;
	shl.b64 	%rd58, %rd57, 2;
	add.s64 	%rd59, %rd68, %rd58;
	ld.global.v4.f32 	{%f835, %f836, %f837, %f838}, [%rd59];
	mul.f32 	%f843, %f835, %f1011;
	mul.f32 	%f844, %f836, %f1011;
	mul.f32 	%f845, %f837, %f1011;
	mul.f32 	%f846, %f838, %f1011;
	fma.rn.f32 	%f847, %f1113, %f1012, %f846;
	fma.rn.f32 	%f848, %f1114, %f1012, %f845;
	fma.rn.f32 	%f849, %f1115, %f1012, %f844;
	fma.rn.f32 	%f850, %f1116, %f1012, %f843;
	st.global.v4.f32 	[%rd59], {%f850, %f849, %f848, %f847};
	mul.lo.s32 	%r82, %r86, 66;
	cvt.s64.s32	%rd60, %r82;
	add.s64 	%rd61, %rd60, %rd37;
	shl.b64 	%rd62, %rd61, 2;
	add.s64 	%rd63, %rd68, %rd62;
	ld.global.v4.f32 	{%f851, %f852, %f853, %f854}, [%rd63];
	mul.f32 	%f859, %f851, %f1011;
	mul.f32 	%f860, %f852, %f1011;
	mul.f32 	%f861, %f853, %f1011;
	mul.f32 	%f862, %f854, %f1011;
	fma.rn.f32 	%f863, %f1109, %f1012, %f862;
	fma.rn.f32 	%f864, %f1110, %f1012, %f861;
	fma.rn.f32 	%f865, %f1111, %f1012, %f860;
	fma.rn.f32 	%f866, %f1112, %f1012, %f859;
	st.global.v4.f32 	[%rd63], {%f866, %f865, %f864, %f863};
	mul.lo.s32 	%r83, %r86, 67;
	cvt.s64.s32	%rd64, %r83;
	add.s64 	%rd65, %rd64, %rd37;
	shl.b64 	%rd66, %rd65, 2;
	add.s64 	%rd67, %rd68, %rd66;
	ld.global.v4.f32 	{%f867, %f868, %f869, %f870}, [%rd67];
	mul.f32 	%f875, %f867, %f1011;
	mul.f32 	%f876, %f868, %f1011;
	mul.f32 	%f877, %f869, %f1011;
	mul.f32 	%f878, %f870, %f1011;
	fma.rn.f32 	%f879, %f1105, %f1012, %f878;
	fma.rn.f32 	%f880, %f1106, %f1012, %f877;
	fma.rn.f32 	%f881, %f1107, %f1012, %f876;
	fma.rn.f32 	%f882, %f1108, %f1012, %f875;
	st.global.v4.f32 	[%rd67], {%f882, %f881, %f880, %f879};
	ld.global.v4.f32 	{%f883, %f884, %f885, %f886}, [%rd39+256];
	mul.f32 	%f891, %f883, %f1011;
	mul.f32 	%f892, %f884, %f1011;
	mul.f32 	%f893, %f885, %f1011;
	mul.f32 	%f894, %f886, %f1011;
	fma.rn.f32 	%f895, %f1101, %f1012, %f894;
	fma.rn.f32 	%f896, %f1102, %f1012, %f893;
	fma.rn.f32 	%f897, %f1103, %f1012, %f892;
	fma.rn.f32 	%f898, %f1104, %f1012, %f891;
	st.global.v4.f32 	[%rd39+256], {%f898, %f897, %f896, %f895};
	ld.global.v4.f32 	{%f899, %f900, %f901, %f902}, [%rd43+256];
	mul.f32 	%f907, %f899, %f1011;
	mul.f32 	%f908, %f900, %f1011;
	mul.f32 	%f909, %f901, %f1011;
	mul.f32 	%f910, %f902, %f1011;
	fma.rn.f32 	%f911, %f1097, %f1012, %f910;
	fma.rn.f32 	%f912, %f1098, %f1012, %f909;
	fma.rn.f32 	%f913, %f1099, %f1012, %f908;
	fma.rn.f32 	%f914, %f1100, %f1012, %f907;
	st.global.v4.f32 	[%rd43+256], {%f914, %f913, %f912, %f911};
	ld.global.v4.f32 	{%f915, %f916, %f917, %f918}, [%rd47+256];
	mul.f32 	%f923, %f915, %f1011;
	mul.f32 	%f924, %f916, %f1011;
	mul.f32 	%f925, %f917, %f1011;
	mul.f32 	%f926, %f918, %f1011;
	fma.rn.f32 	%f927, %f1093, %f1012, %f926;
	fma.rn.f32 	%f928, %f1094, %f1012, %f925;
	fma.rn.f32 	%f929, %f1095, %f1012, %f924;
	fma.rn.f32 	%f930, %f1096, %f1012, %f923;
	st.global.v4.f32 	[%rd47+256], {%f930, %f929, %f928, %f927};
	ld.global.v4.f32 	{%f931, %f932, %f933, %f934}, [%rd51+256];
	mul.f32 	%f939, %f931, %f1011;
	mul.f32 	%f940, %f932, %f1011;
	mul.f32 	%f941, %f933, %f1011;
	mul.f32 	%f942, %f934, %f1011;
	fma.rn.f32 	%f943, %f1140, %f1012, %f942;
	fma.rn.f32 	%f944, %f1139, %f1012, %f941;
	fma.rn.f32 	%f945, %f1138, %f1012, %f940;
	fma.rn.f32 	%f946, %f1137, %f1012, %f939;
	st.global.v4.f32 	[%rd51+256], {%f946, %f945, %f944, %f943};
	ld.global.v4.f32 	{%f947, %f948, %f949, %f950}, [%rd55+256];
	mul.f32 	%f955, %f947, %f1011;
	mul.f32 	%f956, %f948, %f1011;
	mul.f32 	%f957, %f949, %f1011;
	mul.f32 	%f958, %f950, %f1011;
	fma.rn.f32 	%f959, %f1144, %f1012, %f958;
	fma.rn.f32 	%f960, %f1143, %f1012, %f957;
	fma.rn.f32 	%f961, %f1142, %f1012, %f956;
	fma.rn.f32 	%f962, %f1141, %f1012, %f955;
	st.global.v4.f32 	[%rd55+256], {%f962, %f961, %f960, %f959};
	ld.global.v4.f32 	{%f963, %f964, %f965, %f966}, [%rd59+256];
	mul.f32 	%f971, %f963, %f1011;
	mul.f32 	%f972, %f964, %f1011;
	mul.f32 	%f973, %f965, %f1011;
	mul.f32 	%f974, %f966, %f1011;
	fma.rn.f32 	%f975, %f1148, %f1012, %f974;
	fma.rn.f32 	%f976, %f1147, %f1012, %f973;
	fma.rn.f32 	%f977, %f1146, %f1012, %f972;
	fma.rn.f32 	%f978, %f1145, %f1012, %f971;
	st.global.v4.f32 	[%rd59+256], {%f978, %f977, %f976, %f975};
	ld.global.v4.f32 	{%f979, %f980, %f981, %f982}, [%rd63+256];
	mul.f32 	%f987, %f979, %f1011;
	mul.f32 	%f988, %f980, %f1011;
	mul.f32 	%f989, %f981, %f1011;
	mul.f32 	%f990, %f982, %f1011;
	fma.rn.f32 	%f991, %f1152, %f1012, %f990;
	fma.rn.f32 	%f992, %f1151, %f1012, %f989;
	fma.rn.f32 	%f993, %f1150, %f1012, %f988;
	fma.rn.f32 	%f994, %f1149, %f1012, %f987;
	st.global.v4.f32 	[%rd63+256], {%f994, %f993, %f992, %f991};
	ld.global.v4.f32 	{%f995, %f996, %f997, %f998}, [%rd67+256];
	mul.f32 	%f1003, %f995, %f1011;
	mul.f32 	%f1004, %f996, %f1011;
	mul.f32 	%f1005, %f997, %f1011;
	mul.f32 	%f1006, %f998, %f1011;
	fma.rn.f32 	%f1007, %f1156, %f1012, %f1006;
	fma.rn.f32 	%f1008, %f1155, %f1012, %f1005;
	fma.rn.f32 	%f1009, %f1154, %f1012, %f1004;
	fma.rn.f32 	%f1010, %f1153, %f1012, %f1003;
	st.global.v4.f32 	[%rd67+256], {%f1010, %f1009, %f1008, %f1007};
	ret;
}