Menu

量子线路模拟器QuEST在多GPU平台上的性能优化

post on 11 May 2021 about 15487words need 52min
CC BY 4.0 (除特别声明或转载文章外)
如果这篇博客帮助到你,可以请我喝一杯咖啡~

代码地址

自己负责的 ASC20 的赛题的 GPU 部分,不过最后比赛表现不是很理想。随便写点东西吧。

实现工作简述

  • 量子电路优化器
    • 将 targetQubit 相同或小于 12 的连续电路合并
    • 合并概率计算
  • 使用两个通用门代替其他门
  • 基于 CUDA-aware MPI 的多卡实现
  • 使用 15 个 stream 和 16 个显存 Buffer 实现计算和通信重叠
  • 手写的显存分配器
    • 预先申请 92.5%的显存栈
    • 运行时直接移动栈顶指针分配所需空间
    • 运行 34 量子比特时相较于原版 QuEST 多进程实现节省 200+GB 空间
  • 优化 GPU 利用效率
    • 使用 shared memory 存 targetQubit < 12 的电路
    • 使用 constant memory 存合并的电路参数(可以保存约 600 个)
    • 使用 cuBLAS 代替原来的向量规约和点积操作

决赛结果

GPU(s) 1 2 4 8
main_HamExp 15.130841s 13.011804s 18.062945s 15.508143s
main_InvQFT x x x 33.724309s
random 8.241649s 10.510134s 13.1650710s 9.797779s
GHZ 1.148370s 1.057206s 1.641490s 1.313547s
GHZ_QFT_N(N=29) 0.560380s 0.545665s 0.840117s 0.653872s

被大清和北航按着锤了,他们 main_InvQFT 算例在四机八卡上分别 9s 和 22s…他们的显存分配比我写的更好,我这边只跑出其中两个算例,他们几乎全部跑完了。和北航的同学交流了一下,显存不够的时候怎么办?对方回答:拉回内存上呀…也去问了一下大清的队长,他们是直接重构了 QuEST 中状态向量的分布。华科的同学说大清把通信改成多对多了,我想了想很有道理,因为我这边确实很多进程之间是没有通信的,把通信流量均分成 AlltoAll 可能确实更高效。

感觉自己优化的时候老是局限于从某些细节扣出零点几秒,还非常缺乏直接从算法层面重构整个软件的能力。

辛 亥 革 命 失 败

运行环境

我们的决赛平台「星河一号」包含四台浪潮 5280M6 服务器,每个节点上有两张 NVIDIA A100 PCIe。我们发现同一个节点间显卡 P2P 通信速率远低于跨节点的两张 GPU 通信速率,为此使用 NVIDIA 提供的 CUDA P2P Sample 测试的结果支持这一观点。慢的比例远大于 PCIe4.0 和 NVLink 之间的差距,猜测这里显卡的 P2P 走了内存。为此我们调整了机器走线,使得两张显卡均挂在一个 CPU 上,使结果快了一倍,但是仍远低于跨节点速率。

因此,我们运行时按这样分配进程到节点的映射:12213443,使得节点内的两个进程在算法上尽量不要有通信。然而,我们的代码在决赛机器上的八卡性能仍然略低于单卡性能。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
$ nvidia-smi
Tue May 11 12:08:26 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00    Driver Version: 455.32.00    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-PCIE-40GB      On   | 00000000:CA:00.0 Off |                    0 |
| N/A   30C    P0    36W / 250W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-PCIE-40GB      On   | 00000000:E3:00.0 Off |                    0 |
| N/A   31C    P0    36W / 250W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
$ nvidia-smi topo -m
        GPU0    GPU1    mlx5_0  CPU Affinity    NUMA Affinity
GPU0     X      NODE    NODE    32-63   1
GPU1    NODE     X      NODE    32-63   1
mlx5_0  NODE    NODE     X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

初赛结果

GPU(s) 1 2 4
random x 9.62s 7.38s
GHZ x 0.98s 0.79s
GHZ_QFT_N(N=29) 0.75s 0.47s 0.38s

实际上我们在提交初赛成绩的时候有所保留(以防被赛会过度宣传),只交了一份去年的代码,只写了多机多卡,没有做其他优化。random 算例和 GHZ 算例分别 8.53s、7.45s。很显然这么做的并不止我们,而且别人保留的更多…

运行环境

在超算中心 TH-2K 的 V100 节点上运行的,节点上有四张 V100 16GB,用 NVLink 以一种很坑的方式连起来了,不同显卡间 NVLink 的数量并不一样多。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
$ nvidia-smi
Wed May  5 18:58:36 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:8A:00.0 Off |                    0 |
| N/A   39C    P0    56W / 300W |     10MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:8B:00.0 Off |                    0 |
| N/A   36C    P0    61W / 300W |     10MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:B3:00.0 Off |                    0 |
| N/A   36C    P0    62W / 300W |     10MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:B4:00.0 Off |                    0 |
| N/A   39C    P0    60W / 300W |     10MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
$ nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    mlx5_0  mlx5_1  mlx5_2  mlx5_3  CPU Affinity
GPU0     X      NV2     NV1     NODE    PIX     PIX     NODE    NODE    14-27
GPU1    NV2      X      NODE    NV2     PIX     PIX     NODE    NODE    14-27
GPU2    NV1     NODE     X      NV2     NODE    NODE    PIX     PIX     14-27
GPU3    NODE    NV2     NV2      X      NODE    NODE    PIX     PIX     14-27
mlx5_0  PIX     PIX     NODE    NODE     X      PIX     NODE    NODE
mlx5_1  PIX     PIX     NODE    NODE    PIX      X      NODE    NODE
mlx5_2  NODE    NODE    PIX     PIX     NODE    NODE     X      PIX
mlx5_3  NODE    NODE    PIX     PIX     NODE    NODE    PIX      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

彩蛋

只改动statevec_pauliYKernel中的两行代码后(见statevec_pauliYKernel_WuK),random 算例在单张 v100 32G 上就快了 0.8s。

不过很可惜,再后来重构的代码中我把所有的量子门使用两个通用门表示了,这个小彩蛋并没有用上。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
// nvcc pauliYKernel.cu -ptx -arch=compute_70 -code=sm_70
typedef double qreal;

typedef struct
{

    char *buffer;   // generated QASM string
    int bufferSize; // maximum number of chars before overflow
    int bufferFill; // number of chars currently in buffer
    int isLogging;  // whether gates are being added to buffer

} QASMLogger;

/** Represents an array of complex numbers grouped into an array of
 * real components and an array of coressponding complex components.
 *
 * @ingroup type
 * @author Ania Brown
 */
typedef struct ComplexArray
{
    qreal *real;
    qreal *imag;
} ComplexArray;

/// \endcond

/** Represents a system of qubits.
 * Qubits are zero-based
 *
 * @ingroup type
 * @author Ania Brown
 * @author Tyson Jones (density matrix)
 */
typedef struct Qureg
{
    //! Whether this instance is a density-state representation
    int isDensityMatrix;
    //! The number of qubits represented in either the state-vector or density matrix
    int numQubitsRepresented;
    //! Number of qubits in the state-vector - this is double the number represented for mixed states
    int numQubitsInStateVec;
    //! Number of probability amplitudes held in stateVec by this process
    //! In the non-MPI version, this is the total number of amplitudes
    long long int numAmpsPerChunk;
    //! Total number of amplitudes, which are possibly distributed among machines
    long long int numAmpsTotal;
    //! The position of the chunk of the state vector held by this process in the full state vector
    int chunkId;
    //! Number of chunks the state vector is broken up into -- the number of MPI processes used
    int numChunks;

    //! Computational state amplitudes - a subset thereof in the MPI version
    ComplexArray stateVec;
    //! Temporary storage for a chunk of the state vector received from another process in the MPI version
    ComplexArray pairStateVec;
    ComplexArray pairDeviceStateVec;

    //! Storage for wavefunction amplitudes in the GPU version
    ComplexArray deviceStateVec;
    //! Storage for reduction of probabilities on GPU
    qreal *firstLevelReduction, *secondLevelReduction;
    int *lastPairRank;
    void *deviceStateVecRealHandle, *deviceStateVecImagHandle, *pairDeviceStateVecRealHandle, *pairDeviceStateVecImagHandle;
    //! Storage for generated QASM output
    QASMLogger *qasmLog;

} Qureg;
__global__ void statevec_pauliYKernel(Qureg qureg, const int targetQubit, const int conjFac)
{

    long long int sizeHalfBlock = 1LL << targetQubit;
    long long int sizeBlock = 2LL * sizeHalfBlock;
    long long int numTasks = qureg.numAmpsPerChunk >> 1;
    long long int thisTask = blockIdx.x * blockDim.x + threadIdx.x;
    if (thisTask >= numTasks)
        return;

    long long int thisBlock = thisTask / sizeHalfBlock;
    long long int indexUp = thisBlock * sizeBlock + thisTask % sizeHalfBlock;
    long long int indexLo = indexUp + sizeHalfBlock;
    qreal stateRealUp, stateImagUp;

    qreal *stateVecReal = qureg.deviceStateVec.real;
    qreal *stateVecImag = qureg.deviceStateVec.imag;
    stateRealUp = stateVecReal[indexUp];
    stateImagUp = stateVecImag[indexUp];

    // update under +-{{0, -i}, {i, 0}}
    stateVecReal[indexUp] = conjFac * stateVecImag[indexLo];
    stateVecImag[indexUp] = conjFac * -stateVecReal[indexLo];
    stateVecReal[indexLo] = conjFac * -stateImagUp;
    stateVecImag[indexLo] = conjFac * stateRealUp;
}
__global__ void statevec_pauliYKernel_WuK(Qureg qureg, const int targetQubit, const int conjFac)
{

    long long int sizeHalfBlock = 1LL << targetQubit;
    long long int sizeBlock = 2LL * sizeHalfBlock;
    long long int numTasks = qureg.numAmpsPerChunk >> 1;
    long long int thisTask = blockIdx.x * blockDim.x + threadIdx.x;
    if (thisTask >= numTasks)
        return;

    long long int thisBlock = thisTask / sizeHalfBlock;
    long long int indexUp = thisBlock * sizeBlock + thisTask % sizeHalfBlock;
    long long int indexLo = indexUp + sizeHalfBlock;
    qreal stateRealUp, stateImagUp;

    qreal *stateVecReal = qureg.deviceStateVec.real;
    qreal *stateVecImag = qureg.deviceStateVec.imag;
    stateRealUp = stateVecReal[indexUp];
    stateImagUp = stateVecImag[indexUp];

    const qreal
        stateRealLo = stateVecReal[indexLo],
        stateImagLo = stateVecImag[indexLo];

    // update under +-{{0, -i}, {i, 0}}
    stateVecReal[indexUp] = conjFac * stateImagLo;
    stateVecImag[indexUp] = conjFac * -stateRealLo;
    stateVecReal[indexLo] = conjFac * -stateImagUp;
    stateVecImag[indexLo] = conjFac * stateRealUp;
}

终端运行

1
nvcc pauliYKernel.cu -ptx -arch=compute_70 -code=sm_70

得到对应的 ptx 文件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-26907403
// Cuda compilation tools, release 10.1, V10.1.243
// Based on LLVM 3.4svn
//

.version 6.4
.target sm_70
.address_size 64

	// .globl	_Z21statevec_pauliYKernel5Quregii

.visible .entry _Z21statevec_pauliYKernel5Quregii(
	.param .align 8 .b8 _Z21statevec_pauliYKernel5Quregii_param_0[168],
	.param .u32 _Z21statevec_pauliYKernel5Quregii_param_1,
	.param .u32 _Z21statevec_pauliYKernel5Quregii_param_2
)
{
	.reg .pred 	%p<3>;
	.reg .b32 	%r<11>;
	.reg .f64 	%fd<12>;
	.reg .b64 	%rd<29>;


	mov.b64	%rd8, _Z21statevec_pauliYKernel5Quregii_param_0;
	ld.param.u32 	%r2, [_Z21statevec_pauliYKernel5Quregii_param_1];
	ld.param.u32 	%r1, [_Z21statevec_pauliYKernel5Quregii_param_2];
	mov.u64 	%rd3, %rd8;
	cvt.u64.u32	%rd1, %r2;
	mov.u64 	%rd9, 1;
	shl.b64 	%rd2, %rd9, %r2;
	ld.param.u64 	%rd10, [_Z21statevec_pauliYKernel5Quregii_param_0+16];
	shr.s64 	%rd11, %rd10, 1;
	mov.u32 	%r3, %ntid.x;
	mov.u32 	%r4, %ctaid.x;
	mov.u32 	%r5, %tid.x;
	mad.lo.s32 	%r6, %r3, %r4, %r5;
	cvt.u64.u32	%rd4, %r6;
	setp.ge.s64	%p1, %rd4, %rd11;
	@%p1 bra 	BB0_5;

	and.b64  	%rd12, %rd2, -4294967296;
	setp.eq.s64	%p2, %rd12, 0;
	@%p2 bra 	BB0_3;

	rem.s64 	%rd28, %rd4, %rd2;
	bra.uni 	BB0_4;

BB0_3:
	cvt.u32.u64	%r7, %rd2;
	cvt.u32.u64	%r8, %rd4;
	rem.u32 	%r9, %r8, %r7;
	cvt.u64.u32	%rd28, %r9;

BB0_4:
	cvt.u32.u64	%r10, %rd1;
	shr.u64 	%rd13, %rd4, %r10;
	mul.lo.s64 	%rd14, %rd2, %rd13;
	shl.b64 	%rd15, %rd14, 1;
	add.s64 	%rd16, %rd28, %rd15;
	add.s64 	%rd17, %rd16, %rd2;
	ld.param.u64 	%rd18, [%rd3+88];
	cvta.to.global.u64 	%rd19, %rd18;
	ld.param.u64 	%rd20, [%rd3+96];
	cvta.to.global.u64 	%rd21, %rd20;
	shl.b64 	%rd22, %rd16, 3;
	add.s64 	%rd23, %rd19, %rd22;
	ld.global.f64 	%fd1, [%rd23];
	add.s64 	%rd24, %rd21, %rd22;
	ld.global.f64 	%fd2, [%rd24];
	shl.b64 	%rd25, %rd17, 3;
	add.s64 	%rd26, %rd21, %rd25;
	ld.global.f64 	%fd3, [%rd26];
	cvt.rn.f64.s32	%fd4, %r1;
	mul.f64 	%fd5, %fd4, %fd3;
	st.global.f64 	[%rd23], %fd5;
	add.s64 	%rd27, %rd19, %rd25;
	ld.global.f64 	%fd6, [%rd27];
	mul.f64 	%fd7, %fd4, %fd6;
	neg.f64 	%fd8, %fd7;
	st.global.f64 	[%rd24], %fd8;
	mul.f64 	%fd9, %fd4, %fd2;
	neg.f64 	%fd10, %fd9;
	st.global.f64 	[%rd27], %fd10;
	mul.f64 	%fd11, %fd4, %fd1;
	st.global.f64 	[%rd26], %fd11;

BB0_5:
	ret;
}

	// .globl	_Z25statevec_pauliYKernel_WuK5Quregii
.visible .entry _Z25statevec_pauliYKernel_WuK5Quregii(
	.param .align 8 .b8 _Z25statevec_pauliYKernel_WuK5Quregii_param_0[168],
	.param .u32 _Z25statevec_pauliYKernel_WuK5Quregii_param_1,
	.param .u32 _Z25statevec_pauliYKernel_WuK5Quregii_param_2
)
{
	.reg .pred 	%p<3>;
	.reg .b32 	%r<11>;
	.reg .f64 	%fd<12>;
	.reg .b64 	%rd<29>;


	mov.b64	%rd8, _Z25statevec_pauliYKernel_WuK5Quregii_param_0;
	ld.param.u32 	%r2, [_Z25statevec_pauliYKernel_WuK5Quregii_param_1];
	ld.param.u32 	%r1, [_Z25statevec_pauliYKernel_WuK5Quregii_param_2];
	mov.u64 	%rd3, %rd8;
	cvt.u64.u32	%rd1, %r2;
	mov.u64 	%rd9, 1;
	shl.b64 	%rd2, %rd9, %r2;
	ld.param.u64 	%rd10, [_Z25statevec_pauliYKernel_WuK5Quregii_param_0+16];
	shr.s64 	%rd11, %rd10, 1;
	mov.u32 	%r3, %ntid.x;
	mov.u32 	%r4, %ctaid.x;
	mov.u32 	%r5, %tid.x;
	mad.lo.s32 	%r6, %r3, %r4, %r5;
	cvt.u64.u32	%rd4, %r6;
	setp.ge.s64	%p1, %rd4, %rd11;
	@%p1 bra 	BB1_5;

	and.b64  	%rd12, %rd2, -4294967296;
	setp.eq.s64	%p2, %rd12, 0;
	@%p2 bra 	BB1_3;

	rem.s64 	%rd28, %rd4, %rd2;
	bra.uni 	BB1_4;

BB1_3:
	cvt.u32.u64	%r7, %rd2;
	cvt.u32.u64	%r8, %rd4;
	rem.u32 	%r9, %r8, %r7;
	cvt.u64.u32	%rd28, %r9;

BB1_4:
	cvt.u32.u64	%r10, %rd1;
	shr.u64 	%rd13, %rd4, %r10;
	mul.lo.s64 	%rd14, %rd2, %rd13;
	shl.b64 	%rd15, %rd14, 1;
	add.s64 	%rd16, %rd28, %rd15;
	add.s64 	%rd17, %rd16, %rd2;
	ld.param.u64 	%rd18, [%rd3+88];
	cvta.to.global.u64 	%rd19, %rd18;
	ld.param.u64 	%rd20, [%rd3+96];
	cvta.to.global.u64 	%rd21, %rd20;
	shl.b64 	%rd22, %rd16, 3;
	add.s64 	%rd23, %rd19, %rd22;
	ld.global.f64 	%fd1, [%rd23];
	add.s64 	%rd24, %rd21, %rd22;
	ld.global.f64 	%fd2, [%rd24];
	shl.b64 	%rd25, %rd17, 3;
	add.s64 	%rd26, %rd19, %rd25;
	ld.global.f64 	%fd3, [%rd26];
	add.s64 	%rd27, %rd21, %rd25;
	cvt.rn.f64.s32	%fd4, %r1;
	ld.global.f64 	%fd5, [%rd27];
	mul.f64 	%fd6, %fd4, %fd5;
	st.global.f64 	[%rd23], %fd6;
	mul.f64 	%fd7, %fd4, %fd3;
	neg.f64 	%fd8, %fd7;
	st.global.f64 	[%rd24], %fd8;
	mul.f64 	%fd9, %fd4, %fd2;
	neg.f64 	%fd10, %fd9;
	st.global.f64 	[%rd26], %fd10;
	mul.f64 	%fd11, %fd4, %fd1;
	st.global.f64 	[%rd27], %fd11;

BB1_5:
	ret;
}

可以发现,这样进行一下 naive 的修改之后,部分访存的代码由

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
	add.s64 	%rd26, %rd21, %rd25;
	ld.global.f64 	%fd3, [%rd26];
	cvt.rn.f64.s32	%fd4, %r1;
	mul.f64 	%fd5, %fd4, %fd3;
	st.global.f64 	[%rd23], %fd5;
	add.s64 	%rd27, %rd19, %rd25;
	ld.global.f64 	%fd6, [%rd27];
	mul.f64 	%fd7, %fd4, %fd6;
	neg.f64 	%fd8, %fd7;
	st.global.f64 	[%rd24], %fd8;
	mul.f64 	%fd9, %fd4, %fd2;
	neg.f64 	%fd10, %fd9;
	st.global.f64 	[%rd27], %fd10;
	mul.f64 	%fd11, %fd4, %fd1;
	st.global.f64 	[%rd26], %fd11;

变成了

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
	add.s64 	%rd26, %rd19, %rd25;
	ld.global.f64 	%fd3, [%rd26];
	add.s64 	%rd27, %rd21, %rd25;
	cvt.rn.f64.s32	%fd4, %r1;
	ld.global.f64 	%fd5, [%rd27];
	mul.f64 	%fd6, %fd4, %fd5;
	st.global.f64 	[%rd23], %fd6;
	mul.f64 	%fd7, %fd4, %fd3;
	neg.f64 	%fd8, %fd7;
	st.global.f64 	[%rd24], %fd8;
	mul.f64 	%fd9, %fd4, %fd2;
	neg.f64 	%fd10, %fd9;
	st.global.f64 	[%rd26], %fd10;
	mul.f64 	%fd11, %fd4, %fd1;
	st.global.f64 	[%rd27], %fd11;

可以看到,核函数代码对 global 内存的读写顺序改变了,原先 ld.global 与 st.global 指令间存在一定的交错执行关系,经过调整之后统一先读后写,这使得其更容易在运行的时候合并访存。可以使用 nvprof 这个工具验证优化的有效性。statevec_pauliYKernel 在算例一中有 43 个 Calls,其运行时间的 Avg/Min/Max 分别为 63.251ms/57.993ms/84.206ms ,总运行时间占算例一总时间的 14.91%(2.71978s/18.688107s)。经过修改之后,其运行时间的 Avg/Min/Max 分别为 44.101ms/42.064ms/53.005ms ,总运行时间占算例一总时间的 10.87%(1.89634s/17.922671s)。从结果上来看,只改动了原 CUDA 代码的两行就给 GPU 单卡的代码带来了近 1s 的提速,还是非常有效的。

发现这个问题的过程也很有参考意义:在 nvprof 得到的结果中,这个核函数运行时间的 Avg(63.251ms)远大于相同计算量的 statevec_pauliXKernel(43.516ms),且这个核函数的 Max 远大于 Avg,说明运行时间的波动较大,存在优化空间。在逐行比较 statevec_pauliYKernel 和 statevec_pauliXKernel 后,唯一有区别的几乎就是,原作者为了节省两个变量,直接将访存代码放进了计算的表达式中;调整之后果然取得了优化。最后通过类似的手段又在别的地方挤出 0.4s。这也提示我们一个比较优秀的写 CUDA 代码的习惯:尽量不要在同一个表达式中同时出现读写 global memory 的操作;同时对 global memory 的读写应该尽量分开,各自聚集在一起。