post on 03 Dec 2021 about 7314words require 25min
CC BY 4.0 (除特别声明或转载文章外)
如果这篇文章帮助到你,可以请我喝一杯咖啡~
Assume a GPU architecture that contains 10 SIMD processors. Each SIMD instruction has a width of 32 and each SIMD processor contains 8 lanes for single-precision arithmetic and load/store instructions, meaning that each nondiverged SIMD instruction can produce 32 results every 4 cycles. Assume a kernel that has divergent branches that causes, on average, 80% of threads to be active. Assume that 70% of all SIMD instructions executed are single precision arithmetic and 20% are load/store. Because not all memory latencies are covered, assume an average SIMD instruction issue rate of 0.85. Assume that the GPU has a clock speed of 1.5 GHz.
- Compute the throughput, in GFLOP/s, for this kernel on this GPU.
- Assume that you have the following choices, what is speedup in throughput for each of these improvements?
- Increasing the number of single-precision lanes to 16.
- Increasing the number of SIMD processors to 15 (assume this change doesn’t affect any other performance metrics and that the code scales to the additional processors) .
- Adding a cache that will effectively reduce memory latency by 40%, which will increase instruction issue rate to 0.95.
In this exercise, we will examine several loops and analyze their potential for parallelization.
- Does the following loop have a loop-carried dependency?
1 2 3 4 for (int i = 0; i < 100; ++i) { A[i] = B[2 * i + 4]; B[4 * i + 5] = A[i]; }
不存在。对于 A 数组,每轮循环处理的元素不同;对于 B 数组,每一轮循环只依赖偶数下标的元素,而只有奇数下标的元素被修改。
正解:
有四种可能存在的相关:S1 自身、S2 自身的循环间相关,S1 到 S2 的相关(S2 对 S1 的依赖),S2 到 S1 的相关。对于前两种:因为 S1 和 S2 的赋值符号前后对应不同的数组,因此这种相关不存在。
对于 S1 到 S2 可能存在的相关:S1 中计算数组 A 的元素,并在 S2 中加载,两次对 A 的引用在同一个迭代内,不是循环间相关。因此,如果它是仅有的相关,那么这个循环的多次迭代就能并行执行,只要一个迭代中的每对语句保持相对顺序即可。
对于 S2 到 S1,也即 B[2*i+4]
和 B[4*i+5]
可能存在的相关。用最大公约数(Greatest Common Divisor,GCD)测试判断:假定我们已经以索引值 a*x+b
存储了一个数组元素,并以索引值 c*d+d
从同一数组中载入,那么如果要存在循环间相关,必须满足 能够整除 。本例中, 不能整除 ,故不存在相关。
综上,此代码不存在循环带来的相关,可并行执行。
- In the following loop, find all the true dependencies, output dependencies, and anti-dependencies. Eliminate the output dependencies and anti-dependencies by renaming.
1 2 3 4 5 6 for (int i = 0; i < 100; ++i) { A[i] = A[i] * B[i]; /* S1 */ B[i] = A[i] + c; /* S2 */ A[i] = C[i] * c; /* S3 */ C[i] = D[i] * A[i]; /* S4 */ }
A[i]
,或修改后的 t_i
).A[i]
).A[i]
).C[i]
).1
2
3
4
5
6
for (int i = 0; i < 100; ++i) {
int t_i = A[i] * B[i]; /* S1 */
B[i] = t_i + c; /* S2 */
A[i] = C[i] * c; /* S3 */
C[i] = D[i] * A[i]; /* S4 */
}
正解:
改写代码,通过重命名去掉输出相关和反相关:
1
2
3
4
5
6
for (int i = 0; i < 100; ++i) {
A[i] = A[i] * B[i]; /* S1 */
B1[i] = A[i] + c; /* S2 */
A1[i] = C[i] * c; /* S3 */
C1[i] = D[i] * A1[i]; /* S4 */
}
- Consider the following loop, Are there dependences between S1, S2 and S3? Is this loop parallel? If not, show how to make it parallel.
1 2 3 4 5 for (int i = 0; i < 100; ++i) { A[i] = B[i] + C[i]; /* S1 */ B[i + 1] = D[i] + E[i]; /* S2 */ C[i + 1] = D[i] * E[i]; /* S3 */ }
存在循环依赖,S1 依赖前一轮的 S2、S3。可以通过调整语句顺序,去掉循环依赖,效果如下。
1
2
3
4
5
6
7
8
9
A[0] = B[0] + C[0];
#pragma omp parallel for
for (int i = 0; i < 99; ++i) {
B[i + 1] = D[i] + E[i];
C[i + 1] = D[i] * E[i];
A[i + 1] = B[i + 1] + C[i + 1];
}
B[100] = D[99] + E[99];
C[100] = D[99] * E[99];
当然也可以在调用处展开原来的计算,缺点是计算和访存都翻倍了。
1
2
3
4
5
6
#pragma omp parallel for
for (int i = 0; i < 99; ++i) {
A[i] = i ? D[i - 1] + E[i - 1] + D[i - 1] * E[i - 1] : B[0] + C[0];
B[i + 1] = D[i] + E[i];
C[i + 1] = D[i] * E[i];
}
正解:
循环内 S1, S2, S3 之间不存在名称相关或真数据相关。
循环间数组 B 和数组 C 可能存在循环间相关。用最大公约数法分析,
B[i]
和 B[i+1]
,,也即 可以整除 (d-b)=1
,因此数组 B 存在循环间相关,不是循环可并行的。数组 C 同理。
改写代码:
1
2
3
4
5
6
7
8
9
A[0] = B[0] + C[0];
#pragma omp parallel for
for (int i = 0; i < 99; ++i) {
B[i + 1] = D[i] + E[i];
C[i + 1] = D[i] * E[i];
A[i + 1] = B[i + 1] + C[i + 1];
}
B[100] = D[99] + E[99];
C[100] = D[99] * E[99];
List and describe at least four factors that influence the performance of GPU kernels. In other words, which runtime behaviors are caused by the kernel code cause a reduction in resource utilization during kernel execution?
正解:
Assume that we have a function for an application of the form , which gives the fraction of time that exactly processors are usable given that a total of processors are available. This means that
Assume that when processors are in use, the applications run times faster.
- Rewrite Amdahl’s Law so that it gives the speedup as a function of for some application.
设串行运行整个程序的时间为 1,则使用 p 个个物理核心加速程序的各个部分所能达到的加速比为 。
- An application A runs on single processor for a time seconds. Different portions of its running time can improve if a larger number of processors is used. The figure below provides the details. How much speedup will A achieve when on 8 processors?
Fraction of T 20% 20% 10% 5% 15% 20% 10% Processors(P) 1 2 4 6 8 16 128
使用八个处理器时,多于八核并行的部分实际上只有八核运行。因此对应的加速比为
- Repeat for 32 processors and an infinite number of processors.
对于无限核数,可以认为是 128 核,则有
In this exercise, we examine the effect of the interconnection network topology on the CPI of programs running on a 64-processor distributed-memory multiprocessor. The processor clock rate is 2.0 GHz, and the base CPI of an application with all references hitting in the cache is 0.75. Assume that 0.2% of the instructions involve a remote communication reference. The cost of a remote communication reference is ns, being the number of communication network hops that a remote reference has to make to the remote processor memory and back. Assume all communication links are bidirectional.
- Calculate the worst-case remote communication cost when the 64 processors are arranged as a ring, as an processor grid, or as a hypercube(Hint: longest communication path on a hypercube has n links).
正解:
最大通信网络跳数 h 为最长通信路径 p 的两倍(要折返)。
- Compare the base CPI of the application with no remote communication to the CPI achieved with each of the three topologies in last part.
则该 CPU 单周期时间为 s,即 .
正解:
则该 CPU 单周期时间为 s,即 。
Related posts