体系结构HW3

Assume a GPU architecture that contains 10 SIMD processors. Each SIMD instruction has a width of 32 and each SIMD processor contains 8 lanes for single-precision arithmetic and load/store instructions, meaning that each nondiverged SIMD instruction can produce 32 results every 4 cycles. Assume a kernel that has divergent branches that causes, on average, 80% of threads to be active. Assume that 70% of all SIMD instructions executed are single precision arithmetic and 20% are load/store. Because not all memory latencies are covered, assume an average SIMD instruction issue rate of 0.85. Assume that the GPU has a clock speed of 1.5 GHz.

  • Compute the throughput, in GFLOP/s, for this kernel on this GPU.

$10\text{Cores}\times 8\times 1.5\text{GHz}\times 80\%\times 70\%\times 0.85=57.12\text{GFLOP/s}$

  • Assume that you have the following choices, what is speedup in throughput for each of these improvements?
    1. Increasing the number of single-precision lanes to 16.
    2. Increasing the number of SIMD processors to 15 (assume this change doesn’t affect any other performance metrics and that the code scales to the additional processors) .
    3. Adding a cache that will effectively reduce memory latency by 40%, which will increase instruction issue rate to 0.95.
  1. $10\text{Cores}\times 16\times 1.5\text{GHz}\times 80\%\times 70\%\times 0.85=114.24\text{GFLOP/s}$,加速比为 $\frac{114.24}{57.12}=2$
  2. $15\text{Cores}\times 8\times 1.5\text{GHz}\times 80\%\times 70\%\times 0.85=85.68\text{GFLOP/s}$,加速比为 $\frac{85.68}{57.12}=1.5$
  3. $10\text{Cores}\times 8\times 1.5\text{GHz}\times 80\%\times 70\%\times 0.95=63.84\text{GFLOP/s}$,加速比为 $\frac{63.84}{57.12}\approx 1.11$

In this exercise, we will examine several loops and analyze their potential for parallelization.

  • Does the following loop have a loop-carried dependency?
for (int i = 0; i < 100; ++i) {
  A[i] = B[2 * i + 4];
  B[4 * i + 5] = A[i];
}

不存在。对于 A 数组,每轮循环处理的元素不同;对于 B 数组,每一轮循环只依赖偶数下标的元素,而只有奇数下标的元素被修改。


正解

有四种可能存在的相关:S1 自身、S2 自身的循环间相关,S1 到 S2 的相关(S2 对 S1 的依赖),S2 到 S1 的相关。对于前两种:因为 S1 和 S2 的赋值符号前后对应不同的数组,因此这种相关不存在。

对于 S1 到 S2 可能存在的相关:S1 中计算数组 A 的元素,并在 S2 中加载,两次对 A 的引用在同一个迭代内,不是循环间相关。因此,如果它是仅有的相关,那么这个循环的多次迭代就能并行执行,只要一个迭代中的每对语句保持相对顺序即可。

对于 S2 到 S1,也即 B[2*i+4]B[4*i+5] 可能存在的相关。用最大公约数(Greatest Common Divisor,GCD)测试判断:假定我们已经以索引值 a*x+b 存储了一个数组元素,并以索引值 c*d+d 从同一数组中载入,那么如果要存在循环间相关,必须满足 $\gcd(c,a)$ 能够整除 $(d-b)$。本例中,$a=2,b=4,c=4,d=5,\gcd(4,2)=2$ 不能整除 $(5-4)=1$,故不存在相关。

综上,此代码不存在循环带来的相关,可并行执行。

  • In the following loop, find all the true dependencies, output dependencies, and anti-dependencies. Eliminate the output dependencies and anti-dependencies by renaming.
for (int i = 0; i < 100; ++i) {
  A[i] = A[i] * B[i]; /* S1 */
  B[i] = A[i] + c;    /* S2 */
  A[i] = C[i] * c;    /* S3 */
  C[i] = D[i] * A[i]; /* S4 */
}
  • S2 与 S1 存在 true dependence(A[i],或修改后的 t_i).
  • S3 与 S4 存在 true dependence(A[i]).
  • S1 与 S3 存在 output dependence(A[i]).
  • S3 与 S4 存在 anti-dependence(C[i]).
for (int i = 0; i < 100; ++i) {
  int t_i = A[i] * B[i]; /* S1 */
  B[i] = t_i + c;        /* S2 */
  A[i] = C[i] * c;       /* S3 */
  C[i] = D[i] * A[i];    /* S4 */
}

正解

  • 输出相关(写后写):S1 和 S3 通过 A[i] 产生了输出相关。
  • 反相关(读后写):S3 和 S1 通过 A[i];S2 和 S1 通过 B[i];S3 和 S2 通过 A[i];S4 和 S3 通过 C[i] 产生了反相关。
  • 真数据相关(写后读):S2 和 S1 通过 A[i];S4 和 S3 通过 A[i] 产生了真数据相关。

改写代码,通过重命名去掉输出相关和反相关:

for (int i = 0; i < 100; ++i) {
  A[i] = A[i] * B[i];   /* S1 */
  B1[i] = A[i] + c;     /* S2 */
  A1[i] = C[i] * c;     /* S3 */
  C1[i] = D[i] * A1[i]; /* S4 */
}
  • Consider the following loop, Are there dependences between S1, S2 and S3? Is this loop parallel? If not, show how to make it parallel.
for (int i = 0; i < 100; ++i) {
  A[i] = B[i] + C[i];     /* S1 */
  B[i + 1] = D[i] + E[i]; /* S2 */
  C[i + 1] = D[i] * E[i]; /* S3 */
}

存在循环依赖,S1 依赖前一轮的 S2、S3。可以通过调整语句顺序,去掉循环依赖,效果如下。

A[0] = B[0] + C[0];
#pragma omp parallel for
for (int i = 0; i < 99; ++i) {
  B[i + 1] = D[i] + E[i];
  C[i + 1] = D[i] * E[i];
  A[i + 1] = B[i + 1] + C[i + 1];
}
B[100] = D[99] + E[99];
C[100] = D[99] * E[99];

当然也可以在调用处展开原来的计算,缺点是计算和访存都翻倍了。

#pragma omp parallel for
for (int i = 0; i < 99; ++i) {
  A[i] = i ? D[i - 1] + E[i - 1] + D[i - 1] * E[i - 1] : B[0] + C[0];
  B[i + 1] = D[i] + E[i];
  C[i + 1] = D[i] * E[i];
}

正解

循环内 S1, S2, S3 之间不存在名称相关或真数据相关。

循环间数组 B 和数组 C 可能存在循环间相关。用最大公约数法分析, B[i]B[i+1],$a=1,b=0,c=1,d=1$,也即 $\gcd(a,c)=1$ 可以整除 (d-b)=1,因此数组 B 存在循环间相关,不是循环可并行的。数组 C 同理。

改写代码:

A[0] = B[0] + C[0];
#pragma omp parallel for
for (int i = 0; i < 99; ++i) {
  B[i + 1] = D[i] + E[i];
  C[i + 1] = D[i] * E[i];
  A[i + 1] = B[i + 1] + C[i + 1];
}
B[100] = D[99] + E[99];
C[100] = D[99] * E[99];

List and describe at least four factors that influence the performance of GPU kernels. In other words, which runtime behaviors are caused by the kernel code cause a reduction in resource utilization during kernel execution?

  • 同一个 warp 执行不同的条件分支。
  • 不能合并的访存。
  • 同一个 warp 同时访问同一个 bank 中的元素,导致 bank conflict。
  • 一个 kernel 使用过多 shared memory,导致 warp 占用率下降。
  • 单个线程使用过多寄存器资源,导致 warp 占用率下降。
  • 不合理的 block/grid 大小设置,增加调度开销。

正解

  • 分支发散:当线程遵循不同的控制路径时,导致 SIMD 通道被屏蔽。
  • 覆盖内存延迟:足够数量的活动线程可以隐藏内存延迟并提高指令发布率。
  • 合并的片外内存引用:内存访问应该在 SIMD 线程组内连续组织。
  • 片上内存的使用:具有局部性的内存引用应利用片上内存,应组织 SIMD 线程组内对片上内存的引用以避免内存块冲突。

Assume that we have a function for an application of the form $F(i,p)$, which gives the fraction of time that exactly $i$ processors are usable given that a total of $p$ processors are available. This means that

\[\sum_{i=1}^pF\left(i,p\right)=1\]

Assume that when $i$ processors are in use, the applications run $i$ times faster.

  • Rewrite Amdahl’s Law so that it gives the speedup as a function of $p$ for some application.

设串行运行整个程序的时间为 1,则使用 p 个个物理核心加速程序的各个部分所能达到的加速比为 $k(p)=\frac{1}{\sum_{i=1}^p\frac{F\left(i,p\right)}{i}}$。

  • An application A runs on single processor for a time $T$ seconds. Different portions of its running time can improve if a larger number of processors is used. The figure below provides the details. How much speedup will A achieve when on 8 processors?
Fraction of T 20% 20% 10% 5% 15% 20% 10%
Processors(P) 1 2 4 6 8 16 128

使用八个处理器时,多于八核并行的部分实际上只有八核运行。因此对应的加速比为

\[\frac{1}{20\%/1+20\%/2+10\%/4+5\%/6+(15\%+20\%+10\%)/8}\approx 2.57\]
  • Repeat for 32 processors and an infinite number of processors.
\[\frac{1}{20\%/1+20\%/2+10\%/4+5\%/6+15\%/8+20\%/16+10\%/32}\approx 2.72\]

对于无限核数,可以认为是 128 核,则有

\[\frac{1}{20\%/1+20\%/2+10\%/4+5\%/6+15\%/8+20\%/16+10\%/128}\approx 2.74\]

In this exercise, we examine the effect of the interconnection network topology on the CPI of programs running on a 64-processor distributed-memory multiprocessor. The processor clock rate is 2.0 GHz, and the base CPI of an application with all references hitting in the cache is 0.75. Assume that 0.2% of the instructions involve a remote communication reference. The cost of a remote communication reference is $\left(100+10\times h\right)$ ns, being the number of communication network hops that a remote reference has to make to the remote processor memory and back. Assume all communication links are bidirectional.

  • Calculate the worst-case remote communication cost when the 64 processors are arranged as a ring, as an $8\times 8$ processor grid, or as a hypercube(Hint: longest communication path on a $2^n$ hypercube has n links).
  • 对于 ring 方式组织成的处理器阵列,任意两个处理器之间至多有 32 跳,最坏耗时为 $\left(100+10\times 32\right)=420$ ns.
  • 对于 $8\times 8$ 网格组织成的处理器阵列,任意两个处理器之间至多有 14 跳,最坏耗时为 $\left(100+10\times 14\right)=240$ ns.
  • 对于超立方体方式组织成的处理器阵列,任意两个处理器之间至多有 6 跳,最坏耗时为 $\left(100+10\times 6\right)=160$ ns.

正解

最大通信网络跳数 h 为最长通信路径 p 的两倍(要折返)。

  • 对于 ring 方式组织成的处理器阵列,任意两个处理器之间至多有 32 跳,最坏耗时为 $\left(100+10\times 32\times 2\right)=740$ ns。
  • 对于 $8\times 8$ 网格组织成的处理器阵列,任意两个处理器之间至多有 14 跳,最坏耗时为 $\left(100+10\times 14\times 2\right)=380$ ns。
  • 对于超立方体方式组织成的处理器阵列,任意两个处理器之间至多有 6 跳,最坏耗时为 $\left(100+10\times 6\times 2\right)=220$ ns。
  • Compare the base CPI of the application with no remote communication to the CPI achieved with each of the three topologies in last part.

则该 CPU 单周期时间为 $1/2.0\text{GHz}=5\times10^{-10}$ s,即 $0.5\text{ns}$.

  • 对于 ring 方式组织成的处理器阵列,最坏情况下 CPI 为 $0.75\times\frac{0.5\times 99.8\%+(0.5+420)*0.2\%}{0.5}=2.43$。
  • 对于 $8\times 8$ 网格组织成的处理器阵列,最坏情况下 CPI 为 $0.75\times\frac{0.5\times 99.8\%+(0.5+240)*0.2\%}{0.5}=1.71$。
  • 对于超立方体方式组织成的处理器阵列,最坏情况下 CPI 为 $0.75\times\frac{0.5\times 99.8\%+(0.5+160)*0.2\%}{0.5}=1.39$。

正解

则该 CPU 单周期时间为 $1/2.0\text{GHz}=5\times10^{-10}$ s,即 $0.5\text{ns}$。

  • 对于 ring 方式组织成的处理器阵列,最坏情况下 CPI 为 $0.75\times\frac{0.5\times 99.8\%+(0.5+740)*0.2\%}{0.5}= 3.71$。
  • 对于 $8\times 8$ 网格组织成的处理器阵列,最坏情况下 CPI 为 $0.75\times\frac{0.5\times 99.8\%+(0.5+380)*0.2\%}{0.5}= 2.27$。
  • 对于超立方体方式组织成的处理器阵列,最坏情况下 CPI 为 $0.75\times\frac{0.5\times 99.8\%+(0.5+220)*0.2\%}{0.5}= 1.63$。