体系结构HW3

post on 03 Dec 2021 about 7314words require 25min
CC BY 4.0 （除特别声明或转载文章外）
如果这些文字帮助到你，可以请我喝一杯咖啡~

Assume a GPU architecture that contains 10 SIMD processors. Each SIMD instruction has a width of 32 and each SIMD processor contains 8 lanes for single-precision arithmetic and load/store instructions, meaning that each nondiverged SIMD instruction can produce 32 results every 4 cycles. Assume a kernel that has divergent branches that causes, on average, 80% of threads to be active. Assume that 70% of all SIMD instructions executed are single precision arithmetic and 20% are load/store. Because not all memory latencies are covered, assume an average SIMD instruction issue rate of 0.85. Assume that the GPU has a clock speed of 1.5 GHz.

Compute the throughput, in GFLOP/s, for this kernel on this GPU.

$10\text{Cores}\times 8\times 1.5\text{GHz}\times 80\%\times 70\%\times 0.85=57.12\text{GFLOP/s}$

Assume that you have the following choices, what is speedup in throughput for each of these improvements?

Increasing the number of single-precision lanes to 16.

Increasing the number of SIMD processors to 15 (assume this change doesn’t affect any other performance metrics and that the code scales to the additional processors) .

Adding a cache that will effectively reduce memory latency by 40%, which will increase instruction issue rate to 0.95.

$10\text{Cores}\times 16\times 1.5\text{GHz}\times 80\%\times 70\%\times 0.85=114.24\text{GFLOP/s}$，加速比为 $\frac{114.24}{57.12}=2$
$15\text{Cores}\times 8\times 1.5\text{GHz}\times 80\%\times 70\%\times 0.85=85.68\text{GFLOP/s}$，加速比为 $\frac{85.68}{57.12}=1.5$
$10\text{Cores}\times 8\times 1.5\text{GHz}\times 80\%\times 70\%\times 0.95=63.84\text{GFLOP/s}$，加速比为 $\frac{63.84}{57.12}\approx 1.11$

In this exercise, we will examine several loops and analyze their potential for parallelization.

Does the following loop have a loop-carried dependency?

for (int i = 0; i < 100; ++i) {
  A[i] = B[2 * i + 4];
  B[4 * i + 5] = A[i];
}

不存在。对于 A 数组，每轮循环处理的元素不同；对于 B 数组，每一轮循环只依赖偶数下标的元素，而只有奇数下标的元素被修改。

正解：

有四种可能存在的相关：S1 自身、S2 自身的循环间相关，S1 到 S2 的相关（S2 对 S1 的依赖），S2 到 S1 的相关。对于前两种：因为 S1 和 S2 的赋值符号前后对应不同的数组，因此这种相关不存在。

对于 S1 到 S2 可能存在的相关：S1 中计算数组 A 的元素，并在 S2 中加载，两次对 A 的引用在同一个迭代内，不是循环间相关。因此，如果它是仅有的相关，那么这个循环的多次迭代就能并行执行，只要一个迭代中的每对语句保持相对顺序即可。

对于 S2 到 S1，也即 B[2*i+4] 和 B[4*i+5] 可能存在的相关。用最大公约数（Greatest Common Divisor，GCD）测试判断：假定我们已经以索引值 a*x+b 存储了一个数组元素，并以索引值 c*d+d 从同一数组中载入，那么如果要存在循环间相关，必须满足 $\gcd(c,a)$ 能够整除 $(d-b)$。本例中，$a=2,b=4,c=4,d=5,\gcd(4,2)=2$ 不能整除 $(5-4)=1$，故不存在相关。

综上，此代码不存在循环带来的相关，可并行执行。

In the following loop, find all the true dependencies, output dependencies, and anti-dependencies. Eliminate the output dependencies and anti-dependencies by renaming.

for (int i = 0; i < 100; ++i) {
  A[i] = A[i] * B[i]; /* S1 */
  B[i] = A[i] + c;    /* S2 */
  A[i] = C[i] * c;    /* S3 */
  C[i] = D[i] * A[i]; /* S4 */
}

S2 与 S1 存在 true dependence(A[i]，或修改后的 t_i).
S3 与 S4 存在 true dependence(A[i]).
S1 与 S3 存在 output dependence(A[i]).
S3 与 S4 存在 anti-dependence(C[i]).

for (int i = 0; i < 100; ++i) {
  int t_i = A[i] * B[i]; /* S1 */
  B[i] = t_i + c;        /* S2 */
  A[i] = C[i] * c;       /* S3 */
  C[i] = D[i] * A[i];    /* S4 */
}

正解：

输出相关（写后写）：S1 和 S3 通过 A[i] 产生了输出相关。
反相关（读后写）：S3 和 S1 通过 A[i]；S2 和 S1 通过 B[i]；S3 和 S2 通过 A[i]；S4 和 S3 通过 C[i] 产生了反相关。
真数据相关（写后读）：S2 和 S1 通过 A[i]；S4 和 S3 通过 A[i] 产生了真数据相关。

改写代码，通过重命名去掉输出相关和反相关：

for (int i = 0; i < 100; ++i) {
  A[i] = A[i] * B[i];   /* S1 */
  B1[i] = A[i] + c;     /* S2 */
  A1[i] = C[i] * c;     /* S3 */
  C1[i] = D[i] * A1[i]; /* S4 */
}

Consider the following loop, Are there dependences between S1, S2 and S3? Is this loop parallel? If not, show how to make it parallel.

for (int i = 0; i < 100; ++i) {
  A[i] = B[i] + C[i];     /* S1 */
  B[i + 1] = D[i] + E[i]; /* S2 */
  C[i + 1] = D[i] * E[i]; /* S3 */
}

存在循环依赖，S1 依赖前一轮的 S2、S3。可以通过调整语句顺序，去掉循环依赖，效果如下。

A[0] = B[0] + C[0];
#pragma omp parallel for
for (int i = 0; i < 99; ++i) {
  B[i + 1] = D[i] + E[i];
  C[i + 1] = D[i] * E[i];
  A[i + 1] = B[i + 1] + C[i + 1];
}
B[100] = D[99] + E[99];
C[100] = D[99] * E[99];

当然也可以在调用处展开原来的计算，缺点是计算和访存都翻倍了。

#pragma omp parallel for
for (int i = 0; i < 99; ++i) {
  A[i] = i ? D[i - 1] + E[i - 1] + D[i - 1] * E[i - 1] : B[0] + C[0];
  B[i + 1] = D[i] + E[i];
  C[i + 1] = D[i] * E[i];
}

正解：

循环内 S1, S2, S3 之间不存在名称相关或真数据相关。

循环间数组 B 和数组 C 可能存在循环间相关。用最大公约数法分析， B[i] 和 B[i+1]，$a=1,b=0,c=1,d=1$，也即 $\gcd(a,c)=1$ 可以整除 (d-b)=1，因此数组 B 存在循环间相关，不是循环可并行的。数组 C 同理。

改写代码：

A[0] = B[0] + C[0];
#pragma omp parallel for
for (int i = 0; i < 99; ++i) {
  B[i + 1] = D[i] + E[i];
  C[i + 1] = D[i] * E[i];
  A[i + 1] = B[i + 1] + C[i + 1];
}
B[100] = D[99] + E[99];
C[100] = D[99] * E[99];

List and describe at least four factors that influence the performance of GPU kernels. In other words, which runtime behaviors are caused by the kernel code cause a reduction in resource utilization during kernel execution?

同一个 warp 执行不同的条件分支。
不能合并的访存。
同一个 warp 同时访问同一个 bank 中的元素，导致 bank conflict。
一个 kernel 使用过多 shared memory，导致 warp 占用率下降。
单个线程使用过多寄存器资源，导致 warp 占用率下降。
不合理的 block/grid 大小设置，增加调度开销。

正解：

分支发散：当线程遵循不同的控制路径时，导致 SIMD 通道被屏蔽。
覆盖内存延迟：足够数量的活动线程可以隐藏内存延迟并提高指令发布率。
合并的片外内存引用：内存访问应该在 SIMD 线程组内连续组织。
片上内存的使用：具有局部性的内存引用应利用片上内存，应组织 SIMD 线程组内对片上内存的引用以避免内存块冲突。

Assume that we have a function for an application of the form $F(i,p)$, which gives the fraction of time that exactly $i$ processors are usable given that a total of $p$ processors are available. This means that
\[\sum_{i=1}^pF\left(i,p\right)=1\]
Assume that when $i$ processors are in use, the applications run $i$ times faster.

Rewrite Amdahl’s Law so that it gives the speedup as a function of $p$ for some application.

设串行运行整个程序的时间为 1，则使用 p 个个物理核心加速程序的各个部分所能达到的加速比为 $k(p)=\frac{1}{\sum_{i=1}^p\frac{F\left(i,p\right)}{i}}$。

An application A runs on single processor for a time $T$ seconds. Different portions of its running time can improve if a larger number of processors is used. The figure below provides the details. How much speedup will A achieve when on 8 processors?

Fraction of T 20% 20% 10% 5% 15% 20% 10%

Processors(P) 1 2 4 6 8 16 128

Fraction of T	20%	20%	10%	5%	15%	20%	10%
Processors(P)	1	2	4	6	8	16	128

使用八个处理器时，多于八核并行的部分实际上只有八核运行。因此对应的加速比为

\[\frac{1}{20\%/1+20\%/2+10\%/4+5\%/6+(15\%+20\%+10\%)/8}\approx 2.57\]

Repeat for 32 processors and an infinite number of processors.

\[\frac{1}{20\%/1+20\%/2+10\%/4+5\%/6+15\%/8+20\%/16+10\%/32}\approx 2.72\]

对于无限核数，可以认为是 128 核，则有

\[\frac{1}{20\%/1+20\%/2+10\%/4+5\%/6+15\%/8+20\%/16+10\%/128}\approx 2.74\]

In this exercise, we examine the effect of the interconnection network topology on the CPI of programs running on a 64-processor distributed-memory multiprocessor. The processor clock rate is 2.0 GHz, and the base CPI of an application with all references hitting in the cache is 0.75. Assume that 0.2% of the instructions involve a remote communication reference. The cost of a remote communication reference is $\left(100+10\times h\right)$ ns, being the number of communication network hops that a remote reference has to make to the remote processor memory and back. Assume all communication links are bidirectional.

Calculate the worst-case remote communication cost when the 64 processors are arranged as a ring, as an $8\times 8$ processor grid, or as a hypercube(Hint: longest communication path on a $2^n$ hypercube has n links).

对于 ring 方式组织成的处理器阵列，任意两个处理器之间至多有 32 跳，最坏耗时为 $\left(100+10\times 32\right)=420$ ns.
对于 $8\times 8$ 网格组织成的处理器阵列，任意两个处理器之间至多有 14 跳，最坏耗时为 $\left(100+10\times 14\right)=240$ ns.
对于超立方体方式组织成的处理器阵列，任意两个处理器之间至多有 6 跳，最坏耗时为 $\left(100+10\times 6\right)=160$ ns.

正解：

最大通信网络跳数 h 为最长通信路径 p 的两倍（要折返）。

对于 ring 方式组织成的处理器阵列，任意两个处理器之间至多有 32 跳，最坏耗时为 $\left(100+10\times 32\times 2\right)=740$ ns。
对于 $8\times 8$ 网格组织成的处理器阵列，任意两个处理器之间至多有 14 跳，最坏耗时为 $\left(100+10\times 14\times 2\right)=380$ ns。
对于超立方体方式组织成的处理器阵列，任意两个处理器之间至多有 6 跳，最坏耗时为 $\left(100+10\times 6\times 2\right)=220$ ns。

Compare the base CPI of the application with no remote communication to the CPI achieved with each of the three topologies in last part.

则该 CPU 单周期时间为 $1/2.0\text{GHz}=5\times10^{-10}$ s，即 $0.5\text{ns}$.

对于 ring 方式组织成的处理器阵列，最坏情况下 CPI 为 $0.75\times\frac{0.5\times 99.8\%+(0.5+420)*0.2\%}{0.5}=2.43$。
对于 $8\times 8$ 网格组织成的处理器阵列，最坏情况下 CPI 为 $0.75\times\frac{0.5\times 99.8\%+(0.5+240)*0.2\%}{0.5}=1.71$。
对于超立方体方式组织成的处理器阵列，最坏情况下 CPI 为 $0.75\times\frac{0.5\times 99.8\%+(0.5+160)*0.2\%}{0.5}=1.39$。

正解：

则该 CPU 单周期时间为 $1/2.0\text{GHz}=5\times10^{-10}$ s，即 $0.5\text{ns}$。

对于 ring 方式组织成的处理器阵列，最坏情况下 CPI 为 $0.75\times\frac{0.5\times 99.8\%+(0.5+740)*0.2\%}{0.5}= 3.71$。
对于 $8\times 8$ 网格组织成的处理器阵列，最坏情况下 CPI 为 $0.75\times\frac{0.5\times 99.8\%+(0.5+380)*0.2\%}{0.5}= 2.27$。
对于超立方体方式组织成的处理器阵列，最坏情况下 CPI 为 $0.75\times\frac{0.5\times 99.8\%+(0.5+220)*0.2\%}{0.5}= 1.63$。

Related posts

Tilelang 入门-L2 友好的矩阵乘 05 Jul 2025

DAC25 GoPTX Script 25 Jun 2025

今日此时所想之事（四） 21 Jun 2025

Loading comments...