HW3

Assume a GPU architecture that contains 10 SIMD processors. Each SIMD instruction has a width of 32 and each SIMD processor contains 8 lanes for single-precision arithmetic and load/store instructions, meaning that each nondiverged SIMD instruction can produce 32 results every 4 cycles. Assume a kernel that has divergent branches that causes, on average, 80% of threads to be active. Assume that 70% of all SIMD instructions executed are single precision arithmetic and 20% are load/store. Because not all memory latencies are covered, assume an average SIMD instruction issue rate of 0.85. Assume that the GPU has a clock speed of 1.5 GHz.

• Compute the throughput, in GFLOP/s, for this kernel on this GPU.

$10\text{Cores}\times 8\times 1.5\text{GHz}\times 80\%\times 70\%\times 0.85=57.12\text{GFLOP/s}$

• Assume that you have the following choices, what is speedup in throughput for each of these improvements?
1. Increasing the number of single-precision lanes to 16.
2. Increasing the number of SIMD processors to 15 (assume this change doesn’t affect any other performance metrics and that the code scales to the additional processors) .
3. Adding a cache that will effectively reduce memory latency by 40%, which will increase instruction issue rate to 0.95.
1. $10\text{Cores}\times 16\times 1.5\text{GHz}\times 80\%\times 70\%\times 0.85=114.24\text{GFLOP/s}$，加速比为 $\frac{114.24}{57.12}=2$
2. $15\text{Cores}\times 8\times 1.5\text{GHz}\times 80\%\times 70\%\times 0.85=85.68\text{GFLOP/s}$，加速比为 $\frac{85.68}{57.12}=1.5$
3. $10\text{Cores}\times 8\times 1.5\text{GHz}\times 80\%\times 70\%\times 0.95=63.84\text{GFLOP/s}$，加速比为 $\frac{63.84}{57.12}\approx 1.11$

In this exercise, we will examine several loops and analyze their potential for parallelization.

• Does the following loop have a loop-carried dependency?
for (int i = 0; i < 100; ++i) {
A[i] = B[2 * i + 4];
B[4 * i + 5] = A[i];
}


• In the following loop, find all the true dependencies, output dependencies, and anti-dependencies. Eliminate the output dependencies and anti-dependencies by renaming.
for (int i = 0; i < 100; ++i) {
A[i] = A[i] * B[i]; /* S1 */
B[i] = A[i] + c;    /* S2 */
A[i] = C[i] * c;    /* S3 */
C[i] = D[i] * A[i]; /* S4 */
}

• S2 与 S1 存在 true dependence(A[i]，或修改后的 t_i).
• S3 与 S4 存在 true dependence(A[i]).
• S1 与 S3 存在 output dependence(A[i]).
• S3 与 S4 存在 anti-dependence(C[i]).
for (int i = 0; i < 100; ++i) {
int t_i = A[i] * B[i]; /* S1 */
B[i] = t_i + c;        /* S2 */
A[i] = C[i] * c;       /* S3 */
C[i] = D[i] * A[i];    /* S4 */
}


• 输出相关（写后写）：S1 和 S3 通过 A[i] 产生了输出相关。
• 反相关（读后写）：S3 和 S1 通过 A[i]；S2 和 S1 通过 B[i]；S3 和 S2 通过 A[i]；S4 和 S3 通过 C[i] 产生了反相关。
• 真数据相关（写后读）：S2 和 S1 通过 A[i]；S4 和 S3 通过 A[i] 产生了真数据相关。

for (int i = 0; i < 100; ++i) {
A[i] = A[i] * B[i];   /* S1 */
B1[i] = A[i] + c;     /* S2 */
A1[i] = C[i] * c;     /* S3 */
C1[i] = D[i] * A1[i]; /* S4 */
}

• Consider the following loop, Are there dependences between S1, S2 and S3? Is this loop parallel? If not, show how to make it parallel.
for (int i = 0; i < 100; ++i) {
A[i] = B[i] + C[i];     /* S1 */
B[i + 1] = D[i] + E[i]; /* S2 */
C[i + 1] = D[i] * E[i]; /* S3 */
}


A[0] = B[0] + C[0];
#pragma omp parallel for
for (int i = 0; i < 99; ++i) {
B[i + 1] = D[i] + E[i];
C[i + 1] = D[i] * E[i];
A[i + 1] = B[i + 1] + C[i + 1];
}
B[100] = D[99] + E[99];
C[100] = D[99] * E[99];


#pragma omp parallel for
for (int i = 0; i < 99; ++i) {
A[i] = i ? D[i - 1] + E[i - 1] + D[i - 1] * E[i - 1] : B[0] + C[0];
B[i + 1] = D[i] + E[i];
C[i + 1] = D[i] * E[i];
}


A[0] = B[0] + C[0];
#pragma omp parallel for
for (int i = 0; i < 99; ++i) {
B[i + 1] = D[i] + E[i];
C[i + 1] = D[i] * E[i];
A[i + 1] = B[i + 1] + C[i + 1];
}
B[100] = D[99] + E[99];
C[100] = D[99] * E[99];


List and describe at least four factors that influence the performance of GPU kernels. In other words, which runtime behaviors are caused by the kernel code cause a reduction in resource utilization during kernel execution?

• 同一个 warp 执行不同的条件分支。
• 不能合并的访存。
• 同一个 warp 同时访问同一个 bank 中的元素，导致 bank conflict。
• 一个 kernel 使用过多 shared memory，导致 warp 占用率下降。
• 单个线程使用过多寄存器资源，导致 warp 占用率下降。
• 不合理的 block/grid 大小设置，增加调度开销。

• 分支发散：当线程遵循不同的控制路径时，导致 SIMD 通道被屏蔽。
• 覆盖内存延迟：足够数量的活动线程可以隐藏内存延迟并提高指令发布率。
• 合并的片外内存引用：内存访问应该在 SIMD 线程组内连续组织。
• 片上内存的使用：具有局部性的内存引用应利用片上内存，应组织 SIMD 线程组内对片上内存的引用以避免内存块冲突。

Assume that we have a function for an application of the form $F(i,p)$, which gives the fraction of time that exactly $i$ processors are usable given that a total of $p$ processors are available. This means that

$\sum_{i=1}^pF\left(i,p\right)=1$

Assume that when $i$ processors are in use, the applications run $i$ times faster.

• Rewrite Amdahl’s Law so that it gives the speedup as a function of $p$ for some application.

• An application A runs on single processor for a time $T$ seconds. Different portions of its running time can improve if a larger number of processors is used. The figure below provides the details. How much speedup will A achieve when on 8 processors?
Fraction of T 20% 20% 10% 5% 15% 20% 10%
Processors(P) 1 2 4 6 8 16 128

$\frac{1}{20\%/1+20\%/2+10\%/4+5\%/6+(15\%+20\%+10\%)/8}\approx 2.57$
• Repeat for 32 processors and an infinite number of processors.
$\frac{1}{20\%/1+20\%/2+10\%/4+5\%/6+15\%/8+20\%/16+10\%/32}\approx 2.72$

$\frac{1}{20\%/1+20\%/2+10\%/4+5\%/6+15\%/8+20\%/16+10\%/128}\approx 2.74$

In this exercise, we examine the effect of the interconnection network topology on the CPI of programs running on a 64-processor distributed-memory multiprocessor. The processor clock rate is 2.0 GHz, and the base CPI of an application with all references hitting in the cache is 0.75. Assume that 0.2% of the instructions involve a remote communication reference. The cost of a remote communication reference is $\left(100+10\times h\right)$ ns, being the number of communication network hops that a remote reference has to make to the remote processor memory and back. Assume all communication links are bidirectional.

• Calculate the worst-case remote communication cost when the 64 processors are arranged as a ring, as an $8\times 8$ processor grid, or as a hypercube(Hint: longest communication path on a $2^n$ hypercube has n links).
• 对于 ring 方式组织成的处理器阵列，任意两个处理器之间至多有 32 跳，最坏耗时为 $\left(100+10\times 32\right)=420$ ns.
• 对于 $8\times 8$ 网格组织成的处理器阵列，任意两个处理器之间至多有 14 跳，最坏耗时为 $\left(100+10\times 14\right)=240$ ns.
• 对于超立方体方式组织成的处理器阵列，任意两个处理器之间至多有 6 跳，最坏耗时为 $\left(100+10\times 6\right)=160$ ns.

• 对于 ring 方式组织成的处理器阵列，任意两个处理器之间至多有 32 跳，最坏耗时为 $\left(100+10\times 32\times 2\right)=740$ ns。
• 对于 $8\times 8$ 网格组织成的处理器阵列，任意两个处理器之间至多有 14 跳，最坏耗时为 $\left(100+10\times 14\times 2\right)=380$ ns。
• 对于超立方体方式组织成的处理器阵列，任意两个处理器之间至多有 6 跳，最坏耗时为 $\left(100+10\times 6\times 2\right)=220$ ns。
• Compare the base CPI of the application with no remote communication to the CPI achieved with each of the three topologies in last part.

• 对于 ring 方式组织成的处理器阵列，最坏情况下 CPI 为 $0.75\times\frac{0.5\times 99.8\%+(0.5+420)*0.2\%}{0.5}=2.43$。
• 对于 $8\times 8$ 网格组织成的处理器阵列，最坏情况下 CPI 为 $0.75\times\frac{0.5\times 99.8\%+(0.5+240)*0.2\%}{0.5}=1.71$。
• 对于超立方体方式组织成的处理器阵列，最坏情况下 CPI 为 $0.75\times\frac{0.5\times 99.8\%+(0.5+160)*0.2\%}{0.5}=1.39$。

• 对于 ring 方式组织成的处理器阵列，最坏情况下 CPI 为 $0.75\times\frac{0.5\times 99.8\%+(0.5+740)*0.2\%}{0.5}= 3.71$。
• 对于 $8\times 8$ 网格组织成的处理器阵列，最坏情况下 CPI 为 $0.75\times\frac{0.5\times 99.8\%+(0.5+380)*0.2\%}{0.5}= 2.27$。
• 对于超立方体方式组织成的处理器阵列，最坏情况下 CPI 为 $0.75\times\frac{0.5\times 99.8\%+(0.5+220)*0.2\%}{0.5}= 1.63$。