HW1

In this exercise, assume that we are considering enhancing a quad-core machine by adding encryption hardware to it. When computing encryption operations, it is 20 times faster than the normal mode of execution. We will define percentage of encryption as the percentage of time in the original execution that is spent performing encryption operations. The specialized hardware increases power consumption by 2%.

• Draw a graph that plots the speedup as a percentage of the computation spent performing encryption. Label the y-axis “Net speedup” and label the x-axis “Percent encryption.”
{
"mark": "line",
"data": {
"sequence": {
"start": 0,
"stop": 101,
"as": "PercentEncryption"
}
},
"transform": [
{
"calculate": "100/((100-datum.PercentEncryption)+0.05*datum.PercentEncryption)",
"as": "NetSpeedup"
}
],
"encoding": {
"x": {
"field": "PercentEncryption",
"type": "quantitative"
},
"y": {
"field": "NetSpeedup",
"type": "quantitative"
}
}
}

• With what percentage of encryption will adding encryption hardware result in a speedup of 2?

• What percentage of time in the new execution will be spent on encryption operations if a speedup of 2 is achieved?

• Suppose you have measured the percentage of encryption to be 50%. The hardware design group estimates it can speed up the encryption hardware even more with significant additional investment. You wonder whether adding a second unit in order to support parallel encryption operations would be more useful. Imagine that in the original program, 90% of the encryption operations could be performed in parallel. What is the speedup of providing two or four encryption units, assuming that the parallelization allowed is limited to the number of encryption units? (Answer two and four respectively)

The ways of a set can be viewed as a priority list, ordered from high priority to low priority. Every time the set is touched, the list can be reorganized to change block priorities. With this view, cache management policies can be decomposed into three sub-policies: Insertion, Promotion, and Victim Selection. Insertion defines where newly fetched blocks are placed in the priority list. Promotion defines how a block’s position in the list is changed every time it is touched (a cache hit). Victim Selection defines which entry of the list is evicted to make room for a new block when there is a cache miss.

• Can you frame the LRU cache policy in terms of the Insertion, Promotion, and Victim Selection sub-policies?
1. 新数据插入（Insertion）到 set 头部；
2. 当缓存命中即缓存数据被访问，则将数据移到（Promotion）头部；
3. 当缓存满的时候，将尾部的数据丢弃（Victim）。

1. 插入：将新获取的块放在优先级最高的位置，也即优先级列表最前；同时，优先级列表最后的块从集合中删去。
2. 提升：如果在优先级列表中的块被触摸（缓存命中），则将该块与优先级比他大 N 级的块调换位置；如果该块在前 N 个，则提到最前。从而增加其优先级。
3. 受害者：同（1）中所述，当插入到来时，优先级最后块的成为受害者，被移出集合。
• Can you define other Insertion and Promotion policies that may be competitive and worth exploring further?

A cache acts as a filter. For example, for every 1000 instructions of a program, an average of 20 memory accesses may exhibit low enough locality that they cannot be serviced by a 2 MB cache. The 2 MB cache is said to have an MPKI (misses per thousand instructions) of 20, and this will be largely true regardless of the smaller caches that precede the 2 MB cache. Assume the following cache/latency/MPKI values: 32 KB/1/100, 128 KB/2/80, 512 KB/4/50, 2 MB/8/40, 8 MB/16/10. Assume that accessing the off-chip memory system requires 200 cycles on average.

• For the following cache configurations, calculate the average time(denote it by cycles) spent accessing the cache hierarchy.
1. 32 KB L1; 8 MB L2; off-chip memory
2. 32 KB L1; 512 KB L2; 8 MB L3; off-chip memory
3. 32 KB L1; 128 KB L2; 2 MB L3; 8 MB L4; off-chip memory
1. $1+\frac{100}{1000}\times\left(16+\frac{10}{1000}\times 200\right)=2.8$
2. $1+\frac{100}{1000}\times\left(4+\frac{50}{1000}\times\left(16+\frac{10}{1000}\times 200\right) \right)=1.49$
3. $1+\frac{100}{1000}\times\left(2+\frac{80}{1000}\times\left(8+\frac{40}{1000}\times\left(16+\frac{10}{1000}\times 200\right)\right)\right)=1.26976$

1. $1+\frac{100}{1000}\times\left(16+\frac{10}{100}\times 200\right)=4.6$
2. $1+\frac{100}{1000}\times\left(4+\frac{50}{100}\times\left(16+\frac{10}{50}\times 200\right) \right)=4.2$
3. $1+\frac{100}{1000}\times\left(2+\frac{80}{100}\times\left(8+\frac{40}{80}\times\left(16+\frac{10}{40}\times 200\right)\right)\right)=4.48$
• What do you observe about the downsides of a cache hierarchy that is too shallow or too deep?
• 缓存太浅，会导致 Cache Miss 频繁发生，导致内存墙。
• 缓存太深，会导致访存延迟增加，而性能提升效果却逐渐下降，同时成本与功耗都大大增加。

Consider a 16 MB 16-way L3 cache that is shared by two programs A and B. There is a mechanism in the cache that monitors cache miss rates for each program and allocates 1–15 ways to each program such that the overall number of cache misses is reduced. Assume that program A has an MPKI of 100 when it is assigned 1 MB of the cache. Each additional 1 MB assigned to program A reduces the MPKI by 1. Program B has an MPKI of 50 when it is assigned 1 MB of cache; each additional 1 MB assigned to program B reduces its MPKI by 2. What is the best allocation of ways to programs A and B? Please explain why this allocation is best.

Virtual machines (VMs) have the potential for adding many beneficial capabilities to computer systems, such as improved total cost of ownership (TCO) or availability. Could VMs be used to provide the following capabilities? If so, how could they facilitate this?

• Test applications in production environments using development machines?

• Quick redeployment of applications in case of disaster or failure?

• Higher performance in I/O-intensive applications?

• Fault isolation between different applications, resulting in higher availability for services?

• Performing software maintenance on systems while applications are running without significant interruption?