## 课后作业

Why is it difficult to construct a true shared-memory computer? What is the minimum number of switches for connecting p processors to a shared memory with b words (where each word can be accessed independently)?

Consider a memory system with a level 1 cache of 32 KB and DRAM of 512 MB with the processor operating at 1 GHz. The latency to L1 cache is one cycle and the latency to DRAM is 100 cycles. In each memory cycle, the processor fetches four words (cache line size is four words). What is the peak achievable performance of a dot product of two vectors? Note: Where necessary, assume an optimal cache placement policy.

/* dot product loop */
for (i = 0; i < dim; ++i)
dot_prod += a[i] * b[i];


### Homework-asst-1

Consider the following code where each line within the function represents a single function.

typedef struct
{
float x, y;
} Point;
inline void innerProduct(const Point *a, const Point *b, float *result)
{
float x1 = a->x, //Uses a load instruction
x2 = b->x,
product1 = x1 * x2,
y1 = a->y,
y2 = b->y,
product2 = y1 * y2,
inner = product1 + product2;
*result = inner; //Uses a store instruction
}
void computeInnerProduct(const Point a[], const Point b[], float result[], int n)
{
for (int i = 0; i < n; ++i)
innerProduct(&a[i], &b[i], &result[i]);
}


In the following questions, you can assume the following:

• $N$ is very large($>10^6$).
• The machines described have modern CPUs, providing out-of-order execution, speculative execution, branch prediction, etc.
• There are no cache misses.
• The overhead of updating the loopindex i is negligible.
• The overhead due to procedure calls, as well as starting and ending loops, is negligible.

Problem 1: Instruction-Level Parallelism Suppose you have a machine $M_1$ with two load/store unites that can each load or store a single value on each clock cycle, and one arithmetic unit can perform one arithmetic operation(e.g., multiplication or addition) on each clock cycle. A. Assume that the load/store and arithmetic have latencies of one cycle. How many clock cycles would be required to execute computeInnerProduct as a function of n? Explain what limits the performance.

B. Now assume that the load/store and arithmetic unit have latencies of 10 clock cycles, but they are fully pipelined, able to initiate new operations every clock cycle. How many cycles would be required to execute computeInnerProduction as a function of n?Explain how this relates your answer to Part-A.

Problem 2: SIMD with ISPC Consider running the following ISPC code.

export void computeInnerProductISPC(
uniform point[] a,
uniform point[] b,
uniform float[] result,
uniform int n
)
{
foreach (i = 0...n)
result[i] = a[i].x * b[i].x + a[i].y * b[i].y;
}


Suppose machine $M_2$ has one 8-wide SIMD load/store unit, and one 8-wide SIMD arithmetic unit. both have latencies of one clock cycle. A. How many clock cycles would be required to execute computeInnerProductISPC as a function of n? Explain what lmits the performance.

B. If we run the computeInnerProductISPC on a five-core machine $M_3$, where each core has the same SIMD capabilities as $M_2$, what would be the best speedup it could achieve over the single-core performance of Part-A? Explain.

### Homework-asst-2

Leverage the sample code provided in the course web site.

Build and run the code in the prob1_mandelbrot_threads directory of the Assignment 1 code base.This program produces the image file mandelbrot-vV -serial.ppm, where V is the view index. This image is a visualization of a famous set of complex numbers called the Mandelbrot set. As you can see in the images below, the result is a familiar and beautiful fractal. Each pixel in the image corresponds to a value in the complex plane, and the brightness of each pixel is proportional to the computational cost of determining whether the value is contained in the Mandelbrot set—white pixels required the maximum (256) number of iterations, dark ones only a few iterations, and colored pixels were somewhere in between. (See function mandel() defined in mandelbrot.cpp.) You can learn more about the definition of the Mandelbrot set at en.wikipedia.org/wiki/Mandelbrot set. Use the command option 「–view V 」 for V between 0 and 6 to get the different images. You can click the links below to see the different images on a browser. Take the time to do this—the images are quite striking. (View 0 is not shown— it is all white.)

1. Modify the code in mandelbrot.cpp to parallelize the Mandelbrot generation using two cores. Specifically, compute the top half of the image in thread 0, and the bottom half of the image in thread 1. This type of problem decomposition is referred to as spatial decomposition since different spatial regions of the image are computed by different processors.
2. Extend your code to utilize T threads for {2, 4, 8,16} , partitioning the image generation work into the appropriate number of horizontal blocks. You will need to modify the code in function workerThreadStart, to partition the work over the threads.
3. To confirm (or disprove) your hypothesis, measure the amount of time each thread requires to complete its work by inserting timing code at the beginning and end of workerThreadStart(). How do your measurements explain the speedup graph you previously created?

cd prog1_mandelbrot_threads
make
./mandelbrot -t 2


[mandelbrot serial]:            [708.508] ms
Wrote image file mandelbrot-serial.ppmHello world from thread 0


ffmpeg -i mandelbrot-serial.ppm mandelbrot-serial.png


mandelbrot-serial.pngmandelbrot-thread.png

$./mandelbrot -t 4 [mandelbrot serial]: [713.990] ms Wrote image file mandelbrot-serial.ppmHello world from thread 1 Hello world from thread 2 Hello world from thread 0 Hello world from thread 3 Hello world from thread 1 Hello world from thread 2 Hello world from thread 0 Hello world from thread 3 Hello world from thread 1 Hello world from thread 0 Hello world from thread 2 Hello world from thread 3 Hello world from thread 1 Hello world from thread 2 Hello world from thread 0 Hello world from thread 3 Hello world from thread 1 Hello world from thread 2 Hello world from thread 0 Hello world from thread 3 [mandelbrot thread]: [291.995] ms Wrote image file mandelbrot-thread.ppm++++ (2.45x speedup from 4 threads)  4线程的时候只快了2.45倍。 $ ./mandelbrot -t 8
[mandelbrot serial]:            [707.881] ms
Wrote image file mandelbrot-serial.ppmHello world from thread 1


8线程的时候3.6倍。

$./mandelbrot -t 16 [mandelbrot serial]: [738.481] ms Wrote image file mandelbrot-serial.ppmHello world from thread 1 Hello world from thread 4 Hello world from thread 13 Hello world from thread 5 Hello world from thread 6 Hello world from thread 2 Hello world from thread 7 Hello world from thread 8 Hello world from thread 9 Hello world from thread 10 Hello world from thread 11 Hello world from thread 0 Hello world from thread 12 Hello world from thread 14 Hello world from thread 15 Hello world from thread 3 Hello world from thread 1 Hello world from thread 6 Hello world from thread 3 Hello world from thread 4 Hello world from thread 5 Hello world from thread 7 Hello world from thread 8 Hello world from thread 9 Hello world from thread 2 Hello world from thread 10 Hello world from thread 11 Hello world from thread 12 Hello world from thread 13 Hello world from thread 0 Hello world from thread 14 Hello world from thread 15 Hello world from thread 1 Hello world from thread 2 Hello world from thread 0 Hello world from thread 4 Hello world from thread 5 Hello world from thread 6 Hello world from thread 7 Hello world from thread 8 Hello world from thread 9 Hello world from thread 10 Hello world from thread 11 Hello world from thread 12 Hello world from thread 13 Hello world from thread 14 Hello world from thread 15 Hello world from thread 3 Hello world from thread 1 Hello world from thread 2 Hello world from thread 9 Hello world from thread 15 Hello world from thread 5 Hello world from thread 6 Hello world from thread 7 Hello world from thread 8 Hello world from thread 3 Hello world from thread 10 Hello world from thread 11 Hello world from thread 12 Hello world from thread 13 Hello world from thread 0 Hello world from thread 14 Hello world from thread 4 Hello world from thread 1 Hello world from thread 2 Hello world from thread 9 Hello world from thread 3 Hello world from thread 5 Hello world from thread 6 Hello world from thread 7 Hello world from thread 8 Hello world from thread 0 Hello world from thread 4 Hello world from thread 10 Hello world from thread 11 Hello world from thread 12 Hello world from thread 13 Hello world from thread 14 Hello world from thread 15 [mandelbrot thread]: [185.233] ms Wrote image file mandelbrot-thread.ppm++++ (3.99x speedup from 16 threads)  16线程的时候3.99倍。 可以看到，加速比随进程增加而增加，但是增加的量不是线性的。这是因为并行的时候会有额外的开销。 ### 源代码 #### mandelbrot.cpp 主要的代码， #include <stdio.h> #include <pthread.h> // Use this code to time your threads // 我加入的代码在126~134行和164~174行 #include "CycleTimer.h" /* 15418 Spring 2012 note: This code was modified from example code originally provided by Intel. To comply with Intel's open source licensing agreement, their copyright is retained below. ----------------------------------------------------------------- Copyright (c) 2010-2011, Intel Corporation All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of Intel Corporation nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ // Core computation of Mandelbrot set membershop // Iterate complex number c to determine whether it diverges static inline int mandel(float c_re, float c_im, int count) { float z_re = c_re, z_im = c_im; int i; for (i = 0; i < count; ++i) { if (z_re * z_re + z_im * z_im > 4.f) break; float new_re = z_re * z_re - z_im * z_im; float new_im = 2.f * z_re * z_im; z_re = c_re + new_re; z_im = c_im + new_im; } return i; } // // MandelbrotSerial -- // // Compute an image visualizing the mandelbrot set. The resulting // array contains the number of iterations required before the complex // number corresponding to a pixel could be rejected from the set. // // * x0, y0, x1, y1 describe the complex coordinates mapping // into the image viewport. // * width, height describe the size of the output image // * startRow, endRow describe how much of the image to compute void mandelbrotSerial( float x0, float y0, float x1, float y1, int width, int height, int startRow, int endRow, int maxIterations, int output[]) { float dx = (x1 - x0) / width; float dy = (y1 - y0) / height; for (int j = startRow; j < endRow; j++) { for (int i = 0; i < width; ++i) { float x = x0 + i * dx; float y = y0 + j * dy; int index = (j * width + i); output[index] = mandel(x, y, maxIterations); } } } // Struct for passing arguments to thread routine typedef struct { float x0, x1; float y0, y1; unsigned int width; unsigned int height; int maxIterations; int *output; int threadId; int numThreads; } WorkerArgs; // // workerThreadStart -- // // Thread entrypoint. void *workerThreadStart(void *threadArgs) { WorkerArgs *args = static_cast<WorkerArgs *>(threadArgs); // TODO: Implement worker thread here. printf("Hello world from thread %d\n", args->threadId); //以下是我加的第一部分内容 mandelbrotSerial( args->x0, args->y0, args->x1, args->y1, args->width, args->height, args->height / args->numThreads * args->threadId, //startRow args->threadId == args->numThreads - 1 ? args->height : args->height / args->numThreads * (args->threadId + 1), //endRow args->maxIterations, args->output); //增加完毕，只需要调用串行部分的代码即可 return NULL; } // // MandelbrotThread -- // // Multi-threaded implementation of mandelbrot set image generation. // Multi-threading performed via pthreads. void mandelbrotThread( int numThreads, float x0, float y0, float x1, float y1, int width, int height, int maxIterations, int output[]) { const static int MAX_THREADS = 32; if (numThreads > MAX_THREADS) { fprintf(stderr, "Error: Max allowed threads is %d\n", MAX_THREADS); exit(1); } pthread_t workers[MAX_THREADS]; WorkerArgs args[MAX_THREADS]; for (int i = 0; i < numThreads; i++) { // TODO: Set thread arguments here. args[i].threadId = i; //以下是我加的第二段内容 args[i].x0 = x0; args[i].x1 = x1; args[i].y0 = y0; args[i].y1 = y1; args[i].width = width; args[i].height = height; args[i].maxIterations = maxIterations; args[i].output = output; args[i].numThreads = numThreads; //增加完毕，把所有信息传给worker线程即可 } // Fire up the worker threads. Note that numThreads-1 pthreads // are created and the main app thread is used as a worker as // well. for (int i = 1; i < numThreads; i++) pthread_create(&workers[i], NULL, workerThreadStart, &args[i]); workerThreadStart(&args[0]); // wait for worker threads to complete for (int i = 1; i < numThreads; i++) pthread_join(workers[i], NULL); }  #### main.cpp #include <stdio.h> #include <algorithm> #include <getopt.h> #include "CycleTimer.h" extern void mandelbrotSerial( float x0, float y0, float x1, float y1, int width, int height, int startRow, int endRow, int maxIterations, int output[]); extern void mandelbrotThread( int numThreads, float x0, float y0, float x1, float y1, int width, int height, int maxIterations, int output[]); extern void writePPMImage( int *data, int width, int height, const char *filename, int maxIterations); void scaleAndShift(float &x0, float &x1, float &y0, float &y1, float scale, float shiftX, float shiftY) { x0 *= scale; x1 *= scale; y0 *= scale; y1 *= scale; x0 += shiftX; x1 += shiftX; y0 += shiftY; y1 += shiftY; } void usage(const char *progname) { printf("Usage: %s [options]\n", progname); printf("Program Options:\n"); printf(" -t --threads <N> Use N threads\n"); printf(" -v --view <INT> Use specified view settings (1-6)\n"); printf(" -? --help This message\n"); } bool verifyResult(int *gold, int *result, int width, int height) { int i, j; for (i = 0; i < height; i++) { for (j = 0; j < width; j++) { if (gold[i * width + j] != result[i * width + j]) { printf("Mismatch : [%d][%d], Expected : %d, Actual : %d\n", i, j, gold[i * width + j], result[i * width + j]); return 0; } } } return 1; } #define VIEWCNT 6 int main(int argc, char **argv) { const int width = 1440; const int height = 900; const int maxIterations = 256; int numThreads = 2; float x0 = -2.167; float x1 = 1.167; float y0 = -1; float y1 = 1; // Support VIEWCNT views float scaleValues[VIEWCNT + 1] = {1.0f, 1.0f, 0.015f, 0.02f, 0.02f, 0.02f, 0.002f}; float shiftXs[VIEWCNT + 1] = {0.0f, 0.0f, -0.98f, 0.35f, 0.0f, -1.5f, -1.4f}; float shiftYs[VIEWCNT + 1] = {0.0f, 0.0f, 0.30f, 0.05f, 0.73f, 0.0f, 0.0f}; // parse commandline options //////////////////////////////////////////// int opt; static struct option long_options[] = { {"threads", 1, 0, 't'}, {"view", 1, 0, 'v'}, {"help", 0, 0, '?'}, {0, 0, 0, 0}}; while ((opt = getopt_long(argc, argv, "t:v:?", long_options, NULL)) != EOF) { switch (opt) { case 't': { numThreads = atoi(optarg); break; } case 'v': { int viewIndex = atoi(optarg); // change view settings if (viewIndex >= 1 && viewIndex <= VIEWCNT) { float scaleValue = scaleValues[viewIndex]; float shiftX = shiftXs[viewIndex]; float shiftY = shiftYs[viewIndex]; scaleAndShift(x0, x1, y0, y1, scaleValue, shiftX, shiftY); } else { fprintf(stderr, "Invalid view index\n"); return 1; } break; } case '?': default: usage(argv[0]); return 1; } } // end parsing of commandline options int *output_serial = new int[width * height]; int *output_thread = new int[width * height]; // // Run the serial implementation. Run the code three times and // take the minimum to get a good estimate. // memset(output_serial, 0, width * height * sizeof(int)); double minSerial = 1e30; for (int i = 0; i < 3; ++i) { double startTime = CycleTimer::currentSeconds(); mandelbrotSerial(x0, y0, x1, y1, width, height, 0, height, maxIterations, output_serial); double endTime = CycleTimer::currentSeconds(); minSerial = std::min(minSerial, endTime - startTime); } printf("[mandelbrot serial]:\t\t[%.3f] ms\n", minSerial * 1000); writePPMImage(output_serial, width, height, "mandelbrot-serial.ppm", maxIterations); // // Run the threaded version // memset(output_thread, 0, width * height * sizeof(int)); double minThread = 1e30; for (int i = 0; i < 5; ++i) { double startTime = CycleTimer::currentSeconds(); mandelbrotThread(numThreads, x0, y0, x1, y1, width, height, maxIterations, output_thread); double endTime = CycleTimer::currentSeconds(); minThread = std::min(minThread, endTime - startTime); } printf("[mandelbrot thread]:\t\t[%.3f] ms\n", minThread * 1000); writePPMImage(output_thread, width, height, "mandelbrot-thread.ppm", maxIterations); if (!verifyResult(output_serial, output_thread, width, height)) { printf("ERROR : Output from threads does not match serial output\n"); delete[] output_serial; delete[] output_thread; return 1; } // compute speedup printf("++++\t\t\t\t(%.2fx speedup from %d threads)\n", minSerial / minThread, numThreads); delete[] output_serial; delete[] output_thread; return 0; }  #### Makefile CXX=g++ -m64 CXXFLAGS=-I../common -Iobjs/ -O3 -Wall APP_NAME=mandelbrot OBJDIR=objs COMMONDIR=../common PPM_CXX=$(COMMONDIR)/ppm.cpp
PPM_OBJ=$(addprefix$(OBJDIR)/, $(subst$(COMMONDIR)/,, $(PPM_CXX:.cpp=.o))) default:$(APP_NAME)

.PHONY: dirs clean

dirs:
/bin/mkdir -p $(OBJDIR)/ clean: /bin/rm -rf$(OBJDIR) *.ppm *~ $(APP_NAME) OBJS=$(OBJDIR)/main.o $(OBJDIR)/mandelbrot.o$(PPM_OBJ)

$(APP_NAME): dirs$(OBJS)
$(CXX)$(CXXFLAGS) -o $@$(OBJS) -lm -lpthread

$(OBJDIR)/%.o: %.cpp$(CXX) $<$(CXXFLAGS) -c -o $@$(OBJDIR)/%.o: $(COMMONDIR)/%.cpp$(CXX) $<$(CXXFLAGS) -c -o $@$(OBJDIR)/main.o: \$(COMMONDIR)/CycleTimer.h