CC BY 4.0 (除特别声明或转载文章外)
如果这篇博客帮助到你,可以请我喝一杯咖啡~
Consider a sparse matrix stored in the compressed row format (you may find a description of this format on the web or any suitable text on sparse linear algebra). Write an OpenMP program for computing the product of this matrix with a vector. Download sample matrics from the Matrix Market and test the performance of your implementation as a function of matrix size and number of threads.
代码如下,需要开c++11
。上面链接下载的矩阵是一个1138 x 1138, 2596 entries的稀疏矩阵,而向量是随机生成的一个1138维的稠密向量。一次乘法的时间可能不够明显,这里输出的是执行十万次矩阵乘法的时间。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#include <bits/stdc++.h>
#include <omp.h>
using namespace std;
typedef double lf;
typedef vector<lf> Vec;
typedef vector<vector<pair<int, lf>>> Mat;
int m, n, th;
Vec operator*(const Vec &v, const Mat &m)
{
Vec r(m.size(), 0);
#pragma omp parallel for num_threads(th)
for (int i = 0; i < m.size(); ++i)
for (const auto &p : m[i])
r[i] += v[p.first] * p.second;
return r;
}
int main()
{
ifstream fin("1138_bus.mtx");
while (fin.peek() == '%')
while (fin.get() != '\n')
;
fin >> m >> n >> th;
Mat ma(m);
for (int x, y, i = 0; i < th; ++i)
{
lf t;
fin >> x >> y >> t;
ma[x - 1].emplace_back(y - 1, t);
}
Vec ve(n);
for (int i = 0; i < n; ++i)
ve[i] = rand();
cout << "number of threads: ";
cin >> th;
auto begin = std::chrono::system_clock::now();
for (int i = 1e5; i; --i)
ve *ma;
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - begin;
std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
}
number of threads | elapsed time |
---|---|
1 | 12.1646s |
2 | 8.52163s |
4 | 9.68788s |
8 | 9.74822s |
16 | 13.2045s |
32 | 19.4011s |
64 | 31.9306s |
我所用的VAIO Z Flip 2016的CPU是Intel(R) Core(TM) i7-6567U,尽管挂着i7的名号但仍然是双核四线程的弱鸡。根据上面测试的结果可以发现,在2个线程时这个并行优化的向量矩阵乘法具有最高的加速比,多于16个线程时反而比1个线程的还要慢。
Implement a producer-consumer framework in OpenMP using sections to create a single producer task and a single consumer task. Ensure appropriate synchronization using locks. Test your program for a varying number of producers and consumers.
使用上一个Project中实现的多线程访问的队列,已经加过锁了。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#include <chrono>
#include <iostream>
#include "wkMultiAccessQueue.hpp"
wk::MultiAccessQueue<int> q;
void producer(int cnt)
{
for (int i = 0; i < cnt; ++i)
q.push(i);
}
void consumer(int cnt)
{
for (int i = 0; i < cnt; ++i)
q.pop();
}
int main()
{
int num;
std::cout << "number of producer-consumers: ";
std::cin >> num;
auto begin = std::chrono::system_clock::now();
#pragma omp parallel for
for (int i = 0; i < num; ++i)
producer(1000);
#pragma omp parallel for
for (int i = 0; i < num; ++i)
consumer(1000);
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - begin;
std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
}
number of producer-consumers | elapsed time |
---|---|
64 | 0.0644229s |
512 | 0.37481s |
4096 | 2.96336s |
32768 | 23.2893s |