并行与分布式计算（4）

post on 29 Apr 2019 about 2724words require 10min
CC BY 4.0 （除特别声明或转载文章外）
如果这些文字帮助到你，可以请我喝一杯咖啡~

Consider a sparse matrix stored in the compressed row format (you may find a description of this format on the web or any suitable text on sparse linear algebra). Write an OpenMP program for computing the product of this matrix with a vector. Download sample matrics from the Matrix Market and test the performance of your implementation as a function of matrix size and number of threads.

代码如下，需要开c++11。上面链接下载的矩阵是一个1138 x 1138, 2596 entries的稀疏矩阵，而向量是随机生成的一个1138维的稠密向量。一次乘法的时间可能不够明显，这里输出的是执行十万次矩阵乘法的时间。

#include <bits/stdc++.h>
#include <omp.h>
using namespace std;
typedef double lf;
typedef vector<lf> Vec;
typedef vector<vector<pair<int, lf>>> Mat;
int m, n, th;
Vec operator*(const Vec &v, const Mat &m)
{
	Vec r(m.size(), 0);
#pragma omp parallel for num_threads(th)
	for (int i = 0; i < m.size(); ++i)
		for (const auto &p : m[i])
			r[i] += v[p.first] * p.second;
	return r;
}
int main()
{
	ifstream fin("1138_bus.mtx");
	while (fin.peek() == '%')
		while (fin.get() != '\n')
			;
	fin >> m >> n >> th;
	Mat ma(m);
	for (int x, y, i = 0; i < th; ++i)
	{
		lf t;
		fin >> x >> y >> t;
		ma[x - 1].emplace_back(y - 1, t);
	}
	Vec ve(n);
	for (int i = 0; i < n; ++i)
		ve[i] = rand();
	cout << "number of threads: ";
	cin >> th;
	auto begin = std::chrono::system_clock::now();
	for (int i = 1e5; i; --i)
		ve *ma;
	auto end = std::chrono::system_clock::now();
	std::chrono::duration<double> elapsed_seconds = end - begin;
	std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
}

number of threads	elapsed time
1	12.1646s
2	8.52163s
4	9.68788s
8	9.74822s
16	13.2045s
32	19.4011s
64	31.9306s

我所用的VAIO Z Flip 2016的CPU是Intel(R) Core(TM) i7-6567U，尽管挂着i7的名号但仍然是双核四线程的弱鸡。根据上面测试的结果可以发现，在2个线程时这个并行优化的向量矩阵乘法具有最高的加速比，多于16个线程时反而比1个线程的还要慢。

Implement a producer-consumer framework in OpenMP using sections to create a single producer task and a single consumer task. Ensure appropriate synchronization using locks. Test your program for a varying number of producers and consumers.

使用上一个Project中实现的多线程访问的队列，已经加过锁了。

#include <chrono>
#include <iostream>
#include "wkMultiAccessQueue.hpp"
wk::MultiAccessQueue<int> q;
void producer(int cnt)
{
	for (int i = 0; i < cnt; ++i)
		q.push(i);
}
void consumer(int cnt)
{
	for (int i = 0; i < cnt; ++i)
		q.pop();
}
int main()
{
	int num;
	std::cout << "number of producer-consumers: ";
	std::cin >> num;
	auto begin = std::chrono::system_clock::now();
#pragma omp parallel for
	for (int i = 0; i < num; ++i)
		producer(1000);
#pragma omp parallel for
	for (int i = 0; i < num; ++i)
		consumer(1000);
	auto end = std::chrono::system_clock::now();
	std::chrono::duration<double> elapsed_seconds = end - begin;
	std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
}

number of producer-consumers	elapsed time
64	0.0644229s
512	0.37481s
4096	2.96336s
32768	23.2893s

Related posts

Tilelang 入门-L2 友好的矩阵乘 05 Jul 2025

DAC25 GoPTX Script 25 Jun 2025

今日此时所想之事（四） 21 Jun 2025

Loading comments...