Friday, November 23, 2007

matrix multiplication


Objective type questions

Code for matrix multiplication(question 1 to 6 based on this )

for (i=0 ; i<1 ; i++)
{
for (j=0 ; j
{
c[i][j] = 0.0;
for (k=0 ; k
c[i][j] + = a[i][k] * b[k][j];
}
}



Which loop does we parallize in matrix multiplication?
a) Outermost
b) Innermost
c) Both
d) All of above

Ans) a,
because in inner most loop data dependencies arises i.e. Two segment access/update a piece of common data, so outermost loop.
How the data dependency between line 3 be removed?
e) Cannot removed
f) By paralleling some other loop
g) By initializing the matrix C outside the loop
h) None of these
Ans) c,



If we parallelize j loop , the parallel algorithm executes n synchronization(per iteration of i )and grain size of parallel code is :
i) Θ(m n l / p)
j) Θ(l / p)
k) О(n l / p )
l) None of these.
Ans), c

If we parallelize i loop , the parallel algorithm executes only one synchronization, grain size of parallel code is :
m) О( m n l / p )
n) Θ( m n l / p )
o) О(n l / p )
p) None of these.
Ans), b

How many rows of each resultant matrix are calculated by each process in tightly coupled multiprocessor?
q) m n l / p
r) l / p
s) n l / p
t) none of these.
Ans), b

Time needed for computing a single row of matrix multiplication.
u) О m n
v) Θ m/n
w) Θ m n
x) None of these
Ans), c

All process needed to be synchronized once so synchronization over head is :
y) Θ l / p
z) Θ m / n
aa) Θ p
bb) None of these
Ans), c

What is the complexity of matrix multiplication parallel algorithm?
cc) Θ(n3 / p + p)
dd) Θ(n3 / p * p)
ee) Θ(n3 / p / p)
ff) None of these
Ans), a

As for each element of C , we must compute


so ,matrix-matrix multiplication involves, how many operations.


(N3)
N2
None of these

Ans), a

9. Assuming that = (n/k)2 processes are present, matrix multiplication is performed by dividing A and B into p blocks of size k*k. Each block multiplication requires, k3 additions, and k3 multiplications.
3k memory fetches
2k2 memory fetches
3k2 memory fetches
none of these

10.




Subjective type questions

Q 1. which loop we parallelize in matrix multiplications?
Ans ) Sequential code of matrix multilication
for (i=0 ; i<1 ; i++)
{
for (j=0 ; j
{
c[i][j] = 0.0;
for (k=0 ; k
c[i][j] + = a[i][k] * b[k][j];
}
}
In the above code there are three loop which can be parallelized but which one should be made parallelized.
Here, we can parallelize any loop either j or i without causing any problem because data dependence arises only in the innermost for loop that is in 3 lines over here.
(Data dependency is that when two segment access/update the common data. For example consider a= a+1and b=a+1 assume a=0,b=0 in order execution will give a=1,b=2,if opposite order result will be a=1,b=1.when executed parallelly we don’t expect any order and when the program is run again and again different output will be generated).
If we parallelize j loop, the parallel algorithm executes n synchronization (per iteration of i) and grain size of parallel code is О (n l / p).
If we parallelize i loop, the parallel algorithm executes only one synchronization, grain size of parallel code is Θ (m n l / p).
The rule should be always try to parallelize the outermost loop to maximize the grain size



Q 2. How the parallel multiplication is achieved and discusses the one
dimensional and two dimensional decompositions?

Ans. One dimensional decomposition: In particular, we consider the problem of developing a library to compute C = A.B , where A , B , and C are dense matrices of size N N . (A dense matrix is a matrix in which most of the entries are nonzero.) This matrix-matrix multiplication involves operations, since for each element of C , we must compute


Figure 4.10:
Matrix-matrix multiplication A.B=C with matrices A , B , and C decomposed in one dimension. The components of A , B , and C allocated to a single task are shaded black. During execution, this task requires all of matrix A (shown stippled).

We start by examining algorithms for various distributions of A , B , and C. We first consider a one-dimensional, columnwise decomposition in which each task encapsulates corresponding columns from A , B , and C . One parallel algorithm makes each task responsible for all computation associated with its . As shown in Figure 4.10, each task requires all of matrix A in order to compute its . data are required from each of P-1 other tasks, giving the following per-processor communication cost:

Note that as each task performs computation, if N P , then the algorithm will have to transfer roughly one word of data for each multiplication and addition performed. Hence, the algorithm can be expected to be efficient only when N is much larger than P or the cost of computation is much larger than .
Two dimensional decomposition

Figure 4.11
Matrix-matrix multiplication A.B=C with matrices A , B , and C decomposed in two dimensions. The components of A , B , and C allocated to a single task are shaded black. During execution, this task requires corresponding rows and columns of matrix A and B , respectively (shown stippled).
Less cost in two dimensional over one dimensional
We consider a two-dimensional decomposition of A , B , and C . As in the one-dimensional algorithm, we assume that a task encapsulates corresponding elements of A , B , and C and that each task is responsible for all computation associated with its . The computation of a single element requires an entire row and column of A and B , respectively. Hence, as shown in Figure 4.11, the computation performed within a single task requires the A and B submatrices allocated to tasks in the same row and column, respectively. This is a total of data, considerably less than in the one-dimensional algorithm.
Q3. A lgorithm of parallel matrix multiplication?
We represent the matrix multiplication algorithm for tightly coupled multiprocessor .
We know that it is not very expensive to share and exchange data across processes in tightly coupled machines.
The concern here is to achieve maximum speed up .
We choose to parallelize the outer most loop in this algorithm. We have Process working in parallel computing l/p rows of resultant matrix.
Each process in the program works to compute every p-th row (i.e., i, i+1, i+2p and soon) of the matrix.

Begin
/* we consider processor indexed from 1 to p addressed as p(1) top(p)*/
for all p(r),where 1 <= r <= p do begin
/* i,j,k,t are local to the p(r) */
for i= r to 1 step p do begin
for j= 1 to n do begin
t=0;
for k= 1 to m do begin
t + = a[I,k] * b[k,j];
end;
c[i,j] = t;
end;
end;
end (for all)
end.

We represent the matrix multiplication algorithm for loosely coupled multiprocessor
Where some matrix element may be much easier to access than others
Some reason for the algorithm needs to be redesigned.
A process must access l/ rows of matrix A and all elements of B(l/) times only a single addition and multiplication takes lace for every element in B
Ignoring the memory access time can be safe on tightly coupled multiprocessors where global memory is equally accessible to each processors this is not so with loosely coupled multi computer
On loosely coupled multiprocessors it is best to keep most memory references local as far as possible.
Block matrix multiplication
As over here Aand B both are n*n marix, where n=2k.
Then A and B can be thought of as conglomerates of four smaller matrices, each of size k*k.


A = A11 A12
A21 A22



B = B11 B12
B21 B22
C = C11 C 12
C 21 C 22

Then, C is written as:

A11 B 11 + A12B21 A11 B 12 + A12B22
A21 B 11 + A22B21 A21 B 12 + A22B22
C =

As seen from the above equations, the scheme partitions the problem into disjoint subtasks, which can be solved independently.
The multiplications of the smaller sub matrices can be done in parallel. These local multiplications can be further subdivided using the same scheme.


ANALYSIS:
We can assign processes to do the multiplication task local to the respective blocks.
Assuming that = (n/k)2 processes are present, matrix multiplication is performed by dividing A and B into p blocks of size k*k. Each block multiplication requires 2k2 memory fetches, k3 additions, and k3 multiplications.
The number of arithmetic operations per memory fetch has increased from 2 to k = n/√p, a significant improvement.
The block matrix multiplication algorithm performs better on the NUMA multiprocessors because it increases the number of computations performed per non local memory fetch.
For similar reasons a careful choice of block sizes to maximize the cache hit-rate can yield a sequential block-oriented matrix multiplication algorithm that executes faster than traditional algorithm illustrated earlier.
Matrix-matrix multiplication algorithm based on two-dimensional decompositions. Each step involves three stages:
(a) an A submatrix is broadcast to other tasks in the same row;
(b) local computation is performed; and
(c) the B submatrix is rotated upwards within each column.
To complete the second parallel algorithm, we need to design a strategy for communicating the submatrices between tasks. One approach is for each task to execute the following logic (Figure 4.12):

set
for j =0 to in each row i
, the th task broadcasts
to the other tasks in the row accumulate .
send to upward neighbor
endfor

Each of the steps in this algorithm involves a broadcast to tasks (for A' ) and a nearest-neighbor communication (for B' ). Both communications involve data. Because the broadcast can be accomplished in steps using a tree structure, the per-processor communication cost is

0 comments;Click here for request info on this topic:

Post a Comment