Thursday, November 29, 2007

Message Passing Interface

0 comments;Click here for request info on this topic
Defination:The MPI is a standard API (Application Programming Interface) that can
be used to create ||el applications. MPI is designed primarily to suport the
SIMD ,Single Instruction Multiple Data model . As MPI is a standard common library , programs writen with the mpi r highly porable.
The goal of the mpi simply stated is to devlp a widely used standard for writing mpi programs. As such the interface shud esatblish a practical portable , effiecint & flexible standard for msg passing.The standard is maintained by MPI forum. Available for both C & Fortran programs & available on a wide variety of ||el machines. target platform is distributed memory system as the SP.
Set of library routines are used to design scalable ||el appl.These routines provide a wide range of operations that include computations , communication and synchronization.MPI 1.2 is the currnt stndrd suportd by major vendors.
Features:
1.MPI has fully asynchronous commn- Immidiate send & recive oper can fully overlap computation.
2.MPI groups are solid ,efficient & deterministic- Group membrship is static , ther are no race cond caused by processes independently entering & leaving a group. New group formation is colectiv & froup membership info is distributd not centralized.
3. MPI efficiently manages message buffers.
Msgs are sent & recieved from user data structures not from staging buffer within commn library.
4. Target platform is distributd memory system including massively ||el machines , SMP Clusters & hetrogeneous networks
Read full story

Parallel Virtual Machine

0 comments;Click here for request info on this topic
Parallel Virtual Machine in detail.
The PVM system is composed of two parts. The first part is a daemon , called pvmd3 and sometimes abbreviated pvmd , that resides on all the computers making up the virtual machine. Pvmd3 is designed so any user with a valid login can install this daemon on a machine. When a user wishes to run a PVM application, he first creates a virtual machine by starting up PVM.The PVM application can then be started from a Unix prompt on any of the hosts. Multiple users can configure overlapping virtual machines, and each user can execute several PVM applications simultaneously.
The second part of the system is a library of PVM interface routines. It contains a functionally complete repertoire of primitives that are needed for cooperation between tasks of an application. This library contains user-callable routines for message passing, spawning processes, coordinating tasks, and modifying the virtual machine.
The PVM computing model is based on the notion that an application consists of several tasks. Each task is responsible for a part of the application's computational workload. Sometimes an application is parallelized along its functions; that is, each task performs a different function, for example, input, problem setup, solution, output, and display. This process is often called functional parallelism . A more common method of parallelizing an application is called data parallelism . In this method all the tasks are the same, but each one only knows and solves a small part of the data. This is also referred to as the SPMD (single-program multiple-data) model of computing. PVM supports either or a mixture of these methods. Depending on their functions, tasks may execute in parallel and may need to synchronize or exchange data, although this is not always the case.
The PVM system currently supports C, C++, and Fortran languages. The C and C++ language bindings for the PVM user interface library are implemented as functions,Fortran language bindings are implemented as subroutines rather than as functions. This approach was taken because some compilers on the supported architectures would not reliably interface Fortran functions with C functions.
All PVM tasks are identified by an integer task identifier (TID) . Messages are sent to and received from tids. Since tids must be unique across the entire virtual machine, they are supplied by the local pvmd and are not user chosen. PVM contains several routines that return TID values so that the user application can identify other tasks in the system.
PVM includes the concept of user named groups. When a task joins a group, it is assigned a unique "instance'' number in that group. Instance numbers start at 0 and count up. In keeping with the PVM philosophy, the group functions are designed to be very general and transparent to the user. For example, any PVM task can join or leave any group at any time without having to inform any other task in the affected groups. Also, groups can overlap, and tasks can broadcast messages to groups of which they are not a member. To use any of the group functions, a program must be linked with libgpvm3.a .
The general paradigm for application programming with PVM is as follows. A user writes one or more sequential programs in C, C++, or Fortran 77 that contain embedded calls to the PVM library. Each program corresponds to a task making up the application. These programs are compiled for each architecture in the host pool, and the resulting object files are placed at a location accessible from machines in the host pool. To execute an application, a user typically starts one copy of one task by hand from a machine within the host pool. This process subsequently starts other PVM tasks, eventually resulting in a collection of active tasks that then compute locally and exchange messages with each other to solve the problem.
PVM program hello.c
main()
{
int cc, tid, msgtag;
char buf[100];
printf("i'm t%x\n", pvm_mytid());
cc = pvm_spawn("hello_other", (char**)0, 0, "", 1, &tid);
if (cc == 1) {
msgtag = 1;
pvm_recv(tid, msgtag);
pvm_upkstr(buf);
printf("from t%x: %s\n", tid, buf);
} else
printf("can't start hello_other\n");
pvm_exit();
}
Read full story

Monday, November 26, 2007

RSS Feed Reader : .net Project

0 comments;Click here for request info on this topic

RSS (Really Simple Syndication) is a family of web feed formats used to publish frequently updated content such as blog entries, news headlines or podcasts. An RSS document, which is called a "feed", "web feed", or "channel", contains either a summary of content from an associated web site or the full text. RSS makes it possible for people to keep up with their favorite web sites in an automated manner that's easier than checking them manually.

RSS content can be read using software called an "RSS reader", "feed reader" or an "aggregator". The user subscribes to a feed by entering the feed's link into the reader or by clicking an RSS icon in a browser that initiates the subscription process. The reader checks the user's subscribed feeds regularly for new content, downloading any updates that it finds.

The initials "RSS" are used to refer to the following formats:

  • Really Simple Syndication (RSS 2.0)
  • RDF Site Summary (RSS 1.0 and RSS 0.90)
  • Rich Site Summary (RSS 0.91)
Click here to download the project (zipped)

You need "Microsoft Visual Studio .net 2003 with .net famework1.1" as minimum requirement to runt this project.

You might need to make it complete its GUI and browser part as i did not worked on it.
Screenshot

Read full story

Friday, November 23, 2007

matrix multiplication

0 comments;Click here for request info on this topic
Objective type questions

Code for matrix multiplication(question 1 to 6 based on this )

for (i=0 ; i<1 ; i++)
{
for (j=0 ; j
{
c[i][j] = 0.0;
for (k=0 ; k
c[i][j] + = a[i][k] * b[k][j];
}
}



Which loop does we parallize in matrix multiplication?
a) Outermost
b) Innermost
c) Both
d) All of above

Ans) a,
because in inner most loop data dependencies arises i.e. Two segment access/update a piece of common data, so outermost loop.
How the data dependency between line 3 be removed?
e) Cannot removed
f) By paralleling some other loop
g) By initializing the matrix C outside the loop
h) None of these
Ans) c,



If we parallelize j loop , the parallel algorithm executes n synchronization(per iteration of i )and grain size of parallel code is :
i) Θ(m n l / p)
j) Θ(l / p)
k) О(n l / p )
l) None of these.
Ans), c

If we parallelize i loop , the parallel algorithm executes only one synchronization, grain size of parallel code is :
m) О( m n l / p )
n) Θ( m n l / p )
o) О(n l / p )
p) None of these.
Ans), b

How many rows of each resultant matrix are calculated by each process in tightly coupled multiprocessor?
q) m n l / p
r) l / p
s) n l / p
t) none of these.
Ans), b

Time needed for computing a single row of matrix multiplication.
u) О m n
v) Θ m/n
w) Θ m n
x) None of these
Ans), c

All process needed to be synchronized once so synchronization over head is :
y) Θ l / p
z) Θ m / n
aa) Θ p
bb) None of these
Ans), c

What is the complexity of matrix multiplication parallel algorithm?
cc) Θ(n3 / p + p)
dd) Θ(n3 / p * p)
ee) Θ(n3 / p / p)
ff) None of these
Ans), a

As for each element of C , we must compute


so ,matrix-matrix multiplication involves, how many operations.


(N3)
N2
None of these

Ans), a

9. Assuming that = (n/k)2 processes are present, matrix multiplication is performed by dividing A and B into p blocks of size k*k. Each block multiplication requires, k3 additions, and k3 multiplications.
3k memory fetches
2k2 memory fetches
3k2 memory fetches
none of these

10.




Subjective type questions

Q 1. which loop we parallelize in matrix multiplications?
Ans ) Sequential code of matrix multilication
for (i=0 ; i<1 ; i++)
{
for (j=0 ; j
{
c[i][j] = 0.0;
for (k=0 ; k
c[i][j] + = a[i][k] * b[k][j];
}
}
In the above code there are three loop which can be parallelized but which one should be made parallelized.
Here, we can parallelize any loop either j or i without causing any problem because data dependence arises only in the innermost for loop that is in 3 lines over here.
(Data dependency is that when two segment access/update the common data. For example consider a= a+1and b=a+1 assume a=0,b=0 in order execution will give a=1,b=2,if opposite order result will be a=1,b=1.when executed parallelly we don’t expect any order and when the program is run again and again different output will be generated).
If we parallelize j loop, the parallel algorithm executes n synchronization (per iteration of i) and grain size of parallel code is О (n l / p).
If we parallelize i loop, the parallel algorithm executes only one synchronization, grain size of parallel code is Θ (m n l / p).
The rule should be always try to parallelize the outermost loop to maximize the grain size



Q 2. How the parallel multiplication is achieved and discusses the one
dimensional and two dimensional decompositions?

Ans. One dimensional decomposition: In particular, we consider the problem of developing a library to compute C = A.B , where A , B , and C are dense matrices of size N N . (A dense matrix is a matrix in which most of the entries are nonzero.) This matrix-matrix multiplication involves operations, since for each element of C , we must compute


Figure 4.10:
Matrix-matrix multiplication A.B=C with matrices A , B , and C decomposed in one dimension. The components of A , B , and C allocated to a single task are shaded black. During execution, this task requires all of matrix A (shown stippled).

We start by examining algorithms for various distributions of A , B , and C. We first consider a one-dimensional, columnwise decomposition in which each task encapsulates corresponding columns from A , B , and C . One parallel algorithm makes each task responsible for all computation associated with its . As shown in Figure 4.10, each task requires all of matrix A in order to compute its . data are required from each of P-1 other tasks, giving the following per-processor communication cost:

Note that as each task performs computation, if N P , then the algorithm will have to transfer roughly one word of data for each multiplication and addition performed. Hence, the algorithm can be expected to be efficient only when N is much larger than P or the cost of computation is much larger than .
Two dimensional decomposition

Figure 4.11
Matrix-matrix multiplication A.B=C with matrices A , B , and C decomposed in two dimensions. The components of A , B , and C allocated to a single task are shaded black. During execution, this task requires corresponding rows and columns of matrix A and B , respectively (shown stippled).
Less cost in two dimensional over one dimensional
We consider a two-dimensional decomposition of A , B , and C . As in the one-dimensional algorithm, we assume that a task encapsulates corresponding elements of A , B , and C and that each task is responsible for all computation associated with its . The computation of a single element requires an entire row and column of A and B , respectively. Hence, as shown in Figure 4.11, the computation performed within a single task requires the A and B submatrices allocated to tasks in the same row and column, respectively. This is a total of data, considerably less than in the one-dimensional algorithm.
Q3. A lgorithm of parallel matrix multiplication?
We represent the matrix multiplication algorithm for tightly coupled multiprocessor .
We know that it is not very expensive to share and exchange data across processes in tightly coupled machines.
The concern here is to achieve maximum speed up .
We choose to parallelize the outer most loop in this algorithm. We have Process working in parallel computing l/p rows of resultant matrix.
Each process in the program works to compute every p-th row (i.e., i, i+1, i+2p and soon) of the matrix.

Begin
/* we consider processor indexed from 1 to p addressed as p(1) top(p)*/
for all p(r),where 1 <= r <= p do begin
/* i,j,k,t are local to the p(r) */
for i= r to 1 step p do begin
for j= 1 to n do begin
t=0;
for k= 1 to m do begin
t + = a[I,k] * b[k,j];
end;
c[i,j] = t;
end;
end;
end (for all)
end.

We represent the matrix multiplication algorithm for loosely coupled multiprocessor
Where some matrix element may be much easier to access than others
Some reason for the algorithm needs to be redesigned.
A process must access l/ rows of matrix A and all elements of B(l/) times only a single addition and multiplication takes lace for every element in B
Ignoring the memory access time can be safe on tightly coupled multiprocessors where global memory is equally accessible to each processors this is not so with loosely coupled multi computer
On loosely coupled multiprocessors it is best to keep most memory references local as far as possible.
Block matrix multiplication
As over here Aand B both are n*n marix, where n=2k.
Then A and B can be thought of as conglomerates of four smaller matrices, each of size k*k.


A = A11 A12
A21 A22



B = B11 B12
B21 B22
C = C11 C 12
C 21 C 22

Then, C is written as:

A11 B 11 + A12B21 A11 B 12 + A12B22
A21 B 11 + A22B21 A21 B 12 + A22B22
C =

As seen from the above equations, the scheme partitions the problem into disjoint subtasks, which can be solved independently.
The multiplications of the smaller sub matrices can be done in parallel. These local multiplications can be further subdivided using the same scheme.


ANALYSIS:
We can assign processes to do the multiplication task local to the respective blocks.
Assuming that = (n/k)2 processes are present, matrix multiplication is performed by dividing A and B into p blocks of size k*k. Each block multiplication requires 2k2 memory fetches, k3 additions, and k3 multiplications.
The number of arithmetic operations per memory fetch has increased from 2 to k = n/√p, a significant improvement.
The block matrix multiplication algorithm performs better on the NUMA multiprocessors because it increases the number of computations performed per non local memory fetch.
For similar reasons a careful choice of block sizes to maximize the cache hit-rate can yield a sequential block-oriented matrix multiplication algorithm that executes faster than traditional algorithm illustrated earlier.
Matrix-matrix multiplication algorithm based on two-dimensional decompositions. Each step involves three stages:
(a) an A submatrix is broadcast to other tasks in the same row;
(b) local computation is performed; and
(c) the B submatrix is rotated upwards within each column.
To complete the second parallel algorithm, we need to design a strategy for communicating the submatrices between tasks. One approach is for each task to execute the following logic (Figure 4.12):

set
for j =0 to in each row i
, the th task broadcasts
to the other tasks in the row accumulate .
send to upward neighbor
endfor

Each of the steps in this algorithm involves a broadcast to tasks (for A' ) and a nearest-neighbor communication (for B' ). Both communications involve data. Because the broadcast can be accomplished in steps using a tree structure, the per-processor communication cost is
Read full story

Multi-processor systems OS

0 comments;Click here for request info on this topic
OBJECTIVE-TYPE QUESTIONS
Q1. _____ is a model of one address computer.
P-RAM
RAM
ROM
P-ROM

Ans: B

Q2. All algorithms for ___________________ can be expressed in terms of RAM model and its instruction set.

Parallel Machines
SIMD/ Vector Processing Systems
Sequential Machines
Every type of machines

Ans: C

Q3. Default behavior of P-RAM is running multiple instruction Streams on different processors. However, it possible to constraint all the processors to fetch and execute the same set of instructions? (True / False)

Ans: True.

Q4. A multi-computer system is a _____________ Whereas Multi-Processor System is a _____________

Collection of computer systems that include multiple processors, single autonomous computer that is interconnected with other systems for achieving parallelism.
Single autonomous computer that is interconnected with other systems for achieving parallelism, Collection of computer systems that include multiple processors.
Single Computer that include multiple processors, Collection of multiple interconnected autonomous computer systems.
Collection of multiple interconnected autonomous computer systems, Single Computer that include multiple processors.
Ans: D

Q5. The most suitable interconnection network for small multiprocessor systems (i.e. having upto 10 processing units) is:

Common Bus
Cross Bar Switch
Hierarchical Organization of Processors
None Of the Above
Ans: A

Q6. By organizing processors and memory modules in hierarchically, one can obtain _________ architecture.

UMA
NUMA
Symmetric
Asymmetric
Ans: B

Q7. _______________ , which is used for keeping latest updated value in cache for single processor systems, is neither necessary nor sufficient for achieving cache coherence in multi-processor systems.

Static Coherence Checking Algorithms
Dynamic Coherence Checking Algorithms
Write through fashion
Static & Dynamic Coherence Checking Algorithm and Write through mechanism
Ans: C

Q8. List possible organizations in design of operating systems for multi-processor systems:

Ans: 1. Master Slave Configuration
2. Separate Supervisor Configuration
3. Floating Supervisor Configuration

Q9. In which Message passing Model, the sender is also blocked till receiver receives and acknowledges the message?

Synchronous
Asynchronous
Both Synchronous and Asynchronous
In None of Message Passing Model sender get blocked

Ans: A

Q10. Does Message Passing Model require mechanism for mutual exclusion to access shared data? (Yes/NO)

Ans. Does not require (NO)

Q11. Multithreading is implementation of which parallel programming model?

Message Passing Model
Functional and Logical Models
Data Parallel Model
Shared Memory Model
Ans: D

Q12. Give an example (programming language) of Data parallel programming model for SIMD processors.

Ans: Fortran 90



SUBJECTIVE-TYPE QUESTIONS

Q1. Depending Upon the Sharing of resources Multiple Processor Systems can be classified into which categories? Explain each briefly. What are desirable properties of Multi-Processor Systems?
Ans: Depending Upon the Sharing of resources such as Memory, Clock, Computer bus and the way of Communication, Multi Processor Systems can be classified in following two categories:
Tightly Coupled Systems
Loosely Coupled Systems

· Tightly Coupled Systems:
The Tightly Coupled Multi Processor Systems are those in which the Processors are kept in close communication to share the processing tasks. These systems can further be divided in to two sub-categories:
1. Shared Memory Multi-Processor systems
2. Message Passing Multi-Computer Systems

Shared Memory Multi-Processor systems contain multiple processors that are connected at the system bus level. These systems allow 1000 CPU’s to communicate via a Shared Memory. Every CPU has equal access to the entire physical memory. These systems may also participate in a memory hierarchy with both local and shared memory.

Depending upon the way the Operating System used in Multiple Processors, two Organizations of Multi-Processor system are possible. These are:
Asymmetric Multiprocessor
Symmetric Multiprocessor

Asymmetric Multiprocessor: In an asymmetrical multiprocessor system, one processor, called Master, is dedicated to execute the Operating System. The remaining processors are usually identical and form a pool of computational processors. All the processors in this pool are called Slaves. The Master processor schedules the work and control the activities of the slave processors depending upon the resources available with them. Due to this arrangement of multi processors, it is also known as Master-Slave Multi Processing. In some asymmetrical multiprocessor system, the processors are assigned different roles such that they do not depend upon the master for their processing. For Example: One processor may handle all trivial requests and jobs like editing, calculating etc.


Symmetric Multi Processor: In a symmetrical multiprocessor system, all of the processors are essentially identical and perform identical functions. There is one copy of Operating System in Memory. The CPU, on which the system call was made, traps to the Kernel and processes the system calls. In such an arrangement there is said to be a floating Master as different processors execute operating system at different times.


Message Passing Multi-Computer Systems allow a number of CPU-Memory pairs (called node) to connect by some kind of high-speed interconnection. The Basic node of Multi Computer system consists of CPU, memory, a network interface and some times a hard disk but the Graphic adapter, monitor, keyboard and mouse are nearly always absent. Each memory is local to a single CPU and can only be directly accessed by that CPU. There is no shared memory in this design. So without a shared address space processors communicate by sending multiword messages over the interconnection (Message Passing) in order to share computational tasks. These systems are also known by a variety of names including Cluster Computers and COWS (Cluster Of Workstations).


These Multi-Computer systems can be either
Asymmetric Multi-Computers or
Symmetric Multi-Computers

1. In Asymmetric Multi-Computers, a front end computer interact with user as users may log into the front end computer which executes a full, multi-programmed operating System and provide all functions for program development. Other than this Front End Computer, other nodes of the Multi-Computer systems are Back End Computers, which are reserved for executing the parallel programs.
2. In Symmetric Multi-Computer operating system execute the same OS (Multi Programmed OS) and has identical functionality. User may log onto any computer to edit and compile their programs and any other computer may be called upon to execute a parallel program. This configuration of Multi Computer system solves the problem of single point of failure in Asymmetric Multi-Computer systems.

· Loosely Coupled Systems:
These types of Multi processor systems are also referred to as Distributed Systems. These are based on multiple standalone single or dual Processor Computers interconnected via a high-speed communication system (gigabit Ethernet is common). The intent of the distributed system is to turn a loosly connected bunch of machines into a cohrent system based upon one concept. This is done by having another layer of software on the top of the Operating System. This Layer is called Middleware.


Desireable Properties of MultiProcessor Systems:

Process Recoverability: - If aprocessor fails the process running on it should be recoverable. Another processor must be assigned to it. This can be achieved by maintaining a shared register file that keep a record of the state for each active process in the system.

Efficient Context Switching: - While executing parallel programs, it is needed to swap the processes in and out of the memory for efficient utilization of the resources. The processors must have efficient mechanisms to support Operating System in switching the context.

Large Virtual and Physical address space: - As the size of the problem being solved on the parallel machines, increases there is need for large memory space and addressing capacities.

Effective Synchronization Primitives: - Multiple processes that constitute a parallel program need to cooperate to compute results. This is possible by sharing data and providing access to shared data. To maintain the integrity of the data, access needs to be allowed only in exclusive mode.

Interprocess Communication Mechanism: - Multiprocessor systems must provide communication between cooperating processes in the form of signals, interrupts and messages.


Q2. What do you mean by programming model? Which are various Parallel Programming Models available?
Ans:
A programming model is a collection of program abstractions providing a programmer a simplified and transparent view of computer Hardware/Software system. Parallel programming models are specifically designed for multiprocessors, multi computer, SIMD or vector computers. There are five such models described bellow that differ in the way these processes share data, achieve synchronization and communication:

1. Shared Memory Programming: -
Multiprocessor programming is based on the use of shared variables in commonly accessible memory for communication and sharing data. Besides sharing variables in a common address space communication also takes place through software signals and interrupts. As multiple processors may attempt to access the shared data processes must ensure exclusive access to the critical sections (Part of the program accessing the shared resources or variables).
The implementation of the model is supported in various ways on different platforms.
· On UNIX platform it is available in the form of linkable libraries for creation of processes, shared memory blocks, semaphores and IPC mechanisms.
· This model is also available in the form of multithreading libraries.

2. Message Passing: -
Multi computer employ message passing as the mechanism for inter process communication. In this model, two processes residing on two different nodes communicate with each other by passing messages over communication channel. The message may be instruction, data, and synchronisation or interrupt signal.
Message passing model does not require mechanism for mutual exclusion for access to shared data as there is no way for processes to share each others address space.
The implementation of this model can be done by two ways as explained bellow:

a. Synchronous Message Passing: - It requires both the sender and receiver to be synchronized just like in telephone call communication. No buffering of message is done by communication channel.
The receiver is always blocked waiting for the message to arrive.
The sender is also blocked till receiver receives and acknowledges the message.
This mode of communication is suitable for tightly coupled message passing multi computer systems where communication delay overhead is sufficiently small.

b. Asynchronous Message Passing: - This mode of the model does not impose blocking on the sender. The outgoing messages get buffered in the communication channel and are delivered to target when target process chooses to look for it.
This mode of communication is suitable for loosely couples systems that are made up of networked autonomous machines.

3. Data-parallel Model: -
This model for SIMD processors is an extension of the sequential programming
The language compiler for a specific machine must translate the source code to executable object code, which will exploit the parallelism in the hardware. For Example: - Fortran is tailored for data parallelism.
The compilers must be aware of underlying interconnection topology to generate optimal code for array processor.
Synchronization of data parallel programs is done at compile time rather than at run time.

4. Object Oriented Model: -
In this model mapping of execution units to objects is achieved. These objects are dynamically created and manipulated. Sending and receiving messages among the objects achieve communication.
Concurrent programs are built from low level objects such as processes, threads, and semaphores into high level objects like monitors and program modules.



5. Functional and Logical Models: -
A functional programming language emphasizes the functionality of a program. There is no concept of storage, assignment and branching in functional programming. The evaluation of the function produces the same result regardless the order in which its arguments are evaluated.
Logical Programming is based upon predicate logic. This model is suitable for knowledge processing dealing with large database. This model adopts an implicit search strategy and support parallelism.
Both of these models are used in Artificial Intelligence Applications.



Q3. What are Abstract Models available for Sequential and Parallel Machines/Computers?
Ans:
The purposes of using such models are:
To get feel of the capacities of the machines without bothering about the specific constraints of the real life machines.
It is convenient to develop an algorithm for a general model and then to map it to the actual machine.

The abstract models available for Sequential as well as for parallel machines are:
RAM (Random Access Machine) à Abstract model of sequential computer.




P-RAM (Parallel RAM) à Abstract model of Parallel Computer.
Read full story

Assignment Issues in parallel programming

0 comments;Click here for request info on this topic
Q1. How parallelism can be introduced in sequential machines? Describe any two techniques briefly.

Answer: The sequential machines have been made faster by incorporating various schemes. The main idea in all these schemes is to match the speed of various components so as to utilize the resources to their peak performances.

There are various parallelism techniques that we can use in uniprocessor or sequential machines.
Some of them are:
1. Multiplicity of Functional Units.
2. Overlapped CPU and I/O operations.
3. Pipelining within the CPU.
4. Hierarchical Memory Systems.
5. Multiprogramming and Time Sharing.

Multiplicity of functional units:

Mostly computers have only one Arithmetic & Logic Unit (ALU). ALU could perform one function at a time.
The practical machines used today have multiple and specialized units that can operate in parallel.
In sequential machines now it is possible to have multiple functional units for addition, multiplication, division, increment, decrement, Boolean operations and shift instructions.



Example:
CDC-6600(designed in 1964) had 10 functional units.
A scoreboard was used to keep track of availability of the functional units and the registers being demanded.

With 10 functional units and 24 registers the machine could achieve a significant increase rate of instruction execution.

2. Overlapped CPU and I/O operations

As we know each instruction contains Computational phase as well as I/O phase. I/O operations can be performed simultaneously with the computational tasks by using separate I/O controllers, I/O processors.
Computers requiring multiple I/O devices employ multiple I/O subsystems which can operate concurrently. This type of multiprocessing speeds up the data transfer between the external devices and memory.

BUS Contention
The design of such systems depends upon Bus Contention.
Because if device interface and computational task require same bus to transfer data then there will be no speed up in data transfer rate.
To minimize bus contention, some systems employ redundant bus architecture.
Tradeoff is there,
Consequent complications in the design but at the same time the utilization of CPU and other resources maximize.

Processor Intervention
Data is transferred between memory and I/O devices or vice versa.
Processor interferes in this data transfer.
This interference decreases data transfer rate.
To avoid this we use Direct Memory Access (DMA) technique.
DMA is used to provide direct information transfer between the external devices and the primary memory




Q2. How pipelining within the CPU introduces parallelism in sequential machines?

Answer: Today every sequential processor manufactured is taking advantage of parallelism in the form of pipelining instructions.

The purpose of pipeline parallelism is to increase the speed of program and to decrease I/O operations.

Pipeline parallelism is when multiple steps depend on each other but the extension can overlap and the output of one step is streamed as input to the next step.

The different stages of pipelining are

Fetch Stage
Decode Stage
Issue stage
Execute stage
Write back Stage

The fetch stage (F) fetches instructions from cache memory, one per cycle.

The decode stage (D) reveals the instruction function to be performed and identifies the resources needed. Resources include general purpose registers, buses and functional units.

The issue stage (I) reserves resources. Pipeline interlock controls are maintained at this stage the operands are also read from registers during the issue stage.

The instructions are executed in one or several execute stages (E).

The last write back stage (W) is used to write results into the registers. Memory load and store operations are treated as part of execution.

To facilitate instruction execution through the pipe instruction prefetching is used.

The execution of multiple instructions is overlapped in time – even before an instruction gets completely executed, another instruction may be in the process of being decoded, yet another instruction may be getting fetched, and so on.
Pipelining of tasks gives us temporal parallelism.

Figure shows the flow of machine instructions through a typical sequential machine. These eight instructions are for execution of the statements X=Y+Z and A=B*C.






The shaded boxes correspond to idle cycles when instruction issues are blocked due to resources latency or conflicts or due to data dependences.

The first two load instructions issue on consecutive cycles. The add is dependent on both loads and must wait three cycles before the data (Y& Z) are loaded in.

Similarly, the store of the sum to memory location X must wait three cycles for add to finish due to a flow dependence. There are similar blockages during the calculation of A.

The total time required is 17 clock cycles. This time is measured beginning at cycle four when the first instruction starts execution until cycle 20 the last instruction starts execution.



Figure shows an improved time after the instruction issuing order is change to eliminate unnecessary delays due to dependence. The idea is to issue all four load operations in the beginning. Both add and multiply instructions are blocked fewer due to fewer cycles due to this data prefetching. The reordering should not change the end results. The time required is being reduced to 11 cycles, measured from 4 to 14.

Q3. What are the various issues in parallel programming?

Answer: The various issues in parallel programming are:

1. Load balancing:

The operating system must utilize the resources efficiently. This is commonly expressed in terms of achieving a uniform balance of loads across the processors. The operating system should schedule the subtasks such that there are no idle resources including the processors.

2. Scheduling cooperating processes:

Parallel programs which consist of concurrently executing tasks must be scheduled such that collectively they are able to use the resources in the machine required to solve the given problem. Scheduling half the subtask of one parallel program and half the subtasks of another parallel program which can not cooperate can be wasteful.
If a parallel program needs all its subtasks to be running and cooperating, then scheduling half of the tasks can lead to starvation. We can achieve this by maintaining a hierarchy of processes in the sense of parent-child relationship within the operating system’s data structure. Scheduling of processes then can be done such that the processes belonging to a single parallel program are scheduled for execution together.

3. Graceful degradation in case of failure of one of the resources:

Given that a parallel machine has multiple resources of the same kind, one must expect a higher degree of fault tolerance from it. Failure of one of its resources should not result in a catastrophic system crash. The operating system should be able to reschedule the task that had been running on the failed resource and continue the parallel program. Ideally, there should be only a fractional degradation in the performance of the parallel machine.

4. Communication schemes:

Most parallel programs need to share data and intermediate result across subtasks during the processing towards the solution of the problem. To achieve effective cooperation among the subtasks of a parallel program, the operating system must provide adequate facility for communication between tasks. These facilities vary depending on whether the machine is of a shared memory type or of a distributed memory type.

5. Synchronization mechanisms:

To ensure integrity of the shared data across the subtask of a parallel program, synchronization between tasks is required. The shared data must be accessed under mutual exclusion (implemented using a mechanism like semaphores). The task may need to wait till some state is reached across all the tasks of the parallel programs. The operating systems need to provide signaling mechanisms for such synchronization requirements.
Read full story

Assignment DATA DEPENDENCY ANALYSIS

0 comments;Click here for request info on this topic
Certain code segments, like a=b+c; a=c+d; can not be run in parallel because:
a) Lack of hardware.
b) Show Dependency/ Unpredictable results.
c) Lack of knowledge.
d) Ease to run sequentially.
Which of the following does not show dependency?
e) a=b+c; d=a*a;
f) a=b+c; b=a*a;
g) a=b+c; a=c+d;
h) a=b+c; e=x-y;
What is the source of Dependency in a program segment?
i) Control.
j) Data.
k) Resource.
l) All of the above.
The set of statements used to formalize the notion of dependency are_________ and _________.
Which of the following shows a real dependence?
m) Output dependence.
n) Input dependence.
o) Flow dependence.
p) Anti- dependence.
If intersection of DEF (S2) & USE (S1) is not an empty set, then it satisfies which dependency?
q) True dependence.
r) Output dependence.
s) Anti- dependence.
t) Input dependence.
__________ dependence & __________ dependence exist in imperative programming.
If S1 & S2 are dependent and S2 & S3 are dependent, then it implies that S1 & S3 are also dependent.
u) True.
v) False.
________ are a direct source of Parallelism.
Equations requiring integer solutions are called ______________.
X [ k+1 ] = X [k] + A [k ] shows which type of dependence?
w) Output dependence.
x) Flow dependence.
y) Anti- dependence.
z) None of the above.

A test used to check loop dependence is ___________.
Diophantine equations can be solved using a mathematical solution strategy?
aa) True.
bb) False.
While computing dependency in loops, we consider arrays as:
cc) Variable.
dd) Subscript.
ee) Both (a) & (b).
ff) None of the above.
Loop dependence may analyze:
gg) If different statements in loop body can be executed in parallel.
hh) If different iterations of a loop can be executed in parallel.
ii) Both (a) & (b).
jj) None of the above.
If a test solution states that there is no solution of an equation, then it implies there is no dependence & vice versa.
kk) True.
ll) False.
The relation DEF (S1) intersection USE (S2) != { } implies:
mm) Output dependence.
nn) True dependence.
oo) Input dependence.
pp) Anti- dependence.
Code segments can be parallelized without any threat in:
qq) Flow dependence.
rr) Output dependence.
ss) Input dependence.
tt) Anti- dependence.
Using suitable variable names in simple statements, we can avoid:
uu) Flow & Output dependence.
vv) Anti & Output dependence.
ww) Flow & Anti dependence.
xx) None of the above.


ANSWERS TO OBJECTIVE QUESTIONS

(b)
(d)
(d)
DEF, USE
(c)
(c)
Output, Anti- dependence.
(b)
Loops
Diophantine
(b)
GCD/Banerjee
(b)
(b)
(c)
(a)
(b)
(b)
(b)


SUBJECTIVE QUESTIONS

1. (A) Define DEF & USE set of statements with examples.
Ans: DEF & USE set of statements are used to identify the type of dependency that exists in a given code of segment.
DEF: It is a set of variables modified by the statement.
USE: It is a set of variables accessed/used by the statement.
Example: let there be a statement, A=5.2
By definition, DEF(A)={A}
& USE (A)={ } (as no variable is used by statement)
let another statement be: A=B+C-2.4
By definition, DEF(A)={A}
& USE(A)={B,C}
DEF & USE are used to formalize the notion of dependence

1. (B) Illustrate with an example: “Dependence is not Transitive”.
Ans: Dependence relation is not transitive. By this we mean, if there are 3 sets of statements S1, S2 & S3, then if S1 & S2 are dependent and S2 & S3 are dependent then it does not imply that S1 & S3 are also dependent.
Let us prove it with an example.
Let S1: A=B+C;
S2: A=C-D;
S3: D=0;
DEF & USE sets of S1, S2 & S3 are:
DEF (S1) = {A} USE (S1) = {B, C}
(1)
DEF (S2) = {A} USE (S2) = {C, D} (2)
DEF (S3) = {D} USE (S3) = { } (3)

Now we know that, Output Dependence arises if:
DEF (S1) intersection DEF (S2)! = { } (A)
From (1) & (2),
DEF (S1) intersection DEF (S2) = {A}
Therefore, Output dependence exists between S1 & S2

Also, Anti- Dependence arises if:
DEF (S2) intersection Use (S1)! = { } (B)
From (2) & (3),
DEF (S3) intersection USE (S2) = {D}
Therefore, Anti-dependence exists between S2 & S3

Now, Flow dependence arises if:
DEF (S1) intersection USE (S2)! = { } (C)
But from (1) & (3)
None of the conditions ((A), (B), (C)) holds true
Therefore, there is no dependence between S1 & S3
Hence dependence is not transitive.

2. (A) what are the various sources of Dependence? Explain with examples.
Ans: Dependency among program segments arises primarily from three sources:
CONTROL DEPENDENCE: It is imposed by the language constructs such as if-then, case etc. even if the segments corresponding to different execution branches are independent, there is no use of executing them in parallel because result of only one of those segments needs to be visible based on the condition being tested.
RESOURCE DEPENDENCE: It arises from the need to share resources among instructions. E.g. if 2 instructions require the use of a floating point processor and only one such floating point unit is available in CPU, then these 2 instructions can not be executed in parallel.
DATA DEPENDENCE: It is the most important source of dependency. It arises when 2 segments of code accesses or updates the common piece of data. E.g. consider 2 statements:
A=A+1;
B=A+1;
Let initial value be A=0, B=0
Result of sequential execution will be A=1, B=2
Result of parallel execution would depend on the order in which
statements would be executed & therefore, will be different every time. This is because statement 2 is referring to a variable being used by statement 1.

2. (B) explain different types of dependencies with examples. Which one
of these is unavoidable?
Ans: To explain the different types of dependencies, let us consider a pair of Statements S1 & S2. There is a dependence across S1 & S2 if there is some
common element in the DEF & USE sets of these statements.
The 4 types of dependencies are:

TRUE/FLOW DEPENDENCE: it arises when
DEF (S1) intersection USE (S2)! = { }
· It is the most common & most difficult to avoid.
· It arises because a value computed by S1 is used in S2 for some processing.
· When a program is solved in a sequence of steps then intermediate or partial results are computed for further use.
Example: Let S1: A=B+C
S2: D=2*A
DEF (S1) = {A}
USE (S2) = {A}
According to definition,
DEF (S1) intersection USE (S2) = {A}
Therefore, Flow dependence arises in these statements.

ANTI-DEPENDENCE: It arises when
DEF (S2) intersection USE (S1)! = { }
· It is opposite of flow dependence.
· It arises when we reuse variable names.
· A variable whose value is used in S1 is redefined in S2.
· It is to be ensured that the old value of variable is used in S1.
· If we execute S1 & S2 in parallel, it might be possible that S2 gets executed first, which results in a new value being used in S1.
Example: Let S1: A=B+C; B is used in S1 & is redefined in S2
S2: B=0;
USE (S1) = {B, C}
DEF (S2) = {B}
And DEF (S2) intersection USE (S1) = {B}
Therefore, Anti-dependence arises.

OUTPUT DEPENDENCE: It arises when
DEF (S1) intersection DEF (S2)! = { }
It arises because of 2 reasons:
· Reuse of variable names for convenience.
· Incremental computation of a variable.
Example: Let S1: A=B+C;
S2: A=A-D;
DEF (S1) = {A}
DEF (S2) = {A}
And DEF (S1) intersection DEF (S2) = {A}
Therefore, Output dependence arises.

INPUT DEPENDENCE: It arises when
USE (S1) intersection USE (S2)! = { }
· There is no threat to parallelism as a commonly accessed or used variable is present.
· Since accessing a value by a process does not change its value, therefore, any number of processes accessing the value simultaneously is acceptable.
Example: Let S1: A=B+C; No harm in executing S1 & S2 in parallel
S2: D=B+5;
USE (S1) = {B, C}
USE (S2) = {B}
And USE (S1) intersection USE (S2) = {B}
Among all 4 types of dependencies, FLOW Dependence is real & is unavoidable because it is inherent in the style of computation. It is bound to occur even irrespective of the programming paradigm being used.


Can different iterations of a loop be done in parallel? Explain with an example.
Ans: “Can different iterations of a loop be done in parallel” is same as saying, “Can Processor A work on a statement 1&2 for i=1 while Processor B works on the same set of statements for i=2.”
There is every possibility that second iteration ( i=2 ) may get completed before first ( i=1 ). Will this create any difficulty in executing these iterations in parallel?
Let us analyze this situation by using DEF & USE sets and looking at their intersections.
Let us consider an example:
Do I
A [I] = A [I-1] +1
This loop can be unfolded & written as:
A [1] = A [0] +1
A [2] = A [1] +1
A [3] = A [2] +1 & so on…
Let us consider two separate iterations with I = i & I= j
Assume i
Loop corresponding to these iterations would be:
A [i] = A [i-1] + 1
A [j] = A [j-1] + 1
If there is no dependency between these 2 statements, then these iterations are independent & can be executed in parallel.
To check the dependency, let us compute DEF & USE sets
DEF i : A [i] USE i : A [i-1]
DEF j : A [j] USE j : A [j-1]
These sets contain subscripts & not variables.

There is a Flow dependence in loop if DEF I & USE j can have a common element. i.e. when i=j-1
Equation i=j-1 has infinite solutions.
Hence dependency arises & loop iterations can not be parallelized as it is.
Determining dependency in this manner is a difficult task as all equations need not be trivial.


Define Diophantine equations. Give a method of solving Diophantine equations with example.
Ans: Diophantine equations are those which require an integer solution.
· There is no direct mathematical solution strategy for solving such equations.
· Different techniques have to be tried depending on the type of the equation.
· 2 popular tests to check dependencies are GCD and Banerjee test.
· If test results indicate there is no solution to the equation then there is no dependency.

Solving Diophantine equations:
General form of an equation is:
E AiXi = C Xi—integers
Only if GCD ( A1,A2,…,An) divides C , then equation has a solution in the integer domain.
If this above condition is true & loop constraints are satisfied by the solution, then there is dependence.

Diophantine equations for 2-variables
General form of equation is
Ax + By = C (1)
Let g = GCD (A, B)
If g divides C, then we can write g as:
g = Au + Bv
Solutions of (1) are given by:
Xt = uC/g + Bt/g , Yt = vC/g –At/g (I)

Example:
Let S1: A [2i-1] = …
S2: … = A [4i-7]
There will be Flow dependence between Xth & Yth iteration if L.H.S of S1 & R.H.S of S2 refer to same array element
i.e. 2x-1 = 4y-7
4y-2x = 6



Comparing it with Ax + By = C, we get
A=4, B=-2, C=6
g = GCD (4, 2) = 2
Since g divides C (2 divides 6)
Therefore equation has a solution.
Writing g = Au + Bv
i.e. 2=4*-1 + (-2)*-3
u=-1, v=-3 (1 set of values, other solutions are also possible)
using (I)
solution is:
Xt = -3-t
Yt = -9-2t
The constraint equation is 4y-2x = 6
Multiple solutions of x & y satisfy the solution.
x = 17, y = 10----referring to array element A [33]
x = 19, y = 11----referring to array element A [37]
Read full story

Assignment Parallel Programming

0 comments;Click here for request info on this topic
Long questions

Q1. Explain parallel reduction with an example.

Given a set of n values a0, a1,…….an-1 and an associative operator ○, reduction is the process of computing a0 ○ a1 ○ an-1. Addition, multiplication, and finding the maximum/minimum of a set are examples of associative operators.
The process of reduction is viewed as a tree-structured operation. The elements of input to the reduction problem are placed at the leaf nodes of a binary tree. The reduction operation is applied to children of each parent node, and the result is propagated towards the root of the tree.
Parallel summation is an example of a reduction operation.







PRAM algorithm to sum n elements using ën/2û processors

SUM :
/*Initial condition: List of n ³ 1 elements stored in A[0…(n-1)]
Final condition: Sum of elements stored in A[0] */
Global variables: n,A[0…(n-1)],j
Begin
spawn(P0, P1, P2,…,P└n/2┘- 1)
for all Pi ,where 0 £ i £ ën/2û -1 do
for j ¬ 0 to log én –1ù do
if i modulo 2j and 2i + 2j < n then
A[2i] ¬A[2i] + A[2i + 2j]
endif
endfor
endfor
end

Analysis of Parallel Reduction

•There is a dependency across levels of reduction. Unless reduction is ready at a lower level, a higher level cannot proceed with the operation.
• Overall time complexity of the algorithm is Θ (log n), given ën/2û processes
• The spawn routine requires élog ën/2û ù process doubling steps.
• The sequential for loop executes élog nù times.

O2. Explain Odd-Even Transposition Sort

It is designed for the processor array model in which the processing elements are organized into one-dimensional mesh.
Assume that A=(a0, a1,…,an-1) is the set of n elements to be sorted. Each of the n processing elements contains two local variables: a, unique element of array A, and t, a variable containing a value retrieved from a neighboring processing element. The algorithm performs n/2 iterations, and each iteration has two phase. In the first phase, called odd-even exchange, the value of a in every odd-numbered processor(except processor n-1) is compared with the value of a stored in the successor processor. The values are exchanged, if necessary, so that the lowered-numbered processor contain the smaller value. In the second phase, called even-odd exchange, the value of a in every even numbered processor is compared with the value of a in the successor processor. The values are exchanged, if necessary, so that lower numbered processor contain the smaller value. After n/2 iterations the value must be sorted.
Odd-Even Transposition Sort Algorithm for the one-dimensional mesh processor array model

Parameter n
Global i {Element to be sorted}
Local t {Element taken from adjacent processor}
Begin
for i ¬1 to n/2 do
for all pj, where 0 £ j £ n-1 do

if j < n-1 and odd(j) then
{Odd-even exchange}
t Ü successor(a) {Get value from successor}
successor(a)Ümax(a,t) {Give away larger value}
a¬ min(a,t) {Keep smaller value}
endif

if even(j) then
{Even-odd exchange}
t Ü successor(a) {Get value from successor}
successor(a) Ü max(a,t) {Give away larger value}
a ¬ min(a,t) {Keep smaller value}
endif
endfor
endfor
end

Example: Odd-Even Transposition Sort for eight values

Indices: 0 1 2 3 4 5 6 7
Initial Values: G H F D E C B A
After odd-even exchange: G F < H D < E B < C A
After even-odd exchange: F < G D < H B < E A < C
After odd-even exchange: F D < G B < H A < E C
After even-odd exchange: D < F B < G A < H C < E
After odd-even exchange: D B < F A < G C < H E
After even-odd exchange: B < D A < F C < G E < H
After odd-even exchange: B A < D C < F E < G H
After even-odd exchange: A < B C < D E < F G < H


Analysis of Odd-Even Transposition Sort

• The complexity of sorting n elements on a one- dimensional mesh processor array with n processors using odd-even transposition sort is Θ(n).

Q3. Explain Enumeration Sort
Assume that we are given table of n elements, denoted a0, a1,…., an-1, on which a linear order has been defined. Thus for any two elements ai and aj, exactly one of the following cases must be true: ai < ai =" aj,"> aj. The goal of sorting is to find a permutation (∏0, ∏1,…. ∏n-1) such that a∏0 £ a∏1 £ ……a ∏n-1.
An enumeration sort computes the final position of each element in the sorted list by comparing it with the other elements and counting the number of elements having smaller value. If j elements have smaller value than ai, then ∏j = i ; i.e., element ai
is the (j +1) element on the sorted list following a∏0,……, a∏j-1.

Algorithm for Enumeration Sort

ENUMERATION SORT (CRCW PRAM)
Parameter n {Number of elements}
Global a[0…(n-1)] {Elements to be sorted}
position[0…(n-1)] {Sorted positions}
sorted[0…(n-1)] {Contains sorted elements}
Begin
spawn(Pi,j, for all 0 £ i, j < n)
for all Pi,j, where 0 £ i, j< n do
position[i] ¬0
if a[i] < a[j] or (a[i] = a[j] and i < j) then
position[i] ¬ 1
endif
endfor
for all Pi,0, where 0 £ i < n do
sorted[position[i]] ¬ a[i]
endfor
end

Analysis of Enumeration Sort
•A set of n elements can be sorted in Q(log n) time with n2 processors, given a CRCW PRAM model in where simultaneous writes to the same memory location cuse the sum of the values to be assigned.
• If the time needed to spawn the processors is not counted, the algorithm executes in constant time.

Q4. Explain Parallel Quick Sort.
Quick sort is a divide and conquer algorithm that easily yields itself to parallelisation.
The array is partitioned into two parts using a pivot element, such that all elements in the left partition are smaller than the pivot element and the elements to the right are larger. The pivot element is inserted in between the two partitions and hence in the sorted position after partitioning. Apply the same algorithm recursively to each left and right parts of the array, if there is just one element in the array to stop.
To parallelise this algorithm, a set of processes(pool of workers) is created. The extents of the array to be sorted are placed on a stack to indicate the presence of work to be done. The first process to fetch the work partitions the array and places the extents of the resulting two partitions on the stack. The workers fetch the extents of a partition to be sorted from the stack and create new partitions. These are added to the stack.

Algorithm for Parallel Quick Sort

Global n {Size of array of unsorted elements}
a[0…(n-1)] {array of elements to be sorted}
sorted {Number of elements in sorted position}
min.partition {Smallest subarray that is partitioned rather than sorted directly}
Local bounds {Indices of unsorted subarray}
median {Final position in subarray of partitioning key}
Begin
sorted ¬ 0
INITIALIZE.STACK()

for all Pi, where 0 £ i < p do
while( sorted < n) do
bounds ¬ STACK.DELETE()
while(bounds.low < bounds.high) do
if(bounds.high - bounds.low < min.partition) then
INSERTION.SORT(a, bounds.low, bounds.high)
ADD.TO.SORTED(bounds.high-bounds.low +1 )
exit while
else
median¬ PARTITION(bounds.low, bounds.high)
STACK.INSERT(median + 1, bounds.high)
bounds.high ¬ median – 1

if bounds.low = bounds.high then
ADD.TO.SORTED(2)
else
ADD.TO.SORTED(1)
endif
endif
endwhile
endwhile
endfor
end

In the above algorithm, function INITIALIZE.STACK initializes the shared stack containing the indices of unsorted subarrays. When a process calls function STACK.DELETE, it receives the indices of an unsorted sub array if the stack contains indices; otherwise, there is no useful work to do at this point. Function STACK.INSERT adds the indices of an unsorted subarray to the stack. Since all these functions access the same shared data structure, their execution must be mutually exclusive. Function ADD.TO.SORTED increases the count of elements that are in their correct positions and execution of this function, too, must be mutually exclusive.
Read full story

Assignment Message Passing Interface

0 comments;Click here for request info on this topic
SUBJECTIVE QUESTIONS

Q1: Write the features of Message Passing Interface (MPI).

The goal of the Message Passing Interface simply stated is to develop a widely used standard for writing message-passing programs. As such the interface should establish a practical, portable, efficient, and flexible standard for message passing. The standard is maintained by the MPI Forum.
A standard portable message-passing library developed in 1993 by a group of parallel computer vendors, software writers, and application scientists. Available to both Fortran and C programs and available on a wide variety of parallel machines. Target platform is a distributed memory system such as the SP.
Set of library routines used to design scalable parallel applications. These routines provide a wide range of operations that include computation, communication, and synchronization. MPI 1.2 is the current standard supported by major vendors.
Features:
MPI has full asynchronous communication.
Immediate send and receive operations can fully overlap computation.
MPI groups are solid, efficient, and deterministic.
Group membership is static. There are no race conditions caused by processes independently entering and leaving a group. New group formation is collective and group membership information is distributed, not centralized.
MPI efficiently manages message buffers.
Messages are sent and received from user data structures, not from staging buffers within the communication library. Buffering may, in some cases, be totally avoided.
MPI is a standard.
Its features and behaviour were arrived at by consensus in an open forum. It can change only by the same process.
Reasons for using MPI:
o Standardization - MPI is the only message passing library which can be considered a standard. It is supported on virtually all HPC platforms.
o Portability - there is no need to modify your source code when you port your application to a different platform which supports MPI.
o Performance - vendor implementations should be able to exploit native hardware features to optimize performance.
o Functionality (over 115 routines)
o Availability - a variety of implementations are available, both vendor and public domain.
Target platform is a distributed memory system including massively parallel machines, SMP clusters, workstation clusters and heterogenous networks.
All parallelism is explicit: the programmer is responsible for correctly identifying parallelism and implementing the resulting algorithm using MPI constructs.
The number of tasks dedicated to run a parallel program is static. New tasks can not be dynamically spawned during run time. (MPI-2 is attempting to address this issue).
Able to be used with C and Fortran programs. C++ and Fortran 90 language bindings are being addressed by MPI-2.











Q2: Give the principles on which the concept of Parallel Virtual Machine
(PVM) is based.

PVM was developed by Oak Ridge National Laboratory in conjunction with several universities, principal among them being the University of Tennessee at Knoxville and Emory University. The original intent was to facilitate high performance scientific computing by exploiting parallelism whenever possible. By utilizing existing heterogeneous networks (Unix at first) and existing software languages (FORTRAN, C and C++), there was no cost for new hardware and the costs for design and implementation were minimized.
Briefly, the principles upon which PVM is based include the following:
· User-configured host pool : The application's computational tasks execute on a set of machines that are selected by the user for a given run of the PVM program. Both single-CPU machines and hardware multiprocessors (including shared-memory and distributed-memory computers) may be part of the host pool. The host pool may be altered by adding and deleting machines during operation (an important feature for fault tolerance).
· Translucent access to hardware: Application programs either may view the hardware environment as an attributeless collection of virtual processing elements or may choose to exploit the capabilities of specific machines in the host pool by positioning certain computational tasks on the most appropriate computers.
· Process-based computation: The unit of parallelism in PVM is a task (often but not always a Unix process), an independent sequential thread of control that alternates between communication and computation. No process-to-processor mapping is implied or enforced by PVM; in particular, multiple tasks may execute on a single processor.
· Explicit message-passing model: Collections of computational tasks, each performing a part of an application's workload using data-, functional-, or hybrid decomposition, cooperate by explicitly sending and receiving messages to one another. Message size is limited only by the amount of available memory.
· Heterogeneity support: The PVM system supports heterogeneity in terms of machines, networks, and applications. With regard to message passing, PVM permits messages containing more than one datatype to be exchanged between machines having different data representations.
·
· Multiprocessor support: PVM uses the native message-passing facilities on multiprocessors to take advantage of the underlying hardware. Vendors often supply their own optimized PVM for their systems, which can still communicate with the public PVM version.


Q3: Explain the Parallel Virtual Machine in detail.
The PVM system is composed of two parts. The first part is a daemon , called pvmd3 and sometimes abbreviated pvmd , that resides on all the computers making up the virtual machine. (An example of a daemon program is the mail program that runs in the background and handles all the incoming and outgoing electronic mail on a computer.) Pvmd3 is designed so any user with a valid login can install this daemon on a machine. When a user wishes to run a PVM application, he first creates a virtual machine by starting up PVMThe PVM application can then be started from a Unix prompt on any of the hosts. Multiple users can configure overlapping virtual machines, and each user can execute several PVM applications simultaneously.
The second part of the system is a library of PVM interface routines. It contains a functionally complete repertoire of primitives that are needed for cooperation between tasks of an application. This library contains user-callable routines for message passing, spawning processes, coordinating tasks, and modifying the virtual machine.
The PVM computing model is based on the notion that an application consists of several tasks. Each task is responsible for a part of the application's computational workload. Sometimes an application is parallelized along its functions; that is, each task performs a different function, for example, input, problem setup, solution, output, and display. This process is often called functional parallelism . A more common method of parallelizing an application is called data parallelism . In this method all the tasks are the same, but each one only knows and solves a small part of the data. This is also referred to as the SPMD (single-program multiple-data) model of computing. PVM supports either or a mixture of these methods. Depending on their functions, tasks may execute in parallel and may need to synchronize or exchange data, although this is not always the case. An exemplary diagram of the PVM computing model is shown in Figure and an architectural view of the PVM system, highlighting the heterogeneity of the computing platforms supported by PVM, is also shown.
The PVM system currently supports C, C++, and Fortran languages. This set of language interfaces have been included based on the observation that the predominant majority of target applications are written in C and Fortran, with an emerging trend in experimenting with object-based languages and methodologies.
The C and C++ language bindings for the PVM user interface library are implemented as functions, following the general conventions used by most C systems, including Unix-like operating systems.
Fortran language bindings are implemented as subroutines rather than as functions. This approach was taken because some compilers on the supported architectures would not reliably interface Fortran functions with C functions. One immediate implication of this is that an additional argument is introduced into each PVM library call for status results to be returned to the invoking program.
All PVM tasks are identified by an integer task identifier (TID) . Messages are sent to and received from tids. Since tids must be unique across the entire virtual machine, they are supplied by the local pvmd and are not user chosen. Although PVM encodes information into each TID the user is expected to treat the tids as opaque integer identifiers. PVM contains several routines that return TID values so that the user application can identify other tasks in the system.
There are applications where it is natural to think of a group of tasks. And there are cases where a user would like to identify his tasks by the numbers 0 - (p - 1), where p is the number of tasks. PVM includes the concept of user named groups. When a task joins a group, it is assigned a unique ``instance'' number in that group. Instance numbers start at 0 and count up. In keeping with the PVM philosophy, the group functions are designed to be very general and transparent to the user. For example, any PVM task can join or leave any group at any time without having to inform any other task in the affected groups. Also, groups can overlap, and tasks can broadcast messages to groups of which they are not a member. To use any of the group functions, a program must be linked with libgpvm3.a .
The general paradigm for application programming with PVM is as follows. A user writes one or more sequential programs in C, C++, or Fortran 77 that contain embedded calls to the PVM library. Each program corresponds to a task making up the application. These programs are compiled for each architecture in the host pool, and the resulting object files are placed at a location accessible from machines in the host pool. To execute an application, a user typically starts one copy of one task (usually the ``master'' or ``initiating'' task) by hand from a machine within the host pool. This process subsequently starts other PVM tasks, eventually resulting in a collection of active tasks that then compute locally and exchange messages with each other to solve the problem. Note that while the above is a typical scenario, as many tasks as appropriate may be started manually. As mentioned earlier, tasks interact through explicit message passing, identifying each other with a system-assigned, opaque TID.
Figure: PVM program hello.c
main()
{
int cc, tid, msgtag;
char buf[100];

printf("i'm t%x\n", pvm_mytid());

cc = pvm_spawn("hello_other", (char**)0, 0, "", 1, &tid);

if (cc == 1) {
msgtag = 1;
pvm_recv(tid, msgtag);
pvm_upkstr(buf);
printf("from t%x: %s\n", tid, buf);
} else
printf("can't start hello_other\n");
pvm_exit();
}
Figure: PVM program hello_other.c
#include "pvm3.h"

main()
{
int ptid, msgtag;
char buf[100];

ptid = pvm_parent();

strcpy(buf, "hello, world from ");
gethostname(buf + strlen(buf), 64);
msgtag = 1;
pvm_initsend(PvmDataDefault);
pvm_pkstr(buf);
pvm_send(ptid, msgtag);

pvm_exit();
}






























Q4: Differences between MPI and PVM.

When your program must be able to use the resources of multiple systems, you choose between MPI and PVM. In many ways, MPI and PVM are similar:
· Each is designed, specified, and implemented by third parties that have no direct interest in selling hardware.
· Support for each is available over the Internet at low or no cost.
· Each defines portable, high-level functions that are used by a group of processes to make contact and exchange data without having to be aware of the communication medium.
· Each supports C and Fortran 77.
· Each provides for automatic conversion between different representations of the same kind of data so that processes can be distributed over a heterogeneous computer network.
The chief differences between the current versions of PVM and MPI libraries are as follows:
· PVM supports dynamic spawning of tasks, whereas MPI does not.
· PVM supports dynamic process groups; that is, groups whose membership can change dynamically at any time during a computation. MPI does not support dynamic process groups.
MPI does not provide a mechanism to build a group from scratch, but only from other groups that have been defined previously. Closely related to groups in MPI are communicators, which specify the communication context for a communication operation and an ordered process group that shares this communication context. The chief difference between PVM groups and MPI communicators is that any PVM task can join/leave a group independently, whereas in MPI all communicator operations are collective.
· A PVM task can add or delete a host from the virtual machine, thereby dynamically changing the number of machines a program runs on. This is not available in MPI.
· A PVM program (or any of its tasks) can request various kinds of information from the PVM library about the collection of hosts on which it is running, the tasks that make up the program, and a task's parent. The MPI library does not provide such calls.
· Some of the collective communication calls in PVM (for instance, pvm_reduce()) are nonblocking. The MPI collective communication routines are not required to return as soon as their participation in the collective communication is complete.
· PVM provides two methods of signaling other PVM tasks: sending a UNIX signal to another task, and notifying a task about an event (from a set of predefined events) by sending it a message with a user-specified tag that the application can check. A PVM call is also provided through which a task can kill another PVM task. These functions are not available in MPI.
· A task can leave/unenroll from a PVM session as many times as it wants, whereas an MPI task must initialize/finalize exactly once.
· A PVM task need not explicitly enroll: the first PVM call enrolls the calling task into a PVM session. An MPI task must call MPI_Init() before calling any other MPI routine and it must call this routine only once.
· A PVM task can be registered by another task as responsible for adding new PVM hosts, or as a PVM resource manager, or as responsible for starting new PVM tasks. These features are not available in MPI.
· A PVM task can multicast data to a set of tasks. As opposed to a broadcast, this multicast does not require the participating tasks to be members of a group. MPI does not have a routine to do multicasts.
· PVM tasks can be started in debug mode (that is, under the control of a debugger of the user's choice). This capability is not specified in the MPI standard, although it can be provided on top of MPI in some cases.
· In PVM, a user can use the pvm_catchout() routine to specify collection of task outputs in various ways. The MPI standard does not specify any means to do this.
· PVM includes a receive routine with a timeout capability, which allows the user to block on a receive for a user-specified amount of time. MPI does not have a corresponding call.
· PVM includes a routine that allows users to define their own receive contexts to be used by subsequent PVM receive routines. Communicators in MPI provide this type of functionality to a limited extent.
On the other hand, MPI provides several features that are not available in PVM, including a variety of communication modes, communicators, derived data types, additional group management facilities, and virtual process topologies, as well as a larger set of collective communication calls.















Q5: Give an example of a MPI program using different routines and derived
data types.

A simple master - slave program in which one is supposed to evaluate the expression (a + b) * (c - d). The master will read the values of a, b, c, and d from the user and one slave will calculate (a + b) and the other one will calculate (c - d). The program is as follows.
mpi_demo.c
#include
#include
#include /* for MPI constants and functions */

#define MSG_DATA 100 /* message from master to slaves */
#define MSG_RESULT 101 /* message from slave to master */

#define MASTER 0 /* rank of master */
#define SLAVE_1 1 /* rank of first slave */
#define SLAVE_2 2 /* rank of second slave */

/* functions to handle the tasks of master, and the two slaves */
void master(void);
void slave_1(void);
void slave_2(void);

int main(int argc, char** argv)
{
int myrank, size;
/* initialize the MPI system */
MPI_Init(&argc, &argv);

/* get the size of the communicator i.e. number of processes */
MPI_Comm_size(MPI_COMM_WORLD, &size);

/* check for proper number of processes */
if(size != 3)
{
fprintf(stderr, "Error: Three copies of the program should be run.\n");
MPI_Finalize();
exit(EXIT_FAILURE);
}
/* get the rank of the process */
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

/* perform the tasks according to the rank */
if(myrank == MASTER)
master();
else if(myrank == SLAVE_1)
slave_1();
else
slave_2();

/* clean up and exit from the MPI system */
MPI_Finalize();

exit(EXIT_SUCCESS);
} /* end main() */

/* function to carry out the masters tasks */
void master(void)
{
int a, b, c, d;
int buf[2];
int result1, result2;
MPI_Status status;

printf("Enter the values of a, b, c, and d: ");
scanf("%d %d %d %d", &a, &b, &c, &d);

/* send a and b to the first slave */
buf[0] = a;
buf[1] = b;
MPI_Send(buf, 2, MPI_INT, SLAVE_1, MSG_DATA, MPI_COMM_WORLD);

/* send c and d to the secons slave */
buf[0] = c;
buf[1] = d;
MPI_Send(buf, 2, MPI_INT, SLAVE_2, MSG_DATA, MPI_COMM_WORLD);

/* receive results from the slaves */
MPI_Recv(&result1, 1, MPI_INT, SLAVE_1, MSG_RESULT,
MPI_COMM_WORLD, &status);
MPI_Recv(&result2, 1, MPI_INT, SLAVE_2, MSG_RESULT,
MPI_COMM_WORLD, &status);

/* final result */
printf("Value of (a + b) * (c - d) is %d\n", result1 * result2);
} /* end master() */

/* function to carry out the tasks of the first slave */
void slave_1(void)
{
int buf[2];
int result;
MPI_Status status;
/* receive the two values from the master */
MPI_Recv(buf, 2, MPI_INT, MASTER, MSG_DATA, MPI_COMM_WORLD, &status);
/* find a + b */
result = buf[0] + buf[1];

/* send result to the master */
MPI_Send(&result, 1, MPI_INT, MASTER, MSG_RESULT, MPI_COMM_WORLD);
} /* end slave_1() */

/* function to carry out the tasks of the second slave */
void slave_2(void)
{
int buf[2];
int result;
MPI_Status status;
/* receive the two values from the master */
MPI_Recv(buf, 2, MPI_INT, MASTER, MSG_DATA, MPI_COMM_WORLD, &status);
/* find c - d */
result = buf[0] - buf[1];

/* send result to master */
MPI_Send(&result, 1, MPI_INT, MASTER, MSG_RESULT, MPI_COMM_WORLD);
} /* end slave_2() */

/* end mpi_demo.c */

To use the MPI system and functions, you first need to include the header file mpi.h as is done in line 8.. In case of MPI, the MPI system assigns each process a unique integer called as its rank beginning with 0. The rank is used to identify a process and communicate with it. Secondly, each process is a member of some communicator. A communicator can be thought of as a group of processes that may exchange messages with each other. By default, every process is a member of the communicator called MPI_COMM_WORLD
Any MPI program must first call the MPI_Init() function. This function is used by the process to enter the MPI system and also do any specific initialization required by the system. Next, we get the size of the MPI_COMM_WORLD communicator i.e. the number of processes in it using the MPI_Comm_size() function. The first parameter is the communicator and the second is a pointer to an integer in which the size will be returned. Here, we need exactly 3 processes, one master and two slaves. After that, we get the rank by calling MPI_Comm_rank(). The three processes will have ranks 0, 1 and 2. All these processes are essentially identical i.e. there is no inherent master - slave relationship between them. So it is up to us to decide who will be the master and who will be the slaves. We choose rank 0 as master and ranks 1 and 2 as slaves. Depending upon the rank, we choose to execute the appropriate function. Note that there is no spawning of processes as in PVM, and as we shall see, we choose to decide the number of process to be spawned from a command line argument rather than the program spawning slaves. Once the execution is finished, we must call the MPI_Finalize() function to perform final clean up.
Let us now consider the master function. After reading the values of a, b, c, and d from the user, the master must send a and b to slave 1 and c and d to slave 2. Instead of sending the variables individually, we choose to pack them up in an array and send the array of 2 integers instead. Once the buffer is ready, unlike PVM, we do not need to pack or encode the data, MPI will manage these details internally. So we can directly call the MPI_Send() function to send the data. The first parameter is the address of the buffer, the second one the number of elements in the message, the third is a specification of the data type of the buffer, which here is MPI_INT specifying that the buffer is an array of integers. Next comes the rank of the process to which we want to send the message. Here it is SLAVE_1. Next is the message tag similar to that in case of PVM. Final parameter is the communicator of which the receiver is a member, which in this case, is MPI_COMM_WORLD.
Once the data is distributed among the slaves, the master must wait for the slaves to send the results. For simplicity, we first collect the message from the slave 1 and then from slave 2. To receive a message, we use the MPI_Recv() function. Again, packing and decoding is handled by MPI internally. The first argument is the address of the buffer in which to receive the data. The second is the size of the buffer in terms of the number of elements, which in this case is 1. Next is the data type, which is MPI_INT here. Next three parameters specify the rank of the source of the message, the tag of the expected message and the communicator of which the source is the member. The final argument is a pointer to a structure of type MPI_Status in which some status information will be returned (however, we ignore this information).
MULTIPLE CHOICE QUESTIONS::

1) The keyword MPI_COMM_WORLD signifies::
a) Rank
b) System Buffer
c) Communicator
d) Application Buffer

Ans:: Communicator

2) The current Message Passing Library supports which two languages::
a) Fortran & C
b) C & C++
c) Java & C
d) Java & Fortran

Ans:: Fortran & C

3) Message Passing Interface targets distributed memory system including::
a) Parallel Machines
b) Workstation Clusters
c) Heterogeneous Networks
d) All of above

Ans:: All of above

4) What are the five basic MPI routines?

Ans:: (i) MPI_Init (*argc,*argv)
(ii) MPI_Comm_size (comm,*size)
(iii) MPI_Comm_rank (comm,*rank)
(iv) MPI_Abort (comm,errorcode)
(v) MPI_Finalize ()


5) PVM was developed by ______________in conjunction with several universities.

Ans:: Oak Ridge National Laboratory

6) PVM permits messages containing more than one datatype to be exchanged between machines having different data representations.
True or False?

Ans:: True

7) The PVM system is composed of two parts. Name them.
Ans:: (i) daemon , called pvmd3/pvmd
(ii) a library of PVM interface routines

8) All PVM tasks are identified by an integer___________

Ans:: task identifier (TID)

9) Dynamic spawning of tasks is supported by::
(a) PVM
(b) MPI
(c) Both
(d) None

Ans:: PVM

10) What does a Rank signify within a communicator in MPI?

Ans:: Within a communicator, every process has its own unique integer
identifier assigned by the system when the process initializes. A rank is sometimes also called a "process ID". Ranks are contiguous and begin at zero.
Read full story

Assignment General Model of Shared Memory Programming

0 comments;Click here for request info on this topic
OBJECTIVE QUESTIONS

Q1. Two processes A and B start execution simultaneously. A is executing a 100 instructions program while B is executing a 50 instructions program. Which of the two processes will execute first?

B
A
Both at same time
Can’t say


Q2. id=create_process(N)
If the value of ‘id’returned by this primitive is 0, it indicates

any child process
first child process
parent process
No process created


Q3. In parallel processing if Unix processes are used as independent units of execution, any computation or memory update that a process does is, by default_______________other processes.

visible to
not visible to


Q4. Special constructs are needed to ___________________data from/with other threads while using threads as independent units of execution in shared memory parallelism.

share
hide
update
correct

Q5. The variables that can be modified by any process and the updation is immediately visible to all other processes are called
local
shared
special
parallel

Q6. The mechanism to ensure that certain blocks of statements are executed by only one process at a time is called

deadlock
block scheduling
self scheduling
mutual exclusion


Q7. To ensure mutual exclusion locks are used to block the regions of statements and are declared as
pointers
array
lists
integer variables

Q8. Locks are allotted from_______________since more than one process need access to it.

Q9. To parallelise processing of loops, the mechanism of dividing single loop into multiple loops and assigning to respective processors is called

loop scheduling
self splitting
loop splitting
self scheduling

Q10. The method of loop parallelization in which processes choose their work dynamically at run time is called

loop scheduling
self splitting
loop splitting
self scheduling

Q11. The drawback of loop splitting arises when the processing of elements involved in a loop is__________.

uniform
non-unform
complex
simple

Q12. The advantage of self scheduling is good _____________and drawback is__________________.

Q13. the common pool of work in self-scheduling method is protected as a ________________.

Q14. for overall synchronization, the meeting point where all processes meet and then proceed to their own tasks is called

Point
Barrier Point
Barrier
Meet Point

Q15. A barrier is a shared lock initialized to N no. of processes on which everyone waits till the lock value becomes

Zero
One
N/2
Very small












ANSWERS

Q1. d
Q2. c
Q3. b
Q4. b
Q5. b
Q6. d
Q7. a
Q8. Shared Memory Pool
Q9. c
Q10. d
Q11. b
Q12. i)load balancing
ii) Overhead in pool management
Q13. Critical section
Q14. c
Q15. a


SUBJECTIVE QUESTIONS

Q1. Explain the process creation and destruction with primitives in shared memory programming.

Ans. Processes are generated as per the requirement of the problem.these processes are destroyed after the parallel part of processing is completed so that system resources are not wasted and also the sequential processing required can be done by a single process without interference from other processes.
So Process management involves :
Generating required no of processes
Destroying these processes when parallel part of processing is completed so that system resources are not wasted and sequential processing is carried on by a single process.

For this we require a no of primitives:
id=create_process(N);

The execution of this primitive results in the creation of N new processes. N+1 processes (including the parent process) are identical, only the id is different for each process.
Create_process returns integers 0 to N .
Id=0 indicates the parent process

2. join_process(N,id)

N is the no of processes
id is the integer returned by create_process.
This statement is executed by all processes and only a single process remains alive after the call.
If we put an instruction after the join statement
àThe instruction will not be executed till all processes have executed join_process.Hence no process is still in the compute stage.
àonly one process will be active beyond the join statement and hence parallelism and consequent problems will not arise.




3. Shared
This primitive allocates shared memory. It also provides an id so that memory can be discarded after use.
Shared() returns a pointer to shared memory allocated. That is why sum0 is declared as a pointer in the following example.
Example: to sum up 1+2+3+4 in parallel.

Int *sum0, *sum1, id1, id2, id;
Sum0=(int *) shared(sizeof (int) ,&id1);
Sum1=(int *) shared(sizeof (int) ,&id2);
*sum0=0,*sum1=0;
id=create_process(1);
If (id==1)
*sum0=1+2;
else
*sum1=3+4;
join_process(2,id);
printf(“%d”,*sum0 + *sum1);
free_shm(id1);free_shm(id2);


4. free_shm
free_shm primitive takes the id to free the allocated shared memory by shared primitive.


Q2. How should the access to shared areas be coordinated? What primitives are used for this purpose?

Ans. In shared memory programming, access to shared areas must be properly coordinated. If one process reads a location with the intention of updating it, it must ensure that nobody else reads that area, till one finishes the update. So we use primitives that can ensure that certain block of statements is executed by only one process at a time. If a process is within such a block, no other process should be allowed to enter the block. This mechanism is called mutual exclusion.

Mutual exclusion is ensured by using locks mechanism. We lock such a region when we enter and unlock it when we get out.
The locking primitives are:
à init_lock(id)
It initializes the lock to be in a known position—locked or unlocked.
à lock(id)
It attempts to lock the lock. If the lock is already locked by some other process, the process is put to wait. It will resume only when the lock has been released by the other process and it gets to lock it.
àunlock(id)
It unlocks the lock and returns. It does not check or wait for any condition.

The argument id is required because we may nee different locks for different areas.The locks should be allotted in a shared memory space since more than one process need access to it. So the locks are declared as a pointer and allotted space from shared memory pool.


Q3.Explain the following terms:
i) Loop splitting
ii) Self scheduling
iii) Barrier


Ans. i) Loop splitting
Loop splitting is a method of loop parallelization. It involves splitting a single loop into multiple separate loops. We divide the N no of elements being processed in a loop to a P no of partitions of N/P elements each and assign to the respective processes.
We can use contiguous block of elements for a processing element or we can interleave the access to the elements.
For example: if N=15
And P=5

We can assign elements to processors in the following ways:
I)
P1à1,2,3
P2à4,5,6
P3à7,8,9
P4à10,11,12
P5 à13,14,15


II)
P1à1,6,11
P2à2,7,12
P3à3,8,13
P4à4,9,14
P5à5,10,15
These method of loop splitting are easy to implement. Deciding what elements a process is going to handle is done statically and there is no run time overhead.
Since the processes access is disjoin areas, no coordination or mutual exclusion is required.

ii) Self-scheduling

Self scheduling is a method in which processes choose their work dynamically at run time

In statically assigning elements to processors, the difficulty arises when the processing of all elements is not uniform.
For example if the work is to be done by even numbered processors others will have no work.
In this model, all the work is considered to be available in a common pool. Each process will go to the pool, pick up work if available, execute the work and then go back for more work. Depending upon the complexity of processing various elements of the loop, the number of iterations handled by the process will vary. It achieves a good load balancing among the processes available. But pool management is an overhead of this method.

iv) Barriers

For overall synchronization of processes, a meet point where all processes meet and proceed to their own tasks again is called a barrier. The meeting ensures that all processes have completed the tasks assigned to them till the meeting task
The various primitives involoved are:
i) bar=barrier_init(nproc)
It creates a barrier structure and returns a pointer to it. Every call to barrier_init creates a new barrier. The reference returned is to be used to identify the barrier in invoking it as well as in cleaning it up.

ii) barrier(bar)
It is the call that is to be done by each process which has to go through the barrier.

iii) clean_barrier(bar)
It is to be invoked for every barrier created to clean up any shared memory and semaphores allotted to implement the barrier.

Content Credit: J.S.
Read full story