Answer: The sequential machines have been made faster by incorporating various schemes. The main idea in all these schemes is to match the speed of various components so as to utilize the resources to their peak performances.
There are various parallelism techniques that we can use in uniprocessor or sequential machines.
Some of them are:
1. Multiplicity of Functional Units.
2. Overlapped CPU and I/O operations.
3. Pipelining within the CPU.
4. Hierarchical Memory Systems.
5. Multiprogramming and Time Sharing.
Multiplicity of functional units:
Mostly computers have only one Arithmetic & Logic Unit (ALU). ALU could perform one function at a time.
The practical machines used today have multiple and specialized units that can operate in parallel.
In sequential machines now it is possible to have multiple functional units for addition, multiplication, division, increment, decrement, Boolean operations and shift instructions.
Example:
CDC-6600(designed in 1964) had 10 functional units.
A scoreboard was used to keep track of availability of the functional units and the registers being demanded.
With 10 functional units and 24 registers the machine could achieve a significant increase rate of instruction execution.
2. Overlapped CPU and I/O operations
As we know each instruction contains Computational phase as well as I/O phase. I/O operations can be performed simultaneously with the computational tasks by using separate I/O controllers, I/O processors.
Computers requiring multiple I/O devices employ multiple I/O subsystems which can operate concurrently. This type of multiprocessing speeds up the data transfer between the external devices and memory.
BUS Contention
The design of such systems depends upon Bus Contention.
Because if device interface and computational task require same bus to transfer data then there will be no speed up in data transfer rate.
To minimize bus contention, some systems employ redundant bus architecture.
Tradeoff is there,
Consequent complications in the design but at the same time the utilization of CPU and other resources maximize.
Processor Intervention
Data is transferred between memory and I/O devices or vice versa.
Processor interferes in this data transfer.
This interference decreases data transfer rate.
To avoid this we use Direct Memory Access (DMA) technique.
DMA is used to provide direct information transfer between the external devices and the primary memory
Q2. How pipelining within the CPU introduces parallelism in sequential machines?
Answer: Today every sequential processor manufactured is taking advantage of parallelism in the form of pipelining instructions.
The purpose of pipeline parallelism is to increase the speed of program and to decrease I/O operations.
Pipeline parallelism is when multiple steps depend on each other but the extension can overlap and the output of one step is streamed as input to the next step.
The different stages of pipelining are
Fetch Stage
Decode Stage
Issue stage
Execute stage
Write back Stage
The fetch stage (F) fetches instructions from cache memory, one per cycle.
The decode stage (D) reveals the instruction function to be performed and identifies the resources needed. Resources include general purpose registers, buses and functional units.
The issue stage (I) reserves resources. Pipeline interlock controls are maintained at this stage the operands are also read from registers during the issue stage.
The instructions are executed in one or several execute stages (E).
The last write back stage (W) is used to write results into the registers. Memory load and store operations are treated as part of execution.
To facilitate instruction execution through the pipe instruction prefetching is used.
The execution of multiple instructions is overlapped in time – even before an instruction gets completely executed, another instruction may be in the process of being decoded, yet another instruction may be getting fetched, and so on.
Pipelining of tasks gives us temporal parallelism.
Figure shows the flow of machine instructions through a typical sequential machine. These eight instructions are for execution of the statements X=Y+Z and A=B*C.
The shaded boxes correspond to idle cycles when instruction issues are blocked due to resources latency or conflicts or due to data dependences.
The first two load instructions issue on consecutive cycles. The add is dependent on both loads and must wait three cycles before the data (Y& Z) are loaded in.
Similarly, the store of the sum to memory location X must wait three cycles for add to finish due to a flow dependence. There are similar blockages during the calculation of A.
The total time required is 17 clock cycles. This time is measured beginning at cycle four when the first instruction starts execution until cycle 20 the last instruction starts execution.
Figure shows an improved time after the instruction issuing order is change to eliminate unnecessary delays due to dependence. The idea is to issue all four load operations in the beginning. Both add and multiply instructions are blocked fewer due to fewer cycles due to this data prefetching. The reordering should not change the end results. The time required is being reduced to 11 cycles, measured from 4 to 14.
Q3. What are the various issues in parallel programming?
Answer: The various issues in parallel programming are:
1. Load balancing:
The operating system must utilize the resources efficiently. This is commonly expressed in terms of achieving a uniform balance of loads across the processors. The operating system should schedule the subtasks such that there are no idle resources including the processors.
2. Scheduling cooperating processes:
Parallel programs which consist of concurrently executing tasks must be scheduled such that collectively they are able to use the resources in the machine required to solve the given problem. Scheduling half the subtask of one parallel program and half the subtasks of another parallel program which can not cooperate can be wasteful.
If a parallel program needs all its subtasks to be running and cooperating, then scheduling half of the tasks can lead to starvation. We can achieve this by maintaining a hierarchy of processes in the sense of parent-child relationship within the operating system’s data structure. Scheduling of processes then can be done such that the processes belonging to a single parallel program are scheduled for execution together.
3. Graceful degradation in case of failure of one of the resources:
Given that a parallel machine has multiple resources of the same kind, one must expect a higher degree of fault tolerance from it. Failure of one of its resources should not result in a catastrophic system crash. The operating system should be able to reschedule the task that had been running on the failed resource and continue the parallel program. Ideally, there should be only a fractional degradation in the performance of the parallel machine.
4. Communication schemes:
Most parallel programs need to share data and intermediate result across subtasks during the processing towards the solution of the problem. To achieve effective cooperation among the subtasks of a parallel program, the operating system must provide adequate facility for communication between tasks. These facilities vary depending on whether the machine is of a shared memory type or of a distributed memory type.
5. Synchronization mechanisms:
To ensure integrity of the shared data across the subtask of a parallel program, synchronization between tasks is required. The shared data must be accessed under mutual exclusion (implemented using a mechanism like semaphores). The task may need to wait till some state is reached across all the tasks of the parallel programs. The operating systems need to provide signaling mechanisms for such synchronization requirements.
0 comments;Click here for request info on this topic:
Post a Comment