The 101 of Instruction-level Parallelism

7 min readDec 8, 2022

The term “instruction level parallelism” (ILP) refers to an architecture in which distinct processes can carry out a number of activities concurrently, each with access to its own set of resources, including address space, registers, identifiers, states, and program counters. It alludes to processors and compiler design strategies that execute tasks like memory load and store, integer addition, and float multiplication in parallel to boost processing performance.

We all know that

The simultaneous execution of multiple instructions from a program While pipe-lining is a form of ILP, the general application of ILP goes much further into more aggressive techniques to achieve parallel execution of the instructions in the instruction stream.

EXPLOITING ILP FOR BENEFIT

Two basic approaches:

1: rely on hardware to discover and exploit parallelism dynamically,

2: rely on software to restructure programs to statically facilitate parallelism.

These methods complement each other. They can be used together to boost performance.

NOW,

Micro-architectural techniques that are used to exploit ILP include:

Instruction pipelining where the execution of multiple instructions can be partially overlapped.Superscalar execution, VLIW, and concepts of explicitly parallel instruction computing that use multiple execution units to carry out multiple instructions concurrently are closely related.
Out-of-order execution where instructions execute in any order that does not violate data dependencies. Note that this technique is independent of both pipelining and superscalar execution. Current implementations of out-of-order execution dynamically (i.e., while the program is executing and without any help from the compiler) extract ILP from ordinary programs.
Register renaming is a technique that allows for out-of-order execution by preventing the needless serialisation of programme operations that would otherwise be imposed by the reuse of registers by those operations.
Executing instructions in full or in part while unsure of whether such execution is appropriate is known as speculative execution. Control flow speculation, in which instructions following a control flow instruction (such as a branch) are executed before the target of the control flow instruction is established, is a frequently used type of speculative execution.
Several other forms of speculative execution have been proposed and are in use including speculative execution driven by value prediction, memory dependence prediction and cache latency prediction.
Branch prediction which is used to avoid stalling for control dependencies to be resolved. Branch prediction is used with speculative execution.

Pgm Challenges and Opportunities to ILP

Basic Block: a straight-line code sequence with a single entry point and a single exit point. Remember, the average branch frequency is 15%–25%; thus, there are only 3–6 instructions on average between a pair of branches.

Loop-level parallelism: the opportunity to use multiple iterations of a loop to expose additional parallelism. Vectorization, data parallelism, and loop unrolling are some techniques for utilizing loop-level parallelism.

Dependencies are artifacts of programs, and hazards are artifacts of pipeline organization.

Not all dependencies turn into risks down the line.

3 types of dependencies:

1: data dependencies (or true data dependencies),

2: name dependencies, and

3: control dependencies

According to the pipeline’s architecture, dependencies thus have the potential to become hazards.

Data Dependencies

Assume that instruction i executes before j, then

An instruction j is data dependent on instruction i if:

instruction i produces a result that may be used by instruction j;

or I Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i.

Name Dependencies (not true dependencies)

When two instructions use the same register or memory location but there is no data flow between the instructions with the same name.

There are 2 types of name dependencies between an instruction i that precedes an instruction j in the execution order:

When instruction j writes to a register or memory location that instruction i reads, there is antidependence.

output dependence: when instruction i and instruction j write the same register or memory location

Because these are not true dependencies, the instructions i and j could potentially be executed in parallel if these dependencies were somehow removed (by using distinct registers or memory locations).

A hazard exists whenever there is a name or data dependence between two instructions, and they are close enough that their overlapped execution would violate the program’s order of dependency.

Possible data hazards:

I RAW (read after write)

I WAW (write after write)

I WAR (write after read)

Note that RAR (read after read) is not a hazard.

Control Dependencies

Dependency of instructions to the sequential flow of execution and preserves branch (or any flow altering operation) behavior of the program.

In general, two constraints are imposed by control dependencies:

An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch.

An instruction that is not control dependent on a branch cannot be moved after the branch so that the execution is controlled by the branch.

Compiler Techniques for Exposing ILP :

1 : Loop unrolling

2: Instruction scheduling

There is just one distinct thread used to execute a process in ILP. Contrarily, concurrency entails assigning many threads to a CPU core in a strict alternation or, if there are enough CPU cores, in true parallelism, with preferably one core for each run-enabled thread.

Hardware and software are the two methods used to implement instruction-level parallelism.

Software level uses static parallelism while hardware level uses dynamic parallelism. Static parallelism involves the compiler choosing which instructions to execute in parallel, whereas dynamic parallelism involves the processor making that decision at runtime.

Lets understand this with a very simple example

Example :
Let’s say that four operations can be completed in a single clock cycle. As a result, the ILP execution hardware will have 4 functional units, each connected to a branch unit, common register file, and operation.
The functional units are capable of carrying out the following sub-operations: integer ALU, integer multiplication, floating point operations, load, and store.

Let the corresponding delays/latencies be 1, 2, 3, and 1.

Let the instructions go in this order:

y1 = x1*1010
y2 = x2*1100
z1 = y1+0010
z2 = y2+0101
t1 = t1+1
p = q*1000
clr = clr+0010
r = r+0001.

Sequential record of execution vs. Instruction-level Parallel record of execution –

***Fig. a*** *shows sequential execution of operations.*
***Fig. b*** *shows use of ILP in improving performance of the processor.*

In the diagram above, the “nop’s” or “no operations” represent the processor’s idle time. Multiplications take three cycles because of the floating-point operations’ three-cycle delay, which requires the CPU to be idle during that time. The processor in Fig. B, however, can use those nops to carry out other operations while those carrying out earlier ones are still in progress.

Of contrast to sequential execution, where each cycle only carries out one operation, cycles 1 and 2 in a processor with ILP carry out four and two operations, respectively. There is a “nop” in cycle 3 since the first two multiplication operations determine the results of the next two operations. In contrast to a sequential processor, an ILP processor completes 8 operations in just 4 cycles.

Architecture:
While many actions are carried out in a single cycle, parallelism is accomplished. This can be done by carrying out the operations concurrently or by taking advantage of the pauses between two sequential operations that are caused by latencies.

Nowadays, the compiler, rather than hardware, determines when to perform an operation. The amount of parallelism information that the compiler sends to hardware via a program varies depending on the type of ILP architecture, though.

The following techniques can be used to categorize ILP architectures:

Sequential Architecture : In this case, hardware, such as superscalar architecture, is not anticipated to receive any explicit parallelism-related information from the software.

Dependence Architectures : The software specifically specifies information about interoperational dependencies, such as dataflow architecture, at this place.

Independence Architecture : Here, program gives information regarding which operations are independent of each other so that they can be executed instead of the ‘nop’s.

Compiler and hardware must decide on data dependencies, independent operations, scheduling of these independent operations, assignment of functional unit, and register to store data before applying ILP.

Now lets take a practical example of how ILP can be used in a real life example