#LyX 1.2 created this file. For more info see http://www.lyx.org/
\lyxformat 220
\textclass article
\language english
\inputencoding auto
\fontscheme default
\graphics default
\paperfontsize default
\papersize Default
\paperpackage a4
\use_geometry 0
\use_amsmath 0
\use_natbib 0
\use_numerical_citations 0
\paperorientation portrait
\secnumdepth 3
\tocdepth 3
\paragraph_separation indent
\defskip medskip
\quotes_language english
\quotes_times 2
\papercolumns 1
\papersides 1
\paperpagestyle default

\layout Title

Architecture
\layout Section*

2002
\layout Subsection*

A1 
\layout Standard

Suppose you have a computer, say the DLX, with 5 phases: Instruction Fetch(IF),
 Instruction Decode(ID), Execution(EX), Memory(MEM), and Write-Back(WB).
 You are considering pipelining the execution of these stages to improve
 performance.
 Assume that you will have a 5-stage pipeline with one stage for each of
 the above phases.
 Also assume only an integer pipeline (no need to worry about multiple EX
 stages for floating-point operations.
 ) 
\layout Enumerate

Give as many reasons as you can for why one cannot expect to obtain a speedup
 of 5 using this 5-Stage pipeline.
 
\layout Enumerate

Explain what is meant by 
\begin_deeper 
\layout Enumerate

Structural hazards, 
\layout Enumerate

Data hazards, and 
\layout Enumerate

Control hazards 
\layout Standard

and the effect of such hazards upon pipeline speedup.
 
\end_deeper 
\layout Enumerate

For each of the hazards listed in (2), describe the hardware support required
 by the pipeline in order to minimize the effects of such hazards.
 If multiple techniques are available, describe any one technique in detail
 for each of the hazards types.
 What are the strengths and weaknesses of the techniques described by you?
 
\layout Subsection*

A2 
\layout Enumerate

Distinguish between the precise and imprecise interrupts.
 Why is it desirable to have precise interrupts in a pipelined processor?
 What is the problem that precise interrupts solve? Why are precise interrupts
 difficult to achieve? 
\layout Enumerate

What is meant by "speculative execution"? What are the potential benefits
 of this approach? Construct a concrete example to explain your answer.
 
\layout Enumerate

What our problems does speculative execution cause for providing precise
 interrupts? 
\layout Enumerate

What is the reorder buffer, what does it contain, and how is it used? How
 does a reorder buffer help to implement precise interrupts in the presence
 of speculative execution? How is a reorder buffer similar to the reservation
 stations of Tomasulo's algorithm, and how is it different? 
\layout Subsection*

A3 
\layout Enumerate

12 pt
\begin_deeper 
\layout Enumerate

Explain 3 cache optimization strategies.
 
\layout Enumerate

Explain a strategy in detail where CPU is involved in the cache miss penalty.
 
\layout Enumerate

Suppose that in 500 memory references there are 50 misses in the first-level
 cash and 10 misses in the second-level cache.
 What are the local and global missrates for first-level and second-level
 caches? 
\end_deeper 
\layout Enumerate

13 pt
\begin_deeper 
\layout Enumerate

Explain 3 techniques to improve memory bandwidth.
 
\layout Enumerate

What is the miss penalty for cache block of 4 words with the following performan
ce of the basic memory organization? 
\begin_deeper 
\layout Itemize

4 clock cycles to send the address 
\layout Itemize

32 clock cycles for the access time for word 
\layout Itemize

4 clock cycles to send a word of data 
\end_deeper 
\layout Enumerate

Give the clock cycle per instruction(CPI) for 16-bit bus and memory, 32-bit
 bus and memory with no interleaving, and 32-bit bus and memory with interleavin
g,.
 In addition to (2) above, assume that 
\begin_deeper 
\layout Itemize

Memory accesses per instruction is 1.4 
\layout Itemize

Average cycles per instruction (ignoring cache misses) is 2 
\layout Itemize

Miss rate is 4% for 16-bit block size and 2% for the 32-bit blocks size.
 
\end_deeper 
\end_deeper 
\layout Subsection*

A4
\layout Enumerate

15 pt
\begin_deeper 
\layout Enumerate

What is Amdahl's law? 
\layout Enumerate

Explain how it can be used to predict speedup for a parallel processor?
\layout Enumerate

Explain other factors that influence the performance of applications on
 parallel machines? 
\end_deeper 
\layout Enumerate

What is multiprocessor cache coherence? Explain 2 different protocols for
 enforcing this coherence.
 
\layout Section*

2001
\layout Subsection*

A1
\layout Enumerate

Explain, with the help of a brief example, how increasing the cache block
 size can increase the number of coherence misses in centralized shared
 memory machines.
 
\layout Enumerate

Show, by example, how contention can cause latency of a write-update cache
 coherence protocol to be higher than that of a write-invalidate protocol.
 
\layout Enumerate

Consider a two-way set associative cache.
 Show, by a clear example, call increasing the cache block size can increase
 the number of conflict misses.
 
\layout Enumerate

Explain a strategy in which a large number of the registers can be effectively
 utilized, without increasing instruction length or the time required for
 procedure calls.
 
\layout Enumerate

DLX supports displacement addressing mode in which a register number R_i
 and an 16-bit offset n is used for addressing a memory location M(R_i +
 n).
 Suppose we change this instruction set to exclude displacement addressing
 mode, and to include register indirect M(R_i).
 Explain how this change can influence the ability to perform instruction
 scheduling and loop unrolling.
 
\layout Enumerate

Suppose we change a system architecture to allow issuing 8 instructions
 per cycle, instead of previously supported 4.
 List what additional resources may be required in the system to allow it
 to benefit from the increased multiple issue ability, clearly explaining
 why.
 
\layout Subsection*

A2
\layout Enumerate

you are given a system with the following parameters: 
\begin_deeper 
\layout Itemize

The clock speed is 100 MHz
\layout Itemize

the CPI of the system, after including the memory stalls, is 4
\layout Itemize

each instruction generates an average of two memory accesses 
\layout Itemize

30% of memory references are writes 
\layout Itemize

Each cache block is 8 4-byte words; the entire block is read on a miss 
\layout Itemize

Cache miss rate is 5% 
\layout Itemize

Cache uses write-allocate 
\layout Itemize

Assume 25% of cache blocks are dirty at any given time.
 
\layout Standard

Calculate the total bankwidth with requirements of the memory bus, if the
 cache is a) write-back, and b) write-through.
\end_deeper 
\layout Enumerate

Suppose we have a 2 processor cache coherent shared memory multiprocessor.
 The size of main memory is 64 bytes, and each processor has a 16 byte cache.
 Cache are directly mapped, with a block size of 2 bytes.
 Given the following sequence of read and write operations (byte numbers)
 from the two processors, determine if each operation is a
\begin_deeper 
\layout Standard

1) cache bit, 
\layout Standard

2) cold cache miss, 
\layout Standard

3) conflict cache miss, 
\layout Standard

4) capacity cache miss, 
\layout Standard

5) true coherence cache misses, or 
\layout Standard

6) false coherence cache miss.
 
\layout Standard

Assume a write back, write-allocate cache.
 
\layout Standard

Read 0 from P0, Read 8 from P0, Read 1 from P0, 
\layout Standard

Read 17 from P0, Read 0 from P0, Read 0 from P1, 
\layout Standard

Read 1 from P0, Write 8 from P1, Read 9 from P0, 
\layout Standard

Read 9 from P1, Write 8 from P0, Read 8 from P1.
\end_deeper 
\layout Subsection*

A3
\layout Enumerate

Explain how exceptions make implementations of a pipelined architecture
 harder.
 Describe different classifications of exceptions, including synchronous
 vs asynchronous, user requested vs coerced, user maskable vs user non-maksable,
 within vs between instructions, and resume vs terminate.
 
\layout Enumerate

Explain the following models of consistency, using brief examples to illustrate
 the differences between them: sequential consistency, total store order
 (TSO), partial store order, weak ordering, and release consistency.
 Also describe how each of these models influences the performance and programma
bility of a shared memory multiprocessor.
 
\layout Subsection*

A4
\layout Enumerate

Discuss the impact of each of the following factors on the branch penalty
 for conditional branches and for unconditional branches in a pipelined
 processor: 
\begin_deeper 
\layout Enumerate

The extent and amount of pre-fetching of instructions in the processor
\layout Enumerate

The processor architecture being VLIW contrasted with superscalar 
\layout Enumerate

The extent to which the processor has "speculative" execution of both instructio
n streams following a conditional branch.
\end_deeper 
\layout Enumerate

As the chief design architecture of a new pipelined superscalar processor,
 one of your tasks is to decide the strategy for handling branch instructions.
 One of your colleagues on the design team is pushing for use sophisticated,
 expensive branch prediction techniques.
 Another colleague opposes this approach by asserting that "Correct prediction
 of a conditional branch outcome merely changes the branch penalty from
 that of a conditional branch to that of an conditional branch.
 
\begin_deeper 
\layout Standard

Do you agree or disagree with the assertion of the second colleague? Give
 logical arguments to support your case.
 
\end_deeper 
\layout Enumerate

Suppose a pipelined processor has an 8-stage instruction execution pipeline,
 with each stage taking one cycle.
 10% far of all instructions are unconditional branches and 20% are conditional
 branches.
 Conditional branches are "successful" (jump to the target) 80% of the time
 on average.
 You may assume that there are around no delays due to other causes (e.g.,
 data hazards).
\begin_deeper 
\layout Standard

The penalty for an unconditional branch is 2 cycles.
 For a conditional branch, if the target is predicted, the penalty for an
 incorrect prediction is 6 cycles while that for a correct prediction is
 2 cycles.
 If the next sequential instruction is predicted, the penalty for incorrect
 and correct predictions is 8 and 0 cycles and respectively.
 
\layout Enumerate

Suppose the strategy is to always predict that conditional branches will
 go to the target.
 What will be the effective CPI (cycles per instruction) of the processor?
 
\layout Enumerate

Suppose dynamic prediction is being used.
 What should be the minimum success rate of the prediction strategy so that
 the instruction execution rate is higher than for strategy (a)?
\end_deeper 
\layout Section*

2000
\layout Subsection*

A1
\layout Enumerate

Explain the limitations of using clocks-rate and MIPS measure to evaluate
 the performance of a computer system.
\layout Enumerate

State the difference between stack based, accumulator based, register based,
 and load-store architectures.
 
\layout Enumerate

What is the difference between static issue (VLIW) and the dynamic issue
 (Superscalar) processors? Compare these two designs with the following
 four considerations.
 
\begin_deeper 
\layout Enumerate

Complexity of the hardware 
\layout Enumerate

Complexity of the compiler 
\layout Enumerate

Code Density 
\layout Enumerate

Backward Compatibility 
\end_deeper 
\layout Enumerate

Give examples to show RAW, WAW, and WAR dependencies in code sequences.
 Explain what instruction set design features will cause WAW hazards and
 WAR hazards.
 
\layout Subsection*

A2
\layout Standard

Consider the implementation of pipelined DLX, a simple instruction set described
 by Hennessey and Patterson.
 For integer operations, there are 5 pipeline stages Instruction Fetch(IF),
 Instruction Decode(ID), Execution(EX), Memory(MEM), and Write-Back(WB).
 Note that in DLX, add R1, R4, R5 means add R4 and R5 and store in R1.
 Floating point multiply and addition operations spend 5 cycles in the EX
 stage.
 Full forwarding is done to reduce the number of data hazards.
 Also, assume that there are no structural hazards in the system.
 A single delayed branch is used to mask stalls due to control hazards.
\layout Standard

Consider the following code sequence:
\layout LyX-Code

for (i=0; i< 1000; i++) 
\newline 
    Y[i] += s * X[i]
\layout Standard

This loop is implemented in DLX as follows:
\layout LyX-Code

loop: LD F0, O(R1)
\begin_deeper 
\layout LyX-Code

MULTD F0, F0, F2
\layout LyX-Code

LD F0, O(R2)
\layout LyX-Code

ADDD F4, F0, F4
\layout LyX-Code

SUBI R1, R1, 8
\layout LyX-Code

SUBI F2, R2, 8
\layout LyX-Code

BNEZ R1, loop
\end_deeper 
\layout Enumerate

Schedule the instructions to minimize the number of data and control hazards,
 without performing any loop unrolling.
 Calculate the CPI of the resulting execution of the loop.
 
\layout Enumerate

Now, unroll the loop so that each new iteration does the work of two iterations
 of the original loop.
 Schedule the instructions to minimize the number of data and control hazards
 and calculate the CPI of the resulting execution of the loop.
 Use the number of instructions in the original loop as the basis for calculatin
g the CPI.
 
\layout Enumerate

Now, suppose the machine has superscalar abilities, with the following features:
\begin_deeper 
\layout Enumerate

Up to 2 instructions can be started in the same cycle, provided they are
 aligned at the 65 bit boundary.
 
\layout Enumerate

One of the instructions can be a load, store, branch, or integer ALU operation,
 the other instruction can be a floating point operation.
 
\begin_deeper 
\layout Standard

Unroll the loop so that each new iteration does the work of four iterations
 of the original loop (i.e., the loop used in part a) ).
 Schedule the instructions to maximize the use of superscalar abilities
 and minimize the number of data and control hazards, Calculate the CPI
 of the resulting execution of the loop.
 As in part b), use the number of instructions in the original loop as the
 basis for calculating the CPI.
\end_deeper 
\end_deeper 
\layout Subsection*

A3
\layout Enumerate

(12 pt) One approach to reducing misses in a cache is to prefetch the next
 bloc.
 A simple but effective strategy is: one block "i" is referenced, make sure
 block "i+1" is in the cache, and if not, prefetch it.
 
\begin_deeper 
\layout Enumerate

Do you think this automatic prefetching strategy is more or less effective
 with the increasing block size? Why? 
\layout Enumerate

Is it more or less effective with increasing cache size? Why?
\layout Enumerate

Is it more or less effective in an instruction cache compared to a data
 cache? Why? 
\layout Enumerate

Is it more or less effective in a TLB compared to a regular cache? Why?
\end_deeper 
\layout Enumerate

(13 pt) Compare and contrast each of the following schemes, assuming that
 each line can hold either the one instruction or one data item.
 
\begin_deeper 
\layout Enumerate

A fully associative cache of size "2n" lines using the LRU replacement policy.
\layout Enumerate

Separate instruction and data caches, both fully associative using the LRU
 replacement policy, and each of size "n" lines
\layout Enumerate

An instructions look-behind buffer that holds the "n" most recently executed
 instructions with an associative search to determine whether a desired
 instruction is present in the buffer, and a fully associative cache of
 size "n lines" (for both instructions and data ) using LRU replacement
 policy.
 
\layout Standard

Compare the above schemes with respect to performance, complexity, and types
 of programs (eg., code structure or data access patterns) that are speeded
 up by each scheme.
 
\end_deeper 
\layout Subsection*

A4
\layout Enumerate

(15pt) Describe a write-invalidate write-back cache-coherence protocol for
 centralized shared memory machines in which four states are used: invalid,
 shared (read only), exclusive (read/write), and clean private (read only).
 Draw the state transition diagram, showing the transitions and actions
 taken.
 Explain each transition taken in 1-2 sentences, classifying them on the
 basis of the initial state.
 
\layout Enumerate

(4 pt) Explain why a write-update or write-through coherence protocol are
 not commonly implemented, especially for larger centralized shared memory
 configurations.
 
\layout Enumerate

(6 pt) What is Amdahl's Law? What does it tell us about the maximum speed
 up that can be attained by a parallel processor? Give an expression for
 the speed up as stated by the law.
\layout Section*

1999
\layout Subsection*

A1
\layout Enumerate

Explain the term "backward compatibility" as it applies to instruction set
 by designs.
 Give 2 examples where can certain kind of architectures have been more
 successful because of backward compatibility.
 
\layout Enumerate

Briefly explain accumulator, stack, and register based architectures, clearly
 marking their relative advantages and disadvantages.
 Give possible reasons that Java Virtual Machine(JVM) is stack based and
 not register based.
 
\layout Enumerate

State any 3 properties of any instruction set that makes it easier to pipeline
 to.
 
\layout Enumerate

What is a load-store architecture? From the currently available chips, give
 one example of a load-store design and one example of a design that is
 not load-store.
 
\layout Subsection*

A2 
\layout Enumerate

Explain the differences between centralized shared memory, distributed shared
 memory and message passing architectures.
 
\layout Enumerate

Explain the significance of coherence protocols for centralized shared memory
 and distributed shared memory machines.
 Also explain why a coherence protocol is not required for message passing
 machine.
 
\layout Enumerate

Explain the difference between snooping and directories based a coherence
 protocols.
 State which one of this is more suitable for centralized shared memory
 system and which one is more successful for distributed shared memory systems?
 
\layout Enumerate

Describe all write-invalidate, write-back snooping coherence protocol for
 centralized shared memory system.
 Assume that 3 states are possible for any cache block: invalid, shared,
 and exclusive.
 Draw the coherence diagram and explain each transaction in one sentence.
 
\layout Subsection*

A3 
\layout Enumerate

Give an example of each of the following types of interactions between the
 operands of successive instructions: 
\begin_deeper 
\layout Standard

Read/Write, Write/Read, and Write/Write 
\layout Standard

For each of these scenes interactions, explain the effect on the performance
 of a pipelined processor executing these instructions.
 
\end_deeper 
\layout Enumerate

Described as clearly as you can the technique of internal forwarding.
 In this technique, how are operand interactions detected and what action
 is taken to reduce their impact on the performance of the processor? For
 each of the 3 interactions in part (1), explain how internal fording eliminates
 or reduces potential delays.
 
\layout Enumerate

How does dynamic scheduling help in further eliminating or reducing delays
 due to the dependences between instructions.
 
\layout Subsection*

A4
\layout Standard

A computer system that uses Virtual Memory with Paging uses a Translation
 Lookaside Buffer (TLB) to speed up address translation.
 It also has a cache which uses physical memory addresses.
 Both Virtue and Physical addresses (byte-addressable) are 32-bit long,
 and page size is 4 kilobytes.
 The TLB can store 16 page table entries, where each entry is 4 bytes long.
 The TLB uses a direct-mapped organization.
 The cache is of size 1 kilobytes and is set-associative with four frames,
 each containing 64 lines of 4 bytes each.
 
\layout Enumerate

Show all the different fields into which a Virtual Addresses and a Physical
 Address are divided for the purpose of looking up the TLB and the cache
 respectively.
 Clearly indicate the position of each field in the address, its size, and
 the purpose it is used for.
\layout Enumerate

List in detail all the steps required to fetch a data item in this system,
 given the Virtual Address of the item.
 Use a diagram to show how the different views of the Virtue and Physical
 Addresses are used in the various steps.
 Be sure to consider all possible cases: hit/miss in the TLB, hit/miss in
 the cache, etc.
 
\layout Section*

1998
\layout Subsection*

A1
\layout Enumerate

Comment on the issues involved in using virtual addresses in a cache as
 compared to using physical addresses.
 
\layout Enumerate

State and explain briefly and 2 techniques for achieving higher main memory
 bandwidth.
 Compare and contrast between the 2 techniques.
 
\layout Enumerate

13 pt
\begin_deeper 
\layout Enumerate

Specify all the portions of a cache address assuming 32-bit physical address,
 32-byte cache block site, and a single cache of the size 1024 kilobytes.
 Consider the following 2 cases: direct mapping, 4-way associative.
 
\layout Enumerate

What are the possible advantages/disadvantages of increasing the associate
 unity in the set-associative cache organization?
\end_deeper 
\layout Subsection*

A2
\layout Enumerate

State and explain amdahl's Law.
 In the same context, comment if superlinear speed up is possible in real
 situations.
 
\layout Enumerate

What are the limitations in achieving a linear speedup in the real parallel
 machine for real application? 
\layout Enumerate

12 pt
\begin_deeper 
\layout Enumerate

Compare and contrast the centralized the shared-memory of architecture vs.
 distributed shared-memory architecture.
 
\layout Enumerate

Explain the use of multiple caches in centralized the shared-memory architecture
 for multiprocessors.
 What problems does the use of the cache cause? How are they resolved?
\end_deeper 
\layout Subsection*

A3
\layout Enumerate

In the context of pipelined processors, explain the terms dynamic scheduling
 and static scheduling, clearly describing the role that the compiler and
 the hardware plays in each.
 
\layout Enumerate

What is the difference between static issue(VLIW) and dynamic issue (super
 scalar) processors? Compare these 2 designs with the following for consideratio
ns: 
\begin_deeper 
\layout Enumerate

Complexity of the hardware 
\layout Enumerate

Complexity of the compiler 
\layout Enumerate

Code density 
\layout Enumerate

Backward compatibility 
\end_deeper 
\layout Enumerate

Explain how loop unrolling can help in improving performance in pipelined
 and multiple issue machines.
 
\layout Subsection*

A4
\layout Standard

Consider the simplest the implementation of pipelined integer DLX, an instructio
n set described by Hennessy and Patterson.
 There are 5 pipelines stages, Instruction Fetch(IF), Instruction Decode
 (ID), Execution (EX), Access Data Memory(MEM), and Write-Back into the
 Registers (WB).
 Assume that all the registers are written only during the last cycle in
 the pipeline.
 Also, note that in DLX, add R1, R3, R5 means add R4 and R5 and store in
 R1.
 Load R3, O(R1) means that O is added to R1, and the result is used as the
 source address; R3 is the destination of the load.
 No structural hazards are possible.
\layout Standard

Consider the following code sequence.
 
\layout LyX-Code

Load R3, O(R1)
\layout LyX-Code

Add R5, R3, R4
\layout LyX-Code

Load R6, O(R2)
\layout LyX-Code

Add R7, R6, R4
\layout Enumerate

Initially, assume that no forwarding is possible.
 Draw the resulting pipeline timing chart and calculate the number of cycles
 it takes to finish in the above code sequence.
 
\layout Enumerate

Assume that the forwarding is done in the implementation of the machine.
 Show the resulting pipeline timing chart and calculate the number of cycles
 required.
 
\layout Enumerate

How can the performance of the above code be improved through compiler re-orderi
ng of the instructions? Show the new pipeline timing chart and calculate
 the number of cycles required.
 
\layout Section*

1997
\layout Subsection*

A1
\layout Enumerate

Consider the following code sequence 
\begin_deeper 
\layout LyX-Code

for (j=0; j<4; j++)
\layout LyX-Code

  for (i=0; i<256; i++)
\layout LyX-Code

    x[i][j] += y[i][j]
\layout Standard

Assume that 
\layout Itemize

The arrays x and y are dimension 256*4.
 The storage allocation is row major.
 
\layout Itemize

Each data item is 4 bytes.
 
\layout Itemize

Cache block size is 32 bytes.
 
\layout Itemize

Total cache size is 8 kilobytes.
 
\layout Itemize

The cache is fully associative with least recently used replacement policy.
 
\layout Itemize

The cache is write-back, with write-allocate.
 
\layout Enumerate

Determine the number of cache misses during the execution of the above loop-nest.
 Assume that no elements of x or y are in cache initially.
 
\layout Enumerate

Based upon your result in the previous part, will there be any advantage
 to applying the loop transformation to the previous code? 
\layout Enumerate

How will you answer be part (b) changed if the cache was directly mapped?
 
\end_deeper 
\layout Enumerate

Explain the Amdashl's law? How does Amdashl's law predict performance for
 the parallel machines? What on the other factors which influence the performanc
e of applications on parallel machines? Explain the effect of these factors
 separately for message passing machines, centralized shared memory machines
 and distributed shared memory machines (8+8+3+3+3).
 
\layout Enumerate

25 pt
\begin_deeper 
\layout Enumerate

Explain the difference between RISC and CISC architecture.
 State 2 important reasons for the emergence of RISC designs.
 
\layout Enumerate

State the difference between stack based, accumulator based, register days
 and load-store architectures.
 Give 1 example of a processor which is registered based, but not load-store.
 
\layout Enumerate

State 2 important considerations in design of an instruction set, from the
 view-point of pipe Line implementation and performance.
 
\end_deeper 
\layout Enumerate

Consider the simplest implementation of pipelined integer DLX, a simple
 instruction set described by Henson and Patterson.
 For integer operations, there are 5 pipeline stages Instruction Fetch(IF),
 Instruction Decode(ID), Execution(EX), Memory(MEM), and Write-Back(WB).
 Assume that all the registers are written only during the last cycle in
 the pipeline.
 Note that in DLX, add R1, R4, R5 means add R4 and R5 and store in R1.
 For any of the ALU operations (like add or sub), the operands are read
 during the decode(ID) stage and the calculations are performed during the
 EX stage.
\begin_deeper 
\layout Standard

Consider the following code sequence: 
\layout LyX-Code

add R1, R4, R5
\layout LyX-Code

sub R6, R1, R5
\layout LyX-Code

add R7, R1, R6
\layout LyX-Code

add R8, R4, R1
\layout Enumerate

If we stall the pipeline whenever there's a date hazard, how many cycles
 with it take to complete the above instructions sequence? Draw the resulting
 pipeline.
 
\layout Enumerate

Can the performance of the above code be improved by compiler-recordering
 of the instructions? Redraw the resulting pipeline and calculate the number
 of cycles required.
 
\layout Enumerate

Assume that we now have a machine which is able to input data from any pipeline
 register rather than having to wait till the data is written to the real
 registers.
 Show the new pipeline and calculate the number of cycles required.
 
\the_end