Computer Science 281
Computer Organization

Denison

Project 4 (HW 10)

SimpleScalar Simulators

Project Binaries and Benchmarks

The simulators that you will be executing for this assignment have been compiled for the Linux architecture. They reside on the filesystem provided by sunshine, and you can access the executables by appending the following directory to your PATH environment variable initialized in your .bash_profile login initialization file.

Add to PATH: /usr/sunshine/cache/simplesim-3.0

You could also create a new environment variable that holds a string with the directory path for the benchmark programs provided. To do this, add the following line to your .bash_profile file:

export BENCH=/usr/sunshine/cache/benchmarks

After your changes to your .bash_profile, you integrate them into the current Terminal shell by "sourcing" the script file:

prompt> . ~/.bash_profile

Project Problems

Problem 1: Cache simulation and associativity.

You can learn about the "go" SPEC95 benchmark by looking at the web page http://www.spec.org/cpu95/news/099go.html.

This problem introduces the sim-cache cache simulator. In order to simulate several cache configurations at one time, the simulator utilizes an algorithm devised by Rabin Sugumar and Santosh Abraham while they were at the University of Michigan. You can look at the sim-cache summary by entering the command sim-cache by itself.

Use a single run of sim-cache to simulate the performance of the following cache configurations for the SPEC95 benchmark "go".

Static (i.e. unchanging) part of the configuration

* least-recently-used (LRU) replacement policy
* 16-byte cache lines
* Unified instruction and data cache
* Same benchmark program and input configuration (the 'go.alpha 2 8 go.in' part of the command line below.

Dynamic part of the configuration, changing from experimental run to experimental run.

* Cache size
* Number of cache sets, from 128 to 2048
* Associativity, from 1 to 8

The command line will be:

sim-cache <arguments> <benchdir>/go.alpha 2 8 go.in

where

* <arguments> is the list of sim-cache parameters needed to produce results for the specified cache configurations.
* go.alpha is the SimpleScalar binary for the "go" benchmark compiled for the alpha architecture.
* "2 8" are the play quality and the board size for the go simulation.
* "go.in" is a file that specifies the starting board position (in this case you need to create an empty file named go.in in the directory where you will be performing the experiments).

You can watch the two computer players as they present their moves. The game (and therefore the simulation) will end at move number 40.

Your objective is to design a set of experiments answer the question of the relative merit of increasing associativity or increasing the number of sets in improving performance.

Start by understanding your workload. Run simulations to see where the point of diminishing returns in increasing the cache size is.

Using the output from sim-cache, for caches of equivalent size, verify using simple calculations whether increasing associativity or the number of sets in the cache gives the most benefit. Explain how you did your computations, give your results, and discuss the relative benefits of cache associativity verses the number of sets in the cache. You should include at least four comparisons. Is the trend always the same? Suggest reasons for the trends or the of lack of trends.

Problem 2: Cache simulation and unified versus separate instruction and data caches.

This problem continues in its use of the sim-cache simulator.

(a) Use sim-cache for the following sets/associativity combinations using a unified cache:

* 128/1
* 128/2
* 128/4
* 2048/1
* 2048/2
* 2048/4

Note: You can make the simulation run faster by including parameters to tell sim-cache to not simulate parts of the memory hierarchy that it normally would. These parameters are:

-cache:il2 none -cache:dl2 none -tlb:dtlb none -tlb:itlb none

(b) Using the flexibility of sim-cache, compare the performance of separate instruction and data caches to the unified cache. The total cache size should be the same as the size of the unified cache. The simulator will give you the miss rates for the instruction and data caches. Use these to compute an overall miss
rate for the two combined. Do this for each of the configurations in part (a).

Discuss the relationship between the miss rates from part (a) and the separate and combined miss rates. What do your results suggest about using combined/separate instruction and data caches? What other factors might affect this choice?

Note: Each simulation will now contain a single L1 instruction cache and a single L1 data cache. The size of each of these will be half that of the unified cache to which you are comparing results. The sizes should be adjusted by halving the number of sets in each cache.

Problem 3: Cache replacement policies.

Recall that the LRU replacement policy "guesses" the block in a set that will be used farthest in the future by using information about when the blocks were used in the past. An unimplimentable, but useful comarison algorithm known as MIN knows which block will be used furthest in the future. Therefore, using a MIN replacement policy would result in better performance than using the LRU replacement policy.

For a direct-mapped cache (set size 1), would you expect the results for LRU and MIN replacement policies to be different? Why or why not?

The sim-cache simulator also supports the replacement policies of FIFO and RAN (random). FIFO replaces the cache line that has been in the cache the longest. Random arbitrarily picks a line to eject.

Redo the simulation in Problem 1 using the other two replacement policies. Discuss the results for associativity 1.

Define differences in miss-rate as

(original miss-rate - new miss-rate) / original miss-rate.

Compute the differences in miss-rate when changing from the LRU to the FIFO and to RAN replacement policies for all cache sizes and associativities 2 and 4.

What do your results suggest about the cache replacement policies?

Problem 4: Generating profile information.

The sim-profile simulator allows you to find out how often certain events happen during the execution of a program. Enter sim-profile by itself to see a list of all of the parameters that allow you to see profile information.

Run each of the following (either one at a time or all at once using the -all parameter) for the SPEC95 benchmark "go". In each case, describe the information that is given and what you learned about the go program from it (e.g., it does no floating-point computations.) Be creative and be specific. The command line you will use is:

sim-profile <arguments> <benchdir>/go.alpha 2 8 go.in

where

* <arguments> is the list of sim-profile parameters.
* go.alpha is the SimpleScalar binary for the "go" benchmark.
* "2 8" are the play quality and the board size for the go simulation.
* "go.in" is a file that specifies the starting board position (in this case an empty file.)

You should generate and discuss the output for the following profile parameters:

* -iclass Counts of each instruction type.
* -iprof Counts of each instruction.
* -brprof Counts of each branch type.
* -amprof Counts of each addressing mode.
* -segprof Counts of accesses to each data segment.

Generate the output for these profile parameters and describe the information that they provide.

* -tsymprof Count of accesses to text segment symbols.
* -tsymprof -internal Same as above, but include compiler symbols.
* -taddrprof Count of accesses to address in the text segment.
* -dsymprof Count of accesses to data segment symbols.
* -dsymprof -internal Same as above, but include compiler symbols.

Problem 5: Introduction to timing (e.g., execution-driven) simulation.

The sim-cache simulator simulates only the state of the memory system--they do not keep track of the timings of events. The sim-outorder simulator does. In fact, it simulates everything that happens in a superscalar processor pipeline, including out-of-order instruction issue, the latency of the different execution units, the effects of using a branch predictor, etc. Because of this, sim-outorder runs more slowly, but it also generates much more information about what happens in a processor. Enter sim-outorder by itself to see a list of all of the parameters that you can set. Look through this list carefully, and try to identify what each one is for.

Because sim-outorder keeps track of timing, it can report the number of clock cycles that are needed to execute the given program for the simulated processor with the given configuration. Run the SPEC95 benchmark "go" with the default parameters for sim-outorder using the command line:

sim-outorder <arguments> <benchdir>/go.alpha 2 8 go.in

where

* <arguments> is the list of sim-outorder parameters.
* go.ss is the SimpleScalar binary for the "go" benchmark.
* "2 8" are the play quality and the board size for the go simulation.
* "go.in" is a file that specifies the starting board position (in this case an empty file.)

with <arguments> empty, so that the defaults are used. Record the instructions simulated per second, instructions-per-cycle (IPC), and the cache miss rates for the L1 and L2 caches.

Now execute the same simulation with a "perfect" memory system. (Note: This is not a perfect memory system, in that the memory access delay will be from 1 to 4 cycles, but it is as close to perfect as you can get without modifying the simulator source code.) By setting the delay for every memory unit (L1 cache, L2 cache, memory, and TLB) to 1 cycle, you can simulate what would happen if the processor did not have to wait for the memory system to deliver or accept data. Record the instructions simulated per second, the IPC, and the cache miss rates for the L1 and L2 caches.

Verify that your second simulation set the latency for all memory devices to 1 cycle by looking at the output generated before the simulation started. Also verify that the same memory activity took place in both simulations by comparing the cache miss rates.

Compute the change in insructions per cycle (IPC) from the first run to the second. Define the percent change in IPC as

% change = 100 x new IPC / old IPC.

What does this tell you about the importance of the memory system in creating fast processors (at least for the go benchmark)? Could you have come to this conclusion using the non-timed simulator, sim-cache? Explain you answer.

Speed is almost always a concern when building a simulator. Your go simulations simulated about 3 million instructions. Researchers commonly simulate billions of instructions using the same simulators. Using the instructions simulated per cycle, calculate the percent change between sim-outorder and sim-cache. Your answer should be larger than 1. What is your opinion as to the difference in simulation speed? How many times slower would sim-outorder have to be before it would not be easy to do this problem using it?

Problem 6: The need for branch prediction.

Run sim-profile to find out the distribution of instruction types for the SPEC95 benchmark "go". You will use the command line:

sim-profile -iclass <benchdir>/go.alpha 2 8 go.in

where

* go.ss is the SimpleScalar binary for the "go" benchmark.
* "2 8" are the play quality and the board size for the go simulation.
* "go.in" is a file that specifies the starting board position (in this case an empty file.)

What percentage of the instructions executed are conditional branches? Given this percentage, how many instructions on average does the processor execute between each pair of conditional branch instructions (do not include the conditional branch instructions).

Using your textbook, class notes, other references, and your own opinion, list and explain several reasons why the performance of a processor like the one simulated by sim-outorder (e.g., out-of-order-issue superscalar) will suffer because of conditional branches. For each reason also explain how--if at all--a branch predictor could help the situaltion.

Problem 7: Branch predictors.

The sim-outorder simulator allows you to simulate 6 different types of branch predictors. You can see the list of them by looking at the -bpred parameter listed in the output generated when you enter sim-outorder by itself. For each of the 6 default predictors, run the same go simulation as you did above and note the branch-direction prediction rate and the IPC for each. Each command line will be similar to:

sim-outorder -bpred nottaken <benchdir>/go.alpha 2 8 go.in

For at least 4 of the branch predictors, describe the predictor. Your description should include what information the predictor stores (if any), the amount of storage (in bits) that is required, how the prediction is made, and what the relative accuracy of the predictor is compared to the others (use the branch-direction measurements for this). Finally, describe how the prediction rate effects the processor IPC for the go benchmark. Do you believe that the results for go will generalize to all programs? If not, describe a program for which the prediction rates or the IPCs would be different.

Extra Credit:

Problem 7: Cache efficiency.

In his Ph.D. thesis written at the University of Wisconsin, Doug Burger defined cache efficiency to be the portion of time that data in the cache is considered to be live, divided by the total time that data could be live.

A cache line is live if it contains valid data that will be read before the program halts, the data is overwritten, or it is evicted from the cache. The total time that a cache line could be live is the entire execution time of the program. Thus, the efficiency of a cache is computed as the sum of the live times of all of the lines in the cache divided by the execution time multiplied by the number of lines in the cache.

Burger found that cache efficiencies tend to be lower than we might expect. He suggested that because much of the data in the cache will never be used again (this is what efficiency measures), the miss-rate of the cache might be tied to this measurement. If this is the case, then finding a better way to decide what will be kept in the cache will raise the efficiency and also lower the miss-rate of the cache.

Modify the source code of the sim-cache simulator to include a calculation of cache efficiency. The way to do it is to keep track of the total time (define the time of event X as the number of cache accesses that have occurred before X in the simulation) that data is live in the cache. When a cache line becomes valid, mark it with a time-stamp. If it is used later, compute the difference between the current time and the saved time-stamp, then update the time-stamp so that a future reference to the line will add to the live time of this line from this point. Keep a sum of all of the live times for all cache lines. When the simulation ends, use this sum to calculate the cache efficiency as defined above.

Note: This modification will include adding a place to keep the sum and adding timestamps to each cache line.

The sum will be a variable of type SS_COUNTER_TYPE. Like the cache-hits counter, it will be defined in cache.h. It will also be initialized, updated, and output in the same places in cache.c that the cache-hits counter is.

The timestamps will be of type SS_TIME_TYPE. They will be defined in the same place in cache.h as the cache-line status variable. They will also be initialized, updated, and used to update the sum in the same places in cache.c that the cache-line status variable is modified.

Run the experiments in Problem 2(a) using your modified version of the sim-cache simulator. For each cache size, plot the hit rate (defined as 1 - miss rate) and the efficiency of the cache on the same plot. Does the efficiency of the cache predict the miss rate?

For each of the cache sizes 128 cache lines and 2048 cache lines, run the same experiment as in Problem 2(a), but make the cache fully associative (a single set with 128 and 2048 lines, respectively). For each cache size, do two runs: one using the LRU replacement policy and one using the "random" replacement policy. Discuss the following:

* Compare the results for each of the two fully-associative cache sizes using the LRU replacement policy with each of the results for a cache with the same total capacity from Problem 1. Do the results make sense?

* Compare the results for each fully-associative cache using the LRU replacement policy with the results for the same cache using the random replacement policy. What could have caused the results?

Problem 8: _Always_ mispredicting branches

The sim-outorder simulator does not include a way to disable branch prediction. This is because out-of-order processors always benefit from predicting branches as taken or not taken, even if a more resource intensive predictor is not implemented.

Modify the sim-outorder simulator to include the parameter -bpred none. This parameter will cause the processor to simulate a processor with no branch prediction be treating every conditional branch as a mis-predicted branch. To do this, modify the file sim-outorder.c. Look at the code that handles the perfect branch prediction command-line parameter, and create new code for the parameter none.

Run the simulator with no branch prediction. How do the IPC results compare to those for the other 6 branch prediction schemes?