Computer Organization

Cache Performance

Purpose: Learn how caches affect program performance.
Method: Analyze the effects of different cache organizations using two instruction streams and using a functional cache simulator on an executable program. You may work in teams of 1 or 2 for this assignment.
Preparation: Read chapter 5 in the textbook.
Files to Use: None for Problems 1 and 2. See below for Problem 3.
What to Hand In: Each submission must include a written report with diagrams.

The attached diagrams (pdf, doc) show three types of cache: (1) a direct mapped cache with one word per line, (2) a direct mapped cache with four words per line and (3) a two-way set associative cache with four words per line. The code segments below give two possible instruction streams, each with a variable number of instructions. We will analyze how each of these caches is used with each instruction stream as described below, in order to determine how the caches affect performance of the code.

The caches are all the same size, 128 Bytes or 32 Words. Since we consider only instruction caches, the least significant two bits of each address are always zero and are not used in addressing individual bytes within the words (instructions) in the cache, as they might be in a data cache.

The caches have the following additional characteristics:

Each cache takes one clock cycle for a hit.
Cache (1), the direct mapped cache with one word per line, takes 5 clock cycles for a cache miss.
Cache (2), the direct mapped cache with four words per line, takes 7 clock cycles for a cache miss.
Cache (3), the two-way set associative cache, takes 7 clock cycles for a cache miss.
The clock cycle for the CPU with the set associative cache is 10% longer than the clock cycle for the two direct mapped caches.

Problem 1: Comparing Direct Mapped Caches

Consider cache (1) which is direct mapped and contains 32 lines of one word each. Five bits (bits 2 - 6) of the instruction address are used to index the cache line. The remaining 25 bits are stored in the tag field of the cache.

Assume that code segment 1 has k = 32 instructions in the loop (the last three instructions in the loop would then have addresses given by:

1001 1000 1111 0000 0000 1100 1101 1000 <instruction 30>

1001 1000 1111 0000 0000 1100 1101 1100 addi $s5, $s5, 1

1001 1000 1111 0000 0000 1100 1110 0000 bne $s5, $s6, loop

Use the diagram of the direct mapped one-word per line cache to show the contents of the cache and the values of the tag fields after the first iteration of the loop. Calculate the time (in clock cycles) for the loop to complete 1,001 iterations. Remember to include the instructions preceeding the loop in your calculation. (Note: do not forget the compulsory cache misses the first time around the loop.)
Consider what happens when one instruction is added to the body of the loop for code segment 1 so that k = 33. Calculate the time (in clock cycles) for the loop to complete 1,001 iterations. As the size of the loop is increased one instruction at a time, how does the execution time for the loop increase?
Do part (1), but using cache (2), direct mapped with four words per line. Remember, on a cache miss the whole cache line is replaced!
Do part (2), but using cache (2).

Problem 2: Comparing a Direct Mapped Cache to a Set Associative Cache

Code segment 2 is used for this problem and contains a loop and a subroutine call.

Use cache (2), direct mapped with four words per line, and show the contents of the cache after the first iteration of the loop. Calculate the time (in clock cycles) for the loop to complete 1,001 iterations.
Use cache (3), the two-way set associative cache, and show the contents of the cache after the first iteration of the loop. Calculate the time (in clock cycles) for the loop to complete 1,001 iterations.
Recall that the two-way set associative cache used for part (2) needs a clock cycle which is 10% longer than the clock cycle for the direct mapped cache. Taking this into consideration, how much faster or slower is this code with cache (3), set associative, than with cache (2), direct mapped?

Code Segment 1

Address	Instruction	Comment
1001 1000 1111 0000 0000 1100 0101 1100	addi $s6, $0, 1001	# initialize number of iterations
1001 1000 1111 0000 0000 1100 0110 0000	add $s5, $0, $0	# initialize loop counter
	loop:
1001 1000 1111 0000 0000 1100 0110 0100	<instruction 1>	# beginning of loop body,
1001 1000 1111 0000 0000 1100 0110 1000	<instruction 2>	# which has a total
1001 1000 1111 0000 0000 1100 0110 1100	<instruction 3>	# of k instructions
...	...
1001 1000 1111 0000 0000 1100 0101 1100+4k	addi $s5, $s5, 1	# instruction k-1
1001 1000 1111 0000 0000 1100 0110 0000+4k	bne $s5, $s6, loop	# instruction k
		# end of loop

Code Segment 2

Address	Instruction	Comment
1001 1000 1111 0000 0000 1100 0101 0100	add $s4, $0, $0	# initialize total
1001 1000 1111 0000 0000 1100 0101 1000	addi $s6, $0, 1001	# initialize number of iterations
1001 1000 1111 0000 0000 1100 0101 1100	add $s5, $0, $0	# initialize loop counter
	loop:
1001 1000 1111 0000 0000 1100 0110 0000	add $a0, $s0, $0	# first parameter
1001 1000 1111 0000 0000 1100 0110 0100	add $a1, $s1, $0	# second parameter
1001 1000 1111 0000 0000 1100 0110 1000	add $a2, $s2, $0	# third parameter
1001 1000 1111 0000 0000 1100 0110 1100	add $a3, $s3, $0	# fourth parameter
1001 1000 1111 0000 0000 1100 0111 0000	jal function	# function call
1001 1000 1111 0000 0000 1100 0111 0100	add $s4, $s4, $v0	# add result to total
1001 1000 1111 0000 0000 1100 0111 1000	<instruction 7>	# remainder of loop
1001 1000 1111 0000 0000 1100 0111 1100	<instruction 8>	# which has a total
1001 1000 1111 0000 0000 1100 1000 0000	<instruction 9>	# of 16 instructions
...	...
1001 1000 1111 0000 0000 1100 1001 0100	<instruction 14>
1001 1000 1111 0000 0000 1100 1001 1000	addi $s5, $s5, 1	# instruction 15
1001 1000 1111 0000 0000 1100 1001 1100	bne $s5, $s6, loop	# instruction 16
		# end of loop
...	...
	function:
1001 1000 1111 0000 0000 1111 0110 1000	addi $sp, $sp, -16	# save state
1001 1000 1111 0000 0000 1111 0110 1100	sw $s0, 0($sp)	# instruction B
1001 1000 1111 0000 0000 1111 0111 0000	sw $s1, 4($sp)	# instruction C
1001 1000 1111 0000 0000 1111 0111 0100	sw $s2, 8($sp)	# instruction D
1001 1000 1111 0000 0000 1111 0111 1000	sw $ra, 12($sp)	# instruction E
1001 1000 1111 0000 0000 1111 0111 1100	<instruction F>	#body of subroutine
1001 1000 1111 0000 0000 1111 1000 0000	<instruction G>
1001 1000 1111 0000 0000 1111 1000 0100	<instruction H>
1001 1000 1111 0000 0000 1111 1000 1000	<instruction I>
1001 1000 1111 0000 0000 1111 1000 1100	lw $s0, 0($sp)	# restore state
1001 1000 1111 0000 0000 1111 1001 0000	lw $s1, 4($sp)	# instruction K
1001 1000 1111 0000 0000 1111 1001 0100	lw $s2, 8($sp)	# instruction L
1001 1000 1111 0000 0000 1111 1001 1000	lw $ra, 12($sp)	# instruction M
1001 1000 1111 0000 0000 1111 1001 1100	addi $sp, $sp, 16	# instruction N
1001 1000 1111 0000 0000 1111 1010 0000	jr $ra	#return

Problem 3: Simulation to Assess Cache Behavior

In this problem, you will get introduced to the sim-cache simulator. You will use this simulator to perform cache simulation with various configurations.

The sim-cache simulator performs a functional simulation of an executable program coupled with an emulation of the memory system supporting the program. The emulated memory system is capable of supporting multiple levels of instruction and data caches, each of which can be configured for different sizes and organization. This allows us to measure the actual hit/miss rate of the given program for the emulated cache organization.

The sim-cache simulator (along with other simulators) are available on each of the machines in the Olin 219 lab. The executables for these simulators are all in the /usr/local/cs281/simplesim-3.0 directory. To ease the repeated process of running this program, you should add /usr/local/cs281/simplesim-3.0 to your search path. This can be done by adding the following line to your .bash_profile file in your home directory and then logging out and back in:

export PATH=/usr/local/cs281/simplesim-3.0/:$PATH

If you are in a hurry, you can simply execute the above command in a Terminal window, and the effect will last in that shell for as long as the Terminal window remains.

Start by running sim-cache with the -h option to get the help screen listing all the options and arguments available for configuration of a simulation run:

<command-prompt>$ sim-cache -h

Notice that for an execution run, you can use the -config option to specify a configuration file, and you always must specify an executable for the simulator to "run" and gather memory statistics upon. The following configuration files can help:

cache_1a.cfg: Example configuration file for an L1 instruction cache, but no L1 data cache or any L2 caches

cache_2a.cfg: Example configuration file for an L1 data cache, but no L1 instruction cache or any L2 caches

The current version of all the simulators are configured as PISA (Portable Instruction Set Architecture), which is an instruction architecture quite similar to the MIPS that we have been working with. Also included with the simulator distribution is a set of PISA executables (both little-endian and big-endian versions). For this problem, we will use the executable: /usr/local/cs281/simplesim-3.0/tests-pisa/bin.little/test-math

You may wish to create a testing directory and copy this execuable along with the configuration files and then, as you run the simulator and accumulate your results, you can put files with output from the simulator in this same directory.

Your goal is to use single runs of the sim-cache simulator to determine the miss ratio when we "execute" the test-math program under different conditions:

least-recently-used (LRU) replacement policy
32 to 512 sets
1-way to 8-way associativity
16-byte cache lines (block size)

Run these experiments two times, once each for a data-only cache and for an instruction only cache. Fill in the two tables below:

Miss Ratio (I-Cache)	1-way	2-way	4-way	8-way
32 sets
64 sets
128 sets
256 sets
512 sets

Miss Ratio (D-Cache)	1-way	2-way	4-way	8-way
32 sets
64 sets
128 sets
256 sets
512 sets

Once you have collected the experimental data, use Excel to plot the results of the simulations. For each of the simulations (data, instruction), plot the miss ratio versus associativity for each number of sets. Using markers, show the points on the curves which correspond to total cache sizes of 1 Kbytes, 2 Kbytes, 4 Kbytes and 8 K bytes (total cache size = sets * block size * associativity). For each simulation, you should produce something that resembles the plot below.

Now answer the following questions based on the above results.

Q 1) For a given number of sets, what effect does increasing associativity have on the miss ratio?

Q 2) For a given associativity, what is the effect of increasing the number of sets?

Q 3) For a given cache size, how does the miss ratio change when going from an associativity of one to two to four? Explain.

Q 4) If you were to design a Instruction cache, limited to a total cache size of 4 Kbytes, which cache organization would you choose, based solely on performance?

Q 5) If you were to design a data cache, limited to a total cache size of 4 Kbytes, which cache organization would you choose, based solely on performance?

1001 1000 1111 0000 0000 1100 1101 1000	<instruction 30>
1001 1000 1111 0000 0000 1100 1101 1100	addi $s5, $s5, 1
1001 1000 1111 0000 0000 1100 1110 0000	bne $s5, $s6, loop