Computer Organization

Lab: Cache Performance

Purpose: Learn how caches affect program performance.
Method: Work in teams of 2 to analyze the effects of different cache organizations using two instruction streams.
Preparation: Read chapter 7 in the textbook.
Files to Use: None.
What to Hand In: Each team must turn in a written report with diagrams.

The attached diagrams (pdf, doc) show three types of cache: (1) a direct mapped cache with one word per line, (2) a direct mapped cache with four words per line and (3) a two-way set associative cache with four words per line. The code segments below give two possible instruction streams, each with a variable number of instructions. We will analyze how each of these caches is used with each instruction stream as described below, in order to determine how the caches affect performance of the code.

The caches are all the same size, 128 Bytes or 32 Words. Since we consider only instruction caches, the least significant two bits of each address are always zero and are not used in addressing individual bytes within the words (instructions) in the cache, as they might be in a data cache.

The caches have the following additional characteristics:

Each cache takes one clock cycle for a hit.
Cache (1), the direct mapped cache with one word per line, takes 5 clock cycles for a cache miss.
Cache (2), the direct mapped cache with four words per line, takes 7 clock cycles for a cache miss.
Cache (3), the two-way set associative cache, takes 7 clock cycles for a cache miss.
The clock cycle for the CPU with the set associative cache is 10% longer than the clock cycle for the two direct mapped caches.

Problem 1: Comparing Direct Mapped Caches

Consider cache (1) which is direct mapped and contains 32 lines of one word each. Five bits (bits 2 - 6) of the instruction address are used to index the cache line. The remaining 25 bits are stored in the tag field of the cache.

Assume that code segment 1 has k = 32 instructions in the loop (the last three instructions in the loop would then have addresses given by:

1001 1000 1111 0000 0000 1100 1101 1000 <instruction 30>

1001 1000 1111 0000 0000 1100 1101 1100 addi $s5, $s5, 1

1001 1000 1111 0000 0000 1100 1110 0000 bne $s5, $s6, loop

Use the diagram of the direct mapped one-word per line cache to show the contents of the cache and the values of the tag fields after the first iteration of the loop. Calculate the time (in clock cycles) for the loop to complete 1,001 iterations. Remember to include the instructions preceeding the loop in your calculation. (Note: do not forget the compulsory cache misses the first time around the loop.)
Consider what happens when one instruction is added to the body of the loop for code segment 1 so that k = 33. Calculate the time (in clock cycles) for the loop to complete 1,001 iterations. As the size of the loop is increased one instruction at a time, how does the execution time for the loop increase?
Do part (1), but using cache (2), direct mapped with four words per line. Remember, on a cache miss the whole cache line is replaced!
Do part (2), but using cache (2).

Problem 2: Comparing a Direct Mapped Cache to a Set Associative Cache

Code segment 2 is used for this problem and contains a loop and a subroutine call.

Use cache (2), direct mapped with four words per line, and show the contents of the cache after the first iteration of the loop. Calculate the time (in clock cycles) for the loop to complete 1,001 iterations.
Use cache (3), the two-way set associative cache, and show the contents of the cache after the first iteration of the loop. Calculate the time (in clock cycles) for the loop to complete 1,001 iterations.
Recall that the two-way set associative cache used for part (2) needs a clock cycle which is 10% longer than the clock cycle for the direct mapped cache. Taking this into consideration, how much faster or slower is this code with cache (3), set associative, than with cache (2), direct mapped?

Code Segment 1

Address	Instruction	Comment
1001 1000 1111 0000 0000 1100 0101 1100	addi $s6, $0, 1001	# initialize number of iterations
1001 1000 1111 0000 0000 1100 0110 0000	add $s5, $0, $0	# initialize loop counter
	loop:
1001 1000 1111 0000 0000 1100 0110 0100	<instruction 1>	# beginning of loop body,
1001 1000 1111 0000 0000 1100 0110 1000	<instruction 2>	# which has a total
1001 1000 1111 0000 0000 1100 0110 1100	<instruction 3>	# of k instructions
...	...
1001 1000 1111 0000 0000 1100 0101 1100+4k	addi $s5, $s5, 1	# instruction k-1
1001 1000 1111 0000 0000 1100 0110 0000+4k	bne $s5, $s6, loop	# instruction k
		# end of loop

Code Segment 2

Address	Instruction	Comment
1001 1000 1111 0000 0000 1100 0101 0100	add $s4, $0, $0	# initialize total
1001 1000 1111 0000 0000 1100 0101 1000	addi $s6, $0, 1001	# initialize number of iterations
1001 1000 1111 0000 0000 1100 0101 1100	add $s5, $0, $0	# initialize loop counter
	loop:
1001 1000 1111 0000 0000 1100 0110 0000	add $a0, $s0, $0	# first parameter
1001 1000 1111 0000 0000 1100 0110 0100	add $a1, $s1, $0	# second parameter
1001 1000 1111 0000 0000 1100 0110 1000	add $a2, $s2, $0	# third parameter
1001 1000 1111 0000 0000 1100 0110 1100	add $a3, $s3, $0	# fourth parameter
1001 1000 1111 0000 0000 1100 0111 0000	jal function	# function call
1001 1000 1111 0000 0000 1100 0111 0100	add $s4, $s4, $v0	# add result to total
1001 1000 1111 0000 0000 1100 0111 1000	<instruction 7>	# remainder of loop
1001 1000 1111 0000 0000 1100 0111 1100	<instruction 8>	# which has a total
1001 1000 1111 0000 0000 1100 1000 0000	<instruction 9>	# of 16 instructions
...	...
1001 1000 1111 0000 0000 1100 1001 0100	<instruction 14>
1001 1000 1111 0000 0000 1100 1001 1000	addi $s5, $s5, 1	# instruction 15
1001 1000 1111 0000 0000 1100 1001 1100	bne $s5, $s6, loop	# instruction 16
		# end of loop
...	...
	function:
1001 1000 1111 0000 0000 1111 0110 1000	addi $sp, $sp, -16	# save state
1001 1000 1111 0000 0000 1111 0110 1100	sw $s0, 0($sp)	# instruction B
1001 1000 1111 0000 0000 1111 0111 0000	sw $s1, 4($sp)	# instruction C
1001 1000 1111 0000 0000 1111 0111 0100	sw $s2, 8($sp)	# instruction D
1001 1000 1111 0000 0000 1111 0111 1000	sw $ra, 12($sp)	# instruction E
1001 1000 1111 0000 0000 1111 0111 1100	<instruction F>	#body of subroutine
1001 1000 1111 0000 0000 1111 1000 0000	<instruction G>
1001 1000 1111 0000 0000 1111 1000 0100	<instruction H>
1001 1000 1111 0000 0000 1111 1000 1000	<instruction I>
1001 1000 1111 0000 0000 1111 1000 1100	lw $s0, 0($sp)	# restore state
1001 1000 1111 0000 0000 1111 1001 0000	lw $s1, 4($sp)	# instruction K
1001 1000 1111 0000 0000 1111 1001 0100	lw $s2, 8($sp)	# instruction L
1001 1000 1111 0000 0000 1111 1001 1000	lw $ra, 12($sp)	# instruction M
1001 1000 1111 0000 0000 1111 1001 1100	addi $sp, $sp, 16	# instruction N
1001 1000 1111 0000 0000 1111 1010 0000	jr $ra	#return

1001 1000 1111 0000 0000 1100 1101 1000	<instruction 30>
1001 1000 1111 0000 0000 1100 1101 1100	addi $s5, $s5, 1
1001 1000 1111 0000 0000 1100 1110 0000	bne $s5, $s6, loop