When more code is less code

Last time, we crafted a function that would turn on an LED for a very short and precise amount of time. At the heart of this function was the code to figure out how many loop iterations we needed. Since each loop  takes 3 cycles, we needed to divide the total delay by 3 to get the number of iterations. Then we needed to find the remainder of that division to find out how many cycles we left over after the loops. We’d use the remainder to select one of thee versions of the looping code, where each version has the appropriate number of cycles added at the end. The pseudo code looks something like this…

count = cycles/3   // integer math will round down to nearest whole
remainder = cycles - (count*3)

switch (remainder) {
    case 0: // no remainder
        loop count times
        done

    case 1: // we need to waste one extra cycle
        loop count times
        waste 1 cycle
        done

    case 2:  // we need to waste 2 cycles
        loop count times
        waste 2 cycles
        done

This looks very simple and it is, but there is a lot of work hidden in there. It turns out that computers as small as the ATTINY used in the Ognite do not know how to multiplication and division.

There are various ways to do multiplication and division by breaking them down to a series of additions and subtractions or using very fancy binary math operations. If we look at the instructions actually generated by the compiler for our divide and multiply, there is a lot of complicated stuff going on here…

				byte loopcounter = cycles/ (byte) 3;		// TODO: either do faster bit compute here, or store dividends and remainder in lookup
  ae:	82 2f       	mov	r24, r18
  b0:	90 e0       	ldi	r25, 0x00	; 0
  b2:	6b ea       	ldi	r22, 0xAB	; 171
  b4:	70 e0       	ldi	r23, 0x00	; 0
  b6:	4d d0       	rcall	.+154    	; 0x152 <__mulhi3>
  b8:	96 95       	lsr	r25

				byte remainder = cycles - (loopcounter*3);			// THis is how many cycles we need to pick up the slack to make up for the granularity of the loop
  ba:	89 2f       	mov	r24, r25
  bc:	88 0f       	add	r24, r24
  be:	88 0f       	add	r24, r24
  c0:	59 2f       	mov	r21, r25
  c2:	58 1b       	sub	r21, r24
  c4:	85 2f       	mov	r24, r21
  c6:	82 0f       	add	r24, r18

00000152 <__mulhi3>:
 152:	55 27       	eor	r21, r21
 154:	00 24       	eor	r0, r0

00000156 <__mulhi3_loop>:
 156:	80 ff       	sbrs	r24, 0
 158:	02 c0       	rjmp	.+4      	; 0x15e <__mulhi3_skip1>
 15a:	06 0e       	add	r0, r22
 15c:	57 1f       	adc	r21, r23

0000015e <__mulhi3_skip1>:
 15e:	66 0f       	add	r22, r22
 160:	77 1f       	adc	r23, r23
 162:	61 15       	cp	r22, r1
 164:	71 05       	cpc	r23, r1
 166:	21 f0       	breq	.+8      	; 0x170 <__mulhi3_exit>
 168:	96 95       	lsr	r25
 16a:	87 95       	ror	r24
 16c:	00 97       	sbiw	r24, 0x00	; 0
 16e:	99 f7       	brne	.-26     	; 0x156 <__mulhi3_loop>

00000170 <__mulhi3_exit>:
 170:	95 2f       	mov	r25, r21
 172:	80 2d       	mov	r24, r0
 174:	08 95       	ret

Wow, dividing and multiplying by 3 is hard for a binary computer.

Turns out that multiplying and dividing by powers of 2 is very easy for a computer because in binary math these are just shifting bits up and down. Too bad we need to divide and multiply by 3. If only our loops took 4 cycles each pass…

Wait! What if we added an extra wasted cycle into each loop to make the total 4? This would mean that now we would need 4 cases for remainders and an extra instruction in each case’s loop.  Normally we would never want to *add* more code, but in this case the little bit of code we add will eliminate a bunch of code elsewhere.

Here is the new generated code for dividing by 4 to find the loop count and remainder….

				byte loopcounter = cycles/ (byte) 4;		// Compiler should turn this into a shift operation
  68:	98 2f       	mov	r25, r24
  6a:	96 95       	lsr	r25
  6c:	96 95       	lsr	r25

				byte remainder = cycles & 0x03 ;			// AND with 0x03 will efficiently return the remainder of divide by 4
  6e:	83 70       	andi	r24, 0x03	; 3

Wow, that is so much shorter faster and and simpler! But we did have to add a bunch of other source code to pad out the loops and add new cases. Was it a win overall?

Divide by 3 Divide by 4
Size of source program 195 lines 236 bytes
Size of generated code 254 bytes 244 bytes
Execution test 8.6ms 5.0ms

Yeay! We had to add a lot of lines to the program, but the resulting compiled code is both smaller and faster!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s