# When more code is less code

Last time, we crafted a function that would turn on an LED for a very short and precise amount of time. At the heart of this function was the code to figure out how many loop iterations we needed. Since each loop  takes 3 cycles, we needed to divide the total delay by 3 to get the number of iterations. Then we needed to find the remainder of that division to find out how many cycles we left over after the loops. We’d use the remainder to select one of thee versions of the looping code, where each version has the appropriate number of cycles added at the end. The pseudo code looks something like this…

```count = cycles/3   // integer math will round down to nearest whole
remainder = cycles - (count*3)

switch (remainder) {
case 0: // no remainder
loop count times
done

case 1: // we need to waste one extra cycle
loop count times
waste 1 cycle
done

case 2:  // we need to waste 2 cycles
loop count times
waste 2 cycles
done```

This looks very simple and it is, but there is a lot of work hidden in there. It turns out that computers as small as the ATTINY used in the Ognite do not know how to multiplication and division.

There are various ways to do multiplication and division by breaking them down to a series of additions and subtractions or using very fancy binary math operations. If we look at the instructions actually generated by the compiler for our divide and multiply, there is a lot of complicated stuff going on here…

```				byte loopcounter = cycles/ (byte) 3;		// TODO: either do faster bit compute here, or store dividends and remainder in lookup
ae:	82 2f       	mov	r24, r18
b0:	90 e0       	ldi	r25, 0x00	; 0
b2:	6b ea       	ldi	r22, 0xAB	; 171
b4:	70 e0       	ldi	r23, 0x00	; 0
b6:	4d d0       	rcall	.+154    	; 0x152 <__mulhi3>
b8:	96 95       	lsr	r25

byte remainder = cycles - (loopcounter*3);			// THis is how many cycles we need to pick up the slack to make up for the granularity of the loop
ba:	89 2f       	mov	r24, r25
bc:	88 0f       	add	r24, r24
be:	88 0f       	add	r24, r24
c0:	59 2f       	mov	r21, r25
c2:	58 1b       	sub	r21, r24
c4:	85 2f       	mov	r24, r21
c6:	82 0f       	add	r24, r18

00000152 <__mulhi3>:
152:	55 27       	eor	r21, r21
154:	00 24       	eor	r0, r0

00000156 <__mulhi3_loop>:
156:	80 ff       	sbrs	r24, 0
158:	02 c0       	rjmp	.+4      	; 0x15e <__mulhi3_skip1>
15a:	06 0e       	add	r0, r22
15c:	57 1f       	adc	r21, r23

0000015e <__mulhi3_skip1>:
15e:	66 0f       	add	r22, r22
160:	77 1f       	adc	r23, r23
162:	61 15       	cp	r22, r1
164:	71 05       	cpc	r23, r1
166:	21 f0       	breq	.+8      	; 0x170 <__mulhi3_exit>
168:	96 95       	lsr	r25
16a:	87 95       	ror	r24
16c:	00 97       	sbiw	r24, 0x00	; 0
16e:	99 f7       	brne	.-26     	; 0x156 <__mulhi3_loop>

00000170 <__mulhi3_exit>:
170:	95 2f       	mov	r25, r21
172:	80 2d       	mov	r24, r0
174:	08 95       	ret```

Wow, dividing and multiplying by 3 is hard for a binary computer.

Turns out that multiplying and dividing by powers of 2 is very easy for a computer because in binary math these are just shifting bits up and down. Too bad we need to divide and multiply by 3. If only our loops took 4 cycles each pass…

Wait! What if we added an extra wasted cycle into each loop to make the total 4? This would mean that now we would need 4 cases for remainders and an extra instruction in each case’s loop.  Normally we would never want to *add* more code, but in this case the little bit of code we add will eliminate a bunch of code elsewhere.

Here is the new generated code for dividing by 4 to find the loop count and remainder….

```				byte loopcounter = cycles/ (byte) 4;		// Compiler should turn this into a shift operation
68:	98 2f       	mov	r25, r24
6a:	96 95       	lsr	r25
6c:	96 95       	lsr	r25

byte remainder = cycles & 0x03 ;			// AND with 0x03 will efficiently return the remainder of divide by 4
6e:	83 70       	andi	r24, 0x03	; 3```

Wow, that is so much shorter faster and and simpler! But we did have to add a bunch of other source code to pad out the loops and add new cases. Was it a win overall?

 Divide by 3 Divide by 4 Size of source program 195 lines 236 bytes Size of generated code 254 bytes 244 bytes Execution test 8.6ms 5.0ms

Yeay! We had to add a lot of lines to the program, but the resulting compiled code is both smaller and faster!