Last time, we crafted a function that would turn on an LED for a very short and precise amount of time. At the heart of this function was the code to figure out how many loop iterations we needed. Since each loop takes 3 cycles, we needed to divide the total delay by 3 to get the number of iterations. Then we needed to find the remainder of that division to find out how many cycles we left over after the loops. We’d use the remainder to select one of thee versions of the looping code, where each version has the appropriate number of cycles added at the end. The pseudo code looks something like this…
count = cycles/3 // integer math will round down to nearest whole remainder = cycles - (count*3) switch (remainder) { case 0: // no remainder loop count times done case 1: // we need to waste one extra cycle loop count times waste 1 cycle done case 2: // we need to waste 2 cycles loop count times waste 2 cycles done
This looks very simple and it is, but there is a lot of work hidden in there. It turns out that computers as small as the ATTINY used in the Ognite do not know how to multiplication and division.
There are various ways to do multiplication and division by breaking them down to a series of additions and subtractions or using very fancy binary math operations. If we look at the instructions actually generated by the compiler for our divide and multiply, there is a lot of complicated stuff going on here…
byte loopcounter = cycles/ (byte) 3; // TODO: either do faster bit compute here, or store dividends and remainder in lookup ae: 82 2f mov r24, r18 b0: 90 e0 ldi r25, 0x00 ; 0 b2: 6b ea ldi r22, 0xAB ; 171 b4: 70 e0 ldi r23, 0x00 ; 0 b6: 4d d0 rcall .+154 ; 0x152 <__mulhi3> b8: 96 95 lsr r25 byte remainder = cycles - (loopcounter*3); // THis is how many cycles we need to pick up the slack to make up for the granularity of the loop ba: 89 2f mov r24, r25 bc: 88 0f add r24, r24 be: 88 0f add r24, r24 c0: 59 2f mov r21, r25 c2: 58 1b sub r21, r24 c4: 85 2f mov r24, r21 c6: 82 0f add r24, r18 00000152 <__mulhi3>: 152: 55 27 eor r21, r21 154: 00 24 eor r0, r0 00000156 <__mulhi3_loop>: 156: 80 ff sbrs r24, 0 158: 02 c0 rjmp .+4 ; 0x15e <__mulhi3_skip1> 15a: 06 0e add r0, r22 15c: 57 1f adc r21, r23 0000015e <__mulhi3_skip1>: 15e: 66 0f add r22, r22 160: 77 1f adc r23, r23 162: 61 15 cp r22, r1 164: 71 05 cpc r23, r1 166: 21 f0 breq .+8 ; 0x170 <__mulhi3_exit> 168: 96 95 lsr r25 16a: 87 95 ror r24 16c: 00 97 sbiw r24, 0x00 ; 0 16e: 99 f7 brne .-26 ; 0x156 <__mulhi3_loop> 00000170 <__mulhi3_exit>: 170: 95 2f mov r25, r21 172: 80 2d mov r24, r0 174: 08 95 ret
Wow, dividing and multiplying by 3 is hard for a binary computer.
Turns out that multiplying and dividing by powers of 2 is very easy for a computer because in binary math these are just shifting bits up and down. Too bad we need to divide and multiply by 3. If only our loops took 4 cycles each pass…
Wait! What if we added an extra wasted cycle into each loop to make the total 4? This would mean that now we would need 4 cases for remainders and an extra instruction in each case’s loop. Normally we would never want to *add* more code, but in this case the little bit of code we add will eliminate a bunch of code elsewhere.
Here is the new generated code for dividing by 4 to find the loop count and remainder….
byte loopcounter = cycles/ (byte) 4; // Compiler should turn this into a shift operation 68: 98 2f mov r25, r24 6a: 96 95 lsr r25 6c: 96 95 lsr r25 byte remainder = cycles & 0x03 ; // AND with 0x03 will efficiently return the remainder of divide by 4 6e: 83 70 andi r24, 0x03 ; 3
Wow, that is so much shorter faster and and simpler! But we did have to add a bunch of other source code to pad out the loops and add new cases. Was it a win overall?
Divide by 3 | Divide by 4 | |
Size of source program | 195 lines | 236 bytes |
Size of generated code | 254 bytes | 244 bytes |
Execution test | 8.6ms | 5.0ms |
Yeay! We had to add a lot of lines to the program, but the resulting compiled code is both smaller and faster!