# When more code is less code

Last time, we crafted a function that would turn on an LED for a very short and precise amount of time. At the heart of this function was the code to figure out how many loop iterations we needed. Since each loop  takes 3 cycles, we needed to divide the total delay by 3 to get the number of iterations. Then we needed to find the remainder of that division to find out how many cycles we left over after the loops. We’d use the remainder to select one of thee versions of the looping code, where each version has the appropriate number of cycles added at the end. The pseudo code looks something like this…

```count = cycles/3   // integer math will round down to nearest whole
remainder = cycles - (count*3)

switch (remainder) {
case 0: // no remainder
loop count times
done

case 1: // we need to waste one extra cycle
loop count times
waste 1 cycle
done

case 2:  // we need to waste 2 cycles
loop count times
waste 2 cycles
done```

This looks very simple and it is, but there is a lot of work hidden in there. It turns out that computers as small as the ATTINY used in the Ognite do not know how to multiplication and division.

There are various ways to do multiplication and division by breaking them down to a series of additions and subtractions or using very fancy binary math operations. If we look at the instructions actually generated by the compiler for our divide and multiply, there is a lot of complicated stuff going on here…

```				byte loopcounter = cycles/ (byte) 3;		// TODO: either do faster bit compute here, or store dividends and remainder in lookup
ae:	82 2f       	mov	r24, r18
b0:	90 e0       	ldi	r25, 0x00	; 0
b2:	6b ea       	ldi	r22, 0xAB	; 171
b4:	70 e0       	ldi	r23, 0x00	; 0
b6:	4d d0       	rcall	.+154    	; 0x152 <__mulhi3>
b8:	96 95       	lsr	r25

byte remainder = cycles - (loopcounter*3);			// THis is how many cycles we need to pick up the slack to make up for the granularity of the loop
ba:	89 2f       	mov	r24, r25
bc:	88 0f       	add	r24, r24
be:	88 0f       	add	r24, r24
c0:	59 2f       	mov	r21, r25
c2:	58 1b       	sub	r21, r24
c4:	85 2f       	mov	r24, r21
c6:	82 0f       	add	r24, r18

00000152 <__mulhi3>:
152:	55 27       	eor	r21, r21
154:	00 24       	eor	r0, r0

00000156 <__mulhi3_loop>:
156:	80 ff       	sbrs	r24, 0
158:	02 c0       	rjmp	.+4      	; 0x15e <__mulhi3_skip1>
15a:	06 0e       	add	r0, r22
15c:	57 1f       	adc	r21, r23

0000015e <__mulhi3_skip1>:
15e:	66 0f       	add	r22, r22
160:	77 1f       	adc	r23, r23
162:	61 15       	cp	r22, r1
164:	71 05       	cpc	r23, r1
166:	21 f0       	breq	.+8      	; 0x170 <__mulhi3_exit>
168:	96 95       	lsr	r25
16a:	87 95       	ror	r24
16c:	00 97       	sbiw	r24, 0x00	; 0
16e:	99 f7       	brne	.-26     	; 0x156 <__mulhi3_loop>

00000170 <__mulhi3_exit>:
170:	95 2f       	mov	r25, r21
172:	80 2d       	mov	r24, r0
174:	08 95       	ret```

Wow, dividing and multiplying by 3 is hard for a binary computer.

Turns out that multiplying and dividing by powers of 2 is very easy for a computer because in binary math these are just shifting bits up and down. Too bad we need to divide and multiply by 3. If only our loops took 4 cycles each pass…

Wait! What if we added an extra wasted cycle into each loop to make the total 4? This would mean that now we would need 4 cases for remainders and an extra instruction in each case’s loop.  Normally we would never want to *add* more code, but in this case the little bit of code we add will eliminate a bunch of code elsewhere.

Here is the new generated code for dividing by 4 to find the loop count and remainder….

```				byte loopcounter = cycles/ (byte) 4;		// Compiler should turn this into a shift operation
68:	98 2f       	mov	r25, r24
6a:	96 95       	lsr	r25
6c:	96 95       	lsr	r25

byte remainder = cycles & 0x03 ;			// AND with 0x03 will efficiently return the remainder of divide by 4
6e:	83 70       	andi	r24, 0x03	; 3```

Wow, that is so much shorter faster and and simpler! But we did have to add a bunch of other source code to pad out the loops and add new cases. Was it a win overall?

 Divide by 3 Divide by 4 Size of source program 195 lines 236 bytes Size of generated code 254 bytes 244 bytes Execution test 8.6ms 5.0ms

Yeay! We had to add a lot of lines to the program, but the resulting compiled code is both smaller and faster!

# Lighting Dim LEDs – Making Bit Banging PWM Precise to a Single Cycle

The Ognite does not use any current limiting resistors, instead it turns on the LEDs for *very* brief moments of time.  The longer the LED is left on each time it is flicked, the brighter it looks. This is called PWM.

Normally this is done by dedicated hardware, but because of the way the Ognite LEDs must be multiplexed, the chip has to do PWM the hard way- by actually turning the power on, waiting, then turning it off.

It is important to do this very accurately and very quickly. A dim LED on an Ognite only gets turned on for 0.000000125 seconds at a time! At this time scale, you have to take into account how much time it takes to think when doing anything. Just deciding if it is time to turn off the LED can take longer than it should be on for.

If the LED needs to be on for 1 or 2 cycles, there literally is not time to decide or check anything while the LED is on, so instead the program gets everything ready and then turns the LED on and then off in one swift and single motion.

The code that does this needs to take into account how long each and every instruction takes and balance them out to make sure that the LED is always on for exactly the right amount of time.

Below is the code that will turn on an LED for exactly the specified number of clock cycles (there are 8,000,000 clock cycles in one second on the Ognite).

It is this level of fine brightness control that gives the flame its lifelike motion. Your eye translates the smooth fades in and out into a continuous moving image, rather than a bunch of blinking lights.

```
// Generating high resolution delays for PWM via Bit Banging on AVR

typedef unsigned char byte;

// Set (ledonbits) to (LED_DUTY_CYCLE_PORT) for (cycles) CPU cycles, then send a zero to the port

#define LED_DUTY_CYCLE_PORT (PORTA)

static inline void ledDutyCycle(unsigned char cycles , byte ledonbits )
{

switch (cycles) {

// First, special case out the 0,1, and 2 cycle cases becuase these are so short that we can't do it any other way then hard code

case 0:

break;

case 1:	{

__asm__ __volatile__ (
"OUT %0,%1 \n\t"				// DO OUTPORT first because in current config it will never actually have both pins so LED can't turn on (not turn of DDRB)
"OUT %0,__zero_reg__ \n\t"

: :  "I" (_SFR_IO_ADDR(LED_DUTY_CYCLE_PORT)) , "r" (ledonbits)

);
}
break;

case 2: {

__asm__ __volatile__ (
"OUT %0,%1 \n\t"				// DO OUTPORT first because in current config it will never actually have both pins so LED can't turn on (not turn of DDRB)
"NOP \n\t"		// waste one cycle that will (hopefully) not get optimized out
"OUT %0,__zero_reg__ \n\t"

: :  "I" (_SFR_IO_ADDR(LED_DUTY_CYCLE_PORT)) , "r" (ledonbits)

);
}
break;

default: {

// for any delay greater than 2, we have enough cycles to go though a general loop construct

// Loop delay for loop counter c:

// --When c==1---
//	loop:   DEC c		// 1 cycle  (c=0)
//	BRNE loop			// 1 cycle
//                      // =============
//                      // 2 cycles total

// --When c==2---
//	loop:   DEC c		// 1 cycle	(c=1)
//	BRNE loop			// 2 cycles (branch taken)
//	loop:   DEC c		// 1 cycle	(c=0)
//	BRNE loop			// 1 cycles (branch not taken)
//                      // =============
//                      // 5 cycles total

// if c==1 then delay=2 cycles (branch not taken at all)
// if c==2 then delay=5 (2+3) cycles
// if c==3 then delay=8 (2+3+3)

// ...so loop overhead is always 2+(c-1)*3 cycles

// Include the single cycle overhead for the trailing OUT after the loop and we get...

// delay = 3+(c-1)*3
// delay = 3+(3*c)-3
// delay = (3*c)
// delay/3 = c

byte loopcounter = cycles/ (byte) 3;		// TODO: either do faster bit compute here, or store dividends and remainder in lookup

byte remainder = cycles - (loopcounter*3);			// THis is how many cycles we need to pick up the slack to make up for the granularity of the loop

switch (remainder) {

case 0:		{// No remainder, so just loop and we will get right delay

__asm__ __volatile__ (
"OUT %[port],%[bits] \n\t"			// DO OUTPORT first because in current config it will never actually have both pins so LED can't turn on (not turn of DDRB)
"L_%=:dec %[loop]\n\t"			// 1 cycle
"BRNE L_%= \n\t"			// 1 on false, 2 on true cycles
"OUT %[port],__zero_reg__ \n\t"

: [loop] "=r" (loopcounter) : [port] "I" (_SFR_IO_ADDR(LED_DUTY_CYCLE_PORT)) , [bits] "r" (ledonbits) , "0" (loopcounter)

);
}
break;

case 1:  {// We need 1 extra cycle to come out with the right delay

__asm__ __volatile__ (
"OUT %[port],%[bits] \n\t"			// DO OUTPORT first because in current config it will never actually have both pins so LED can't turn on (not turn of DDRB)
"NOP \n\t"		// waste one cycle that will (hopefully) not get optimized out
"L_%=:dec %[loop] \n\t"			// 1 cycle
"BRNE L_%= \n\t"			// 1 on false, 2 on true cycles
"OUT %[port],__zero_reg__ \n\t"

: [loop] "=r" (loopcounter) : [port] "I" (_SFR_IO_ADDR(LED_DUTY_CYCLE_PORT)) , [bits] "r" (ledonbits) , "0" (loopcounter)

);
}
break;

case 2:  { // We need 2 extra cycles to come out with the right delay

__asm__ __volatile__ (

"OUT %[port],%[bits] \n\t"			// DO OUTPORT first because in current config it will never actually have both pins so LED can't turn on (not turn of DDRB)
"RJMP L_%=\n\t"					// Waste 2 cycles using half as much space as 2 NOPs
"L_%=:dec %[loop] \n\t"			// 1 cycle
"BRNE L_%= \n\t"			// 1 on false, 2 on true cycles
"OUT %[port],__zero_reg__ \n\t"

: [loop] "=r" (loopcounter) : [port] "I" (_SFR_IO_ADDR(LED_DUTY_CYCLE_PORT)) , [bits] "r" (ledonbits) , "0" (loopcounter)

);
}

break;

} // switch (remainder)

}	// default case where b>2

break;

}	// switch (b)

}

```

# Extreeme Alorithim Optimization

We spend most of our time inside the display refresh, so let’s see if there is any more we can squeeze out there.

The current refresh code is very simple and fast. It has nested loops that scan the X and Y axis. For each pixel it loads the brightness value and then looks up which pins drive to light up the appropriate LED.

```Clean Start
Program Memory Usage  : 3048 bytes   74.4 % Full
Data Memory Usage   : 79 bytes   30.9 % Full
Time in Refresh (min/max): 153us/174us```

Let’s see if we can make it faster and/or smaller…

First we can move the lookup tables from program memory into SRAM. This should save 1 cycle on each lookup, and might be even faster since there are more registers available than can be used for SRAM lookups….

```Lookups in SRAM
Program Memory Usage : 3040 bytes 74.2 % Full
Data Memory Usage : 105 bytes 41.0 % Full
Time in Refresh (min/max): 147us/164us```

There are two costs to this change. First, we are using up 26 bytes more of SRAM, but this is ok because we have plenty of SRAM to spare.  We are also using up a bit of time when the values initially get copied from program memory into SRAM, but this only happens once each time we get turned on so that’s ok too.

Next we can cache the value for PORTB and PORTD locally since we use them twice…

```Cache PORTD and PORTD assignments
Program Memory Usage 	:	3036 bytes   74.1 % Full
Data Memory Usage 		:	105 bytes   41.0 % Full
Time in Refresh (min/max): 144us/161us```

Very small win, but no sacrifice at all so we’ll take it. Note that this only work becasue the compiler does not treat the PORT registers as normal registers – it could. Using the temporary cache variables lands these values in proper registers that the compiler feels comfortable with.

Next we can also cache the value for the ColBits  lookups….

```Cache and Move ColBits Lookups
Program Memory Usage 	:	3006 bytes   73.4 % Full
Data Memory Usage 		:	97 bytes   37.9 % Full
Time in Refresh (min/max): 130us/155us```

This one is a sad victory. We really have not changed the algorithm or flow at all, just moved around some code. Apparently this helped kick the compiler into seeing something. It is arbitrary and capricious and really reinforces the idea that we are going to have to do bare-metal assembler if we really want to get to the best possible code here.

All told, we traded about an hour of effort for 42 bytes of program space and about 17us of time on each screen refresh (about 13% savings).

We could get maybe another 20% reduction by hard coding in some knowledge about where the pins land in the current design and thus skipping some lookups. Until the battery holder problem is solved, I want to keep all options open.

# Beta 2 Firmware Released!

I just shipped a batch of kits with the new Beta 2 Firmware, which is also available here…

https://github.com/bigjosh/Ognite-firmware/releases/tag/v0.2

## Noticeable Features of this Release

• Greatly reduced power usage. Thanks to the new WatchDog reset driven sleep modes, Ognite now draws only about 0.75mA while running. This should translate to months of operation on a couple of AA batteries.
• Diagnostic Startup Screen. This really helps find connection mistakes. I build a lot of Ognites, and I’ve even found a few bad connections that I’ve made thanks to this screen.

## Still left to do

• Really work on optimizing driving LEDs for brightness vs. power. The LEDs currently use up 7x more power than the chip, so this is where real battery life gains will come from. This is all empirical work, trying different power levels and driving modes to see the trade offs on perceived brightness. The relationship is completely non-linear, so this takes a lot of time to work out.
• Re-encode the flame from better source material. I have some video that I took using a better camera than I need to process to get a higher quality flame both ion terms of pixels and flame action.

There is still also lots of room for improvement in tightening up the firmware. Ultimately, I am going to translate everything into straight assembly language and that will likely give at least 25% reduction in size and time – although admittedly these gains are almost purely symbolic since at this point they will not make a noticeable difference in the final product.

# Sleep Deeply, Wake Briefly

Microcontrollers typically use on the order of milliamps when they are running, so if you want to make a battery last more than a few days, then you need to make sure that the processor is running as little as possible. This means you should almost always be sleeping.

On the AVR microcontrollers, this typically means setting the WatchDog timer to wake you up, and then going to sleep. You can set the WatchDog to wake up anywhere from 60 times per second, to only once every 8 seconds depending on what you need to do.

Once the WatchDog triggers, you need to wake up as fast as you can, then do what you need to do, then go back to sleep.

It turns out that correctly setting the SUT fuse bits can have a huge impact on how long it takes you to wake up each time, and this can have a huge impact on how much power you use overall.

Correctly setting the SUT fuses is critical for minimizing power when using sleep modes.

The default setting adds a delay of 62 milliseconds to every wake up, and the processor uses about 5mA during the delay.

Lets take a look at a possibly the smallest possible wake up task to highlight this. All this program does when it wakes up is turn on a pin (so we know it is working), get everything ready to go back to sleep, turn off the pin, and sleep.

Here is what that looks like on the scope…

The little “T” is spot where we turned on the pin (and triggered the scope). The pin is on for such a sort amount of that we can not even see the yellow trace go up at this scale (10ms/div).

The trace is showing the chip’s current usage. See how it the chip starts using power more than 67ms before it actually wakes and runs our code? This is terrible! What is going on?

It turns out that the chip has a feature that delays the chip from running any code for a while right after it wakes up. The idea behind this is to let everything settle down and smooth out before it starts. By default, it waits ~62 milliseconds. This is a good idea when, say, you are running off a power supply that takes some time to stabilize. It is a terrible idea when you are just waking up for a tiny moment to do something and then go back to bed.

According to the math, we are wasting about 0.25 milliampere seconds every time we wake up. That adds up fast. If we are waking up 4 times per second, the delay will be wasting an average of 1mA all the time!

Since this delay happens before the chip even starts up, the only place to configure it is by setting the SUT (Start Up Delay) fuse bits using a hardware programmer.  Here is the default setting…

The Ognite runs off a nice stable battery , so we don’t need any delay. Even if we did, it would be much better to maybe use the Brown Out Detector to delay our startup since we could then disable it once we were up and running and then not need to endure the delay everytime we woke up from a WatchDog.

If you’ve read this far, I assume you are very interested in power saving strategies on the AVR microcontrollers.  In that case, here is some code that demonstrates how you can use only WatchDog resets to wake up periodically with very high power efficiently  …

```/*
* This is a framework for very low power AVR applications that spend most time
* sleeping and do ongoing work only when woken by a WatchDog reset.
*
* For best power efficiency, program the SUT fuses for 0ms startup delay
*
* Created: 11/18/2013 10:45:43 PM
* Author: josh
* Website; https://ognite.wordpress.com
*/
#include <avr/io.h>

#include <avr/power.h>

#include <avr/sleep.h>

#include <avr/wdt.h>

#include <avr/wdt.h>                                // Watchdog Functions

// Set these according to your project,
// but make sure SUT1 and SUT0 are set or you
// waste a lot of time waiting for start up delay

FUSES = {

.low = (FUSE_SUT1 & FUSE_SUT0 & FUSE_CKSEL3 & FUSE_CKSEL2 & FUSE_CKSEL0),                        // Startup with clock/8, no startup delay, 8Mhz internal RC
.high = HFUSE_DEFAULT,
.extended = EFUSE_DEFAULT

};

void init0 (void) __attribute__ ((naked)) __attribute__ ((section (".init0")));

// This code will be run immedeately on reset, before any initilization or main()

void init0(void) {

asm( "in __tmp_reg__ , %[mcusr] " : : [mcusr] "I" (_SFR_IO_ADDR(MCUSR)) ); // Get the value of the MCUSR register into the temp register
asm( "sbrc __tmp_reg__ ,%[wdf] " : : [wdf] "I" (WDRF) ); // Test the WatchDog bit and skip the next instruction if the bit is not set
asm( "rjmp warmstart" : : ); // If we get to here, the the WDT bit was set so we are waking up warm, so jump to warm code

// On power up, This code will fall though to the normal .init seconds and set up all the variables and get ready for main() to start
}

// Put any code you want to run once on power-up here....
// Any global variables set in this routine will persist to the WatchDog routine
// Note that I/O registers are initialized on every WatchDog reset the need to be updated inside userWakeRountine()

// This is "static inline" so The code will just be inserted directly into the main() code avoiding overhead of a call/ret

static inline void userStartup(void) {

}

// Main() only gets run once, when we first power up

int main(void)
{
userStartup();   // Do this first, because if we turned on WDT first it might reset on us
wdt_enable(WDTO_15MS); // Could do this slightly more efficiently in ASM, but we only do it one time- when we first power up
// The delay set here only matters for the first time we reset to get things started, so we set it to the shortest available so we don't wait to wait too long...
// In warmstart we will set it recurring timeout.

// Note that we could save a tiny ammount of time if we RJMPed right into the warmstart() here, but that
// could introduce a little jitter since the timing would be different on the first pass. Better to
// always enter warmstart from exactly the same reset state for consistent timing.

MCUCR = _BV( SE ) | _BV(SM1 ) | _BV(SM0); // Sleep enable (makes sleep instruction work), and sets sleep mode to "Power Down" (the deepest sleep)

asm("sleep");

// we should never get here

}

// Put any code you want to run on every wake up here....
// Any global variables used in this routine will persist
// All I/O Registers are reset to their initial values (per the datasheet) everytime we wake up
// If you plan to do work for longer than the WatchDog timer, then you need to do a WatchDogReset (WDR) occasionally to keep the timer from expiring on you. Note that this can add jitter if consistant timing is important.

// This is "static inline" so The code will just be inserted directly into the warmstart code avoiding overhead of a call/ret

static inline void userWakeRoutine(void) {

// Just for testing, lets twiddle PORTD0 bit on and off with a little delay between...

DDRD |= _BV(0);   // Note that we can't do this in startup because IO registers get cleared everytime we reset
PORTD |= _BV(0);

for( unsigned x=0; x<100;x++ ) {asm("nop");}

PORTD &= ~_BV(0);

}

void __attribute__ ((naked)) warmstart(void) {

// Set the timeout to the desired value. Do this first because by default right now it will be at the initial value of 16ms
// which might not be long enough for us to do what we need to do.

WDTCR = _BV( WDP1) | _BV( WDP0) ; // This sets a 125ms timeout. See the table of timeout values in the datasheet. Note that you can delete this line completely if you want the default 16ms timeout.

// Now do whatever the user wants...

userWakeRoutine();

// Now it is time to get ready for bed. We should have gotten here because there was just a WDT reset, so...
// By default after any reset, the watchdog timeout will be 16ms since the WDP bits in WDTCSR are set to zero on reset. We'd need to set the WDTCSR if we want a different timeout
// After a WDT reset, the WatchDog should also still be on by default because the WDRF will be set after a WDT reset, and "WDE is overridden by WDRF in MCUSR. See “MCUSR – MCU Status Register” on page 45for description of WDRF. This means thatWDE is always set when WDRF is set."

MCUCR = _BV( SE ) | _BV(SM1 ) | _BV(SM0); // Sleep enable (makes sleep instruction work), and sets sleep mode to "Power Down" (the deepest sleep)

asm("sleep");

// we should never get here

}

```

First look at the main(). This is pretty standard C code for setting the WatchDog up to wake us, then enabling sleep, and then actually sleeping. The main() only gets called once at power-up.

Now look above it at the .init0 routine. This little bit of assembly code gets run right after a reset before anything else happens.  It checks to see if we just woke up (as opposed to just powering up)  and if we did, it jumps straight into our warmstart() code. It skips eveything that normally happens on a reset like setting up registers and initialing variables.  This stuff is slow and would delay us from doing the work that we woke up to do, and worse it messes up variables and registers that it might be nice to keep intact between wake ups so we don’t have to keep figuring them out over and over again.

You could also do this by having the WatchDog timeout trigger an Interrupt routine, but since we are just going to be sleeping between WatchDogs, this is wasteful. Interrupt routines assume that they will be “interrupting” other running code so they do a lot of work saving and restoring stuff that we just don’t need. We want a quick wake-do-sleep!

Once into warmstart() we are ready to immediately do our work. All global variables are preserved between between wake ups. The only overhead in warmstart() is getting ready to sleep again, which is a single assignment to MCUCR and setting the timeout again in WDTCR. You might wonder how we can get away with not needing to enable to the WatchDog again (which can be a complicated affair). It turns out that if we leave the WatchDog Reset Flag alone, it will override the WatchDog enable bit and let the WatchDog continue to repeat fire automatically! Handy!

What about setting the WatchDog timeout? In this case, I am using the 125ms timeout, but you could change this to anything you want. If you happened to want the 62ms timeout, you can actually delete this line altogether since that corresponds to all zero bits, and this will be the default value in this register after reset.

Note that anytime we reset (on power-up, or waking with a WatchDog) all I/O registers are set to their initial values. For most things (like DDR and PORT registers), this is zero, but you can look on the data sheet for the chip to see what the initial value for any individual register is. In Ognite this is great since it saves us a little work zeroing out the DDR registers.

# Programming in the Small – Part #5

Bringing it all together!

So, now that we know what all the extra stuff is- and we know that we don’t need it- let’s get rid of it!

Here is the final program..

```/*
* TinyestProject.c
*
* Created: 11/11/2013 2:21:26 PM
*  Author: josh
*/

// This tiny program it will toggle bit A0 at 1 mhz
// It is a very short program, using only 6 bytes total

// Note that for this to really compile down to 6 bytes,
// you need to give the option -nostartfiles to the linker

#include <avr/io.h>

// The "section" attribute tells the linker to put our
// function at the very start of program memory where the
// vectors typically go. The location of the "vectors"
// section is set inside the linker script.

void start (void)  __attribute__ ((section (".vectors")));

void start(void)
{

DDRA |= 0x01;

while(1) {
PINA |=0x01;
}

}
```

It does not look very different from the original program.

```void start (void) __attribute__ ((section (".vectors")));
```

This tells the compiler to tell the linker to put our code at the very start of program memory at location 0. Now the first byte of our program is sitting right where the Reset vector would normally go, so our program will start running as soon as the chip powers up. We will be running “bare metal”.

The next change is just as important, but not part of the code. We must tell the compiler not to include all the various start-up stuff that it normally does include. We do this with the -nostartfiles option. In AVR Studio, you can set that in the Project Options here…

Keep in mind that even though the changes we made to the code were very small, it took a long way to get here. If you just make these changes without understanding each thing we got rid of, you can get into trouble fast. If we decided to expand this tiny program and tried to add an interrupt to it, things would fail hard when the chip jumped into the non-existent interrupt entry in the vector table.  If we did something as small as adding a statically initialized global variable, it would not work because we are not running the normal C start-up code that does the initialization.  Heck, we would not get far trying to write any real program without the code that normally zeros out R1 since lots of compiled C code takes that for granted.

So was this just a silly exercise? No. The production Ognite code very carefully works around all these issues while at the same staying significantly smaller than if it used the normal C start-up stuff.

Is the space saved worth it? Maybe not for most programs but if you are designing a product that you expect to make 1,000,000 copies of, then using these techniques may let you fit into the next size smaller chip which could have a huge total impact.

# Programming in the Small – Part #4

## Reclaiming the Interrupt Vector Table!

If we look at the very top of our program output, we see a huge table taking up 42 bytes of prime real-estate memory…

```00000000 <__vectors>:
0:	14 c0 rjmp	.+40     	; 0x2a <__ctors_end>
2:	1b c0 rjmp	.+54     	; 0x3a <__bad_interrupt>
4:	1a c0 rjmp	.+52     	; 0x3a <__bad_interrupt>
6:	19 c0 rjmp	.+50     	; 0x3a <__bad_interrupt>
8:	18 c0 rjmp	.+48     	; 0x3a <__bad_interrupt>
a:	17 c0 rjmp	.+46     	; 0x3a <__bad_interrupt>
c:	16 c0 rjmp	.+44     	; 0x3a <__bad_interrupt>
e:	15 c0 rjmp	.+42     	; 0x3a <__bad_interrupt>
10:	14 c0 rjmp	.+40     	; 0x3a <__bad_interrupt>
12:	13 c0 rjmp	.+38     	; 0x3a <__bad_interrupt>
14:	12 c0 rjmp	.+36     	; 0x3a <__bad_interrupt>
16:	11 c0 rjmp	.+34     	; 0x3a <__bad_interrupt>
18:	10 c0 rjmp	.+32     	; 0x3a <__bad_interrupt>
1a:	0f c0 rjmp	.+30     	; 0x3a <__bad_interrupt>
1c:	0e c0 rjmp	.+28     	; 0x3a <__bad_interrupt>
1e:	0d c0 rjmp	.+26     	; 0x3a <__bad_interrupt>
20:	0c c0 rjmp	.+24     	; 0x3a <__bad_interrupt>
22:	0b c0 rjmp	.+22     	; 0x3a <__bad_interrupt>
24:	0a c0 rjmp	.+20     	; 0x3a <__bad_interrupt>
26:	09 c0 rjmp	.+18     	; 0x3a <__bad_interrupt>
28:	08 c0 rjmp	.+16     	; 0x3a <__bad_interrupt>```

This is the interrupt vector table. It tells the processor where to go when asynchronous events happy – stuff like getting reset or having a timer expire.

Each slot corresponds to a single type of event. The top slot is where the chip goes when it gets reset. The 2nd one  happens when there is an External Interrupt #1 requested. The 10th one is for when the serial port is finished sending the last byte. All things that happen spontaneously when other code might be running.

When, say, the Universal Serial INterface overflows, the chip will load the program counter with the address of vector #17 (address 0x10) and then start executing. Typically this will be a “RJMP” instruction telling the chip where to goto to find the actual program code that needs to get executed.

In our case, we don’t use any interrupts or timers or counters or anything else fancy, so all of the vectors except for reset just point _bad_interrupt, which itself just points back the the reset vector. So why do we need all of these extra vectors if the chip is never going to jump to any of them? We don’t!

So we are left with just Vector #0, the RESET vector. RESET is basically like reboot for a normal computer and gets used anytime the chip powers up or the reset pin is toggled or stuff like that. Right now the reset vector points to __c_tors_end which is the beginning of the startup C code that we already figured out that we don’t need anyway. So, we could just point this vector directly to the beginning of our code and be done… but that vector would be using up 2 bytes just to tell the chip to jump 2 bytes forward. We can not condone this sort of thing.

Can we somehow get rid of the reset vector itself? It turns out that since the chip just does a simple jump to vector #0 on reset, the RJMP in that vector does not actually need to be an RJMP. It can be any instruction we want. It can even be the first instruction of our program! The trade off is that our code will now be overwriting the vectors that would be at location #2 and #4, but we know that these vectors will never happen so that is ok.

If we start our program at location zero, it will automatically start running when the chip gets reset with not a single byte of vector table in sight!

Ok, so now we’ve figured out that we can, in theory, get rid of each and every extra byte surrounding our code… but how do we actually make this happen in practice?

Tune in next week to find out how!

# Programming in the Small – Part #3

## Picking up bytes littered around main

If you’ve been following my quest for the smallest program from the beginning, you know that we still have 54 extra bytes to get rid of.

Next let us look at the code bracketing the top and bottom our main() function.

Before our code, we have…

`36:	02 d0 rcall	.+4 ; 0x3c  38:	04 c0 rjmp	.+8 ; 0x42 <_exit>`

…which does a call into our main(), and then jumps to the exit handler when our main() returns.

The code after our main() looks like this…

`00000042 <_exit>: 42:	f8 94 cli  00000044 <__stop_program>: 44:	ff cf rjmp	.-2 ; 0x44 <__stop_program>`

Why does the compiler want to call into our code, only to jump to the end when our code returns? Wouldn’t it be easier just put the exit routine dangling after our code so that the execution would just naturally flow from the end of our program into the exit routine?

One good explanation: It is important to call into main() so that can have a return at the end. The return makes it so that we can (theoretically) call main() from other parts of our program just as if it was a normal function. This makes perfect sense, except that the compiler didn’t put a return at the end of the main() function! Hmmm.. I am getting very mixed messages here. Either put a return at the end of the function or don’t call into the function – but not both.

In fact, our code can never return because it has an infinite loop. I think the compiler knows this at some level and that is why it didn’t put in the return, but it forgot to tell the part of the compiler that generated the call. In any case, we don’t need the call, or the return.

The exit code turns off interrupts and goes into an infinite loop. I can understand the loop – in general you don’t want a program to finish and then run off the end of the earth. I don’t understand why you’d want to turn off interrupts – to me intuitively interrupts should still keep working even after main() returns, but this is a matter of taste. The point is moot here because our little program has its own little infinite loop so it never finishes and never touches any of this exit code.

We’ve been able to prune away everything now except for the interrupt vectors – and those are bigger than everything so far put together. A job for next time…

# Programming in the Small – Part #2

## Superfluous Stack and Status Setting

Last time we looked at every one of the whopping 70 bytes of extra code that had accreted around our tiny little 6 byte program.  Now we are going to start figuring out what we can get rid of.

We saw this in the block of initialization code added at the beginning of our program…

`2c:	1f be out	0x3f, r1	; 63`

…which is setting the Status Register (0x3f) to zero. I wasn’t sure why you would need that, and it turns out that you don’t. The data sheet for this chip explicitly states that this register will always be initialized to zero on reset (page 10).  I looked at a few others (including the ATMEGA328 used in the Ardunio), and these chips also make the same guarantee.

As long as the chip is coming out of a rest, the Status Register will already be zero when we hit this line, and the only way I can see that we could end up here is via a reset, so this code is redundant. (The one exception is the Bad Interrupt Vector we saw last time, but that needs to go also!)

Next lets look at the code that initializes the Stack Pointer again…

```  2e:	cf e5 ldi	r28, 0x5F	; 95
30:	d1 e0 ldi	r29, 0x01	; 1
32:	de bf out	0x3e, r29	; 62
34:	cd bf out	0x3d, r28	; 61```

Again, according to the chip specifications this code also appears to be redundant. The Stack Pointer (0x3e) is guaranteed to be initialized to point to the top of SRAM after any reset. I even tested this on an actual chip (harder than it sounds!), and it really does get automatically set no matter how I reset the chip (power,  WatchDog, reset pin).

The code that sets the other register (0x3d) is just plain wrong for this processor. The spec shows this location as “reserved” (page 255) and “reserved I/O memory addresses should never be written” (page 256).

So here are 10 bytes of unnecessary code that we can remove safely remove from any and every ATTINY4313 program (any many others) complied by avr-gcc. Think of the millions and millions of AVRs all around the world saddled with these extra 5 cycles of effort every time they reset!

How does something like this happen? The avg-gcc is a massive software project that supports dozens of processors. It is a huge amount of work to create and maintain something like this and the people who work on it do an amazing job.  Besides taking up a tiny bit of extra space and time, these superfluous bytes do not make any working programs crash, so finding and fixing stuff like this is very, very low on the the priority list.

But to those of us (me?) obsessed with maximal minimalism, even 1 extra byte is too much to swallow.

Tune in next time to see how these bytes – and more -are  banished in search of the smallest program. We still have 54 more bytes to loose!

# Programming in the small

## How small can a C program be?

(For this post, I am using an Atmel ATTINY4313a AVR processor, but most of this stuff should apply to C code complied for any 8-bit AVR chip with the avr-gcc compiler. This  includes the Ardunio.)

Here is the smallest useful C program I could come up with…

```int main(void) {
DDRA|= 0x01;      // Set PORTA0 bit to output.

while(1) {        // Repeat forever
PINA |= 0x01; // Toggle the bit
};
}```

First it sets pin PA0 to output mode, then it toggles it on and off as fast as it can forever.

Here is the assembly code that compiles down to…

```
3c: d0 9a      sbi   0x1a, 0;  // DDRA |= 0x01
3e: c8 9a      sbi   0x19, 0;  // PINA |= 0x01
40: fe cf      rjmp  .-4;      // Jump back and do it again...```

Which is pretty sort and sweet – only 6 bytes long! Note that I ORed the values into the registers so the compiler could use the set bit (SBI) instruction which is only 1 word long. It really doesn’t get any smaller than this.

We can check to make sure the program actually works by connecting an oscilloscope to pin 5 of the chip, and we see this…

Processor speed at bootup= 1MHz
Time for each cycle= 1/1MHz = 1us/cycle
Cycles for SBI instruction=2 cycles
Cycles for RJMP instruction=2  cycles
Total cycles to toggle bit on then off=2*(2 cycles+2 cycles)=8 cycles
Total period=8 cycles * 1 us/cycle=8us
Frequency=1/8us=125kHz

It looks like this is in fact our code talking to us, all 6 bytes of it.

Unfortunately, when we look at what actually downloaded into the chip, we see that it used up 70 bytes of our precious program memory! Who invited the other 64 bytes to this party? Let’s take a look at the compiler output and see…

```00000000 <__vectors>:
0:	14 c0 rjmp	.+40     	; 0x2a <__ctors_end>
2:	1b c0 rjmp	.+54     	; 0x3a <__bad_interrupt>
4:	1a c0 rjmp	.+52     	; 0x3a <__bad_interrupt>
6:	19 c0 rjmp	.+50     	; 0x3a <__bad_interrupt>
8:	18 c0 rjmp	.+48     	; 0x3a <__bad_interrupt>
a:	17 c0 rjmp	.+46     	; 0x3a <__bad_interrupt>
c:	16 c0 rjmp	.+44     	; 0x3a <__bad_interrupt>
e:	15 c0 rjmp	.+42     	; 0x3a <__bad_interrupt>
10:	14 c0 rjmp	.+40     	; 0x3a <__bad_interrupt>
12:	13 c0 rjmp	.+38     	; 0x3a <__bad_interrupt>
14:	12 c0 rjmp	.+36     	; 0x3a <__bad_interrupt>
16:	11 c0 rjmp	.+34     	; 0x3a <__bad_interrupt>
18:	10 c0 rjmp	.+32     	; 0x3a <__bad_interrupt>
1a:	0f c0 rjmp	.+30     	; 0x3a <__bad_interrupt>
1c:	0e c0 rjmp	.+28     	; 0x3a <__bad_interrupt>
1e:	0d c0 rjmp	.+26     	; 0x3a <__bad_interrupt>
20:	0c c0 rjmp	.+24     	; 0x3a <__bad_interrupt>
22:	0b c0 rjmp	.+22     	; 0x3a <__bad_interrupt>
24:	0a c0 rjmp	.+20     	; 0x3a <__bad_interrupt>
26:	09 c0 rjmp	.+18     	; 0x3a <__bad_interrupt>
28:	08 c0 rjmp	.+16     	; 0x3a <__bad_interrupt>

0000002a <__ctors_end>:
2a:	11 24 eor	r1, r1
2c:	1f be out	0x3f, r1	; 63
2e:	cf e5 ldi	r28, 0x5F	; 95
30:	d1 e0 ldi	r29, 0x01	; 1
32:	de bf out	0x3e, r29	; 62
34:	cd bf out	0x3d, r28	; 61
36:	02 d0 rcall	.+4      	; 0x3c
38:	04 c0 rjmp	.+8      	; 0x42 <_exit>

3a:	e2 cf rjmp	.-60     	; 0x0 <__vectors>

0000003c <main>:

3c:	d0 9a sbi	0x1a, 0	; 26
3e:	c8 9a sbi	0x19, 0	; 25
40:	fe cf rjmp	.-4      	; 0x3e <__SP_H__>

00000042 <_exit>:
42:	f8 94 cli

00000044 <__stop_program>:
44:	ff cf rjmp	.-2      	; 0x44 <__stop_program>```

You can spot our little routine just after main(), but it is drowning in a sea of other code.

It turns out that the C compiler throws lots of extra stuff in that, under normal circumstances, makes C programmers’ (and compiler writers’) lives easier. Here is the breakdown of the 70 bytes…

 Interrupt vector table 42 Initialization code 16 Bad Interrupt Vector 2 main (our program) 6 Exit routine 4 Total 70

Lets take each of these and see what they do and what we can do about them.

## Interrupt Vector Table

When ever the processor gets interrupted from running normal step-by-step code, it will jump to one of these addresses based on what interrupted it. This is part of the defined behavior of the chip. If Timer 1 overflows, it jumps to vector #6. If the Analog Compare triggers, it jumps to vector #12. There are 21 vectors in all, each for a different source of interrupts. Each vector is really just an instruction to jump to someplace else, so each vector takes up 2 bytes. 21 vectors * 2 bytes/vector = 42 bytes.

Note that the first vector (at address 0) is particularly important because this is the Reset vector. This where the processor starts up after a reset – including when it gets turned on.

Initialization Code

Here is the initialization code….

```  2a:	11 24       	eor	r1, r1
2c:	1f be       	out	0x3f, r1	; 63
2e:	cf e5       	ldi	r28, 0x5F	; 95
30:	d1 e0       	ldi	r29, 0x01	; 1
32:	de bf       	out	0x3e, r29	; 62
34:	cd bf       	out	0x3d, r28	; 61
36:	02 d0       	rcall	.+4      	; 0x3c```

The first line clears register R1 to equal 0 (anything XORed with itself is zero). The compiler often needs a zero handy, so it dedicates register 1 to always and forever have a zero in it. This is how that original zero gets there.

The next line clears out location the Status Register (0x3f) by loading it with zero (using the handy zero that was put in R1 on the line before). I’m not really sure why they do this…

The next 4 lines set up the Stack Pointer (0x3C) to point to the top of RAM (0x5F). It also is putting an 0x01 in location 0x3E, which according to the chip’s documentation should be a reserved location an not used. Maybe this is a benign copy-paste error from another chip that supported a 2 byte Stack Pointer?

The last line calls into the main() function of our C code.