Programming in the Small – Part #4

Reclaiming the Interrupt Vector Table!

If we look at the very top of our program output, we see a huge table taking up 42 bytes of prime real-estate memory…

00000000 <__vectors>:
   0:	14 c0 rjmp	.+40     	; 0x2a <__ctors_end>
   2:	1b c0 rjmp	.+54     	; 0x3a <__bad_interrupt>
   4:	1a c0 rjmp	.+52     	; 0x3a <__bad_interrupt>
   6:	19 c0 rjmp	.+50     	; 0x3a <__bad_interrupt>
   8:	18 c0 rjmp	.+48     	; 0x3a <__bad_interrupt>
   a:	17 c0 rjmp	.+46     	; 0x3a <__bad_interrupt>
   c:	16 c0 rjmp	.+44     	; 0x3a <__bad_interrupt>
   e:	15 c0 rjmp	.+42     	; 0x3a <__bad_interrupt>
  10:	14 c0 rjmp	.+40     	; 0x3a <__bad_interrupt>
  12:	13 c0 rjmp	.+38     	; 0x3a <__bad_interrupt>
  14:	12 c0 rjmp	.+36     	; 0x3a <__bad_interrupt>
  16:	11 c0 rjmp	.+34     	; 0x3a <__bad_interrupt>
  18:	10 c0 rjmp	.+32     	; 0x3a <__bad_interrupt>
  1a:	0f c0 rjmp	.+30     	; 0x3a <__bad_interrupt>
  1c:	0e c0 rjmp	.+28     	; 0x3a <__bad_interrupt>
  1e:	0d c0 rjmp	.+26     	; 0x3a <__bad_interrupt>
  20:	0c c0 rjmp	.+24     	; 0x3a <__bad_interrupt>
  22:	0b c0 rjmp	.+22     	; 0x3a <__bad_interrupt>
  24:	0a c0 rjmp	.+20     	; 0x3a <__bad_interrupt>
  26:	09 c0 rjmp	.+18     	; 0x3a <__bad_interrupt>
  28:	08 c0 rjmp	.+16     	; 0x3a <__bad_interrupt>

This is the interrupt vector table. It tells the processor where to go when asynchronous events happy – stuff like getting reset or having a timer expire.

Each slot corresponds to a single type of event. The top slot is where the chip goes when it gets reset. The 2nd one  happens when there is an External Interrupt #1 requested. The 10th one is for when the serial port is finished sending the last byte. All things that happen spontaneously when other code might be running.

When, say, the Universal Serial INterface overflows, the chip will load the program counter with the address of vector #17 (address 0x10) and then start executing. Typically this will be a “RJMP” instruction telling the chip where to goto to find the actual program code that needs to get executed.

In our case, we don’t use any interrupts or timers or counters or anything else fancy, so all of the vectors except for reset just point _bad_interrupt, which itself just points back the the reset vector. So why do we need all of these extra vectors if the chip is never going to jump to any of them? We don’t!

So we are left with just Vector #0, the RESET vector. RESET is basically like reboot for a normal computer and gets used anytime the chip powers up or the reset pin is toggled or stuff like that. Right now the reset vector points to __c_tors_end which is the beginning of the startup C code that we already figured out that we don’t need anyway. So, we could just point this vector directly to the beginning of our code and be done… but that vector would be using up 2 bytes just to tell the chip to jump 2 bytes forward. We can not condone this sort of thing.

Can we somehow get rid of the reset vector itself? It turns out that since the chip just does a simple jump to vector #0 on reset, the RJMP in that vector does not actually need to be an RJMP. It can be any instruction we want. It can even be the first instruction of our program! The trade off is that our code will now be overwriting the vectors that would be at location #2 and #4, but we know that these vectors will never happen so that is ok.

If we start our program at location zero, it will automatically start running when the chip gets reset with not a single byte of vector table in sight!

Ok, so now we’ve figured out that we can, in theory, get rid of each and every extra byte surrounding our code… but how do we actually make this happen in practice?

Tune in next week to find out how!

Programming in the Small – Part #3

Picking up bytes littered around main

If you’ve been following my quest for the smallest program from the beginning, you know that we still have 54 extra bytes to get rid of.

Next let us look at the code bracketing the top and bottom our main() function.

Before our code, we have…

36:	02 d0 rcall	.+4 ; 0x3c  38:	04 c0 rjmp	.+8 ; 0x42 <_exit>

…which does a call into our main(), and then jumps to the exit handler when our main() returns.

The code after our main() looks like this…

00000042 <_exit>: 42:	f8 94 cli  00000044 <__stop_program>: 44:	ff cf rjmp	.-2 ; 0x44 <__stop_program>

Why does the compiler want to call into our code, only to jump to the end when our code returns? Wouldn’t it be easier just put the exit routine dangling after our code so that the execution would just naturally flow from the end of our program into the exit routine?

One good explanation: It is important to call into main() so that can have a return at the end. The return makes it so that we can (theoretically) call main() from other parts of our program just as if it was a normal function. This makes perfect sense, except that the compiler didn’t put a return at the end of the main() function! Hmmm.. I am getting very mixed messages here. Either put a return at the end of the function or don’t call into the function – but not both.

In fact, our code can never return because it has an infinite loop. I think the compiler knows this at some level and that is why it didn’t put in the return, but it forgot to tell the part of the compiler that generated the call. In any case, we don’t need the call, or the return.

The exit code turns off interrupts and goes into an infinite loop. I can understand the loop – in general you don’t want a program to finish and then run off the end of the earth. I don’t understand why you’d want to turn off interrupts – to me intuitively interrupts should still keep working even after main() returns, but this is a matter of taste. The point is moot here because our little program has its own little infinite loop so it never finishes and never touches any of this exit code.

We’ve been able to prune away everything now except for the interrupt vectors – and those are bigger than everything so far put together. A job for next time…

Programming in the Small – Part #2

Superfluous Stack and Status Setting

Last time we looked at every one of the whopping 70 bytes of extra code that had accreted around our tiny little 6 byte program.  Now we are going to start figuring out what we can get rid of.

We saw this in the block of initialization code added at the beginning of our program…

2c:	1f be out	0x3f, r1	; 63

…which is setting the Status Register (0x3f) to zero. I wasn’t sure why you would need that, and it turns out that you don’t. The data sheet for this chip explicitly states that this register will always be initialized to zero on reset (page 10).  I looked at a few others (including the ATMEGA328 used in the Ardunio), and these chips also make the same guarantee.

As long as the chip is coming out of a rest, the Status Register will already be zero when we hit this line, and the only way I can see that we could end up here is via a reset, so this code is redundant. (The one exception is the Bad Interrupt Vector we saw last time, but that needs to go also!)

Next lets look at the code that initializes the Stack Pointer again…

  2e:	cf e5 ldi	r28, 0x5F	; 95
  30:	d1 e0 ldi	r29, 0x01	; 1
  32:	de bf out	0x3e, r29	; 62
  34:	cd bf out	0x3d, r28	; 61

Again, according to the chip specifications this code also appears to be redundant. The Stack Pointer (0x3e) is guaranteed to be initialized to point to the top of SRAM after any reset. I even tested this on an actual chip (harder than it sounds!), and it really does get automatically set no matter how I reset the chip (power,  WatchDog, reset pin).

The code that sets the other register (0x3d) is just plain wrong for this processor. The spec shows this location as “reserved” (page 255) and “reserved I/O memory addresses should never be written” (page 256).

So here are 10 bytes of unnecessary code that we can remove safely remove from any and every ATTINY4313 program (any many others) complied by avr-gcc. Think of the millions and millions of AVRs all around the world saddled with these extra 5 cycles of effort every time they reset!

How does something like this happen? The avg-gcc is a massive software project that supports dozens of processors. It is a huge amount of work to create and maintain something like this and the people who work on it do an amazing job.  Besides taking up a tiny bit of extra space and time, these superfluous bytes do not make any working programs crash, so finding and fixing stuff like this is very, very low on the the priority list.

But to those of us (me?) obsessed with maximal minimalism, even 1 extra byte is too much to swallow.

Tune in next time to see how these bytes – and more -are  banished in search of the smallest program. We still have 54 more bytes to loose!

Programming in the small

How small can a C program be?

(For this post, I am using an Atmel ATTINY4313a AVR processor, but most of this stuff should apply to C code complied for any 8-bit AVR chip with the avr-gcc compiler. This  includes the Ardunio.)

Here is the smallest useful C program I could come up with…

int main(void) {
    DDRA|= 0x01;      // Set PORTA0 bit to output.

    while(1) {        // Repeat forever
        PINA |= 0x01; // Toggle the bit
    };
}

First it sets pin PA0 to output mode, then it toggles it on and off as fast as it can forever.

Here is the assembly code that compiles down to…

  
3c: d0 9a      sbi   0x1a, 0;  // DDRA |= 0x01
3e: c8 9a      sbi   0x19, 0;  // PINA |= 0x01
40: fe cf      rjmp  .-4;      // Jump back and do it again...

Which is pretty sort and sweet – only 6 bytes long! Note that I ORed the values into the registers so the compiler could use the set bit (SBI) instruction which is only 1 word long. It really doesn’t get any smaller than this.

We can check to make sure the program actually works by connecting an oscilloscope to pin 5 of the chip, and we see this…

tinywav

Processor speed at bootup= 1MHz
Time for each cycle= 1/1MHz = 1us/cycle
Cycles for SBI instruction=2 cycles
Cycles for RJMP instruction=2  cycles
Total cycles to toggle bit on then off=2*(2 cycles+2 cycles)=8 cycles
Total period=8 cycles * 1 us/cycle=8us
Frequency=1/8us=125kHz

It looks like this is in fact our code talking to us, all 6 bytes of it.

Unfortunately, when we look at what actually downloaded into the chip, we see that it used up 70 bytes of our precious program memory! Who invited the other 64 bytes to this party? Let’s take a look at the compiler output and see…

00000000 <__vectors>:
   0:	14 c0 rjmp	.+40     	; 0x2a <__ctors_end>
   2:	1b c0 rjmp	.+54     	; 0x3a <__bad_interrupt>
   4:	1a c0 rjmp	.+52     	; 0x3a <__bad_interrupt>
   6:	19 c0 rjmp	.+50     	; 0x3a <__bad_interrupt>
   8:	18 c0 rjmp	.+48     	; 0x3a <__bad_interrupt>
   a:	17 c0 rjmp	.+46     	; 0x3a <__bad_interrupt>
   c:	16 c0 rjmp	.+44     	; 0x3a <__bad_interrupt>
   e:	15 c0 rjmp	.+42     	; 0x3a <__bad_interrupt>
  10:	14 c0 rjmp	.+40     	; 0x3a <__bad_interrupt>
  12:	13 c0 rjmp	.+38     	; 0x3a <__bad_interrupt>
  14:	12 c0 rjmp	.+36     	; 0x3a <__bad_interrupt>
  16:	11 c0 rjmp	.+34     	; 0x3a <__bad_interrupt>
  18:	10 c0 rjmp	.+32     	; 0x3a <__bad_interrupt>
  1a:	0f c0 rjmp	.+30     	; 0x3a <__bad_interrupt>
  1c:	0e c0 rjmp	.+28     	; 0x3a <__bad_interrupt>
  1e:	0d c0 rjmp	.+26     	; 0x3a <__bad_interrupt>
  20:	0c c0 rjmp	.+24     	; 0x3a <__bad_interrupt>
  22:	0b c0 rjmp	.+22     	; 0x3a <__bad_interrupt>
  24:	0a c0 rjmp	.+20     	; 0x3a <__bad_interrupt>
  26:	09 c0 rjmp	.+18     	; 0x3a <__bad_interrupt>
  28:	08 c0 rjmp	.+16     	; 0x3a <__bad_interrupt>

0000002a <__ctors_end>:
  2a:	11 24 eor	r1, r1
  2c:	1f be out	0x3f, r1	; 63
  2e:	cf e5 ldi	r28, 0x5F	; 95
  30:	d1 e0 ldi	r29, 0x01	; 1
  32:	de bf out	0x3e, r29	; 62
  34:	cd bf out	0x3d, r28	; 61
  36:	02 d0 rcall	.+4      	; 0x3c 
  38:	04 c0 rjmp	.+8      	; 0x42 <_exit>

0000003a <__bad_interrupt>:
  3a:	e2 cf rjmp	.-60     	; 0x0 <__vectors>

0000003c <main>:

  3c:	d0 9a sbi	0x1a, 0	; 26
  3e:	c8 9a sbi	0x19, 0	; 25
  40:	fe cf rjmp	.-4      	; 0x3e <__SP_H__>

00000042 <_exit>:
  42:	f8 94 cli

00000044 <__stop_program>:
  44:	ff cf rjmp	.-2      	; 0x44 <__stop_program>

You can spot our little routine just after main(), but it is drowning in a sea of other code.

It turns out that the C compiler throws lots of extra stuff in that, under normal circumstances, makes C programmers’ (and compiler writers’) lives easier. Here is the breakdown of the 70 bytes…

Interrupt vector table 42
Initialization code 16
Bad Interrupt Vector 2
main (our program) 6
Exit routine 4
Total 70

Lets take each of these and see what they do and what we can do about them.

Interrupt Vector Table

When ever the processor gets interrupted from running normal step-by-step code, it will jump to one of these addresses based on what interrupted it. This is part of the defined behavior of the chip. If Timer 1 overflows, it jumps to vector #6. If the Analog Compare triggers, it jumps to vector #12. There are 21 vectors in all, each for a different source of interrupts. Each vector is really just an instruction to jump to someplace else, so each vector takes up 2 bytes. 21 vectors * 2 bytes/vector = 42 bytes.

Note that the first vector (at address 0) is particularly important because this is the Reset vector. This where the processor starts up after a reset – including when it gets turned on.

Initialization Code

Here is the initialization code….

  2a:	11 24       	eor	r1, r1
  2c:	1f be       	out	0x3f, r1	; 63
  2e:	cf e5       	ldi	r28, 0x5F	; 95
  30:	d1 e0       	ldi	r29, 0x01	; 1
  32:	de bf       	out	0x3e, r29	; 62
  34:	cd bf       	out	0x3d, r28	; 61
  36:	02 d0       	rcall	.+4      	; 0x3c

The first line clears register R1 to equal 0 (anything XORed with itself is zero). The compiler often needs a zero handy, so it dedicates register 1 to always and forever have a zero in it. This is how that original zero gets there.

The next line clears out location the Status Register (0x3f) by loading it with zero (using the handy zero that was put in R1 on the line before). I’m not really sure why they do this…

The next 4 lines set up the Stack Pointer (0x3C) to point to the top of RAM (0x5F). It also is putting an 0x01 in location 0x3E, which according to the chip’s documentation should be a reserved location an not used. Maybe this is a benign copy-paste error from another chip that supported a 2 byte Stack Pointer?

The last line calls into the main() function of our C code.

Bad Interrupt Vector

This looks like just a vector for other vectors to point to, and all of the unassigned interrupt vectors point to it. The Bad Interrupt Vector itself just jump back to the reset vector. This mans that if, say, you get a timer interrupt and you have not set up a vector for it, then the processor will first jump to the Timer Interrupt Vector, which will send it to the Bad Interrupt Vector, which will send it to the Reset Vector, which will send it to the Initialization code. Seems like it would be much easier and more efficient just to have all the unassigned vectors point directly to wherever you want them to go (currently the Reset Vector).

Exit Routine

This is code that executes when the main() function returns (ours loops forever so this never happens in our case).  All the Exit Routine does is turn off all interrupts and then loop forever. Again not sure why you’d do this – I can imagine writing a program that was completely interrupt driven and just sets up all the interrupts in the main() and then returns. Because this Exit Routine turns off all interrupts, this wont work and I must add an extra while(1) in my main().

At least there is nothing magic going on here – we can see exactly where all of the extra bytes are coming from, and figure out why they are there (or don’t need to be!).

Tune in next time for some drastic cutting….

Cutting power the easy way

Turned off all unused units like the analog comparator, Timer0, USI, and  USART.

Saved about 0.2mA. Not much, but I guess worth the tiny amount of effort.

ACSR |= ACD;        // Turn off analog compare unit. We don't use it, so save power. Saves about 0.1ma   3.6mA drops to  3.5 mA

PRR = PRTIM0 | PRUSI | PRUSART;        // Turn off Timer/Counter0, USI, USART since we don't need them. Saves about 0.1mA.

Brownout detector already off via fuses. Won’t really worry about setting IO pins to output since the input is disabled during sleep and we will be asleep almost all the time except when updating display.

I guess it is now all about maximizing time in SLEEP!

Commit here…

https://github.com/bigjosh/Ognite-firmware/commit/284bfb23cc6c535012a5a8e22c8acf92a21c0073