Extreeme Alorithim Optimization

We spend most of our time inside the display refresh, so let’s see if there is any more we can squeeze out there.

The current refresh code is very simple and fast. It has nested loops that scan the X and Y axis. For each pixel it loads the brightness value and then looks up which pins drive to light up the appropriate LED.

Clean Start
Program Memory Usage  : 3048 bytes   74.4 % Full
Data Memory Usage   : 79 bytes   30.9 % Full
Time in Refresh (min/max): 153us/174us

Let’s see if we can make it faster and/or smaller…

First we can move the lookup tables from program memory into SRAM. This should save 1 cycle on each lookup, and might be even faster since there are more registers available than can be used for SRAM lookups….

Lookups in SRAM
Program Memory Usage : 3040 bytes 74.2 % Full
Data Memory Usage : 105 bytes 41.0 % Full
Time in Refresh (min/max): 147us/164us

There are two costs to this change. First, we are using up 26 bytes more of SRAM, but this is ok because we have plenty of SRAM to spare.  We are also using up a bit of time when the values initially get copied from program memory into SRAM, but this only happens once each time we get turned on so that’s ok too.

Next we can cache the value for PORTB and PORTD locally since we use them twice…

Cache PORTD and PORTD assignments
Program Memory Usage 	:	3036 bytes   74.1 % Full
Data Memory Usage 		:	105 bytes   41.0 % Full
Time in Refresh (min/max): 144us/161us

Very small win, but no sacrifice at all so we’ll take it. Note that this only work becasue the compiler does not treat the PORT registers as normal registers – it could. Using the temporary cache variables lands these values in proper registers that the compiler feels comfortable with.

Next we can also cache the value for the ColBits  lookups….

Cache and Move ColBits Lookups 
Program Memory Usage 	:	3006 bytes   73.4 % Full
Data Memory Usage 		:	97 bytes   37.9 % Full
Time in Refresh (min/max): 130us/155us

This one is a sad victory. We really have not changed the algorithm or flow at all, just moved around some code. Apparently this helped kick the compiler into seeing something. It is arbitrary and capricious and really reinforces the idea that we are going to have to do bare-metal assembler if we really want to get to the best possible code here.

All told, we traded about an hour of effort for 42 bytes of program space and about 17us of time on each screen refresh (about 13% savings).

We could get maybe another 20% reduction by hard coding in some knowledge about where the pins land in the current design and thus skipping some lookups. Until the battery holder problem is solved, I want to keep all options open. 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s