We spend most of our time inside the display refresh, so let’s see if there is any more we can squeeze out there.
The current refresh code is very simple and fast. It has nested loops that scan the X and Y axis. For each pixel it loads the brightness value and then looks up which pins drive to light up the appropriate LED.
Clean Start Program Memory Usage : 3048 bytes 74.4 % Full Data Memory Usage : 79 bytes 30.9 % Full Time in Refresh (min/max): 153us/174us
Let’s see if we can make it faster and/or smaller…
First we can move the lookup tables from program memory into SRAM. This should save 1 cycle on each lookup, and might be even faster since there are more registers available than can be used for SRAM lookups….
Lookups in SRAM Program Memory Usage : 3040 bytes 74.2 % Full Data Memory Usage : 105 bytes 41.0 % Full Time in Refresh (min/max): 147us/164us
There are two costs to this change. First, we are using up 26 bytes more of SRAM, but this is ok because we have plenty of SRAM to spare. We are also using up a bit of time when the values initially get copied from program memory into SRAM, but this only happens once each time we get turned on so that’s ok too.
Next we can cache the value for PORTB and PORTD locally since we use them twice…
Cache PORTD and PORTD assignments Program Memory Usage : 3036 bytes 74.1 % Full Data Memory Usage : 105 bytes 41.0 % Full Time in Refresh (min/max): 144us/161us
Very small win, but no sacrifice at all so we’ll take it. Note that this only work becasue the compiler does not treat the PORT registers as normal registers – it could. Using the temporary cache variables lands these values in proper registers that the compiler feels comfortable with.
Next we can also cache the value for the ColBits lookups….
Cache and Move ColBits Lookups Program Memory Usage : 3006 bytes 73.4 % Full Data Memory Usage : 97 bytes 37.9 % Full Time in Refresh (min/max): 130us/155us
This one is a sad victory. We really have not changed the algorithm or flow at all, just moved around some code. Apparently this helped kick the compiler into seeing something. It is arbitrary and capricious and really reinforces the idea that we are going to have to do bare-metal assembler if we really want to get to the best possible code here.
All told, we traded about an hour of effort for 42 bytes of program space and about 17us of time on each screen refresh (about 13% savings).
We could get maybe another 20% reduction by hard coding in some knowledge about where the pins land in the current design and thus skipping some lookups. Until the battery holder problem is solved, I want to keep all options open.