Subroutines
The F8 has no internal program counter stack, so you must be careful when calling subroutines. Using PI/POP only works for one level of subroutines, because the return address for the first PI opcode will be overwritten by subsequent PI opcodes. Here's a single-level example:
prog:
; ...do something...
pi sub
; ...do more...
sub:
; ...do something...
pop
To have 2 levels of subroutines, you can use the K register to save the first return address:
prog:
; ...do something...
pi sub1
; ...do more...
sub1:
lr k,p
; ...do something...
pi sub2
; ...do more...
pk
sub2:
; ...do something...
pop
That's as deep as the processor allows you to go without writing additional code to save return addresses. In the Channel F BIOS, there are routines which create a simulated stack for the K register. The routine at $0107 (known as PUSHK or CALL) can push K to the stack and the routine at $011E (known as POPK or RTRN) can pop K from the stack. For example:
prog:
; ...do something...
pi sub1
; ...do more...
sub1:
lr k,p
pi PUSHK
; ...do something...
pi sub2
; ...do more...
pi POPK
pk
sub2:
lr k,p
pi PUSHK
; ...do something...
pi sub3
; ...do more...
pi POPK
pk
sub3:
; ...do something...
pop
By using PUSHK/POPK, you can have more than 2 levels of subroutine calls. However, a lot of overhead is added to the code by manipulating the stack. When inside a pushk/popk subroutine it's still possible to use plain pi/pop as it only affects PC0 and PC1. Whenever calling a subroutine one level deep, it's best to use the PI/POP combination; for two levels of subroutines, it's best to use the second example above. If using the K register in a subroutine only simple PI/POP is usable to get there, not to destroy contents of K.
Also consider using macros- you have a lot more code space than the original Channel F programmers, so you might as well use it; the time you save can be considerable.
Blackbird is writing more efficient versions of PUSHK/POPK (Snippet:KStack). Another idea is to write a version that uses the Schach RAM at $2800 that MESS emulates. That would free up more scratchpad registers and possibly also be quicker.
Here's a trick from the Guide: if a subroutine will be called frequently, it's quicker to load its address into the K register and call it using PK than to use PI multiple times. You'll use 4.5 cycles instead of 6.5 cycles to do the same thing.