Optimizing FastForth: Optimizing in a BSR/JSR Threaded Forth
Forth Dimensons - March / April 1993 - page 21 - 26
The author helps intermediate Forth programmers learn how to optimize their applications. These highly portable techniques require only a cerain amount of bravado, an analytical approach, and knowledge of your CPU's instruction set and your Forth. Once it is built and fine tuned, your optimizer should help you produce faster, more efficient code.
The purpose of this paper is to describe a code optimizer for a 68000-based JSR/BSR threaded Forth interpreter/compiler. The code operates in the traditional Forth single pass compiler, optimizing on the fly. The result includes words which execute in fewer instructions than the words called out in the source code.
The Forth used for the code described herein is FastForth, a full 32-bit JSR/BSR threaded Forth for the 68000, described in unmitigated detail Life in the FastForth Lane, in issue of Forth Dimensions. It is a direct modification of a indirect threaded Forth, Real-Forth. This is, in turn, a direct descendent of fig-Forth. (Remember fig Forth?) Fig Forth vocabulary, word names and other features have been retained.
For those not familiar with 32-bit Forths, the memory operators with the prefix W operate on word, or 16-bit, memory locations. FastForth uses the operators and for 32-bit memory operations where the address is known to be an even address. To avoid odd address faults, the regular Forth operators and use byte operations.
The assembler used to illustrate is descended from the fig 68000 assembler by Dr. Kenneth Mantei. It is a typical Forth reverse polish notation assembler. Typical syntax is: source, destination, opcode. The addressing modes relevant to the paper are as follows:
There is nothing particularly new conceptually here. Chuck Moore's CMForth includes an optimizer for the Novix NC 4000. The present paper describes an optimizer for a more traditional CISC instruction set, the Motorola 68000.
The compiler used in FastForth looks very much like a traditional indirect threaded Forth. However, it lays down opcodes which call (via BSR or JSR instructions) lower level words, rather than a list of addresses for NEXT to interpret.
For example, the traditional word is defined as follows:
In an indirect-threaded 32-bit Forth, the compiler would build the header for L. This would be followed by a four byte address for the code to be executed to interpret the word. The code field address is followed by a four byte address for each of the three words called out in the source. This would be followed by the address of the exit code, laid down by the compiler directive ;.
In a BSR/JSR threaded Forth, the compiler lays down BSRs ("branch to subroutine") or JSRs ("jump to subroutine"), as appropriate, to the words called out. The return code consists of an RTS instruction. The result may or may not be smaller than the indirect threaded version, but it certainly will be faster. Whether the result is smaller or not depends on the mix of short BSRs (two bytes), long BSRs (four bytes) and JSRs (six bytes) laid down at compile time.
One optimization discussed in Life in the FastForth Lane to examine the last instruction of a word. If it is a BSR or JSR, that instruction can be twiddled to produce a BRA or JMP instruction.
Another optimization is to lay down inline code instead of calls. This is particularly beneficial when calling short words (e.g. F@) from a distance which would require a JSR instruction. Not only does the technique save time at run time (by eliminating a call and an RTS instruction), but it may reduce the size of words. One circumstance where this technique does not save space is where a four byte word is copied in line to a location which would have required a short, two byte, BSR.
Variables, constants and user variables in FastForth are immediate words which compile in line code, often a six-byte reference.
With these optimizations, the compiler produces the following for the sample word shown above:
The total is sixteen bytes at most, compared to a firm twenty bytes for the indirect threaded version. So the JSR/BSR version may be smaller, but certainly will be much faster!
Two more typical Forth optimizations are common and won't be discussed very much.
If a phrase shows up a lot in a Forth program, it is common practice for the programmer to consolidate that phrase in a word, with a meaningful name. This is optimizing for readability and dictionary size, rather than speed.
The second is to reduce words from high level to assembler. This requires the active intervention of the programmer, and the results are well worth it in terms of speed, and often worthwhile in terms of space. Alas, such optimizations only improve readability for those who know the relevant assembly language (and, sometimes, the relevant assembler), leaving the code more opaque to those without those skills.
The Optimizer Design
An optimizer for inline code should be a single pass optimizer, to be consistent with Forth's traditional single pass compiler. This would make difficult, for example, replacing long forward branches with short ones on the fly, but would result in much simpler code in the compiler. It must, then, operate at compile time, and so must consist of immediate words.
The optimizer should be an add on, so that the user could add to the optimizer if he wished to. However, it should also be carried over to the cross compiler so as to produce a very efficient nucleus.
The optimizer should work silently, as far as the user is concerned. That is, it should require no changes in source code to be useful. This requirement separates out the optimizing compiler from the two common optimizing methods described above.
Initially, the optimizer word could be developed as a series of discrete words, but these would be replaced by one or more defining words and their daughter words.
One key to single pass optimization during compilation is to have immediate words which examine previously compiled opcodes, and twiddle certain ones to produce tighter code.
Another key is to look for certain phrases which can easily be detected and easily twiddled. We could look for the phrase BLK F@, which shows up all over the nucleus, and optimize that. Instead, let us generalize, and look for phrases of the general type:
To detail this example: a user variable is an immediate word, which lays down the following phrase:
Translated into English, the first instruction moves data from a user variable indicated by an offset from the user area register (U), to address register 0. The second pushes it out onto the stack. The result, examined in memory as word data, looks like:
If the next word in the source code is and it is immediate, it can look at here-less-6 for the opcode 41EE. Finding it, it can twiddle the dictionary to produce the following code:
This produces the following memory dump:
The first instruction still moves the contents of the user variable into the address register, but the second instruction now reads data from the location pointed to by the register, and pushes it onto the data stack.
The phrase <USER VARIABLE> F@ is now executed in two instructions instead of the previous four, and occupies six bytes instead of the previous ten. And the optimizer works for all user variables, even ones not defined at the time the optimizer is compiled.
Other two word phrases were similarly identified and optimized, and some three word phrases were also identified and optimized. As each phrase was identified, a defining word was built up, consisting of nested IF ... ELSE ... THEN clauses. The resultant words are monsters, and must be thoroughly understood by the programmer who seeks to modify them. In these two respects, they are unForthish, but the gain obtained by using them is worth the price.
These words must all be state smart. As they will run either at run time or at compile time, they must examine STATE and act accordingly. The action at run time is, of course, to execute their namesakes. Hence in the run time portion of the defining words, the phrase STATE F@ IF ... ELSE @EXECUTE THEN.
In order for that phrase to work correctly, we must have the runtime address of the namesake in the dictionary. We require the namesake to be explicitly stated, ' it and comma the address into memory. This is accomplished by the phrase
(It is possible to dispense with the necessity for naming the namesake word by playing with the contents of the user variable IN (>IN to neo and mezoforthrights). The implementation will be left as an exercise for the student. It was not implemented to save space in the dictionary, not because the author was lazy.)
Another general caveat is that the optimizer must not optimize across branch terminations. While it might be acceptable to optimize the phrase FOO F@, the phrase FOO THEN F@ is not readily optimized. As THEN is an immediate word, and leaves nothing in the dictionary where the optimizer can detect its passage, we must redefine it to leave a flag. This is done on screen 585. This is why the run time portions of our optimizers examine the variable immediately after they examine STATE.
Two defining words have been produced. UNARY is used to optimize words which are unary operators. That is, they take one item from the stack and operate on it, leaving one or zero items on the stack. BINARY is for words which take two items on the stack, and leave one. For examples of daughter words, see screen 589.
With the basic concepts laid down, we can expand our optimizer in three ways. We can add new defining words, for new classes of optimizers. We can add new daughter words to the existing defining words. We can add new capabilities and, if needed, new parameters, to the existing defining words and their daughter words.
The last method of extension is how the optimizer words were produced in the first place. The programmer started out with a default action (compile the namesake, as usual), and one test and one action for a desired condition. As new phrases were con sidered for optimization, the nesting of IF ... ELSE ... THEN clauses continued apace.
This methodology allowed for incremental testing of the words under development. Screen 590 shows a test for the binary operator. The test is done by compiling two words. One is a code definition, consisting of the desired output for the compiler. The other is a test high level word which exercises the optimizer. Screens 591 and 592, not shown, contain the target defining word and daughter words.
The last two lines of the screen compare the two words and disassemble\decompile them both automatically as part of the compilation process. These two tests almost instantly indicate problem areas with words under development. Automated testing of compiler output in this manner allowed very fast, reliable development of the optimizers, and was essential to the success of the project.
Once the basics of the optimizing code have been worked out, it remains only to incrementally add functions to analyze the code and handle the phrases where optimization is desired.
Selecting Phrases for Optimization
If you have your own target compiler and nucleus source, the best way to optimize all possible applications is to improve the nucleus. Anything that improves BLOCK will improve words that call BLOCK. So as FastForth was developed, optimizers were added to the target compiler as well as to the FastForth environment. The choice of phrases to optimize reflects an effort to improve the nucleus first, with improvements elsewhere secondary.
As noted, the phrase <USER VARIABLE> F@ shows up all over the nucleus. Similarly, <USER VARIABLE> F!, <USER VARIABLE> OFF, and <USER VARIABLE> 1+!. The optimizations of and were primary, with the others secondary. These are the phrases to be optimized by the optimizer defining word UNARY on screens 586 and 587.
These words also operate with variables and often with constants. Both variables and constants compile to in line literals, either in the form of <value> @#L S -[ MOV, or in the form of <value> # DR7 MOVQ, DR7 S -[ MOV, for literals in the range of (hex) -80 to 7F. However, since most variables and constants used as variables will be long values, it is essential to detect long literals, with short ones a possible addition for the student.
The long literal form compiles into:
After manipulation by F@, the code should look like this:
After manipulation by F!, the code should look like this:
This means that the code in UNARY will twiddle the literal's opcode to change its destination, and lay down a new instruction. Since the instruction will vary with the word being compiled, this must be provided as an operand to each optimizer as it is compiled. This instance is handled on screen 587, lines three and four.
With nuclear optimization in mind, the phrase <USER VARIBALE> F@ F@ is handled as well. This phrase shows up in places that affect compiler speed, such as in FIND LATEST. Any applications which use double indirection will benefit.
The next defining word for optimizers is the family of binary words. These are words which prior to optimization take two operands from the stack and return one. These are +, -, etc, as indicated on screen 589. In code they take the form:
If we can detect literals and user variables, and see to it that their contents are left in DR7, we can them compile the appropriate opcode to complete the operation, saving a push to and a pop from the data stack.
For example, adding a byte literal to the top of the stack becomes:
Similarly, adding the contents of a user variable to the top of the stack goes from:
This optimization gets rid of three instructions, and produces an optimization of fewer instructions than original source words. Not bad for not being an example of Moorish architecture.
To return to the original example, an updated table taking into account the optimizer is as follows:
Comparison: Traditional Compilers
A conceptually simple but very powerful Forth code optimizer can be had in five screens, less than two pages. One has problems imagining a traditional compiler with optimization occupying so small a source code space. Also, one has a hard time imagining the likes of AT&T or Microsoft releasing source for their compilers. And you don't have to call a 900 number to get support.
Furthermore, the optimizer presented here is a complete unit, and can be removed from the FastForth environment without any changes except, of course, in the size and speed of the generated code. It is dependent only upon the nature of the target processor.
Additional phrases may be selected for optimization by the user, who need only add them to the compiler, in the traditional Forth manner. Eventually, a diminishing return of better speed and code size must be offset against development time and costs. Unlike the traditional compiler, this tradeoff may be made by the end user, the application programmer, if he wishes. In fine Forth tradition, the application programmer may modify the compiler to suit his application, rather than the traditional methodology of modifying the application to fit the compiler's procrustean bed.
Indeed, the very notion of an application programmer having the ability to modify his compiler is a heresy to the ayatollahs of traditional computing.
The FastForth code optimizer produces fast, efficient code. It is easy to understand, and readily modified by the end user. It is very powerful, and conceptually very simple. Indeed, anyone reasonably familiar with the instruction set of his target processor and the inner workings of his Forth can write one. Like Forth itself, it makes an abattoir of the sacred cows of computing.
In the best Forth tradition, the code is released to the public domain. Enjoy it in good health.
FastForth for the Atari ST, including the above code, may be had in alpha release from the author. Please consult the author for the current state of documentation, etc.
Charles Curley is a long time Forth nuclear guru. When not working on computers he teaches firearms safety and personal self defense. His forthcoming book, Polite Society, covers federal and state firearms legislation in layman terms.
0 ( optimizers for : defs ( 22 3 92 CRC 11:05 ) 1 BASE F@ HEX 2 0 VARIABLE OPT ( not particularly re-entrant! ) 3 4 : THEN HERE OPT F! [COMPILE] THEN ; IMMEDIATE 5 6 : BEGIN HERE OPT F! [COMPILE] BEGIN ; IMMEDIATE 7 8 : OPGET ( addr ct --- | get operand ct bytes from addr ) 9 +W@ ; 10 11 12 --> 13 14 15 Scr # 586 0 ( optimizers: unary ( 15 4 92 CRC 8:37 ) 1 : UNARY CREATE SMUDGE -FIND IF DROP , ELSE 0 ERROR THEN 2 SMUDGE W, W, W, IMMEDIATE 3 DOES> STATE F@ ( only if compiling... ) 4 IF HERE OPT F@ - ( not following a begin) 5 IF HERE 6 - W@ 273C = ( following a literal? ) 6 IF 4 OPGET HERE 6 - W! ( yyy ** @#l xxx, ) 7 ELSE HERE 2- W@ 2708 = ( arO s -[ mov, eg user) 8 IF -2 ALLOT HERE 4- W@ 41EE = ( user variable? ) 9 IF 6 OPGET HERE 4- W! ( yyy u ** &[ xxx, ) 10 ELSE 8 OPGET W, THEN [ ( yyy ar0 [ xxx, ) 11 --> 12 13 14 15 Scr # 587 0 ( optimizers: unary ( 21 4 92 CRC 8:15 ) 1 ] ELSE HERE 4Ä W@ 272E = ( user f@ optimize ) 2 IF 206E HERE 4- W! 8 OPGET W, 3 ELSE HERE 6 - W@ 2739 = ( literal f@ optimize ) 4 IF 2079 HERE 6 - W! 8 OPGET W, 5 ELSE F@ <COMP> THEN THEN THEN THEN 6 ELSE F@ <COMP> THEN ( following br resolution ) 7 ELSE @EXECUTE THEN ; ( not compi1ing ) 8 9 10 : BINARY CREATE SMUDGE -FIND IF DROP , ELSE 0 ERROR THEN 11 SMUDGE W, W, IMMEDIATE 12 DOES> STATE F@ [ ( only if compiling... ) 13 --> 14 15 fastForth on Atari ST (c) 1985-92 by Charles Curley Tuesday 6/10/92 11:20:08 Scr # 588 0 ( binary defining word ( 21 4 92 CRC 8:15 ) 1 ] IF HERE OPT F@ - ( not following a begin) 2 IF HERE 4- C@ 70 = ( byte literal? ) 3 IF HERE 4- E TOGGLE ( xx # dr7 moveq, ) 4 -2 ALLOT 4 OPGET W, ( dr7 s [ xxx, ) 5 ELSE HERE 6 - W@ 273C = ( large literal? ) 6 IF 6 OPGET HERE 6 - W! ( yy #1 s [ xxx, ) 7 ELSE HERE 4- W@ 272E = ( user f@ ?? ) 8 IF HERE 4- 9 TOGGLE 4 OPGET W, ( ofuser s [ add ) 9 ELSE HERE 6 - W@ 2739 = ( literal f@ ?? ) 10 IF HERE 6 - 9 TOGGLE 4 OPGET W, ( lit dr7 mov, ) 11 ELSE F@ <COMP> THEN THEN THEN THEN 12 ELSE F@ <COMP> THEN ( following br resolution ) 13 ELSE @EXECUTE THEN ; ( not compiling ) 14 --> 15 Scr # 589 0 ( daughter words ( 20 4 92 CRC 13:22 ) 1 ( opget 8 6 4 ) 2 5290 52AE 52B9 UNARY 1+! 1+! 3 4290 42AE 42B9 UNARY OFF OFF 4 209B 2D5B 23DB UNARY F! F! 5 2710 272E 2739 UNARY F@ F@ 6 7 ( 1.w. lit byte lit ) 8 693 DF93 BINARY + + 9 493 9F93 BINARY - - 10 93 8F93 BINARY OR OR 11 293 CF93 BINARY AND AND 12 A93 BF93 BINARY XOR XOR 13 BASE F! 14 15 Scr # 590 0 \ test area for macro mods ( 21 4 92 CRC 7:08 ) 1 DEBUG FORTH DEFINITIONS FORGET TASK 2 : TASK ; BASE F@ >R HEX 3 : ?LEN: [COMPILE] ' 2- W? ; 4 0 VARIABLE SNARK 5 CODE FOO ofuser fld dr7 mov, dr7 s [ and, 6 snark @#l dr7 mov, dr7 s [ and, 7 7f # dr7 movq, dr7 s [ and, 8 ffff #1 s [ and, 9 NEXT ;C 10 11 1 2 +THRU 12 : BAR fld f@ and snark f@ and 7f and ffff and ; 13 14 R> BASE F! EDITOR FLUSH ?CR ?LEN: FOO ?LEN: BAR 15 ' FOO DUP 2- W@ ' BAR EDITOR -CITEXT . UN: BAR UN: FOO ;S fastForth on Atari ST (c) 1985-92 by Charles Curley Tuesday 6/10/92 11:20:14