4tH

HOMEPAGE: http://www.xs4all.nl/~thebeez/4tH/

1. Introduction

Like Forth, 4tH is a compiler and a interpreter. Unlike Forth you cannot switch between the two. Like Forth, 4tH runs Forth-programs. Not all of them but some. But in a quite different way.

Most things have already been written. There have been Forths written in a high level language. There have been portable Forths. There have been Forths that could interface with C. Different architectures have been used to implement Forth. There have been Forths that were 16 kB or even less.

Well, all of that has been done. But here is a compiler/interpreter that's all of the above. And none of them either. It sounds like an ancient Greek riddle, but it isn't. It's 4tH.

2. History

To understand 4tH you have to know how it came to be. As most things in life, 4tH developed slowly. Its predecessor is a C-function called strcalc(). This function is an implementation of a RPN calculator in one very compact function (about 6 kB source). It works with signed 32 bits integers and has about 20 commands and 20 variables. The C-programmer can add additional variables.

Using it in a C-program is very easy too. Just pass the source as a string and add any variables you need. It will return the result of that calculation.

Well, although primitive it can still be very useful. You can implement an interactive RPN calculator in less than 5 lines of C. It can also be used to make calculations from sources stored elsewhere, like in a file or an environment-variable. If you can store a string there, you can store strcalc() source.

But we were not satisfied. We wanted to create some successor to strcalc() that could be used to create applets, small applications that can be embedded in an application. Like strcalc() it had to be fast and compact and easy to use. All these requirements and 'Reverse Polish Notation'. What language comes to mind first? Forth.

There were a few advantages and disadvantages to that approach. First, if it looked like Forth, it had to be compatible with Forth up to a certain point. Second, if it looked like Forth, we wouldn't have to write thick manuals and explain how to use the language. Third, if it looked like Forth, could we make it crash-proof?

A user can easily crash a Forth-system. Store something at a wrong address and your system hangs. We don't like that, even when the user is at fault. So we had to make a few concessions somewhere, since adding checks means the program will be less compact and slower.

For a very long time we just didn't get the right idea. Then on a dark night in October 1994, it happened. The baby was called 4tH and could do everything strcalc() did.

It took quite a while before 4tH had successfully got away from its strcalc() roots. The very first version was very buggy and little more than an RPN calculator with (incompatible) flowcontrol and some string facilities. It required two passes to compile a source and the resulting bytecode could not be saved. The I/O was C-based and very primitive. There was no Character Segment.

The second version got string and file facilities. The I/O and flowcontrol was completely rewritten, so they now were fully Forth-compatible. The second pass was discarded and H-code could finally be saved. The first move to ANS-Forth was made.

The third version came to be when the H-code eXecutable was created. This fileformat made it possible to port bytecode across platforms. At the same time, 4tH moved more and more toward ANS-Forth. Exception-handling and assertions were introduced. And in the spring of 1997, version 3.1c was released to the general public.

Of course, 4tH didn't stop there. Since then, conditional compilation, forward declarations and inline-macros have been added. The compatibility with ANS-Forth has been significantly improved. Neither the compactness nor the speed of 4tH have been compromised. It uses less memory than previous versions and is just as fast.

3. Applications

4tH is an excellent platform to learn Forth. It looks and behaves like a conventional compiler, but essentially is Forth. A Forth that detects virtually every error and reports what was wrong and where it went wrong, but still is quite fast and compact.

But like any good teacher 4tH is quite strict. Forth allows constructions that should be avoided. 4tH on the other hand, either does not implement these words or restricts their usage.

Other Forth concepts are hard to handle, like the different wordsets for different kinds of numbers. 4tH only uses signed 32 bit integers, which enables the programmer to make a wide range of applications without being bothered by overflow. Pointers, integers and characters are transparently converted.

That doesn't mean that 4tH cannot be used as a scripting language anymore. There are still excellent facilities in 4tH to do just that. They are just modified in order to allow programmers to use 4tH as a stand-alone language. If you wonder how we did all that, here is the answer.

4. The 4tH language

Most Forths use four different datatypes: signed 16 bit numbers, unsigned 16 bit numbers, signed 32 bit numbers and unsigned 32 bit numbers. The latter two are usually called "double numbers". Unlike C they all have their own operators. On top of that there are mixed operators too. Highly confusing!

We never liked that in the first place. Application programmers want to make an application. They don't want to worry whether any intermediate result could possibly be larger than 32767. So 4tH gets rid of most data- types and operators. It uses signed 32 bit numbers. That's it. No mixed, double or unsigned operators.

Second, a Forth programmer has to know how much address-units a cell takes. Since every data-type in 4tH has its own segment, the address-unit of a segment is always one, regardless the data-type. Consequently, ANS- Forth words like 'CELLS' and 'CHARS' are 'NOOP's. Which fits 4tH nicely.

Although 4tH has different words for storing and fetching different data-types, most of its vocabulary is still compatible with Forth. E.g. the word "C!" takes an address in the Character Segment and "!" takes an address in the Integer Segment.

Since the Code Segment and String Segment do not allow any writing, there is no need for such operators.

Each segment has its own allocation operators too. 'VARIABLE', 'ARRAY' and 'VALUE' allocate space in the Integer Area. 'STRING' allocates space in the Character Area. Other words like ''' and 'CREATE' have restricted functionality and compatibility with Forth.

4tH was originally loosely based on the Forth-79 standard, but now it supports most of the CORE wordset of ANS-Forth. Note that compatibility never had the highest priority. 4tH was designed to write applets, not to be the next "fully ANS-Forth compatible compiler with a little difference". If that is what you want, 4tH is not for you.

5. H-code

Long before the dawn of the original IBM-XT there was a language called UCSD Pascal. Like Forth, it was a compiler and an interpreter. In fact, it didn't compile source into object-code for some silicon-based processor. Instead it made P-code. So if you wanted to execute it, you needed a P-code interpreter for your system.

Such an interpreter can run faster than an ordinary interpreter since it doesn't interpret source-statements with all of its symbolic labels intact, but optimized P-code. It seems to have been discovered again, since Java and previous versions of Visual Basic work the same way. Visual Basic hides the interpreter in a DLL, but basically it doesn't work any different.

The 4tH uses the same basic architecture. First the source is compiled into H-code. Then the H-code interpreter is run. A token is a very simple structure. It's got a single byte instruction and an argument. Here's a sample of disassembled H-code:

      [62]   CR          (0)
      [63]   VARIABLE    (2)
      [64]   @           (0)
      [65]   1-          (0)
      [66]   DUP         (0)
      [67]   VARIABLE    (2)
      [68]   !           (0)
      [69]   0BRANCH     (62)

BTW, building a decompiler for tokenized code is quite simple. There is one for Visual Basic and it seems like one emerged for Java too. The H- code was the result after compiling this little piece of source code:

      cr
      begin
             times @ 1- dup times !
      until

You can clearly see that everything is actually compiled. Flow-statements are compiled into BRANCH and 0BRANCH instructions pointing to addresses in the Code Segment.

Compiled H-code can be used on its own. It can be kept in memory, loaded, saved, decompiled and executed. H-code is a combination of the String Segment, the Code Segment and a header. The header contains all the information to set up the runtime environment and some information on the String- and the Code Segments. The Integer Segment and the Character Segment are created at runtime.