Finding and Preventing Bugs in Geos Applications

Marcus Groeber 21 Nov 1996

Originally published in Handheld Systems, Vol 5.1

Testing and debugging application is not exactly what developers would call their favorite passtime. On the other hand, most will agree that finding problems is an integral part of any development process. For this reason, it is important to know about any help the operating system and its tools can give you to identify errors with the least possible effort.

Catching bugs before they hit the user can be even more important on portable or “consumer” devices - I have heard people saying that they wouldn’t use a PDA which crashed even just once in normal use (and obviously this would extend to any additional software running on the device). Most users will have adopted a more cynical attitude towards “error-free” computing products, but this statement points to something that should be kept in mind: a crashed PDA will often leave you with few options to recover your data or even to restore normal operations - especially when you’re “out on the road”.

Bug Types

In an operating system as complex as Geos, where you will often try to delegate as much work as possible to the system itself (or where the system is doing “behind the scenes” work that you don’t normally notice), there are a number of reasons that can result in improper behavior of program:

Problems whose cause is located entirely in the logic of your own code, like reading from uninitialized memory or running into an infinite loop when traversing a data structure.
Interactions with the rest of the system which you are not aware of. Typical examples of this would be blocks of memory moving when you would not expect them to, or forgetting to call the superclass to do its own processing when subclassing a pre-defined message to an object.
System functions not performing as described, for example, a C API call which trashes the SI register used for storing a variable, or a message which is documented, but not implemented.

Finding Bugs Automatically

Of course, the best bugs are those that never find their way into your program in the first place. It is hard to do anything about “type 3” errors, apart from not ruling them out completely when thinking about why your code failed. On the other hand, Geos offers a number of tools for catching “type 1” bugs by helping to pin down any side effects they cause.

All of these methods are usually called “the EC mechanism” (EC for error-checking). The basic idea is that there are two version of any geode in the system, one which will be given to the end user and another which is full of extra checks and testing aids. These two version can be created from one source code by conditional compilation. In particular, there is an EC version of the operating itself which performs extensive checking on any arguments passed to system functions, and which can also do things to increase the likeliness of some bugs to appear during a test.

The EC version is normally used together with the SWAT debugger (which is in my opinion a great asset of the Geos system, even though it is as non-visual as can be), because without it a failed error check will not yield much more than a KR-something abort of the system, while SWAT will allow you to do a full post mortem analysis of the session that caused the problem.

The most immediate advantage of using the EC version is that it will catch a large number of problems with almost no effort from your side required, simply by checking most data that is passed through system calls.

If your target (the machine on which the code to be debugged is run) is fast enough (a 386/40 being the absolute minimum), you can also enable additional levels of global error checking by using the EC command of SWAT. You can either turn on certain areas of testing selectively (Geoworks recommends to use at least ec +normal +segment +high), or you can use the commands ec all or ec ALL (note the difference in case) to enable entire set of common tests. As said, you will need a relatively fast target machine for the application to remain interactible with high EC levels.

Running your application on an EC target will certainly not catch all the bugs automatically, but it is probably the type of testing with the best effort/result ratio.

In rare cases, testing with full EC features may mean that you have to clean up your code even more than necessary: There are situations were testing with ec +segment enabled (testing the contents of ES and DS registers for proper values whenever possible) will fail on correct code simply because the outdated address of a segment which had been moved in the meantime was still hanging around in the ES register without being used - but fixing things like this will usually take only a fraction of the time saved by successful detection of bugs.

You can also beef up error detection in your own code by adding further calls to EC routines verifying pointers, handles or memory regions. If you are using more complex algorithms which cannot be tested by the standard checks of the system alone, you should also consider adding your own error checking routines doing internal cross-checks and asserting that assumptions made in the code are actually true. Conditional compilation makes sure that these won’t affect the performance of your final program. (Chapter 6.5 of the Concepts book in the SDK describes how all of these features can be used.)

Another line of defense can be created by increasing the warning level used by the compiler. For Borland C, Geoworks is offering an updated version of the COMPILER.MK makefile which turns on additional warnings (available from their Web site) that may point toward potential problems. Meanwhile, you can achieve the same effect by adding the line

XCCOMFLAGS = -w -g255 -w-amp -w-pin -w-cln -w-sig -w-sus

to the LOCAL.MK file in your project directory. If you are creating a new file, make sure also to include the following line:

#include <$(SYSMAKEFILE)>

If you are using the ConstructOptr() macro or any of the LMem*Handles() functions, you should add the following definitions to the start of your source code to avoid frequent unnecessary warnings caused by the system header files themselves:

#undef ConstructOptr
#define ConstructOptr(han,ch) ((((optr) (han)) << 16) | ((ChunkHandle) (ch)))

Don’t Let it Happen

There are a number of potential problems resulting from peculiarities of the system, and most Geos developers will probably come across them sooner or later. A good source for information on this subject (actually, one of the few) is was the “Developer Relations” area of www.geoworks.com, which among other things featured a “Frequently Asked Questions” (FAQ) list and a “knowledge base” with common questions and answers. If it has has happened to you, it is likely that it has happened to someone else as well...

Following is a random selection from some of my “favorite” Geos programming errors:

Always keep in mind that Geos is a real mode, virtual memory system. This means that any pointers to global memory blocks may become invalid whenever that block is unlocked. (See next item for an exception.) Such problems are often hard to track down because a block doesn’t have to move any time you unlock it. You can use the EC version of Geos to make finding these bugs easier by using the +lmemMove and +unlockMove error-checking flags with the ec command of SWAT (see above).
“Real mode” also implies that access to an improper memory address (for example, dereferencing a NULL pointer) will not necessarily be detected by the system, unless it causes any visible effect. This is the reason why Geos will always force you to shut down whenever it detects even a single bad handle passed to a routine - it cannot tell what other damage has already been done to memory areas that cannot be verified automatically.
You can use the EC_BOUNDS() macro to perform a “sanity check” on a pointer at any time. Note that this check will only be included into the error checking version of your program.
Even though “virtual memory” may sound like “memory in abundance”, keep in mind that available memory on an OmniGo is only a few hundred KBytes maximum so any code asking for additional memory may potentially fail and should at least try to resolve the situation gracefully. Watch those return codes.
There is an exception to the rule that “a locked blocked won’t move”: while you are within the method handler code for an object, you can normally assume that object’s block to be locked, and pself to point to the instance data of the object. Anyway, Geos takes the freedom to move around the block whenever your code sends a message (or uses @callsuper()). Because object blocks are really LMem heaps, the rule that any LMem operation on that block may also invalidate pointers applies in addition to that.
This means that you will have to “refresh” the value of pself after a message or an LMem operation using a line like:

pself = ObjDerefVis(oself);

(or ObjDerefGen(), depending on the type of object you’re using.)
If an object is behaving strangely after you have subclassed a pre-defined message (as opposed to defining a new one in the class definition), you may have forgotten to call the message handler for the superclass (using @callsuper()) which also has to do some processing of its own. If in doubt, do it. Of course, there are cases where calling the superclass is not appropriate, for example, if you are intercepting certain keystrokes by handling MSG_META_KBD_CHAR yourself.
Another “classic” pitfall comes in the way messages are delivered to objects: when using @send, there is no way of knowing when the message will be dispatched to the object. (It will usually be delivered immediately if the object is running on the same thread, but it is not a good idea to rely on this.) This is important for messages passing “volatile” parameters, like pointers or memory handles which are freed by the sender of the message. To make sure that your program will only continue after the message has been handled, use @call This is also the method of choice if you want to process the return value of a message.
It should be noted that Geos does not perform any checking of its own for improper messages being sent to an object. Because messages symbols (MSG_*) are internally converted to two-byte message IDs, it is very likely that two classes share the same message numbers for totally different messages. On the other hand, if a message with a certain number is not understood by a certain class, it is simply passed back “up” the class hierarchy until some class either deals with it or it is dropped at the “root” of the class tree (MetaClass). When using @call, the statement may even return a nonsense value without giving a warning if the message was not handled anywhere.

In other words, any checking of whether it is “sensible” to send a message to certain object must be done by yourself. The only exception to this are “classed events” which are ensured to be delivered only to objects of a given class (or one of its descendants).

Libraries

When moving parts of your GOC code to libraries, you may find yourself in the situtation that code you have trusted in for years suddenly refuses to run. In this case, you should make sure that you have observed all the specifics of the library situtation:

The most important difference with library code is that the contents of the DS (data segment) register become an issue, because whenever an application calls a library function, two different DGROUPs (default data segments, this is where the compiler stores global variables and constants) are involved: that of the caller (which is what DS points to at the entry of the routine) and that of the library (which is required to access constants and globals).

Another difference in a library is that the segment the stack is located in (which is provided by the caller) is not identical with the data segment of the library (sometimes called “SS!=DS” for short) - this rarely causes problems, because Geos programs are using the “large” memory model where all pointers carry their segment with them anyway.

To tell the compiler of this situation, two things must be done (these are somewhat specific to Borland C, but as there is no other supported C compiler for Geos development, so be it...):

Add the _export keyword to each of your exported functions. For example, the prototype for a function could look something like this:

void _pascal _export Foo(int bar);

(It is a kind of tradition to use Pascal calling convention for library functions, unless they take a variable number of arguments. You can also omit the _pascal keyword.)

Add the line

XCCOMFLAGS = -WDE

to the local.mk file in your project directory, or add the -WDE switch to an existing flag list.

Problems Caused by Testing

When going through a high number of edit/compile/debug cycles, you may occasionally find that some problems are actually caused by the actual way you are performing your testing. These may often hold you up for some time with no apparent reason, so I have listed a few of them here...

Did you ever wonder why an apparently foolprof change didn’t fix the problem you expected it to? Perhaps you have simply forgotten to update the version of the application you just compiled in the target directory. If you are using a two-machine setup and testing your application with SWAT running, there is a simple way of avoiding this: create a revision control file in your project directory, using a command like this:

grev new <projectname>.rev

(If you’re using the OmniGo SDK, you may have to get a fixed version of GREV available from www.geoworks.com or on CompuServe.) Whenever you call PMAKE, this will add one to a version counter for your Geode, giving SWAT a chance to check whether the symbol files (which are taken directly from your project directory and updated at every compile) match the version of the application that was just started. If they don’t, you probably have not yet uploaded the latest version.
When testing on a single machine, you should note that an application hasn’t shut down completely by the moment its window disappears. This can cause all sorts of strange problems in multi tasking environments where Geos is often “frozen” when put into the background. When recompiling and switching back to the old session, you may actually be able to start the “new” version before the old one has fully terminated.
When testing your application on the actual OmniGo, it may be a good idea to reboot (Shift-On-Next) your device after uploading a new version, because even termination with Fn+F3 doesn’t always seem to clean up all the remains of the previous version.

The Real Thing

When developing an OmniGo application, you will probably do most of your SWAT testing with the OG emulation on a second machine, simply because it is faster and more convenient. Anyway, you may sometimes have to debug your application on the actual device for finding very specific errors.

The OmniGo contains all the files necessary for SWAT debugging in ROM, so it is basically ready to use as a debugging target “out of the box”. Anyway, there are still some preparations which are either necessary or convenient to make before starting your first debugging session:

To get the symbol files for all the libraries on the OmniGo in their proper version, you will have to install the OmniGo SDK. To little changes to the symbol file are necessary for the “attaching” to work smoothly (these changes accomodate for slight version differences between the retail OG libraries and the versions that come with the SDK):

In the OMNIGO\DRIVER\POWER\JEDI directory, copy the JPWR.SYM file over to JPWR.GYM.
In the OMNIGO\LIBRARY\JEDI\NOTES directory, copy the NOTES.SYM file over to NOTES.GYM.

To make starting the SWAT stub easier, you should install some kind of text mode driver in your OmniGo’s AUTOEXEC.BAT (it located on drive B - get it using a file transfer application and modify it on your PC) also comment out the last line starting GEOS to make your device fall into DOS mode at reboot. (If you don’t use a text mode driver, you will have to enter starting commands without seeing them on the screen.)

The final thing is to make sure that the power driver of the OG does not turn off the serial port during startup. Geoworks recommends stopping the startup process at the right moment, but I have created myself a little TCL script which does the necessary fix for me.

It is called JPWRFIX.TCL (see listing) and should be placed into the TCL subdirectory of your Geos SDK installation.

#
# JPWRFIX
#
# Keeps the OmniGo from powering down the serial port
# during startup by catching the launch of the jpwr
# driver and forcing the serial power status to "on".
#
# The line [load jpwrfix] should be
# added to autoload.tcl
#

[defsubr _jpwr_catch {args}
{
    if {[string c [patient name [index $args 0]] jpwr]==0} {
      assign serialPowerStatus 3
    }
    return EVENT_HANDLED
}]

event handle START _jpwr_catch

To load it automatically at startup, append the line

[load jpwrfix]

at the end of the AUTOLOAD.TCL file (also in the TCL directory). If there is an AUTOLOAD.TLC file, it should be moved to another place because it contains a compiled version of the unmodified autoload file and would take precedence over the new one.

Now you should be ready for the first attach: After a reboot, you can either type GEOS to launch the device as usual, or enter SS1 to start it for debugging over the built-in serial port. You will have to configure your host to use a 9600 baud rate for this command to work. After starting SWAT on the host end of the connection, you can watch the libraries being loaded and the system starting up. (Because the power fix must be applied remotely, there is no way of launching the device and then attaching to it.)

The price paid for disabling the serial power driver during debugging is that batteries are drained very quickly while the device is attached. You should keep a few sets of batteries handy if you are planning for a longer testing session. Leaving the device in DOS mode for an extended period after a reboot will also shorten battery life because of higher CPU load (really!) and inability to turn itself of.

When testing on the OmniGo, remember that it doesn’t have EC capabilities, so it is mostly useful for final testing, and that you cannot set any breakpoints into ROM routines (which is where most of Geos resides), limiting the use of some SWAT functions.