Optimizing Geos Applications

Originally published in Handheld Systems, Vol 4.6

One of the reasons why Geos has been quite successful as an operating system for low-cost handheld devices is that its appetite for resources (CPU horsepower as well as memory) is relatively modest. On the other hand, this also means that users will expect applications written for it to perform as good on hardware that would be considered as "low-end" in terms of today’s desktop PCs. In this article, I will try to give a few examples of things you should consider when tuning your applications for optimum performance.

In the "old days" when operating systems only offered a limited set of basic system functions, like memory management, access to mass storage and sometimes CPU time allocation, the most important aspect of optimization was "tweaking" those parts of an application’s code that used up the most CPU time and/or memory. This could consist of using more "intelligent" algorithms, rearranging the code to minimize the number or complexity of instructions or even coding critical parts in assembly language. Discussions about news compilers often centered around how cleverly they used CPU registers or how much invariant code they could remove from the body of a benchmark loop.

Under the Hood

While the choice of good algorithms is still an important task, the situation has otherwise changed quite dramatically with today’s operating systems and graphical environments. A complex system like Geos offers a large number of services and utility functions to the application. For example, the kernel and user interface modules of Geos alone export more than 2000 API functions and class definitions (of course, some of them are for internal use only or present different ways of accessing the same functionality). This has two consequences:

  1. A lot of standard tasks can be left to the system, taking advantage of optimizations and testing already performed for its code (no need to "reinvent the wheel").
  2. When it comes to optimizing a program, the performance of the application’s code may become much less important than its efficent use of system functionality.

This means that optimizing may often involve more learning about how the system does things internally, than working on your own code. I will give a few examples from my own work where taking a different perspective on what happens "between two lines of GOC code" have yielded surprising performance gains.

Mean Maths

Some time ago, I tried to explore the possibilites of writing a simple "virtual reality" game for the OmniGo. The central part of this should be a routine for displaying a maze of texture-mapped walls, which I wanted to adapt from a Mac source code I had found in a German computing magazine. (This is what would eventually be my "Raycast" demo.)

When I had the first working port of the rendering routine, it would require about 7 seconds to compute and draw a single image on my machine (a P75). The original C routine made heavy use of float and double variables, which are converted to 8087 floating point opcodes by the Borland C compiler. On machines without such a coprocessor, these would eventually be emulated by the borlandc library of Geos, which in turn uses the math library.

After a day or so spent working around bugs in the borlandc library (which apparently have been fixed on the OmniGo), I decided to make the big switch and remove all the float arithmetics in favor of WWFixed integer math. This was easier than I had thought, and the result was unexpectedly striking: the redraw time was suddenly down to about 200 milliseconds per frame, a factor-35 improvement over the first version! As an added benefit, I could once again remove most of the variables holding intermediate results for avoiding emulator bugs...

All further changes, like using double-buffer techniques for drawing, converting the timing critical texture scaling routine to assembly language and using a new "divide-free" algorithm for scaling only yielded another factor 4 or so in total. (The program was still too slow for the OmniGo, where it would only do about one frame per second, which would certainly not make for a very exciting game - so the project was discontinued...)

Divide and Conquer

The following snippets of code show a simple "divide" operation for the four different data types offered by Geos for storing non-integer numbers:

  • C float (double and long double analogous):
float f1=1.0, f2=3.0, f3;
/*...*/
f3 = f1/f2;

Geos80:

FloatNum g1, g2, g3;
FloatDwordToFloat(1); FloatPopNumber(&g1);
FloatDwordToFloat(3); FloatPopNumber(&g2);
/*...*/
FloatPushNumber(&g1);
FloatPushNumber(&g2);
FloatDivide();
FloatPopNumber(&g3);
  • WWFixed:
WWFixedAsDWord w1,w2,w3;
w1 = MakeWWFixed(1);
w2 = MakeWWFixed(3);
/*...*/
w3 = GrSDivWWFixed(w1,w2);

The table lists the number of operations per second performed by each of these variants:

op/sP75 OmniGo
C long double6000400
Geos8015000860
WWFixed1200007000

 This comparison shows that the "obvious" way of using the built-in types of C carries quite a speed penalty, especially when taking into account that Geos80 numbers use 80-bit precision (while float and double only use 32 and 64 bits, respectively), and, according to the SDK documentation, are "equivalent to a long double".

Of course, the comparison to WWFixed is not completely fair, because it only covers a very limited range of numbers (0 to 65535 or signed equivalent) and has a very limited accuracy, but the precision may still be sufficent for many applications.

There is another small benefit involved in not using float and double types: the version of the GLUE linker that comes with the OmniGo SDK will not link the borlandc library to the application if no floating point emulation is required, which will slightly improve loading time and memory requirements of the program.

Drawing GStrings

I came across another suprise lesson in speed during the development of FontMagick, a program to help in the design of ornamental text elements written for the Geoworks Ensemble application suite.

The program makes heavy use of the concept of GStrings. These are series of drawing commands stored in memory blocks or VM files which can be played back to an actual viewing device (screen or printer) later - this is important because many of the effects create by FontMagick are based on drawing the same object a number of times, each with different attributes and at a different place.

The application uses a fairly complex way of manipulating GStrings describing the outline of each letter. This allows it to work around a number of bugs in the drawing engine and printer drivers of Geos and to be able to apply non-linear distortions (transformations that cannot be described using a simple matrix) to an object.

I assumed that these computations were relatively time-consuming compared to the time required for actually drawing the object, so I decided to "cache" all the drawing commands necessary to reproduce the effect in a VM-based GString.  Therefore, redrawing the contents of the main window would only require playing back the GString without any further computation. In addition, I used a simple call to GrGetGStringBounds() to get the dimensions of the resulting object.

As beta testers kept complaining about the time needed to re-compute the shape of an object any time one of the settings had been changed, I went back to the drawing board and had a look at the "real" timing using Swat and the Profiling Kernel (see below). After some timing of the various components, I found out what the real time-killers were:

  • GrGetGStringBounds() alone took about as much time as all the rest of the computation, preparation and drawing of the object together. Looking back, it was probably naive to assume that the accurate size of an object with a large number of splines could be computed really fast, but I was surprised when I saw how much time it took.
  • Drawing to a GString into a VM file actually took longer than drawing the same object to the screen directly, even though the object involved filling complex, spline-bounded areas, while I had expected creating a GString to do little more than putting one command record after the other.
  • The profiler (see below) revealed that a single routine, LMemInsertAt(), accounted for a considerable amount of total CPU time spent. It seems that the LMem operations involved in creating the internal structure of a GString are not implemented in an optimum fashion for this type of use...
  • To make things worse, the object would still have to be drawn on screen anyway, regardless of how much time had been spent for just putting together the command sequence.
  • The actual computations which I had always blamed for the time delay when formulating my excuses to the beta testers used up about a tenth of the time needed for a single screen refresh.

Once I found this out, the solution was fairly obvious: I reduced caching to a minimum, using it only to store one copy of the outline to which all the maths had been applied. Replication, scaling and attribute manipulation of the outline is now done completely "on the fly" whenever the view is exposed. As a side effect, I could remove the entire handling of the VM cache file, because the "basic" outline would be small enough to fit into an LMem GString, which is much faster.

To avoid having to call GrGetGStringBounds() on the (non-existing) GString for the complete object, I only use it on the "fundamental" outline now and then infer the total object size based on "rules of thumb" and "play-it-safe" calculations. This may give me a narrow white margin around complex objects (or it may rarely lead to parts of the object slightly extending beyond the bounds because of inaccuracies in GrGetGStringBounds() which are then multiplied), but it means that almost no additional time is required for size calculations.

The only area which still suffers from the slowness of VM GStrings now is the "copy to clipboard" routine, which has no other choice than to create the transfer format.

A Lost Gem: the Profiling Kernel

As shown, it can often be useful to check how much time a certain routine really takes to execute, instead of just guessing... One method of finding out is using the built-in "tick" timer of Geos which has an accuracy of 1/60s:

dword time1,time2;
char buf[80];

time1 = TimerGetCount();
/* ...do something time consuming... */
time2 = TimerGetCount();
/* Format and display the result: */
sprintf(buf,"%ld milliseconds",(time2-time1)*1000/60);
UserStandardDialog(
  NULL, NULL, NULL, NULL, buf,
  (CDT_NOTIFICATION << CDBF_DIALOG_TYPE_OFFSET) |
  (GIT_NOTIFICATION << CDBF_INTERACTION_TYPE_OFFSET));

A more sophisticated tool for doing this has been included with the Geos 2.01 SDK, which is intended mainly for Ensemble and Zoomer development. Anyway, if your application is not using any OmniGo-specific libraries or classes, chances are good that your program can still be tested with this method.

It requires a special version of the system kernel by the name of geos_tp.geo. This version is included with the non-error-checking (NC) version of the Geos 2.01 target, and it can be activated by using the Options/Kernel... command of the Debug application. After selecting the "Timer" profiling kernel and restarting the system, you can attach the Swat debugger to the session and use the timer-profile command to collect performance data (you can get a brief description of the command by typing help timer-profile at the Swat prompt).

The idea of the Timer profiler is that whenever the system encounters a timer tick (once ever 1/60 of a second), the kernel records where execution was interrupted. It is reasonable to assume that the number of hits within each routine is proportional to its share of total execution time...

To use the profiling kernel, follow these steps:

  1. Launch your application on the Target (with Swat attached) and bring it to the sitution where performance should be analyzed.
  2. Stop the program using Ctrl-C at the Host (the machine running Swat).
  3. Enter timer-profile start.
  4. Restart the program using the go command.
  5. You can now work with the program for up to 34 seconds (if you let it run longer, you will not get any information about the time after that). To make most efficient use of this time, you should have completed any user interactions before stopping the program.
  6. Stop the program using Ctrl-C at the Host.
  7. Enter timer-profile to display the results of the analysis. If the list contains a large number of different routines, it is good to know that Ctrl-U will take you back one page in Swat’s output...

The obvious disadvantage of this method is that it can only analyze a fairly small segment of the program. In addition, you will only get information about the parts of the code where interrupts actually took place, but this will not tell you anything about the "history" of a given call, that is, the reason why a certain system routine was called this often - this will sometimes require additional guesswork to interpret the results.

For some reason, Geoworks chose not to include a profiling version of the new kernel with the emulator of the OmniGo SDK (even though the timer-profile command is still there), so the only way of getting it is through the 2.01 SDK.