About GeoDump

[ PC/Geos and FreeGEOS development ]

This is the documentation for the GeoDump 0.5 tool - it should probably go into the FreeGEOS repository at some point, but as I found myself looking up its content sometimes, I am just copying it here for now.

In short, if you know the EXEHDR utility that usually comes with the Microsoft compilers, you will know what GeoDump is for... although it is much more now, as it can do the same thing for data and font files, and it can also create fairly sophisticated disassemblies of Geos programs.

GeoDump is probably my oldest PC/Geos tool. I started writing it back in 1991, when I found out that there was a lot more in common between the various Geos files than what is obvious at first glace. When decoding a complex file format, I usually make my notes about structures not in form of a text, but by writing a program than can reproduce and comment my findings. Therefore, GeoDump is a very "evolutionary" thing that grows whenever I find a new piece of the puzzle.

One of the first things I realized after getting the SDK was that GeoDump would not die, even though a lot of the former "undocumented" stuff is now public. Most of the things shown by GeoDump are still not fully documented (e.g. the DOS level format in which VM files are stored), and looking into files at this level may still be interesteing for debugging "mysterious" problems, and also for clearing up problems in the documentation.

GeoDump 0.5 is the fifth public version of this program.

PC/Geos file formats and GeoDump

Looking at the "file info" that is displayed for Geos files by the File Manager, one can easily see that there are two fundamentally different file types in the Geos system:
Applications (programs you can run) - these are commonly called "geodes" in Geos-speak.

VM Files (documents and generally all kinds of data) - the abbreviation "VM" stands for Virtual Memory and indicates that Memory and File management are very closely related in Geos. It is also possible to store data in "raw binary" format, i.e. your program is totally on its own for organizing its data.

Actually, sice the 2.0 version, there is a third type of file, which appears as the @dirinfo.000 file in Geos directories with a long name or with symbolic links (cross-references to files which appear to exist in multiple place but are actually store once; these cannot be created using the retail 2.0 version).

From the SDK developer's point of view, these files are regarded as abstract concepts whose implementation under DOS is nothing that can be relied on. So if you had a Geos version that didn't require DOS, these files could look completely different, but for an application, access to them would remain the same.

In contrast to this, GeoDump is designed to analyze the way these files are stored under a DOS file system. It takes the DOS files created by Geos and displays their content (for which the specific application is responsible) using the structure that is contained in the Geos standard file formats.

GeoDump should be regarded mostly as a developer's tool which can be used to identify unknown file formats (e.g. analyzing a GeoWrite file without knowing about VM structure first is next to hopeless) or to "take apart" exectuable files or libraries to find out how a certain routine is used. For geodes (programs), GeoDump contains a disassembler which is aware of Geos file structure (although it doesn't know about objects yet) and should therefore produce much better results than any other disassembler. It should create even more readable dumps than the disassembler that comes in SWAT, as the disassembly engine in GeoDump will (to a certain degree) trace the control flow of the program to tell code from data, identify jump targets and structure the code visually to show blocks of self contained modules.

Since version 0.5, GeoDump also knows about the structure of PC/Geos vector fonts files (FNT files). Their basic structure is completely different from that used in VM files.

The header
All "true" Geos files start with a specific Geos header containing the extended file name, the name of the icon, version information etc. From Geos' point of view, this header is not really a part of the file, but of the environment in which the file is store (just as file name and file date of normal DOS files are stored in the directory and not in the file itself), but if you take a Geos file for what it is under DOS, this is what the first bytes mean...
Executable programs (geodes)
If you have a "geode" program, the structure of the "real" file following the header is something that reminds of the structure used by OS/2 and Windows for their "segmented executables":
Additional header with program-specific data
List of libraries used by the program (all those files in SYSTEM...)
List of exported function (points where the program can be called from outside - this is especially important for libraries which are nothing but a lot of subroutines that can be called by someone else)
Table of segments (mostly called "resources" in the SDK)
Resource 1
- Code and/or data
- Fixup table (which contain references to all those places in the code where the actual adress of a system function or of another segment, which may move around, has to be correct when the program is loaded)
Resource 2
...

You will see a lot of these things when you look at the output that Glue creates when linking a file (if you have the SDK :-)).

Resources may also contain "local heaps" (see next section); this kind of store is used for program objects and for "visual monikers" (names and icons).

VM files

In contrast to DOS, OS/2 and Windows (apparently even in the newest versions), Geos offers to applications a far-reaching support for handling complex file formats. While for DOS (and most other operating systems) a file is only a long string of characters which can (and must) be modified completely by the application itself, almost any Geos data file is structured in blocks of variable size that are constantly moved in and out of memory.

An application can tell Geos to put a data block into a file and doesn't have to worry about finding space for it etc. - it only has to remeber the number Geos gives to that block so it can recall it later. This is very similar to the way most programs already manage their data in memory ("dynamic heap"), so that way of thinking can be easily transfered to file access under Geos.

This is part of the reason why Geos files tend to grow bigger than their plain DOS counterparts, but it is also responsible for the speed at which the Geos autosave feature works and for things like storing the application state to disk when shutting down; even the Save/Revert mechanism of the applications is implemented in the file system itself.

One of the blocks inside every VM file contains a directory, which maps the block numbers (handles, written as [...] in GeoDump output) to the place where the blocks actually reside in the file (which may change frequently while the file is opened). The directory also keeps tracks of free space in the file.

There are on (or two) special blocks called "map blocks" whose numbers are separately store in special places of the file, so an application knows where to start when it wants to load some of the blocks of the file. Usually, these blocks (whose numbers are displayed by GeoDump together with the header of the VM file) are usually the point to start when trying to understand a file format, simply because they are the first thing a programs looks at when getting data from a file.

Each of the VM blocks can itself contain a "local heap" (it is called an "LMem block" then), which is something like a small version of a complete VM file, in that it contains again a number of small blocks ("chunks") together with a directory of those blocks and possible free spaces between them.

Fonts

Starting from version 0.5, GeoDump can also dump out the font file format used by Geos. For vector fonts rendered using the Nimbus Q rasterizer, the "/D" switch gives a summary account of all the characters in the font, while "/L" displays the complete set of outline data and kerning pairs.

If a font contains bitmapped images for certain (usually small) pointsizes, these are currently not dumped in a structured manner, but rather just listed as hex dumps in the /L mode.

Note: Kerning pairs are a feature not currently found in any of the Geos fonts I've seen, including the URW fonts that come with the operating system. They are supported by Nimbus Q, though...

The symbolic disassembler

Since version 4, GeoDump contains a Geode disassembler which allows converting Geode binaries into semi-symbolic assembler source codes. This is to say that the program is attempting to separate code from data in mixed segments and also marks any labels that are used as jump targets. The code is not suitable for immediate reassembly but rather intended for the analysis of specific parts of a program within their context.

In the current version, Object definitons contained in the code are only listed as chunks of "raw" data not listing object types or instance variables.

The strategy of the disassembler is to follow strictly any possible code flow and to view only those parts of the program as "code" that can be reached from a known entry point by a series of jumps and branches. Therefore, the program might need some help to get across jump tables and indirect references which seem to appear rather frequently in Geos code (probably as a result of the internal object model).

To disassemble a program without any additional knowledge, enter

GEODUMP /L filename >output_file

The disassembly process may take some time to converge, as it is done on a multi-pass basis (the number of passes is displayed after the header when the analysis is completed), but it is terminated when the number of passes exceeds 10, so you shouldn't abort the process too quickly if you think that it is "hung" (although I have no rigid proof yet that it cannot hang under any circumstances...).

For large files (like the kernel), disassembly may sometimes take a couple of hours, but you will see the segments processed even if you redirect the output to a file, so you know that the program is still alive.

For applications, the initial disassembly will not look very useful, as it consists almost entirely of data items - libraries and drivers will yield better results, because they contain more external entry points which help the disassembler find its way through the code.

As most part of the code in a Geos application seems to be reached using adresses from a jump table, you will almost certainly have to add jump tables or code entry points using a disassembly information file. This is described in one of the following sections.

Assigning symbolic names to library references

Normally, when a Geode refers to routines in another Geode, these are only referenced by an ordinal number, rather than a name, to improve size and loading time of applications. This means that, without additional information, a disassembler can only write out numbered labels (like geos_123 for a kernel entry point) for such references.

However, the number-to-name-mapping can be added by creating a definition file for each library used, which should contain lines in the form

lib_name number symbolic_name

where lib_name is the name of the library (usually the same for each line in the file), number is the ordinal number (decimal) of the entry, and symbolic_name is the textual name to be used in the disassembled code. Lines starting with ";" are regarded as comments and will be ignored.

The name info file can be written with any ASCII text editor. The file should have an ".sls" extension and the same name as the library it refers to (here, "name of the library" means its permanent name, rather than the filename, which means that its the same name for both error-checking and non-EC version). It must be stored in the directory pointed to by the GEODUMP environment variable (default: current directory).

If the LDF (linker definition file) of a library is available, as is the case for most system libraries if you have the SDK, this file can be converted into an SLS file automatically using the LDF2SLS program. Type the command

PRINTOBJ ldfname.ldf | LDF2SLS >slsname.sls

replacing ldfname with the path and the name of the LDF file belonging to the library and slsname with the path and the name of the symbol file to be created. The LDF files for the standard libraries are included in the INCLUDE\LDF subdirectory of the SDK. PRINTOBJ is a very helpful tool from the SDK which creates detailed listings of the information in OBJ, LDF and SYM files.

The batch file MAKESLS.BAT included with this package automatically creates SLS files for all the libraries included in the SDK

Disassembling using an info file

A disassembly info file contains entries which explicitly describe the contents of certain regions in the code. These files are read using a very simple parser, so not all errors may be detected. Any line may only contain one single letter command at the very beginning of the line, together with additional arguments depending on the command. Blank lines and lines starting with a ";" are regarded as comments, as well as any characters after the last argument in a line.

There are currently two different types of commands (all arguments have to be given in hexadecimal notation):

Code and data entry points. They consist of a line in the format

C ssss oooo [name]<or> D ssss oooo [name]

and cause the region starting at segment "ssss" (decimal), offset "oooo" (hexdecimal) to be dissasembled as code or data, respectively. In case of doubt, data takes precedence over code. If a name is specified, it is used as a label for that location in the code. Note that data references appearing in the original code (e.g. mov ds:[saveAX],ax) cannot be safely identified with data labels, so they are left in the disassembly with absolute offsets and must be assigned manually.

Jumplists. These are very important in Geos. Usually they can be reconized by a large number of segment fixups in a row - jumplists consisting of offsets only ("near") will take a certain intuition to find; usually it is a good idea to start from regions known to be code entry points and look if these adresses appear somewhere. A jumplist is defined using the format

J ssss oooo llll c

and causes the region starting at ssss:oooo (with a length of "llll" bytes) to be used as a jumplist. Note that ssss has to be in decimal, while oooo is interpreted as hex. The character "c" describes the kind of jumplist.

2	"offset only" jumplist - only offsets in the current segment, which are normally not fixed up.
4	jumplist containing far pointers (each in the form "offset first, segment last"). Usually, every second word of this list is fixed up as a segment.
S / O	split jumplists. These are slightly unusual, but seem to be generated by some Geos tool. They contain of two separate tabels, one containing only offsets and one containing only segments. These tables are immediately adjancent in memory. The letter "S" describes a list where the segment table comes first; "O" means the offsets are listed first. I did not find any "O" type jumplists yet.

The disassembly info file can be written with any ASCII text editor. The file should have a ".dis" extension and the same name as the geode it refers to. It must be stored in the directory pointed to by the GEODUMP environment variable (default: current directory).

Here, "name of the geode" means its permanent name, rather than the filename, which means that its the same name for both error-checking and non-EC version. This is important as locations may move between EC and NC version if conditional code fragments are included; therefore, using a disassembly info file for a version it was not created for may create incorrect results.

If the SYM (symbolic debugging info file) of a library is available, as is the case for most system applications, libraries and drivers if you have the SDK, this file can be converted into an DIS file automatically using the SYM2DIS program. Type the command

PRINTOBJ symname.sym | SYM2DIS >disname.dis

replacing symname with the path and the name of the SYM file belonging to the geode and disname with the path and the name of the info file to be created. The SYM files for the standard system components are included in the subdirectories DRIVERS, LIBRARY and APPL of the SDK.

Special switches

The "/R" switch, immediately followed by a number, only dumps/lists the resource with the given number when used with the name of a Geode. This can be used for quickly checking the changes in the output caused by a modification in the disassembly info file. It should be noted that unnamed entry points from other segments into that resource will not be used in this cause, possibly causing code to be mistaken for data.

The "/1" switch causes a disassembly listing to be produced during the first pass over the file, rather than re-running the disassembly until no further code entry points can be found. This may speed up disassembly when used with detailed debug info files, and it could be used as a last resort in case disassenbly "hangs" due to incorrect disassembly info before any output is produced.

Disclaimer (well, sort of)

Regarding the disassembler: Keep in mind that most license agreements do not permit disassembly of the code covered, so you should be especially careful when using the results of this program for a purpose the author of the original code may not agree with... Apart from that, using a disassembler to steal someone elses code is in most cases neither fair nor worth the effort...

The disassembly engine

Most of the disassembly engine is not my own code - please look into the header of the disassembler source for more details. I only modified it to make it a truly symbolic disassembler; hopefully this will be made available as a separate library later, and it might also find its way into a similar kind of disassembler for Windows and OS/2 - at that level, it is not actually that far-fetched, but it will probably have to be optimized for 32 bit code and large segments before that...