REPORT FOR THE SORTIE STRUCTURED TOOL DEMONSTRATION
 
Holger Kienle and Xiaomin Wu
 
 
1. Introduction
 
 
1.1. Tool
 
Rigi is an interactive, visual tool designed to help developers better
understand and redocument their software. Rigi includes parsers to
read the source code of the subject software and produce a graph of
extracted artifacts such as procedures, variables, calls, and data
accesses. To manage the complexity of the graph, an editor allows the
software engineer to automatically or manually collapse related
artifacts into subsystems. These subsystems typically represent
concepts such as abstract data types or personnel assignments. The
created hierarchy can be navigated, analyzed, and presented using
various automatic or user-guided graphical layouts.
 
The discovered structural information is useful for making informed
development and management decisions. The information serves as
documentation that is up-to-date and accurate because it is derived
from the actual source code. Thus, Rigi helps to understand legacy
software systems where the existing documentation may be missing or
lacking. Rigi aids reengineering tasks that need to discover design
information in existing software.
 
Rigi also includes a tool to view the subject software's source code
using a web browser. The tool produces a hypertext version of the
source code that includes hypertext links for related source artifacts
(for example, each function call has a link to the definition of the
function that was called).
 
 
1.2. Team
 
The team consists of two graduate students (Holger Kienle and Xiaomin
Wu). Both had little prior knowledge of the Rigi tool. One team member
(Holger Kienle) had previous experience with the Bauhaus reverse
engineering tool, which is based on the Rigi visualization engine. He
was involved in a previous tool demonstration that consisted of
reengineering the xfig application [CEK00].
 
Another graduate student (Johannes Martin) who is intimately familiar
with Rigi was consulted various times on technical issues. He is the
maintainer of the Rigi C and C++ parsers and familiar with parsing
programs for use with Rigi [MWW00]. He did not, however, participate
in the SORTIE reengineering effort.
 
 
2. Experience Report
 
The following steps were taken to analyze SORTIE with Rigi:
1. Familiarization with Rigi
2. Familiarization with SORTIE
3. Generation of RSF
  - with Rigi C++ parser.
  - with TkSee C++ parser.
  The first approach involved the writing of "stubs" for the Borland
  VCL library. The second approach involved the development of a
  converter from GXL to RSF.
4. Postprocessing of RSF
5. Structural redocumentation with Rigi
 
 
2.1. Familiarization with Rigi
 
Since both team members did not have prior experience with the Rigi
tool, some time was spent upfront to gain a basic understanding of its
capabilities. Familiarization with Rigi included reading the "Rigi's
User Manual" (which was found to be a valuable resource, albeit
sometimes not detailed enough on topics such as RSF and RCL), and
running the semi-automatic demos that are part of the Rigi
distribution.
 
During this first exploration of Rigi we encountered several
problems. Rigi is extremely slow on Linux with modern window managers
(such as Gnome and KDE1/2). Invocation of a Web browser from Rigi to
view the hypertext version of the source code did not work. Lastly, we
got confused with the various RSF formats (3-tuple unstructured RSF,
partly structured RSF, structured RSF, and 4-tuple unstructured
RSF). Different tools expect different RSF formats as input and can
produce different RSF formats.
 
As a reaction to the encountered difficulties, we set up a Rigi Wiki
[RigiWiki] (with the TWiki software [TWiki]) to document solutions for
the problems that we encountered. The Rigi Wiki is a central
information source with links to already existing (but previously
scattered) information and new information derived from our
experiences. For example, we started a Rigi FAQ (which as of this
writing contains ten questions and answers) to help new Rigi users.
 
Familiarization with Rigi took us about 16 hours, but these were
spread over about two weeks. We frequently got stuck and had to ask
questions to Johannes Martin via email. It would have been valuable to
have a knowledgeable person as part of the team or in close physical
proximity.
 
 
2.2. Familiarization with SORTIE
 
Initial information about the SORTIE code stated that it was written
in C++. Based on this assumption, we planned to briefly inspect the
SORTIE code to pinpoint possible compilation problems prior to running
the code through the Rigi C++ parser.
 
This inspection revealed that the "C++" source is a proprietary C++
dialect developed by Borland. Furthermore, the code is tightly coupled
with a Borland specific GUI library, the Visual Component Library
(VCL):
 
  Borland C++ "is an unusual product in that it is a C++ tool built on
  top of an Object Pascal object framework called the Visual Component
  Library, or VCL. In fact, all of the VCL, and almost all of the IDE,
  is written not in C++, but in Object Pascal. [VCL]"
 
  "VCL is based on components, properties, and events, while OWL and
  MFC have none of these features. In particular, events support
  something called the delegation model, which is an alternative to
  simple inheritance. The VCL fully supports all standard OOP concepts
  such as inheritance, polymorphism, and encapsulation. What it brings
  to the party are components, properties, and events. (Events are
  also known as closures.) [VCL]"
 
It seems that VCL was introduced with Borland C++ 5.0. We had access
to an older Borland C++ 4.5 compiler, but VCL was not part of it.
 
Soon, it became obvious to us that in order to compile and understand
the SORTIE code, knowledge about Borland C++ is necessary. Since we
neither had a Borland compiler nor the accompanying documentation, we
snooped around the Web. The most valuable source proved to be an
online book [BB] and the Borland Community Web site [BC]. We also
downloaded a free version of the Borland C++ compiler, but this test
version (not too surprisingly) did not contain the VCL.
 
 
2.2.1. Borland IDE
 
Part of Borland's C++ IDE (dubbed C++ Builder 5 [BBuilder]) is a GUI
builder, which automatically generates code. This makes it hard to
tell in SORTIE which part of the code has been generated and which
part is hand-written.
 
Since we do not have access to the Borland IDE, we can only make
intelligent guesses about its capabilities.
 
The GUI has the capability to assemble a Form interactively.
Information about these Forms are stored in a file with the suffix
dfm:
 
  "This file contains the details of the objects contained in a form.
  It can be view as text by right clicking on the form and selecting
  view as text from the pop-up menu, or it can be converted to text
  and back using the convert.exe found in the bin directory. Caution
  should be used in altering this file as changes to it could prevent
  the IDE from being able to load the form. This file is critical to
  moving or rebuilding the project. [BDSS]"
 
The SORTIE code comes with the (binary) .dfm files. In order to
inspect the contents of these files, we asked Sachen (the SORTIE
domain expert) to convert the .dfm files to text by running the
convert.exe utility.
 
Each Form is associated with a C++ class, which resides in its own
.h/.cpp file. The skeleton of the automatically generated .h file
looks as follows:
 
  class TMyForm : public TForm 
  { 
  __published: // IDE-managed Components 
      ...
  private: // User declarations 
 
  public: // User declarations 
 
  virtual __fastcall TMyForm(TComponent* Owner); 
  };
 
  extern TMyForm *MyForm;
 
After the __published keyword, the components of the Form and the
components' methods are listed. The components, along with their
properties and events, are described in the corresponding .dfm file.
 
The skeleton of the automatically generated .cpp file looks as
follows:
 
  #pragma resource "*.dfm"
  TMyForm *MyForm;
 
  __fastcall TMyForm::TMyForm(TComponent* Owner)
    : TForm(Owner)
  {
  }
 
The pragma directive indicates that a corresponding .dfm file exists.
Every Form has a corresponding global variable. This variable is
typically initialized as follows:
 
  Application->CreateForm(__classid(TMyForm), &MyForm);
 
SORTIE does this initialization in the file SORTIEBC.cpp (see below).
 
The SORTIE source has 48 .dfm files out of 63 .cpp files. This
indicates that a large part of the code has been written with the help
of the GUI builder. The 63 .cpp files can be grouped as follows:
 
48    GUI-builder based (corresponding .dfm file)
 1    SORTIEBC.cpp, main application file (automatically generated)
 1    globals.cpp, empty
13    User created classes (not subclassed from VCL)
 
(Except for SORTIEBC.cpp, every .cpp files has a corresponding .h
file.)
 
Most of the contents of the .dfm files deal with properties that do
not help program understanding, such as form width, font color, and
title bar name. However, at least two properties are of interest: (1)
events and (2) pull-down menus. Both are (partially) lost in the
generated code.
 
The VCL introduces the concept of events, which can be specified with
the IDE:
 
  "Events provide a means for a component to notify the user of some
  pre-defined occurrence within the component. Such an occurrence
  might be a button click or the pressing of a key on a keyboard. [BS]"
 
 "...C++Builder generates the function declaration, and saves the
 relationship between the event and event handler into the .dfm
 file. The .dfm file is compiled into the target application's
 resource. When an application built with C++Builder starts up, VCL
 reads the form from the resource and assign properties and events of
 the components on the form. [BS]"
 
By looking at the textual representation of the .dfm files, we
identified the following events:
 
OnActivate     4
OnChange    77
OnClick           246
OnClose         5
OnCreate     2
OnDblClick    17
OnDragDrop     2
OnEnter         4
OnExit         2
OnKeyPress     1
OnMouseDown     3
OnMouseMove     2
OnMouseUp    19 
OnPaint         7
OnResize     4
OnScroll     2
OnShow        23
 
There is a convention that for every event "OnE" associated with
component "C", C's owner (i.e., the Form) contains a method
  void __fastcall CE(TObject *Sender);
 
Unfortunately, this convention is not always obeyed in SORTIE. This
makes it hard--or even impossible without the .dfm files--to identify
a component's events. For example, the component PointGLIMenu has an
OnClick event called GLITestClick.
 
Another useful information that is contained in the .dfm files is the
pull-down menu structure along with the menu items' callbacks (all
OnClick events).
 
Here is the pull-down menu structure of the main application window
(extracted from sortiapp.dfm). The text in round brackets are
bubble-help messages.
 
File
  Load SEM file...
  -
  New parameter file... (Create a parameter file from scratch)
  Load parameter file... (Load an existing parameter file)
  Save parameter file... (Save current parameters to disk)
  -
  Open playback file...
  -
  Open text editor... (Open text editor)
  -
  Create map file...
  Load map file...
  Save tree map...
  -
  Load Harvest Regime...
  -
  Create light location file...
  -
  Exit (Quit SORTIE W)
Edit
  Location... (Plot and sky parameters dialog)
  Species... (Species parameters dialog)
  Initial densities... (Set initial population characteristics)
  Substrate...
  Disturbance (Create or review a harvest regime)
    Harvest...
    Wind...
View
  Legend
  Plot map
  --Graphs--------
  Rel. basal area
  Abs. basal area
  Rel. density
  Abs. density
  Substrate
  --Histograms-----
  Dbh distribution
  Height distribution
  --Tables----------
  Data Table
  Harvest Table
  --Other-------
  GLI Map
  Speed Bar
Output
  Setup Output Files...
  Write data files... (Select data to save)
  Write playback file...
  -
  Setup harvest output...
  -
  Calculate GLI
    Entire Plot...
    From Point File...
Model
  Run (Run SORTIE W)
  Pause (Pause run)
  Reset (Reset run)
  -
  Load run... (Open previously saved model run)
  Save run... (Save current run)
Window
  Cascade
  Tile
  Arrange Icons
Help
  About...
 
The .dfm files also reveal the type of files that can be saved and
loaded by the SORTIE application. This information is kept in the
Filter setting of the TSaveDialog and TOpenDialog components:
 
*.sgli    GLI Test Results
*.sll    Light location files
    GLI location file
*.hvr    Harvest Results Files
        Harvest Regime Files
*.sem    Sortie Elevation Map Files
*.par    Parameter files
*.spb    SORTIE playback files
*.stmf    Tree Map Files
*.stm    Alternate version (of Tree Map Files)
 
The IDE also automatically generates the main application file. For
SORTIE, the file's name is SORTIEBC.cpp. The file lists all entities
that represent the application and the main function.
 
 
2.2.2. Borland C++ Dialect
 
An article available at the Borland Community Web site discusses
several C++ enhancements [Intersimone]:
 
- Borland-specific Keywords: The following table lists the keywords
along with the number of their occurrence in the SORTIE code.
 
__classid         47 (all except one in SORTIEBC.cpp and sortiapp.cpp)
__closure          0
__fastcall      1501
__property         0
__published       49
 
(In this table and elsewhere, the number of occurrence has been
determined with grep, followed by a brief inspection of the
results. Hence, the actual number might be less!)
 
The keywords __fastcall and __published can be safely eliminated for
the compilation process. The __classid keyword (which is rather a
builtin function) takes the name of a class. It is the VCL's
introspection mechanism. Luckily, here are few occurrences and they are
concentrated in three SORTIE files.
 
- Exception handling: Borland introduces a finalization mechanism with
the __finally keyword. Luckily, SORTIE does not make use of this
feature.
 
The following types are caught in catch statements:
 
Exception                1      (all in SORTIEBC.cpp)
EAccessViolation         4      (all in srtmdiio.cpp)
std::bad_alloc          28
char*                    1      (all in srtmdiio.cpp)
 
Exception is the root class of the VCL exception
hierarchy. EAccessViolation is a subclass of Exception. The occurrences
of std::bad_alloc are used to guard sequences of new statements.
 
The SORTIE application throws only 10 exceptions. All of them appear
in a single file (srtmdiio.cpp) and throw a char* type.
 
In summary, exception handling can be safely ignored for the analysis
of SORTIE, since it is not part of the "normal" control flow. This is
fortunate since most tools do not support exception. For example,
neither the Rigi C++ schema nor the Dagstuhl Middle Model (DMM) can
currently express exception code.
 
- Data types: VCL introduces several specific data types (for
compatibility with the underlying ObjectPascal). Of these, SORTIE
seems to make only use of AnsiString (185 occurrences), which is a
specialized string class similar to STL's ANSI string class. The
following table lists several of AnsiString's methods:
 
c_str           572     (returns a char* version of the string)
Length()          2
ToInt()           1
ToDouble()        1
 
 
2.2.3. Discussion
 
All findings up to this point have been done by manual inspection of
the code and searching/filtering with grep, sed, and sort. At this
stage, we did not try to understand the control and data flow within
SORTIE. Our objective was to understand the compilation model of the
Borland C++ Builder, to get to know the Borland C++ dialect, and to
get a rough idea of the code base. Our findings in this phase gave
valuable information for the parsing process.
 
An important part of the tool collaboration is to share results with
other researchers. We published the online resources that we found as
well as the information described in this section on the first
author's Web site [Kienle]. The link to the Web site was posted to the
sortie-demo mailing list [SD].
 
As pointed out to us by Jens Jahnke, it is questionable whether
neither having a Borland C++ compiler, nor having the VCL's API is a
realistic reverse engineering scenario. (Granted, the source code of
the VCL might be lost or not accessible, but the published API to
compile against should be available.)
 
Furthermore, while it is common for reverse engineering tasks to
ignore library code (such as the C standard library) in order to not
clutter the analysis results, it is questionable whether this should
be done in the case of object-oriented framework code. For
object-oriented code it might be important to know the full
inheritance hierarchy along with redefined methods. This is especially
important for object-oriented frameworks, such as the VCL, to detect
the framework's "hooks."
 
Finally, it would have been helpful to have Sachen's "A Summary of the
User Requirements for the SORTIE Reengineering Project" [Gendron]
available earlier. Reverse engineering program code without even
knowing the domain is hard (and probably not realistic).
 
This step was accoplished in roughly 20 hours spread over a couple of
weeks.
 
2.3. Generation of RSF
 
Our code inspection revealed that we had to anticipate parsing
problems. Because neither of us had experiences in parsing C++ code
for RSF generation, we decided to pursuit two separate parsing
approaches.
 
The first approach is using the Rigi C++ parser to parse the source
code. The second approach is using the TkSee C/C++ parser, which
generates GXL output. Both parsers' function is the extraction of
artifacts from the subject system.
 
2.3.1 Rigi C++ Parser
 
The Rigi C++ Parser is built on top of the IBM Visual Age C++
compiler. In order to generate RSF, the Rigi C++ parser expects code
that will pass the IBM Visual Age C++ compiler. Xiaomin describes the
difficulties in her final report [Wu01]:
 
  "When using the Rigi C++ parser to parse the program, the source
  code was loaded into the Visual Age C++ compiler first. However, it
  is very difficult to get the code compiled. Thousand lines of error
  messages came out. By analyzing the error messages, we found that
  the problems can be classified to two categories. The first one is
  syntax errors. It is caused by the syntax differences between
  Borland C++ and Visual Age C++. One example is that some keywords
  that are unique to Borland C++ can not be identified by Visual Age
  C++. Some other errors are about the operators, variable types,
  identifiers, etc. To solve this problem, some modifications have
  been made to the source code. Another kind of problem is about the
  class library of C++ compilers. The subject system was developed
  using the Borland C++ builder, and referenced the Borland C++ class
  library a lot, but these classes are not available in Visual Age C++
  library. To solve this problem, many supplements to the Visual
  Age C++ library have been made. For example, 37 dummy header files
  that should be in the VCL (Visual Component Library) of Borland C++
  have been written and put into the class library of Visual Age C++
  (under the ./include/ subdirectory of VA), and a header file that
  contains all the definitions of the missing super-classes has been
  written and put into the class library of Visual Age C++. One thing
  I need to mention here is that the definitions of these
  super-classes might not be the same as what they are in the Borland
  class library, since our only purpose is to get the source code
  compiled and I wrote the classes according to the information
  provided by the error messages, so only the variables and functions
  that were referenced by SORTIE were reconstructed in the
  super-class."
 
The generated output has to be postprocessed in order clean up the
RSF. Since the Rigi C++ parser generates output by querying Visual
Age's repository (called CodeStore), it also reports artifacts in
CodeStore that are not present in the original source (such as
standard operators and constructors inserted during the compilation
process). Luckily, these can be easily filtered because of the
employed naming scheme. All names look as follows:
 
name^file^line_number^column
 
The following example shows C++ code along with the parser generated
names:
 
class X {
  char c;                       // X::c^X.cpp^2^3
  void setChar (char c) {...}   // c^X.cpp^3^17
}
 
Artifacts that do not have a corresponding source location show up as
name^^0^0. Artifacts that do not have a source code name show up as
@number^file^line_number^column. (In a few circumstance, the parser
also emits nodes with the name <unnamed> (10 occurrences, all in
library code) and <unnamed-enum> (76 occurrences; 28 of these in
library code)).
 
 
2.3.2 TkSee C++ Parser (University of Ottawa)
 
The TkSee C++ Parser, which is still in an early development stage, is
based on Source-Navigator [SN]. Source-Navigator seems to have only a
partial parser. This means on one hand that the parser is more robust
(i.e., can deal with C++ flavors and variations), but on the other
hand, the parse is not as precise. In fact, the parser does not
necessarily report if it gets out of sync and produces wrong
results. For example, the TkSee C++ Parser's home page states
[Marchen]:
 
  "Access specifier __published [...] is treated as class member
  variable of type int."
 
In order to obtain a meaningful parse, we applied the following
preprocessing steps to the code: (1) removal of Borland specific
keywords, pragmas, and directives and (2) fixing of file names
("sortiefile" sed script provided with the SORTIE source).
 
We finally ran the parser (version 1.39b) on the source to obtain GXL
output. The GXL output adheres to the Dagstuhl Middle Model (DMM)
[DMM01]. This fact was our primary motivation for using this
parser. We iterated through several parser releases and submitted
several bug reports.
 
2.3.2.1. Schema Conversion
 
The second part of the parsing process consisted of the transformation
from GXL to RSF. This is not a mere translation of exchange formats,
since the underlying schema on both sides of the translation process
needs to be preserved. Hence, we had to find a mapping from the DMM
into the Rigi C++ schema. Furthermore, in order to compare the RSF
obtained with this approach to the RSF obtained with the Rigi C++
parser, we tried to generate RSF that is close to the Rigi C++ parser.
 
The RSF Rigi C++ schema looks as follows:
 
- Riginode:
 
Unknown
Collapse
Function
Prototype
Variable
Constant
Datatype
Subsystem
Class
Namespace
 
 
- Rigiarc:
 
extends        Class      Class
isInNamespace    Unknown      Namespace
isMemberOf    Unknown      Class
calls           Function  Unknown
accesses        Function  Unknown
isTheSameAs     Datatype  Unknown
returns         Unknown   Unknown
references      Function  Unknown
hasType         Unknown   Unknown
isDefinedIn    Unknown      Function
isCastTo    Unknown      Unknown
level
composite
multiarc
 
 
- Rigiattr:
 
attr Node file
attr Node lineno
attr Node position
attr Node annotate
attr Node nodeurl
attr Node tagged
attr Node classKind
attr Node accessKind
 
attr Arc name
attr Arc annotate
attr Arc arcurl
 
Both schemas are middle-model schemas (i.e., they have coarse-grained
artifacts). Still, there is often not a direct one-to-one mapping.
 
The following tables show the (preliminary) mapping of nodes, edges,
and attributes between the two schemas. cppparse refers to the Rigi
C++ domain as specified in the Rigi HTML documentation (author:
Johannes Martin).
 
- Nodes
 
cppparse    DMM
 
Function    Routine who is pointed to by a Defines arc:
                SourcePart --Defines--> Routine
 
        Method who is pointed to by a Defines arc:
        SourcePart --Defines--> Method --IsMethodOf--> Class
 
Prototype    Routine who is pointed to by a Declares arc:
                SourcePart --Declares--> Routine
 
        Method who is pointed to by a Declares arc:
        SourcePart --Declares--> Method --IsMethodOf--> Class
 
Variable    Variable (who is pointed to by Defines/Declares arc)
        SourcePart --Defines/Declares--> Variable
 
        Formal Parameter
        SourcePart --Defines--> Variable --IsParameterOf--> Routine
        SourcePart --Defines--> Variable --IsParameterOf--> Method
 
        PARSER BUG: FormalParameter should be used instead of
        Variable.
 
        Field
        SourcePart --Defines--> Field --IsFieldOf--> Class
 
Constant    Constant (who is pointed to by Defines arc):
        SourcePart --Defines--> Constant
        SourcePart --Defines--> EnumerationLiteral
 
        (supported since parser version 1.37b)
 
Datatype    Type who is pointed to by a Defines arc:
        SourcePart --Defines--> Type
 
        SourcePart --Defines--> EnumeratedType
 
        NOTE: EnumeratedTypes (in Classes) are missing a
        IsPartOf/IsEnumeratedTypeOf arc.
 
        NOTE: Predefined types (int, bool) are not
        captured. These have only Declares arcs:
        SourcePart --Declares--> Type
 
        PARSER BUG: typedefs are not handled correctly -- they
        get lost.
 
Class        Class (who is pointed to by a Defines arc):
        SourcePart --Defines--> Class
 
Namespace    -
        NOTE: NOT (YET) PART OF DMM
 
 
- Relationships
 
cppparse    DMM
 
extends        Class --InheritsFrom--> Class
        SourcePart --Defines--> Class --InheritsFrom--> Class
 
isInNamespace    -
        NOTE: NOT (YET) PART OF DMM
 
isMemberOf    Method --IsMethodOf--> Class
        Field --IsFieldOf--> Class
 
        Q: Are there other members as well (e.g, enums and typedefs
           in classes).
 
isTheSameAs    Type --IsDefinedInTermsOf--> Type
 
        PARSER BUG: Not handled yet (version 1.39b)
 
accesses    SetsComponent/UsesComponent
 
        PARSER BUG: Not handled yet (version 1.39b). Not too
        useful for SORTIE anyway: Since SORTIE
        directly accesses public instance variables (instead
        of accessor functions), lots of spurious references
        would be generated.
 
calls        Routine/Method --Invokes--> Routine/Method
 
returns        Routine --IsReturnValueOf--> Type
        Method --IsReturnValueOf--> Type
 
references    Method --Uses--> Variable
 
        Note: Formal parameters (BUG: represented as
        Variables) show up as Uses. To detect them:
        Variable --IsParameterOf--> Method
 
hasType        Variable --IsOfType--> Type
        Constant --IsOfType--> Type
        EnumerationLiteral --IsEmunerationLiteralOf--> EnumeratedType
 
isDefinedIn    SourcePart --Contains--> SourcePart --Defines--> X
        X = { Variable, Constant, Routine, Type, EnumeratedType }
 
 
- Attributes
 
cppparse    DMM
 
classKind    n/a
 
accessKind    "access" attribute (of Method/Field)
 
 
Translation from one schema to the other proved to be not
straightforward. Even though there is documentation for DMM, we had to
closely analyze the GXL to understand the semantic of the parser's
output. Furthermore, the parser slightly deviates from the DMM (e.g.,
instead of FormalParameter the more general type Variable is emitted
for formal parameters), generates wrong output (e.g., an earlier
version of the parser produced incorrect method containment and access
attributes for methods) and does not yet emit all DMM constructs
(e.g., an earlier version of the parser did not emit Constants). Our
findings resulted in several bug reports and suggested improvements of
the parser.
 
On the Rigi side, we made the following observations: The frequent use
of "Unkown" (see Rigi C++ schema above) for source and target nodes
severely limits understanding of the schema specification -- it is
less self-documenting than other schemas. Similarly, Rigi's attribute
model is too imprecise. The C++ schema is not sufficiently documented
(e.g., missing documentation of classKind and accessKind
attributes). Furthermore, the documentation was partially wrong in
explaining the naming scheme and very vague on which artifacts
actually get extracted from the code. The latter can be partially
explained by the fact that the Rigi parser selectively dumps the
CodeStore without necessarily knowing CodeStore's working. Finally,
the Rigi C++ schema has no representation for enumeration
literals. Since SORTIE uses enums, this would be desirable. On the
positive side, our findings improved documentation and the discovery
of a significant parser bug.
 
 
2.3.2.2. XSL Implementation
 
Originally, we were hoping to use Jeff Michaud's GXL <-> RSF converter
written in Java, but is still is in early beta stage and currently not
further developed [GXLRSF].
 
Our converter that performs the mapping from GXL to RSF has been
written in XSL. XSL is reasonably well suited for a prototype
implementation. The XSL code is written by hand. There is a single
script for each artifact that is extracted. This approach yields
redundant code, but the code is easier to read and debug. This is
important since XSL is not a language that is human-readable in a
reasonable way (IMHO).
 
As an example, the following XSL code matches subclass relationships
(i.e., SourcePart --Defines--> Class --InheritsFrom--> Class) and
outputs a corresponding RSF extends tuple:
 
<?xml version='1.0' encoding='utf-8' ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xlink="www.w3.org/1999/xlink">
 
<xsl:include href="./utils.xsl"/>
<xsl:output method="text" />
 
<!--
  GXL to RSF converter: extends arcs
  (Holger M. Kienle, kienle@csr.uvic.ca)
-->
 
<xsl:template match="/">
  <xsl:text># Generated from gxl2rsf_extends.xsl&#10;</xsl:text>
 
  <xsl:apply-templates select="/gxl/graph/edge/type"/>
</xsl:template>
 
 
<xsl:template match="type[@xlink:href = 'dmmschema.xsd#InheritsFrom']">
 
  <xsl:variable name="FROM_ID" select="../@from"/>
  <xsl:variable name="TO_ID" select="../@to"/>
 
  <!-- ************************* -->
  <!-- *** Node's SourcePart *** -->
  <!-- ************************* -->
 
  <xsl:variable name="SP_FROM_ID"
    select="/gxl/graph/edge[@to = $FROM_ID]/type
                           [@xlink:href = 'dmmschema.xsd#Defines']/
                           ../@from"
  />
 
  <xsl:variable name="SP_TO_ID"
    select="/gxl/graph/edge[@to = $TO_ID]/type
                           [@xlink:href = 'dmmschema.xsd#Defines']/
                           ../@from"
  />
 
  <!-- ********************************************** -->
  <!-- *** Print: "extends" ClassName ClassName   *** -->
  <!-- ********************************************** -->
 
  <xsl:text>extends </xsl:text>
 
  <xsl:call-template name="print_node">
    <xsl:with-param name="NODE_ID" select="$FROM_ID"/>
    <xsl:with-param name="SOURCE_PART_ID" select="$SP_FROM_ID"/>
  </xsl:call-template>
 
  <xsl:text> </xsl:text>
 
  <xsl:call-template name="print_node">
    <xsl:with-param name="NODE_ID" select="$TO_ID"/>
    <xsl:with-param name="SOURCE_PART_ID" select="$SP_TO_ID"/>
  </xsl:call-template>
 
  <xsl:text> </xsl:text>
 
  <xsl:call-template name="print_SourcePart">
    <xsl:with-param name="SOURCE_PART_ID" select="$SP_FROM_ID"/>
    <xsl:with-param name="DELIMITER" select="','"/>
  </xsl:call-template>
 
  <xsl:text>&#10;</xsl:text>
</xsl:template>
</xsl:stylesheet>
 
Not show is the utility code (in util.xsl) that handles the printing
of names. A significant part of the code deals with generating name
that are similar to the ones generated by the Rigi C++ parser.
 
XSL was an interesting experiment for us. Since we had no prior XSL
knowledge, the code was rewritten several times as our understanding
increased. The resulting code has not been optimized for speed, but
readability. We had trouble finding a good XSL processor until Sergei
Marchen pointed us to Xalan. The Xalan for Java implementation (ran
uncompiled on a JVM) was stable enough to run our scripts, but
execution speed proved to be a problem, especially for large GXL
files. Translation of all files took around 9 days and bigger files
(such as srticode.cpp) took more than a day.
 
The parser translates every source file to a corresponding GXL
file. (The Ottawa parser can also generate a single GXL file capturing
all source files, but this file is too big to process with an XSL
script.) Unfortunately, this means that names of artifacts that do not
have a definition cannot be resolved. These names have to be resolved
by a primitive "linker" in a separate postprocessing step. For
example, for the class TABAGraphForm is represented with two nodes:
TABAGraphForm^abagraph.h^21^6 and TABAGraphForm^^^. The first, full
name is generated when the class's definition is parsed (in
abagraph.h). The latter is generated whenever there is a reference to
the class in a different file than abagraph.h. We are currently in the
process of writing a linker that translates abbreviated names to full
names. Even full names can be ambiguous because the naming scheme
(taken from the original Rigi parser) is not unambiguous.
 
Here is Sergei's reply to the question "Will I get the same result
when I process every file separatley compared to using the -l/-s
option with all files?":
 
  "They will be still valid but completely independent from each
  other. The thing is that IDs of the "same" nodes in a different
  files will be different and vice versa - "different" nodes in
  different files can have the same ID because in this case in every
  file the numbering of nodes will be started from "1" every time you
  run the parser. In case of list of files (specified by "-l" switch
  or autogenerated by "-s (-f)" switch(es)) parser keeps track of
  elements IDs so node, representing type "int" for example will have
  the same ID in every GXL file containing an "edge(s)" to 'type
  "int"'. So some or all of those files can be kind of "linked"
  together and they still will correctly represent relationships
  between nodes. If you specify "-A" switch parser just outputs all
  the information into one file and this file will contain only one
  node for type "int" and it will most likely have same ID as in case
  of multiple files output... Same about all other things. However
  locally defined (and referenced) variables will have different IDs
  from the globally defined variables with the same name so we can
  easily distinguish one from another."
 
Development of the converter took about 25 hours, spread over several
weeks.
 
2.3.2.3. Comparison of RSF and GXL
 
It is interesting to do a comparison of the file sizes between GXL and
RSF for all of SORTIE:
 
                   GXL        RSF
lines        1.495.211     36.929 
characters  35.143.782  2.921.113
 
Note that the conversion from GXL to RSF looses some information
(e.g., the path attribute and the Contains relationships of
SourceParts). Still, even when this is taken into account, GXL's file
size is still considerably larger.
 
2.3.3. Comparison
 
We would like to compare the RSF generated by both
approaches. Identifying their differences (especially inconsistencies)
should help us to detect parser bugs/characteristics and
strengths/weaknesses of the respective approaches.
 
So far we did not have time to do this.
 
2.4. Postprocessing of RSF
 
Xiaoming started to analyze SORTIE based on the RSF obtained with the
Rigi parser. Her findings are summarized in her report [Wu01].
 
We did not have the time to analyze SORTIE based on the RSF obtained
with the Ottawa C++ parser.
 
3. Collaboration Partners
 
3.1. Peer Tools
 
As described above, we used the TkSee C++ Parser (University of
Ottawa), which is a parser similar in functionality to the Rigi C++
parser. The TkSee C++ parser output is GXL (and TA++), which forced us
to write a converter from GXL to RSF. Writing of the converter took a
considerable amount of time -- time that could have been more
profitably spent. We hope that the research community will converge
towards a common exchange format (preferably GXL) to ease tool
interoperability. But even if a common exchange format exists, the
problem of schema conversion remains. In fact, the schema conversion
from DMM to Rigi's C++ schema was the most difficult part of the
translation process.
 
3.2. Complementary Tools
 
In the future, we would like to use the clustering results of the
TkSee participants, but this will mean, of course, yet another
converter.
 
Besides Rigi we used a variety of Unix tools, mainly grep, sed, and
awk. We also used emacs to browse and search in files (etags to
search in emacs over multiple files).
 
In general, there were few offers of participants of output that they
produced. Thus, we currently do not even know if there was potential
to use other's output. (Maybe participants talked to each other with
private email instead of posting results; this was the case to a
certain extent in our communication with the TkSee participants.)
 
We offered output, but did not receive response of participants. It
appears that Andrew Malton used the textual output of the dfm files
that we posted, but his lead him to a dead end.
 
3.3 General Observations
 
Communication among the participants was far less than we expected. As
of this writing, only 22 messages were sent over the email
list. Personally, I did not like the eGroups list. After the list was
switched over to Yahoo, I did not register again, because I do not
want to give personal information away. It is not very reassuring for
that "Yahoo! will try to provide more relevant content and advertising
based on the information collected on this page and on the Yahoo! 
products and services you use." Having a mailing list in addition to a
central, open, non-commercial bulletin-board such as a Wiki might be a
better way (but this is a matter of personal taste I guess).
 
On way to increase collaboration might be to encourage participants to
post results in stages. Every participant could have a status bar
showing their current stage. For example, participants should post
their obtained output (and experiences) as soon as they succeed
parsing the code. Thus, other participants learn early on from first
successful parsing experiences or can skip the parsing stage by using
other participant's output. Since parsing proved to be a difficult
part of the assignment (by far more difficult than xfig), this could
have save redundant (and frustrating) work for other teams.
 
 
References
 
[BB] Borland C++ Builder, http://read.cnread.net/dnwl/cxsj/c/bcb/.
 
[BC] Borland Community Web site, http://community.borland.com/.
 
[BBuilder] Borland C++ Builder home page,
  http://www.borland.com/bcppbuilder/.
 
[BDSS] Borland Developer Support Staff, "Delphi 3 file types with
  descriptions", Borland Community Web site,
  http://community.borland.com/article/0,1410,16552,00.html.
 
[BS] Borland Staff, "Borland C++Builder White Paper: C++Builder and
  VCL", Borland Community Web site,
  http://community.borland.com/article/0,1410,10095,00.html.
 
[CEK00] Jorg Czeranski, Thomas Eisenbarth, Holger Kienle, Rainer
  Koschke, and Daniel Simon, "Analyzing xfig Using the Bauhaus Tool",
  7th Working Conference on Reverse Engineering (WCRE '00), pages
  197-199, Brisbane, November 2000.
 
[DMM01] "The Dagstuhl Middle Model (DMM)",
  http://scgwiki.iam.unibe.ch:8080/Exchange/2.
 
[Gendron] Sachen Gendron, "A Summary of the User Requirements for the
  SORTIE Reengineering Project",
  http://www.csr.UVic.CA/chisel/collab/casestudy.html
 
[GXLRSF] GXL to RSF converter,
  http://crown.csc.UVic.CA/cgi-bin/twiki/view/Main/RigiReleases.
 
[Intersimone] David Intersimone, "VCL++: Applying the Power of the VCL
  to C++ by Randy Haben", Borland Community Web site,
  http://community.borland.com/article/0,1410,20565,00.html.
 
[Kienle] Holger Kienle, "SORTIE Reverse Engineering Tools Collaboration
  Page", http://cedar.csc.uvic.ca/kienle/view/Main/SORTIE.
 
[Marchen] Sergei Marchen, "C/C++ Parser with TA++ and GXL output",
  University of Ottawa, http://137.122.24.96/dmm/.
 
[MWW00] Johannes Martin, Kenny Wong, Bruce Winter, and Hausi Muller,
  "Analyzing xfig Using the Rigi Tool Suite", 7th Working Conference
  on Reverse Engineering (WCRE '00), pages 207-209, Brisbane, November
  2000.
 
[RigiWiki] http://crown.csc.UVic.CA/cgi-bin/twiki/view/Main/WebHome.
 
[SD] Sortie-demo mailing list and Yahoo! Groups,
  http://sg.groups.yahoo.com/group/sortie-demo.
 
[SN] "The Source-Navigator IDE", http://sources.redhat.com/sourcenav/.
 
[TWiki] TWiki home page, http://www.twiki.org/.
 
[VCL] http://read.cnread.net/dnwl/cxsj/c/bcb/001.htm.
 
[Wu01] Xiaomin Wu, "Final Report of Directed study -- Topic: Software
  Reverse Engineering", http://www.csc.uvic.ca/~xwu/final-report.htm.