Strategy for reverse engineering legacy C++ code (C++ code)

This topic provides a roadmap to ease the process of reverse engineering code using C++ Code Reverser, aimed at first time users of the Reverser. It describes the steps required for the most common usage patterns, and highlights possible pitfalls. It assumes that you are familiar with the UML, C++ and compiler concepts.

There are three common scenarios (use cases) for reverse engineering code:

• To reverse in existing handwritten code (a.k.a. 'legacy' code) to better visualize, understand and document it than is possible with just code. This will also enable further software development to be done in the more productive environment of Modeler.

The Reverser does its work in two stages:

• It parses the code files selected for reverse engineering and reports any errors. It is important that you correct parsing errors before continuing.

The subsections that follow detail the issues - the solutions are in the final section.

The Reverser uses a one pass pre-processor, and this means that the order of the source files to reverse can be significant, in particular when reversing .h files (.cpp files do not generally include other .cpp files - just .h files which have to be in the correct order anyway to compile correctly). The order is important because the .h files contain the class definitions, and classes may reference each other, so we need to avoid forward references. This is exactly the same problem that compilers have when they are following the #include statements in your source files. If you using a precompiled header file, the code file from which the precompiled header file is built must be first in the list.

The Reverser is primarily looking for the class declarations in the source files. These are (normally) found in '.h' files; however, the easiest way in many projects to include your '.h' files in an acceptable order is to point the Reverser at the '.cpp' files instead (see above). The Reverser will then follow the #include statements in the '.cpp' files in order. The other reason it is best to pick the '.cpp' files is that there is often additional information in them that is not in the header, such as comments and also named parameters (rather than just data-types). The reversal of such information will add useful information to your model.

Your source code will #include headers from the compiler, and other libraries. Do you want these libraries in your model? These libraries are potentially very large - often very much larger than the bespoke part of your application. Normally you don't want these class definitions in your model, as you will not be changing them, merely using them. Importing such definitions will just increase the size and complexity of your model and greatly increase the time taken for the Reverser to run. You have three options for reverse engineering each library:

• Do not reverse engineer the library. References to the library are captured as text. This approach minimizes the Model size and the time taken to reverse engineer code.

• Reverse engineer used elements of the library only. This approach allows you to see the Library files and used elements in your Model, whilst not wasting time reversing unused library elements.

• Reverse engineer the complete library. This approach allows you to see the complete library file in your Model, which can be useful if you want to use currently unused library elements in the future.

Most non-trivial C++ projects involve #ifdefs that result in conditional compilations. For example there may be multiple definitions of a 'Timer' class to interface to different hardware or OS combinations - only one is actually compiled based on which of 'OS1' or 'OS2' is defined. Just as a compiler needs to know the specific definitions to compile correctly, so too does the Reverser. The Reverser reverse engineers only one configuration of your code.

Even once you have excluded standard libraries from your reversing, some projects are still very large. As the Reverser checks each item it is reversing against the existing model (or what it has already reversed in the case of legacy reversal), the speed at which reversing proceeds decreases as model size increases. As a result of this, for very large projects it is often a good idea to split the project up along natural interfaces and place each smaller project in a separate directory.

Each smaller project can then be reverse engineered into a separate package, so the directory mappings for each project can specify what is within the scope of each smaller project, and what is not. If you have co-dependencies, the first package you reverse engineer may need to be reverse engineered again after the subsequent packages are reverse engineered, so that all references are modeled in the model as links.

We recommend that you create a separate Model Settings File for each smaller project in the model.

As the question 'How large is very large (when it comes to reversing)' is affected by so many factors (number of classes, size of classes, connectivity between classes, code style, power of computer, length of wait that is acceptable) it is best determined by experiment and experience - try it on your model and see. As a real 'finger in the air' estimate, though, the upper limit for a manageable reversal is of the order of a few hundred classes or a hundred thousand lines of code.

Through the 'Reverse Engineer Code Bodies' check box on the Reverse Engineering Options 1 page, you can choose to either reverse engineer function bodies or not.

When updating Operation Body properties, the Reverser attempts to preserve model object references.

Model Settings Files record ACS settings used for generating and reverse engineering code. By default, the Model Settings File for a model is saved locally to your Modeler installation folder. You can you can save a Model Settings File to a shared directory, so that everyone uses the same settings for generating and reverse engineering code. In addition, you can use Model Settings Files to maintain different configurations, such as, code with and without debug constructs.

Note that there is considerably more information (including many screenshots) for the tool operations that follow in the help. The subsections that follow cover all of the steps required for a 'first reversal'.

Decide whether to reverse into one or a number of destination projects. If you decide on several then you need to decide how to split up the source code between the projects. After you have decided on the split you will probably need to move files around, to make sure that each project has its own distinct directory or directories, from which it reverses into its Class Model. If you don't do this (that is, if some directories contain class definitions meant to go to different projects) you will end up with classes defined in more than one project, which will cause problems at integration time.

The root directory to package mapping defines what code the Reverser is treating as application code and what code is treated as library and external code.

If you have a dsp, vcproj or batch file (created through your make utility- not the make file) for your project, on the Select Model page, click the Project File button, and then select the dsp, vcproj or batch file. The dsp or vcproj file is easier to use than the batch file, because you have to specify additional information when using the batch file.

To include the source files in the required order, one of the following mechanisms is recommended, find the first one you can apply, they get more involved as you go down the list:

◦ If the code is based on Visual Studio, then reference the dsp or vcproj project file through the Project File button on the Select Model page. The list of files to reverse engineer will be populated in the correct order from information in the dsp or vcproj file. It will also set up the Reverser pre-processor define and include definitions.

◦ If you have a batch file created through your make utility for your project (not the make file), then reference the batch file through the Project File button on the Select Model page. The list of files to reverse engineer will be populated in the correct order from information in the batch file. It will also set up the Reverser pre-processor define and include definitions. By its nature, the make file format is difficult to analyze, so while this approach is well worth trying, the Reverser will not always succeed in finding the correct information. You should cross-check the results it produces.

◦ If you have no dsp, vcproj or batch file (created through your make utility) for your project then you will need to add .cpp source files (if available) or .h files (if .cpp files are not available as when you are reversing the interface to a precompiled library).

◦ If the Root Directory owns all of the code files selected for reverse engineering (either directly or through any of the Root Directory's its subfolders), all the code files will be reverse engineered because of the default Root Directory to Root Object mapping.

◦ If you have selected files (or they get #included) for reverse engineering that are not owned by the root directory or any of its sub folders, you must map folders to Packages in the Model so that each code file selected for reverse engineering is reverse engineered.

• If you are using a precompiled header file, the code file from which the precompiled header is built must be first in the list.

• Each code file you select for reverse engineering is reverse engineered only if their owning folder or one of the owning folder's parent folders is mapped to a Package (either the Model itself or a Package in the Model).

On the Reverse Engineering Options 2 page, you must specify the pre-processor variables (equivalent to the #define statements in the code) needed to resolve how to reverse code segments within #ifdef and similar directives, and how to resolve macro substitutions.

If you selected a dsp, vcproj or batch file (created through your make utility) for your project on the Select Model page, the #defines list will be populated from information in the selected file.

If you have code that causes Reverser parsing errors but is valid for your compiler, you can hide that code from the Reverser parser through the RTS_SYNC_INVOKED #define. RTS_SYNC_INVOKED is automatically defined when the Reverser is running, so you can make the code that causes parsing errors conditional using #ifndef RTS_SYNC_INVOKED so that the Reverser ignores it.

The code files you have selected for reverse engineering may be dependent on other code files, such as the MFC library. The Reverser parser needs to know the paths in which dependent code files reside to correctly parse the code files selected for reverse engineering. You determine how libraries and other code files that are referenced through #includes are dealt with on the Reverse Engineering Options 3 page:

◦ If you do not want to reverse engineer a used library, list the #include path but do not map the path to a Package in the Model. Code files that use the library will parse successfully and references to the library will be captured as text.

◦ If you want to reverse engineer only used elements of the used library, list the #include path and map the path to a Package in the Model. The Reverser will reverse engineer onlythe used elements of the library.

◦ If you want to reverse engineer the complete used library, you must reverse engineer the library as a selected file before reverse engineering your code files. You then reverse engineer your code files, list the #include path and map the path to the Package in which the library resides. For more information about reverse engineering libraries, see the Reverse Engineering Libraries section that follows.

◦ If you do not set a #includes path, #include statements will fail and are likely to then cause further parsing errors that may prevent your own code from reverse engineering correctly; however, if you are experiencing memory problems when reverse engineering large quantities of code, not setting a #include path will reduce the amount of memory required by the reverse engineering process.

If you selected a dsp, vcproj or batch file (created through your make utility) for your project on the Select Model page, the #defines list will be populated from information in the selected file.

If you are working with C++ code, you must ensure that the #INCLUDE path check box is selected for each path. If the #INCLUDE path check box is cleared and the path is not mapped to a package, the path is ignored.

If the library you want to reverse engineer has been previously stored in an integrated configuration management tool (CM tool), you can add the appropriate CM tool package to your Model, rather than reverse engineering the library to your Model.

If the library you want to reverse engineer has been previously reverse engineered to another Model, you can export the Package to a directory and then import that Package to your Model, rather than reverse engineering the library to your Model.

Typically you will reverse engineer only the library header files, in which case the order they are reverse engineered may be important. To ensure they are reverse engineered in the correct order, create an 'all.cpp' file that lists the header files in a compilable order through #includes. Reverse engineer the all.cpp file to the required Package in the Model.

On the Parsing Complete page, resolve #include errors first, by adding search paths, then repeat the parsing process. Expect to get some of these even if you used a project or batch file created through your make utility to load your include paths as some compiler manufacturers hard code paths to core libraries which then do not need to be specified in the project or batch file, but are still needed.

After there are no #include errors, eliminate the pre-processor definition errors by adding or changing the #define definitions, and repeat the parsing process until there are no errors.