Data Cleansing: About the Tools Available
A time-consuming part of the data loading process is reviewing and cleansing the data and ensuring that it meets the data format requirements specified by the appropriate RBInfo files and DTDs.
|
Data files must be in XML format and must comply with the DTD supplied with the Windchill installation (available from the standardX20 directory). For information about creating and validating XML data files, see Validating the XML Format.
|
This section provides the pros and cons of different tool options for cleansing the data.
Text Editors: Pros and Cons
Text editors, such as WordPad, Notepad, and TextPad, are simple to use and readily available. You might choose to use one of these if additional data conversion is not necessary and the size of the data file is small. They provide a quick view of what is about to be loaded.
Pros
• Simple to use
• Available on most operating systems
• Provide a quick review of data
• Minimal learning curve
• Can be used for simple search and replace
• No additional coding or preparation is required to view data in this tool
• Works for XML, CSV, and other text files
Cons
• Data validation not available
• Additional data transformation is not available
• Not viable for large amounts of data
Microsoft Excel: Pros and Cons
Customers may have routines to produce data in flat-file format separated by a delimiter, such as a comma. Microsoft Excel is a good tool to use to view this data in neatly organized rows and columns.
Pros
• Simple to use
• Available on most Windows operating systems
• Provides a quick review of data
• Can be used for simple find-and-replace operations
• Can develop formulas to validate data
• Minimal learning curve
• Provides the ability to manually move columns, which limits additional data transformation
• Provides the ability to filter and sort data. This provides an easy approach to identifying duplicate parts and documents as well as producing a list of items to add to resource bundles.
• Data consistency
Cons
• Data must be parsed into columns manually (using the “Text To Columns” functionality)
• Limited to 65,536 rows of data per worksheet
• Not available when working on non-Windows platforms
• Requires that data be in row and column format
Custom Code: Pros and Cons
Custom developed code can greatly enhance the load process. Developing XML transforms, Java programs, or both allows the customer to continually deliver data in the same format regardless of the desired load file format. With the introduction of contexts, custom-built code can be used to separate data into groups of load files as is now required.
Custom code is typically used in conjunction with another tool, such as a text editor.
This approach is most likely to be used when the load file format is different from the format of the data that was provided.
Pros
• Can be developed with special rules for data validation
• Transforms data from neutral (customer-supplied) to load-ready format
• Allows data segregation
• Consistent and explainable outcomes, unlike manual manipulation involved with the use of a tool such as Excel where there is the potential for error
• Transportable and shareable
• Runs on multiple operating systems
Cons
• Tightly coupled to input and output. A change in the input or output requires a change to the custom code.
• No user interface to review the data
• Usually requires the use of an additional tool for reviewing data
• Multiple skill sets are required
• Combination of manual and programmatic efforts, after code is developed