Unicode Encoding

Reference Topics > Unicode Encoding > Unicode Encoding

Unicode Encoding

This section appendix describes how the new Unicode support used internally by Pro/ENGINEER from Wildfire 4.0 onward affects Creo TOOLKIT and its applications.

Introduction to Unicode Encoding

UNICODE is an acronym for "Universal Character Encoded System". It is a unique character encoding scheme allowing characters from European, Greek, Arabic, Hebrew, Chinese, Japanese, Korean, Thai, Urdu, Hindi, and other world languages to be encoded in a single character set. This enables applications to simultaneously support text in multiple languages in their data files. Unicode encoding covers most of the letters, punctuation marks, and technical symbols commonly used in the English language that are not covered by the legacy encoding.

Unicode defines two mapping methods:

• UCS (Universal Character Set) encoding

• UTF (Unicode Transformation Format) encoding

For more information on Unicode Encoding, visit http://unicode.org.

Pro/ENGINEER Wildfire 4.0 onward, all string data in Pro/ENGINEER (previously stored in the legacy encoding format) is now stored in the Unicode encoding. Pro/ENGINEER Wildfire 4.0 uses the UCS-2 encoding on Windows platforms and UCS-4 encoding in UNIX environments for widestring data. It reads and writes character data using the mulitbyte UTF-8 encoding on all platforms. UTF-8 is an 8-bit, variable-length character encoding format that uses one to four bytes per character.

Some important terminology about string encoding related to Creo TOOLKIT that is used throughout this appendixsection is described as follows:

• “Unicode encoding” refers to the string and widestring encodings used by Pro/ENGINEER Wildfire 4.0 and later.

• “Legacy encoding” refers to the encoding used by Pro/ENGINEER Wildfire 3.0 and earlier. Depending on the language, this encoding is typically some version of an EUC encoding.

• “Native encoding” refers to the encoding used by the operating system in the language in which the system is running. This encoding is the same as legacy encoding in most cases.

• “Multibyte string” refers to a character array representing a string in the C language. Because of the limited size of the character (a single byte), combinations of multiple bytes are used to represent characters outside the ASCII range.

• “7-bit ASCII” refers to the character range 0x0 through 0x127. This range is shared between Unicode and non-Unicode encodings used by Creo Parametric. Thus, any data of this type is unchanged after transcoding.

• “8-bit ASCII” refers to the character range 0x128 through 0x255. In many European native encodings, this range is used to represent European accented vowels and other letters. In Unicode, this range is not directly used. Therefore, 8-bit ASCII native strings are not equivalent in Unicode.

• “Byte Order Mark” (BOM) refers to a string of three bytes U+FEFF (represented in C language strings by “\357\273\277”), and is placed on the top of a text file to indicate that the text is Unicode encoded. Unicode has designated the character U+FEFF as the BOM and reserved U+FFFE as an illegal character for UTF-8 encoding. Most of the text files generated by Creo Parametric are written with the BOM and Unicode encoding. Creo Parametric can accept a Unicode encoded text file with a BOM, or a legacy encoded text file without a BOM as the input.

• “Transcoding” refers to the act of changing a string or widestring encoding from one encoding to another, for example, from platform native to Unicode or vice-versa. For some transcoding operations, there is a possibility of data loss, since characters from one encoding may not be supported in the target encoding.

¿Fue esto útil?