Character Sets and Encoding

Arbortext Command Language > Arbortext Command Language Overview > Character Sets and Encoding

Arbortext Editor supports the import and export of both 8-bit (for example, ISO-Latin) and multibyte character encodings (for example, Shift-JIS and GB_2312-80). These character encodings are required for input, display, and publishing of the following languages:

• Japanese

• Simplified Chinese

• Traditional Chinese

• Korean

• English

• European (European characters are represented as character entities within Asian-encoded files)

Arbortext Editor supports the following character sets and encoding methods:

• Adobe-Standard-Encoding

Adobe-Standard-Encoding allows a single font to be used for multiple operating systems.

• ISO-8859-1 through 9

ISO 8859 defines accented and non-Latin characters used in European languages. ISO-8859 supports the following character sets:

◦ Part 1: – Latin Alphabet No. 1

◦ Part 2: – Latin Alphabet No. 2

◦ Part 3: – Latin Alphabet No. 3

◦ Part 4: – Latin Alphabet No. 4

◦ Part 5: – Latin/Cyrillic Alphabet

◦ Part 6: – Latin/Arabic Alphabet

◦ Part 7: – Latin/Greek Alphabet

◦ Part 8: – Latin/Hebrew Alphabet

◦ Part 9: – Latin Alphabet No. 5

◦ Part 10: – Latin Alphabet No. 6

• ISO-10646-UCS-2

The International Standards Organization's encoding that is equivalent to Unicode.

ISO 10646 actually defines two encoding methods, one of which is a 4-byte form referred to as 10646.UCS-4. 10646.UCS-4 can encode over 4 billion unique characters.

When writing an XML document in UCS-2, Arbortext Editor always writes out a byte order mark.

• EUC-JP

Extended UNIX Code.

EUC was developed as a method of handling multiple character sets, Japanese or otherwise, within a single text stream. EUC-JP is ISO-2022 compliant 8-bit encoding for which initially designated ASCII to GO and JIS X 0208-1983 to G1 without explicit announcement.

• Shift-JIS

The Japan Industry Standard multibyte encoding.

The encoding, developed by Microsoft, has been widely implemented on Japanese PCs. It is the Japan Industry Standard multibyte encoding. The codes are numerically shifted from the codes used by JIS standard X 0208. Shift-JIS is also referred to as MS Kanji and SJIS.

• Big5

The multibyte encoding standardized by Taiwan. Traditional Chinese is encoded in Big 5.

• GB_2312-80

The multibyte encoding standardized by the People's Republic of China. Simplified Chinese is encoded using GB_2312-80.

• KSC_5601

The multibyte Wansung and Johab encoding standardized by Korea.

• UTF-8

Unicode character set Transformation Format, 8-bit form.

UTF-8 is a variable-length encoding of Unicode using 8–bit sequences. This transformation format was developed by X/Open. It is described in ISO/TEC 10646 AM1.

• US-ASCII

American Standard Code for Information Interchange.

ASCII specifies codes and SPACE and a set of 94 characters (letters, digits and punctuation or mathematical symbols) suitable for the interchange of English language documents.