mbstoucs

Arbortext Command Language > Functions by Alphabetical Listing > mbstoucs

mbstoucs

mbstoucs (bytestr, len[, charset])

This function converts the byte string bytestr encoded in the character set charset to a Unicode string. len specifies the number of bytes in bytestr to examine, and is normally the return value from a call to mblen.

charset is one of the character sets listed in the description of mblen. If charset is not specified, the system character set is assumed.

If the character set is a multi-byte character set, then it is possible that the last few bytes in bytestr do not form a whole character but are the prefix of a multi-byte character. In this case, they should be joined with the rest of the character for a subsequent call to mbstoucs. See the following example.

Example

This example shows how to convert a Japanese file encoded in JEUC into SJIS. For brevity, no error checking is done. Also, this example would need additional code to handle converting from or to 16-bit character sets, for example, to handle byte-swapping or the Unicode text file signature.

inf = open(jeucfile, "rb")
outf = open(sjisfile, "wb")
while ((len = read(inf, buf, 512)) 0)
{
# append what we just read to any left over bytes
# from previous read
bstr = remb . buf;
# compute how many bytes form whole characters
mb_len = mblen(bstr, "jeuc");
# and copy the remainder to REMB
remb = substr(bstr, mb_len+1);
# convert to Unicode
ucsstr = mbstoucs(bstr, mb_len, from_charset);
# and out to SJIS
mbstr = ucstombs(ucsstr, "sjis");
write(outf, mbstr);
}
close(inf);
close(outf);