Wu, P.-C., "A Base62 Transformation Format of ISO 10646 for Multilingual Identifiers," Software-Practice & Experience, Vol. 31, No. 12, Oct. 2001, pp.1125-1130. (SCI Expanded, EI)

論文題目: 多語文識別字的ISO 10646基底62轉換格式


ISO 10646 Universal Character Set (UCS) is a 31-bit coding architecture that covers symbols in most of the world's written languages. Identifiers in programming languages are usually defined by using alphanumeric characters of ASCII, which represent mainly English words. An approach for working around this deficiency is to encode multilingual identifiers into the alphanumeric range of ASCII. For case-sensitive languages, an encoding that utilizes [0-9][A-Z][a-z] can be more space-efficient for multilingual identifiers. This paper proposes a base62 transformation format of ISO 10646 called UTF-62. The resulting string of UTF-62 is within a [0-9][A-Z][a-z] range, a total of 62 base characters. UTF-62 also preserves the lexicographic sorting order of UCS-4.

Key Words: Universal Character Set (UCS), programming languages, case sensitive, UTF-8, space efficiency, lexicographic sorting order.



   ISO 10646通用字元集(UCS)是一31位元的編碼架構,涵蓋全世界大多數書寫文字符號。程式語言的識別字通常由ASCII的文數字字元定義,然而如此主要能表示英文詞彙。繞過此一問題的方式是將識別字編碼至ASCII文數字的範圍。在大小寫有別之程式語言,使用[0-9][A-Z][a-z]為編碼範圍,對多語文識別字可有較高的空間效率。本文提出ISO 10646的基底62轉換格式,稱為UTF-62UTF-62產生的字串介於[0-9][A-Z][a-z]範圍,共有62字元。UTF-62並保留UCS-4的詞典排序順序。