Wu, P.-C., "A Page-Shift Transformation Format of ISO 10646," Software-Practice & Experience, Vol. 32, No. 1, Jan. 2002, pp.73-82. (SCI Expanded, EI)

論文題目: ISO 10646頁切換轉換格式

Abstract

ISO 10646 Universal Character Set (UCS) or Unicode covers symbols in most of the world's written languages. There are various UCS transformation formats (UTF). UTF-8 is compatible with systems that assume 8-bit characters. One of the problems with UTF-8 is its space efficiency. For files containing most Asian characters such as Han ideographs, the file sizes increase about 50% using UTF-8. Although Standard Compression Scheme for Unicode (SCSU) can compress Unicode strings to the size of a locale-specific character set, it is complicated and is not intended to serve as a general-purpose interchange format. This paper proposes a page-shift transformation format of ISO 10646, called UTF-S. There are 4 pages: 1-byte, 2-byte, 3-byte and 4-byte. Shift to page 0 uses a special code (00)16; shift to page 1, 2, and 3 uses ISO 2022 shift codes SO, SS2, and SS3, respectively. We test several text files and compare these UTF with Big5, a locale-specific character set. The result shows that the space efficiency of UTF-S is better than that of UTF-16 and UTF-8 and is close to that of SCSU. UTF-S is suitable in replacing locale-specific character sets with ISO 10646 in Internet applications, such as World Wide Web.

Key Words: Universal Character Set (UCS), ISO 2022, fixed-width encoding, space efficiency, Internet, World Wide Web.

摘要

ISO 10646通用字元集(UCS)Unicode涵蓋幾乎所有語文的書寫符號。通用字元集有許多轉換格式(UTF)UTF-8與支援8位元字元的系統相容。UTF-8的問題之一是其空間效率。對含有大部分亞洲字元例如漢字的檔案,使用UTF-8檔案大小約增加50%。雖然Unicode標準壓縮架構(SCSU)可以壓縮Unicode字串達到特定地區字元集的大小,但SCSU較複雜且其設計並不是作為通用的交換格式。本文提出ISO 10646頁切換的轉換格式,稱為UTF-S。此格式共分四頁:一位元組、二位元組、三位元組、四位元組。切換至第0頁使用(00)16特殊碼;切換至第1, 23頁分別使用ISO 2022切換碼SO, SS2SS3。我們測試數個文字檔,並將這些UTF格式與特定地區字元集Big5作比較。結果顯示UTF-S在空間效率上超越UTF-16UTF-8,並接近SCSUUTF-S適合用來在網際網路應用例如全球資訊網中,以ISO 10646取代特定地區字元集。

 

關鍵詞:通用字元集(UCS)ISO 2022、固定寬度編碼、空間效率、網際網路、全球資訊網