Character Sets and Internal Codes

  1. Unicode Home Page. Unicode Consortium's homepage for information on Unicode, the new international standard. The current version of the Unicode standard is 4.0.1, with 4.0 the major revision since version 3.2. There are 1,226 new character assignments made to the Unicode Standard, Version 4.0 (over and above what was in Unicode 3.2). These additions include currency symbols, additional Latin and Cyrillic characters, the Limbu and Tai Le scripts; Yijing Hexagram symbols, Khmer symbols, Linear B syllables and ideograms, Cypriot, Ugaritic, and a new block of variation selectors (especially for future CJK variants). Double diacritic characters were added for dictionary use. (The earlier 3.2 version had a total of 95,156 encoded characters, with the primary feature of Unicode 3.2 being that of 1,016 new encoded characters added to the earlier Unicode 3.1 standard, which had 94,140 enoded characters, having doubled in size from a total of 49,194 encoded characters in Unicode 3.0. Regarding font support, Wenlin 3.0 software supports Unicode 3.1. The Arial Unicode MS font (a Windows TrueType font, see below) supports the earlier Unicode 2.1 standard, with some 40,000 encoded characters. For more on font support, see, for example, BabelStone 1357's Unicode Fonts page, with information on the ranges covered by the following fonts: Arial Unicode MS, Bitstream CyberBit, Code2000, Code2001, SIL Yi, SimSun-18030, and TITUS CyberbitBasic.)
  2. Ideographic Rapporteur Group (IRG). This is the working group that processes CJKV characters proposed for inclusion in Unicode. (thanks to Eric Rasmussen)
  3. Arial Unicode MS Font. This TrueType Unicode font for Windows (<aruniupd.exe> (version 0.86, 13.7 MB), *was* downloadable from Microsoft's website, where the following description of the font was given: "The font Arial Unicode MS is a full Unicode font, containing all of the approximately 40,000 alphabetical characters, ideographic characters, and symbols defined in the Unicode 2.1 standard." The Arial Unicode MS font is bundled with Microsoft's software: MS Office 2000, MS Office 2000 Premium, MS Office XP, MS Access 2000, MS Outlook 2000, MS PowerPoint 2000, MS Publisher 2000/2002, MS FrontPage 2000, MS Internet Explorer 5.5 (and above?), etc. (Some of the software may have been bundled with version .084 of the font, which supports the earlier Unicode standard 2.0. A search on the web for "aruniupd.exe" might yield a downloadable copy, such as at the Orwell.ru and KCUA's sites. It is also available at the CHILDES Project, from their CHILDES Tools webpage (the "Arial Unicode" font).)
    The Arial Unicode MS font is probably still the font that supports the broadest range of characters in the Unicode standard, containing at least all of the approximately 40,000 alphabetical characters, ideographic characters, and symbols that are defined in the Unicode 2.1 standard. Given its bulk -- 23.6 MB when installed, containing 51,180 glyphs -- Microsoft recommends that it be used only when one cannot use multiple fonts tuned for different writing systems. However, precisely because of its enormous multilingual scope -- far surpassing the earlier, 13MB, 26,218-glyph Bitstream Cyberbit font, for example -- the Arial Unicode MS font has opened up possibilities for library cataloguers, such as those at the Ohio State University, to display online library catalog entries in webpages with Unicode-encoded CJK and other non-Roman characters using just one Unicode font. (The font also displays fine under Windows 98.)
  4. Unihan Database. On-line, searchable database that is part of the Unicode Standard. The database contains image maps for the "unified Han" set of logographic characters for Chinese (including dialect characters), Japanese, Korean and historical Vietnamese; comparative encoding info on various schemes (i.e., mapping to major standards (GB 2312, GB 12345, CNS 11643, CCCII, Big5, JIS X 0208, JIS X 0212, KS C 5601, KS C 5657) and other mappings (PRC Telegraph, ROC Telegraph, EACC, Xerox), dictionary definitions, and indices to authoritative dictionaries (Kangxi, Morohashi, Dae Jaweon, Hanyu Da Zidian, Nelson, Matthews, Karlgren, Fenn, Cowles, Meyer-Wempe). For a useful reference, see A User's Guide to the Unihan Database prepared by John H. Jenkins and Richard Cook. The current Unihan Database 3.1 supports the Unicode Standard 3.1. To access the database via the Unicode numbers, go to the Unihan 3.1 Grid Index. For example, the Unihan 3.1 Index for U+4E00 through U+4EFF contains yi 'one' as the first entry. Note: Click on the individual characters in the Grid Chart to obtain full information on each character. To access the database using radical and stroke count, go to the Unihan 3.1 Radical-Stroke Index and select the radical, indexed by number of strokes in the radical. For example, the 'man' radical is composed of two strokes, and is located in the set of radicals with 2 as the number of strokes in the radical. Included under the character entries, besides the encodings and meanings, is a list of Chinese compounds that are drawn largely -- but not exclusively -- from the CEDICT dictionary file. (The CEDICT dictionary project, begun by Paul Denisowski, is now maintained at Erik Peterson's On-line Chinese Tools website.) The Unihan Database is maintained by John H. Jenkins, at Apple Computer, Inc.(Thanks to tip from Richard S. Cook.) See also "Naming of the Kangxi Radicals," prepared by John Jenkins and Wang Xiaoming (1997-02-23), an MS DOC file ((N449.doc) that is downloadable from the IRG Reports webpage, part of the ISO/IEC JTC1/SC2/WG2/IRG Ideographic Rapporteur Group website at Chinese U. of Hong Kong. Note: The University of Albani's course, EAS205: East Asian Research and Bibliographic Methods, has a handy chart of The 214 KangxiRadicals in PDF format, with pronunciation given in Pinyin romanization followed by a gloss and some examples.
  5. Microsoft: Font Properties Extension. Freely-downloadable extension to the properties dialog box for Win9x/2000/NT (current version 2.1) to enable right-clicking on a font file to see its basic properties displayed, including version, creation and modification dates, compiler, vendor, and copyright, as well as such handy information as number of glyphs, font-encoding type, supported Unicode ranges, code pages supported by extended character sets, etc.
  6. A Brief History of Character Codes in North America, Europe, and East Asia. Steven J. Searle (U. of Tokyo) gives an historical overview that includes CJK codes and Unicode, and provides some links to info on 8-bit character sets.
  7. A Tutorial on Character Code Issues. Jukka Korpela's website that "tries to clarify the concepts of character repertoire, character code, and character encoding especially in the Internet context." (Updated URL (09/16/01), thanks to Charles Benoit)
  8. Unicode and Multilingual Support in HTML, Fonts, Web Browsers and Other Applications. Alan Wood's must-visit site for information on Unicode resources. Includes info on CJK, tips on using characters from Unicode fonts in Microsoft's Word 97/2000/2002 by picking them from the Symbol dialog box; list of fonts and information on them, including Unicode fonts for Chinese; info on Unicode and multilingual font and keyboard utilities; creating multilingual web pages: Unicode support in HTML, HTML editors and web browsers; links to other Unicode information websites; etc.
  9. Two New Coded Character Standards from China and their Implementation: HK SCS & GB 18030. Abstract of a presentation by Dirk Meyer (Adobe Systems, Inc.) for the Eighteenth International Unicode Conference (24-27 April 2001, Hong Kong). (See other abstracts in the Conference Program pertaining to CJK and Unicode at the conference.)
  10. Unicode Codepages. Codepages from Microsoft's website. (Thanks to tip from Dominic Beecher.)
  11. "How to See (CJK) UTF-8 in a Browser". Erik Peterson's instructions, including downloading sites for Unicode-based fonts. (See also my Chinese Language Software section.)
  12. Fonts for DOS/Windows/Mac (part of my ChinaLinks2 that includes freely-downloadable and commercially-available Unicode fonts)
  13. Hong Kong's Information Technology Services Dept. (thanks to Thomas Chan) (Big5)
  14. Cantonese Vernacular Characters and RichWin97 vs RichWin2000. My webpage exploring some issues pertaining to inputting and displaying of characters from the extended Big5 character set using different Chinese encoding/decoding software (Richwin, NJStar Communicator) and Unicode fonts.
  15. Non-Unicode Characters/Components in Wenlin. Wenlin Institute's list of characters and components included in Wenlin software (for language learners) that are not (yet) found in Unicode.
0