GLYPHLISTS AND FONT ENCODING HEURISTICS IN FREETYPE/2


PART I - OVERVIEW OF GLYPHLISTS AND FONT ENCODING

The displayable character forms (known as 'glyphs') within any OpenType font 
may be stored in any arbitrary physical order inside the file.  So, how to 
find the character you want?  Every OpenType font format includes one or more 
'CMAPs' (character maps) which are essentially look-up tables based on well-
known encoding standards.  The main one is Unicode, which essentially all 
OpenType fonts support.  

An OS/2 font driver like FreeType/2 needs to know what CMAP(s) any given font
file supports, and be able to deal with it/them as needed.  However, the font
driver will be looking up characters based on what the Graphics Engine (GRE)
layer of OS/2 Presentation Manager asks it for... and GRE doesn't really know
or care what encoding the font uses.

Instead, GRE has its own way of indexing characters, known as a 'glyphlist'.  
It is therefore up to the font driver (i.e. FreeType/2) to translate an OS/2
glyphlist index into a CMAP index appropriate for the font in use.

The general logic for character look-up between the font and GRE therefore 
goes something like this:

   [font file] <------------->  [font driver]  <------------->  [GRE/GPI]
                glyph indexed                   glyph indexed
                   by CMAP                       by glyphlist

GRE actually supports several different glyphlists, each of which covers a
certain set of characters, organized in a certain way.  Some of these are
based on well-known character encodings, so it is possible that the font
may actually be using a CMAP which is identical to one of these glyphlists.
In theory, the font driver's job is easy in such cases - just use the index
as-is.  In practice, however, things aren't always so tidy (more on that in
a bit).

In general, the font driver needs to decide what glyphlist to use for each
font, then internally translate it (if necessary) to and from the font's own 
CMAP on the fly.  We want this conversion to be as simple and efficient as
possible, so generally we choose a glyphlist that is very close (or even
identical) to one of the font's CMAPs.  We must also keep in mind, though,
that the glyphlist is the ONLY way we have of passing characters from the
font to back to Presentation Manager.  If the glyphlist we use doesn't
include a particular character, then it will be inaccessible to OS/2, even
if the font itself includes it.  Each glyphlist supports a somewhat different
set of characters, although there is a good deal of overlap.

Therefore, choosing the correct glyphlist is one of the most critical parts 
of the font driver.  It is very important that we choose the best glyphlist 
such that the maximum range of characters from the font will be accessible.

The Graphics Engine supports the following glyphlists in OS/2 version 4.5: 

  "PM383"   - Native OS/2 font; glyphs are indexed by Extended UGL codepoint.

  "PMUGL"   - Either an alias for PM383, or possibly an updated version of it.

  "UNICODE" - Unicode font; glyphs are indexed by UCS codepoint.

  "PMJPN"   - Japanese font; glyphs are indexed by Extended UGL up to codepoint
              949, and Shift-JIS codepoint for all higher values.

  "PMKOR"   - Korean font; glyphs are indexed by Extended UGL up to codepoint
              949, and Wansung/KSC codepoint for all higher values.

  "PMCHT"   - Traditional Chinese font; glyphs are indexed by Extended UGL up 
              to codepoint 949, and Big-5 codepoint for all higher values.

  "PMPRC"   - Simplified Chinese font; glyphs are indexed by Extended UGL up 
              to codepoint 949, and GB-2312 codepoint for all higher values.

  "SYMBOL"  - Not documented, but apparently equivalent to ""

  ""        - Passthrough font; no indexing is done by GPI (the application
              is assumed to know the correct glyph index within the font file)

As you can see, Unicode (UCS-2) is actually a supported glyphlist.  This 
basically supports any character that a font could contain, so why not always
use the UNICODE glyphlist for fonts with Unicode CMAPs?

Well, as noted above, things weren't always that simple.  We can certainly
do this, and in fact it is a configurable option in the driver.  However,
use of the UNICODE glyphlist has some odd interactions with two flags that
can be set in the font metrics reported to GRE: the 'DBCS' and 'MBCS' flags.

When UNICODE is in use and the DBCS/MBCS flags are set:
 - There is a slightly a higher processing overhead, which may result in a
   measurable slowdown when rendering large quantities of text.
 - DBCS glyph substitution (the substituting of unsupported DBCS glyphs from
   a different, designated font) is not available for text being displayed in
   a Unicode font.

Conversely, when UNICODE is in use and the DBCS/MBCS flags are NOT set:
 - Many characters outside the basic ASCII range (particularly graphic
   characters) will not display when using a Unicode font, regardless of the
   currently-active display codepage.

Some additional points should be made with regard to DBCS glyph substitution.
This feature is enabled through the "PM_SystemFonts" -> "PM_AssociateFont" 
key in OS2.INI, and applies throughout Presentation Manager.  (It is normally
set only on DBCS versions of OS/2, but it works on all versions.)
 - Only a font using a glyphlist of "PMJPN", "PMKOR", "PMCHT" or "PMPRC" can
   be defined as the substitution (association) font.
 - When this feature is enabled, only a font using the UNICODE glyphlist will
   be capable of rendering text under one of the Unicode codepages (1200, 
   1207 or 1208) via PM or GPI functions.  (For unknown reasons, this
   limitation does not exist when glyph substitution is disabled.)

All of this makes the job of choosing the best glyphlist rather tricky.



When reading a TrueType font, the FreeType/2 installable font driver (IFI), 
FREETYPE.DLL, uses some fairly complex logic to determine the best glyphlist 
to pass to GPI (generating it from the cmap via a conversion process if 
necessary).  A rough overview of this logic is provided below for each major
version.

First, some common definitions.  FreeType/2 defines the following encoding
schemes:

TRANSLATE_UGL:
  Font has a Unicode cmap; convert the cmap indices to UGL and set the GPI 
  glyphlist to "PM383".

TRANSLATE_UNICODE:
  Font has a Unicode cmap; do not perform any conversion and report the 
  glyphlist to GPI as "UNICODE".

TRANSLATE_UNI_SJIS:    
TRANSLATE_UNI_BIG5:    
TRANSLATE_UNI_GB2312:  
TRANSLATE_UNI_WANSUNG: 
  Font has a Unicode cmap, but contains support for one (and only one) of these
  specific CJK character sets; convert the cmap indices accordingly (using the
  appropriate codepage table), and set the GPI glyphlist to "PMJPN"/"PMCHT"/
  "PMPRC"/"PMKOR".

TRANSLATE_SYMBOL:
  Font is a symbol font; do not perform any conversion and report the glyphlist
  to GPI as "SYMBOL".

TRANSLATE_SJIS:
TRANSLATE_BIG5:
TRANSLATE_GB2312:
TRANSLATE_WANSUNG:
  Font has one of these CJK cmaps; do not perform any conversion and report 
  the glyphlist to GPI as "PMJPN"/"PMCHT"/"PMPRC"/"PMKOR".

Fallback logic (a.k.a. "none of the above"):
  Do not perform any conversion but set the GPI glyphlist to "PM383".  The font
  is therefore assumed to have a UGL-compatible cmap.  Obviously, this is not a
  very desirable option, and should only be used as the last resort.


Version 1.0 and 1.10 (by Michal Necasek)

  In these versions, the general objective seems to be to avoid using the
  UNICODE glyphlist whenever possible, apparently due to concerns about the
  overhead.

  - DBCS_LIMIT is a number defined as 1024 (v1.0) or 2048 (1.10).

  - If the font has a Unicode cmap AND more than DBCS_LIMIT characters, use 
    TRANSLATE_UNI_xxx if applicable, or TRANSLATE_UNICODE otherwise.

  - If the font doesn't have a Unicode cmap, or has a Unicode cmap but contains
    fewer than DBCS_LIMIT characters, then:
      - If the font has an Apple Roman cmap, use TRANSLATE_SYMBOL.
      - If the font has a CJK cmap, use the corresponding TRANSLATE_xxx.
      - If all else fails, use the fallback logic.


Version 1.2x (by KO Myung-hun)

  This version provides a registry-settable flag by which the user can force the
  UNICODE glyphlist to always be used, if possible.  

  When the above flag is not set, some extra processing has been added to give 
  priority to use of the PMJPN/PMPRC/PMCHT/PMKOR glyphlists when the OS/2
  country code is set to one of these locales.

  The overall logic is:

  - AT_LEAST_DBCS_GLYPH is a number defined as 3072.

  - If the font has a Unicode cmap and the 'force Unicode' flag is set, use
    TRANSLATE_UNICODE.

  - Otherwise, if the font has a Unicode cmap AND more than AT_LEAST_DBCS_GLYPH
    characters:
     - If the system country code is set to a CJK locale, and the font contains
       a particular character for that CJK language, OR if the font has a CJK
       cmap, then use the corresponding TRANSLATE_UNI_xxx.
     - Otherwise, use TRANSLATE_UNICODE.

  - If the font doesn't have a Unicode cmap, or has a Unicode cmap but contains
    fewer than AT_LEAST_DBCS_GLYPH characters, then:
     - If the font has an Apple Roman cmap, use TRANSLATE_SYMBOL.
     - If the font has a CJK cmap, use the corresponding TRANSLATE_xxx.
     - If all else fails, use the fallback logic.

  There is one other major difference in this version: when a font uses the
  UNICODE glyphlist, it does not set the flags in the font metrics which
  indicate 'MBCS' or 'DBCS' support, unless the font contains at least
  AT_LEAST_DBCS_GLYPH glyphs.  This is contrary to the default behaviour in
  version 1.0/1.10, as well as in the original TRUETYPE.DLL from IBM.  (See
  the notes earlier in this file for the implications.)


PART II - NAME LOOKUP AND ENCODING

Every TrueType font has a 'name' table which contains strings that describe
the font.  The term 'name' is a bit misleading because the entries in this
table include not only the literal font names, but also things like the
copyright message, license text, and various other things.  

FreeType/2 searches the name table in order to determine the font family name
(e.g. "Arial") and the font face name (e.g. "Arial Bold Italic").

The contents of the names table may be encoded in a number of different ways,
and/or provided in several languages.  As with cmaps, it is incumbent on the
font driver to choose the best name out of those available to present to the
operating system.

FreeType/2 searches for font family and face names using the following logic.

Version 1.0 and 1.10 (by Michal Necasek)

  This version looks for a name matching any of these criteria:
    - Unicode-encoded, English language
    - Unicode-encoded, CJK language
    - Native-DBCS encoding, CJK language
  and takes the first one found.  Unfortunately, there is a major weakness in 
  how these names are passed to GPI: if the name is Unicode-encoded, FreeType/2
  simply strips out the high-order bytes and treats the remainder as ASCII.  
  This works acceptably well if the font name is pure ASCII, but it causes names
  containing non-ASCII characters to be mangled into unintelligibility.

Version 1.2x (by KO Myung-hun)

  The logic is somewhat more intelligent in this version.  However, it still
  does not accept names in languages other than English, Chinese, Japanese, or
  Korean.

  The logic is:
    - If a Microsoft-platform CJK Unicode name which matches the current system
      codepage is found, use that.
    - Otherwise, if an English Unicode name, or a native DBCS-encoded CJK name
      is found, use the first one found.  In the former case, assume that it is
      ASCII-encoded Unicode, and strip out all high-order bytes.  This logic is
      theoretically vulnerable to the same flaw as above, except that it is only
      applied to names that are reported as English-language (and English font
      names do not usually contain non-ASCII characters).
    - Otherwise, as a fallback, if there are any Microsoft-platform CJK Unicode
      names, simply take the first one (of any language) found, regardless of the
      current system codepage.

  Later updates in KO Myung-hun's code (not included in any public FreeType/2 
  v1.2x release) have replaced the above with much cleaner logic:
    - Search for a Microsoft-platform Unicode name, in order of preference:
        - A CJK language name that matches the current system codepage
        - A US-English name
        - The first name found
      For CJK names, convert to the appropriate predetermined DBCS codepage.  
      For any other language, convert it to the current process codepage.
    - If no Microsoft-platform Unicode name was found, search for a Microsoft-
      platform name under other encodings.  If a CJK language name that matches
      the current system codepage is found, OR if a symbol encoding was found,
      select it and return the string verbatim.


PART III - CHANGES FOR VERSION 1.3x

1. In order to avoid the problem of Unicode fonts not being usable for DBCS 
   font association, any font which is currently defined as PM_AssociateFont is
   automatically detected on DLL initialization.  Rather than being assigned a
   glyphlist of UNICODE, it is assigned the appropriate CJK DBCS glyphlist 
   instead (some heuristics are necessary to determine the correct one).

   It should be noted that turning off Unicode support for the font will (of
   course) render any characters outside that particular CJK character set 
   inaccessible.  This is unlikely to be a problem if the font is a dedicated
   CJK font to begin with, but the documentation should discourage the user
   from using pan-Unicode fonts (like Times New Roman MT 30/WT xx) for font
   association.

   This behaviour is configurable via the INI file (see below), although it
   defaults to enabled.

2. Given the myriad and highly subjective advantages/disadvantages of using the
   UNICODE glyphlist, the user should be given more control over the selection
   logic, via INI file settings.  (These should preferably be configurable 
   through some kind of GUI.)

   For the current release, these additional settings have been added under the
   "FreeType/2" application in OS2.INI:

   Set_Unicode_MBCS - INT (0 | 1)
     1:  Always report UNICODE fonts as DBCS- and MBCS-capable.  This allows
         support for SBCS graphics characters, but means that font association
         will not work with text rendered in this font.  This is the behaviour 
         of IBM's TRUETYPE.DLL, and earlier versions of FreeType/2.
     0:  Only report a Unicode font as DBCS-capable if it contains more
         than AT_LEAST_DBCS_GLYPHS glyphs.  This allows font association 
         to work in conjunction with this font, but breaks the rendering
         of many SBCS graphic characters (apparently due to a bug in GPI).
         This is the behaviour of FreeType/2 as of version 1.2x.
   Default: 1
   The reason this needs to be easily switchable is because the best behaviour
   is likely to be different depending on the environment.  1 (report as MBCS)
   would almost certainly be the most useful setting on systems which are not
   likely to use font association (most SBCS systems).  However, on DBCS
   systems, 0 might be more more useful as it allows font association to work
   more effectively.

   Force_DBCS_Association - INT (0 | 1)
     1:  Never use the UNICODE glyphlist for the font currently in use as the PM
         DBCS "association" (glyph substitution) font, but assign a suitable DBCS
         glyphlist instead (as outlined above).
     0:  Ignore the current association font settings when determining the 
         glyphlist for each font.  (This corresponds to the behaviour of all
         prior versions of FreeType/2.)
   Default: 1

   Force_DBCS_Combined - INT (0 | 1)       <-- Added in version 1.3.2
     1:  Never use the UNICODE glyphlist for any font which is listed in OS2.INI
         PM_ABRFiles (that is, forwarded to a "combined" font wrapper), but
         assign a suitable DBCS glyphlist instead (as outlined above).
     0:  Ignore the defined combined fonts when determining glyphlists.  (This
         corresponds to the behaviour of all prior versions of FreeType/2.)
   Default: 1

   Some additional settings are under consideration; see the PROPOSALS section
   below.

3. The name-selection logic has been tweaked somewhat.  First, 932 and 942 are
   now recognized as Japanese codepages, and 1386 is recognized as a Simplified
   Chinese codepage.

   Second, a simple (and not-very-intelligent) sanity check was added to make 
   sure that a reported DBCS-encoded name string actually is in the reported
   DBCS encoding, and not in Unicode.  This is mainly a workaround for some of
   the Japanese system fonts included as part of OS/2 Warp-J (RF*.TTF), which
   inexplicably put Unicode-encoded strings under the Shift-JIS encoding flag.

4. As of version 1.31, the FdQueryFullFaces API has been implemented.  This
   enables use of the GpiQueryFullFontNames function with TrueType fonts.

5. Various bugfixes and other tweaks have been made.


PART IV - FUTURE PROPOSALS 

1. In accordance with the goal of providing better control over how fonts are
   assigned the UNICODE glyphlist, some more INI file settings might be
   considered.

   Use_Unicode_Encoding (modified behaviour) - INT
     2:  Always use UNICODE for Unicode fonts.  This is the current behaviour 
         when Use_Unicode_Encoding==TRUE
     1:  Always use UNICODE, unless TRANSLATE_UNI_SJIS, TRANSLATE_UNI_BIG5, 
         TRANSLATE_UNI_GB2312, or TRANSLATE_UNI_WANSUNG is indicated (in which
         case, use the corresponding DBCS glyphlist).  This is basically the
         current behaviour when Use_Unicode_Encoding==FALSE (as per value 0) 
         but without the AT_LEAST_DBCS_GLYPHS condition.
     0:  Only use Unicode if the font contains at least AT_LEAST_DBCS_GLYPHS
         glyphs, and the font is not TRANSLATE_UNI_xxx (as per value 1).  This 
         is the current behaviour when Use_Unicode_Encoding==FALSE.
    Recommended default: 2
    Some day it may also be desirable to add another value (e.g. 3) which causes
    FreeType/2 to actually generate a UNICODE glyphlist from a non-Unicode cmap;
    essentially converting a non-Unicode font into a Unicode one via simulation.
    But this is not a short-term objective.

    <font filename> - DWORD
    Allow flags to be set for specific fonts.  At present, one flag should be
    supported: cause UNICODE to be always used for this specific font, 
    regardless of the current setting of Use_Unicode_Encoding.
    (In practise, it would only be meaningful if Use_Unicode_Encoding < 2.)
    Recommended default: none

2. The name-lookup logic could still be refined in some ways.  

   Other languages besides Chinese/Japanese/Korean or English could possibly 
   be accepted.

3. Allow use of the 'PostScript' name of the font as a last-ditch fallback for
   both the family and face names.  Obviously this is not ideal, as it would
   cause all variations (bold, italic, etc.) to appear as entirely separate
   fonts to the operating system.  But as fall-back logic it may be acceptable,
   although it should be subjected to as much testing and feedback as possible.
   (This feature is under consideration mainly to cope with these Japanese fonts
   from Epson:
   http://www.i-love-epson.co.jp/download2/printer/driver/win/page/ttf30.htm )
   Ideally, this should be a configurable per-font option, set via the DWORD
   INI flag described above.

4. Non-CJK fonts which are not Unicode compatible should be supported a bit more
   intelligently.  At present, they generally seem to be rejected; the proper
   behaviour should probably be to fall back to SYMBOL encoding (which basically
   maps each glyph index to a physical position in the font file and lets the 
   application worry about whether it's correct or not).

5. Other TODOs:
    - Implement UDC (User-Defined Character) support for DBCS systems.
    - Allow specific fonts to be renamed or at least aliased to work around
      name-resolution problems with particular fonts or font families.
