Monday, May 27, 2013

TeX Wiki の英訳

先日の LaTeX のフォント設定の話が TeX Wiki からリンクされたみたい。向こうのほうがシステムを限定しない範囲では詳しい。(先日の自分のポストは現在の Ubuntu 上の quick fix という位置づけ)
そのページに「誰か英訳して」って書いてあったから英訳した。新規ページを作るにはユーザー登録がいるみたいなので、ここに書いとく。一応 PukiWiki の書式を真似たけど、プレビューできてないので正しくフォーマットできてるかは不明。

訳しながら気になったというかわからんかった点:
  • \CJKfamily と \CJKencfamily の違いは?
  • NFSS って何? そういう convention でいいの?
  • CJK 環境内部で他の言語を使った時の話なのか、CJK 環境外部に影響を及ぼすという話なのか?
  • 「ソースファイルのエンコーディングは、フォントエンコーディングと同じものが、 使用されます」は「フォントエンコーディングのエンコーディングは、ソースファイルと同じものが、 使用されます」の間違い?
  • inputenc の説明が全体的によく分からん…
  • 「インストール」の節のTTフォントをOTFに偽装するだの何だの言う話は日本語からしてよく分からなかったので、直訳風味。
  • 最初のサンプルをインストール無しでコンパイルしようとしたけど、どうすればいいのかよく分からんかった。単に展開すれば現在ディレクトリから辿って各種ファイルを見つけてくれるのかと思いきゃそんな事は無いようで。Directory hierarchy を完全に潰して全部のファイルを同一ディレクトリにぶち込んでみたけど dvipdfm にこんな事言われて失敗する (latex は通る):
dvipdfm UTF8-noembed-CJK                                                                                             
UTF8-noembed-CJK.dvi -> UTF8-noembed-CJK.pdf
[1
kpathsea: Running mktexpk --mfmode / --bdpi 600 --mag 1+0/600 --dpi 600 sungu5b
mktexpk: don't know how to create bitmap font for sungu5b.
mktexpk: perhaps sungu5b is missing from the map file.
kpathsea: Appending font creation commands to missfont.log.

** WARNING ** Could not locate a virtual/physical font for TFM "sungu5b".
** WARNING ** >> There are no valid font mapping entry for this font.
** WARNING ** >> Font file name "sungu5b" was assumed but failed to locate that font.
** ERROR ** Cannot proceed without .vf or "physical" font for PDF output...

Output file removed.

正直これ以上やる気無いので誰か他の人頑張って下さい。 
 
* [[The CJK package for LaTeX:http://cjk.ffii.org/]]

(This note was translated from an incomplete version as of 2013-05-26.)

ASCII Inc.'s pTeX is a TeX distribution for processing Japanese, but it
contains extensions to both the TeX typesetter and the DVI engine that make
certain tools like dvisvgm and dvipng incompatible.

This note explains how to use CJK LaTeX, which allows you to process Chines,
Japanese, and Korean (CJK) text solely by macros, without modifications to TeX
proper.  CJK characters are unfortunately rendered as bitmap fonts, but as long
as you can live with that, this arrangement lets you keep using standard TeX
and related tools like you always used to.  Other caveats include that the CJK
package (unlike pTeX) doesn't handle vertical text flow and line breaking rules
very well, but on the upside it can process Chinese and Korean text via UTF-8
encoding.  It seems to be used widely outside of Japan for mixing short
snippets of Asian text into English documents.

The CJK package, like inputenc, changes the category code of bytes that have
their 8-th bit set, so that TeX sources containing multibyte characters can be
compiled by standard (8-bit enabled) LaTeX.  It is not compatible with pLaTeX
which interprets multibyte characters as characters rather than as macros.  The
basic usage is

>\usepackage{CJK}~
...~
\begin{CJK}{encoding}{family}~
...~
\end{CJK}
<

The "encoding" part can be UTF-8 EUC-JP, Shift_JIS, GB2312, Big5, EUC-KR,
x-EUC-TW (CNS 11643), and various other encodings.  The following is a more
complete table of major supported encodings.

|CENTER:Encoding|CENTER:TeX Name|CENTER:TFM Encoding|h
|Big5|Bg5|c00|
|GB2312|GB|c10|
|EUC-JP|JIS|c40|
|Shift_JIS|SJIS|c40|
|JIS X 0212 (EUC-JP)|JIS2|c50|
|EUC-KR|KS|c60|
|UTF-8|UTF8|c70|

The TeX Name column shows the name that should be specified in the second
argument to \begin{CJK} in the TeX source.  TFM encoding is a parameter
required to build an fd file.  As you may haved guessed from the explanation so
far, TeX can handle source code that mixes different encodings in a single
file.  However, in practice it's probably easier to edit if you separate
different encodings into separate files and assemble them by \input as
necessary.  In particular, keeping Big5 and Shift_JIS encodings in separate
files is imperative because these encodings contain characters whose trailing
bytes conicide with special characters like "\", "{", and "}" that need to be
preprocessed away to avoid confusing TeX, like:

 \begin{CJK}{JIS}{}
 \input{euc-jp-text1}%
 \CJKenc{Bg5}%
 \ifx\VTeXversion\undefined%
 \immediate\write18{bg5conv < big5-text.raw > big5-text.tex}%
 \fi\input{big5-text}%
 \input{euc-jp-text2}
 \end{CJK}

The "family" part specifies the font family.  If you leave it blank, TeX
selects the "song" family by default.  This default can be changed with
\CJKfamily or \CJKencfamily.  Note the family does not completely specify the
font.  The font that TeX actually accesses during typesetting is determined by
the "(TFM encoding)(family).fd" files following the NFSS convention.  For
example, if the TeX file specifies the song family, TeX will select
cyberbXX.tfm specified in c70song.fd if the TeX source is UTF-8, or select
jsso12XX.tfm specified in c40song.fd if the TeX source is EUC-JP, and so on.


**Extensions

The CJK package distribution contains several extension packages and examples.
Here we explain the CJKutf8 package, which is probably the most important one.
Chinese and Japanese (and to some extent Korean) text have the unique property
that lines can be broken almost anywhere.  The CJK package implements this
liberal line breaking rule, which can cause inappropriate line breaks when
other languages are mixed into the CJK environment.  This is a bit like how
pTeX incorrectly hyphenates English documents written in full-width
alphanumeric characters.  The font encoding determines hyphenation rules,
kerning, and ligature, so to correctly process non-CJ languages inside the CJK
environment, we have to arrange for the right font encoding to be loaded
outside (prior to?) the CJK environment.  CJKutf8 does just that.  To explain
how this works, I need to tell you about the inputenc package first.

***The inputenc Package

The big change that LaTeX2e made from LaTeX2.09 was the adoption of NFSS2.
This protocol made the font encoding an attribute that the user specifies
separately from all other aspects of the font.  As a result, the minimal
complete source code in LaTeX2e is
>\documentclass{...}~
\usepackage[...]{fontenc}~
\usepackage[...]{inputenc}~
\begin{document}~
...~
\end{document}
<
For backwards compatibility (with LaTeX2.09), OT1 is used if no fontenc is
given, and the source file's encoding is used as the font encoding.  A source
file that doesn't load these packages cannot be said to be fully compliant to
LaTeX2e's conventions, even if it suits your needs.

inputenc.sty itself only sets the character class of character with the 8th bit
set to active and to raise an error whenever the source code uses them.  To use
these activated characters in the TeX source, they have to be redefined to
macros that generate the right character in the right encoding.  The option to
inputenc specifies the file to do this redefinition.  The UTF8 option is
available starting with the Feb 9, 2004 version of LaTeX2e.

\usepackage[UTF8]{inputenc}

But writing this in the preamble is not enough to enable all UTF-8 encoded
characters.  When inputenc is given the UTF8 option, it goes through all font
encodings loaded in the preamble (scanning all the way up to the last line
preceding the document body) and for each encoding XXX reads in XXXenc.dfu and
enables redefinitions of characters defined in that file.  Characters that are
not defined in any of those files remain undefined and attempts to use them
results in an error.

Currently, the standard distribution contains the following dfu files.
>lcyenc.dfu~
ly1enc.dfu~
omsenc.dfu~
ot1enc.dfu~
ot2enc.dfu~
t1enc.dfu~
t2aenc.dfu~
t2benc.dfu~
t2cenc.dfu~
ts1enc.dfu~
x2enc.dfu
<
utf8enc.dfu combines all of the files above.  Languages that can be written in
these font encodings are typeset in UTF-8 with exactly the same hyphenation,
kerning, and ligature as when they are typeset with some other encodings.

So, currently Unicode support in standard LaTeX, with no additional packages,
works as follows.

-LaTeX provides complete support for Unicode source files encoded in UTF-8 (not limited to the BMP).
-Theoretically, any language that satisfies the following conditions can be properly typeset from a UTF-8 source file as long as an appropriate font encoding and dfu file are prepared.
--A line is composed of horizontally listed glyphaemes (i.e. characters), and lines are listed from top to bottom.
--Each line has enough "space" (a white space or similar entity) with a flexible width where the line can be broken.  (So for Chinese and Japanese, there is an implicit space in this sense between (almost) every pair of characters.)

*** CJKutf8 パッケージ

This package does a lot of things under the hood, but its interface is
straightforward.  It reads in inputenc, tries to hijack everything inside the
CJK environment and process it using inputenc, and reverts to the CJK
environment whenever inputenc fails.  
 \documentclass{article}
 \usepackage[T1]{CJKutf8} % The font encoding can be specified in the option.
 
 \begin{document}
 \begin{CJK}{UTF8}{min}
 % Write something in UTF-8.  Hyphenation is properly handled if you specify
 % the right language with babel or the like.
 UTF-8で何か文章を書く。babel等でハイフネーションの言語を指定すれば、正しく組版される。
 \end{CJK}
 \end{document}

*Installation
**TeX

You first need a working LaTeX installation.  Additionally, you need the macro
files from [[CTAN:languages/japanese/CJK/]] (under the directory named
cjk-4.x.x/; it may be archived in a zip or tarball) and the font metric (TFM)
files.  The default font settings that come with the CJK package are compatible
with dvips and pdflatex, but this also means its suboptimal for use with
dvipdfmx.  This section explains how to write a custom font definition.  The
TFM used in standard TeX (which doesn't include nonstandard extensions like
those of pTeX or Omega) can describe up to only 256 glyphs per TFM file.  This
is insufficient to handle Chinese characters or other large character sets, so
in CJK a single font is distributed across multiple files.  That might sound
scary, but you can easily generate those TFM files from any TTF font using
ttf2tfm.
> ttf2tfm [TTF] [TFM stem]@[SFD name]@

If you have a TTC file which combines multiple TrueType faces into one font,
you can use the -f option to choose which face you want.  If you're generating
cyberbXX.tfm the [TFM stem] is cyberb, and for jsso12XX.tfm it's jsso12.  [SFD
name] determines how to split into subfonts.  Which SFD is needed depends on
the font's CMap encoding and the TeX source code's encoding, but for recent
TrueType fonts you should use one that starts with "U".  If you are planning to
use full-width characters exclusively, you can also just copy an existing TFM
file to a different name and use that instead.  (The TFM files contained in the
samples below was made this way, so if you typeset half-width alphanumeric
characters with those files you'll get pretty ugly results.)  For instance, if
you have a document written in EUC-JP/Shift_JIS, you want to use a TFM file
whose stem is foo, and refer to that font as the "bar" family, you do ttf2tfm
baz foo@UJIS@ to create f0001.tfm–foo35.tfm.  Then you have to write
c40bar.fd, which should contain at least \DeclareFontFamily{C40}{bar}{}
\DeclareFontShape{C40}{bar}{m}{n}{<-> CJK * foo}{} If you put those files in
somewhere LaTeX can find them, your LaTeX source should compile as expected.
\documentclasss{article} \usepackage{CJK}
 
 \begin{document}
 \begin{CJK}{JIS}{bar}
 % Write your Japanese text here in EUC-JP.
 ここにEUC-JPで日本語の文章を書きます。
 \end{CJK}
 \begin{CJK}{SJIS}{bar}
 % Write your Japanese text here in Shift_JIS.
 % You may have to preprocess this block if you use certain characters.
 ここにShift_JISで日本語の文章を書きます。%
 しかし、もしかすると、このブロックだけ%
 プリプロセッサーを通さないと%
 \LaTeX のコンパイルが通らないかも知れません。
 \end{CJK}
 \end{document}

To process Shift_JIS or Big5, you'll also need to install the preprocessors
sjisconv and bg5conv.

**DVI Driver

pdflatex is slated to officially support the CJK package in the near future,
but for now the only ways to generate decent PDFs with CJK are VTeX
(commercial) and dvipdfmx.  Here we will focus on dvipdfmx.
>By "decent" we mean that the non-decent PDFs require the fonts to be split in accordance with the TFM files.

The only things you need to set up are the mappings between the DVI file's TFM
and PDF file's fonts.
> DVI files do not contain any information about glyph appearances.  They only specify the size and position of each character and which TFM that information comes from.  The job of a DVI driver is to attach glyph shapes extracted from the fonts.  This means it needs a mapping between TFM and fonts.  Without this mapping, most DVI drivers tries to generate a bitmap font on its own.  Currently, pTeX generates an error and dies at this point, which signals the user that there's something wrong with the installation.  But if the CJK package is fully installed, the driver often succeeds in generating bitmaps for default fonts, which causes many people to not notice the problem and keep using a half-broken installation.  The samples below use newly defined TFMs and show how to map them to real fonts.

dvipdfmx has many files (called map files) that map TFM to fonts inside PDF
files, but most of them are shared with dvipdfm, so they can handle only those
8-bit fonts that dvipdfm can understand.  So this mapping has to be added to
the dvipdfmx-specified map file called cid-x.map.  (Details will be added
later.)

***When Using Non-existent (CFF, CID-keyed) OpenType Fonts
dvipdfmx knows about the following fonts.
|CENTER:|||c
|~Language|CENTER:Character Set|CENTER:Font Name|h
|~Japanese|Adobe-Japan1|Ryumin-Light|
|~|~|GothicBBB-Medium|
|~|~|HeiseiMin-W3|
|~|~|HeiseiKakuGo-W5|
|~|Adobe-Japan1-2|HeiseiMin-W3-Acro|
|~|~|HeiseiKakuGo-W5-Acro|
|~|Adobe-Japan1-4|KozMinPro-Regular-Acro|
|~|~|KozGoPro-Medium-Acro|
|~Simplified Chinese|Adobe-GB1|STSong-Light|
|~|Adobe-GB1-2|STSong-Light-Acro|
|~|Adobe-GB1-4|AdobeSongStd-Light-Acro|
|~Traditional Chinese|Adobe-CNS1|MSung-Light|
|~|~|MHei-Medium|
|~|Adobe-CNS1-0|MSung-Light-Acro|
|~|~|MHei-Medium-Acro|
|~|Adobe-CNS1-4|AdobeMingStd-Light-Acro|
|~Korean|Adobe-Korea1|HYSMyeongJo-Medium|
|~|~|HYGoThic-Medium|
|~|Adobe-Korea1-0|HYSMyeongJo-Medium-Acro|
|~|~|HYGoThic-Medium-Acro|
|~|Adobe-Korea1-2|AdobeMyungjoStd-Medium-Acro|

So if you specify these font names in a cid-x.map entry, which looks like
> [TFM stem]@[SFD name]@ [CMap name] [Font file name]

you'll get a PDF that doesn't embed those fonts, even if the fonts aren't on
dvipdfmx's search path.  [CMap name] is the mapping from the encoding that
results from applying SFD, to the ordering CID (translator note: I have no idea
what this is talking about; this goes for much of the subsequent discussion).
[SFD name] is usually the same as the SFD name you passed in to ttf2tfm when
you created the TFM file.  But you can something like the following instead,
too.
 jsso12@UJIS@ UniJIS-UCS2-H HeiseiMin-W3-Acro
This entry above collects the characters in jsso12XX.tfm and decodes them to Unicode, and maps them to the glyphs in Adobe-Japan1.
 jsso12@SJIS@ RKSJ-H HeiseiMin-W3-Acro
This collects the characters in jsso12XX.tfm and decodes them into Shift_JIS, then maps them to the glyphs in Adobe-Japan1.
 jsso12@SJIS@ 90ms-RKSJ-H HeiseiMin-W3-Acro
This also goes through Shift_JIS, but uses the mapping from Windows-31J (Microsoft Windows Standard Japanese Character Set).  Some characters will deploy different variants than the above.
 jsso12@SJIS@ 78-RKSJ-H HeiseiMin-W3-Acro
This uses glyphs conforming to the example glyph shapes in JIS C 6226-1978 (JIS X 0208:1978).

PDF that contain non-embedded fonts can be rendered on some systems with
different substitute fonts.

***When Using Existing OpenType (CFF, CID-keyed) Fonts
This is almost the same as in the previous section, but the [Font file name]
has to specify a font that exists on dvipdfmx's search path.
By default the font will be embedded, but a "!" before the
[Font file name] prevents embedding.
Embedding is also suppressed when [Font file name] is followed by ",Bold",
",Italic", or ",BolItalic".
***Using TrueType Fonts as OpenType (CFF, CID-keyed)
For TrueType fonts, there is no set order in which glyphs are listed, so 
accesses to glyphs must go through the font file's CMap table.  If
the character set of the CMap file specified in [CMap name] is one of Adobe's
standard character sets, the TrueType font can be embedded as if it's a CID font
using the standard mapping from Unicode.  However, glyphs that do not exist in
Unicode are usually not included in TrueType fonts, and even if they are, they
are inaccessible.  If the character set of the CMap file specified in
[CMap name] is not a standard character set from Adobe, you can emulate e.g. the
Adobe-Japan1 supplement 4 set by adding "/AJ14" after the font name.
Generally speaking, you should use this technique whenever you use a TrueType
font without embedding it.

***Using TrueType Fonts
To access a TrueType font using a CMap table, use the cid-x.map entry
> [TFM stem]@[SFD name]@ unicode [Font file name] [options]
The SFD should start with a "U", indicating that the TFM encoding should be
mapped to Unicode.
>
 -w option
Given when the TrueType font will be used for vertical text.
 -w 0
Horizontal text (default)
 -w 1
Vertical text

>
 -p option
Used to access characters that lie outside of Unicode's BMP (Basic Multilingual Plane)
 -p 0
Access the BMP (default)~
In other words, the code points 0x0000–0xFFFF are mapped to characters with those exact code points.
 -p 1
Access the SMP (Supplementary Multilingual Plane).
The characters needed in TeX are usually ancient scripts.
The code points 0x0000–0xFFFF are slid over by 0x10000 and mapped to
characters in the code point range 0x10000–0x1FFFF.
 -p 2
Access the SIP (Supplementary Ideographic Plane).
This includes Chinese characters that did not fit in the BMP.
The code points 0x0000–0xFFFF are slide over by 0x20000 and mapped to
characters in the code point range 0x20000–0x2FFFF.

If, for some reason, you must access a TrueType font's glyph in the order they
are listed, without going through a CMap, you can do so by specifying a CMap
that has the encoding Adobe-Identity.  However, the CMap file must not be named
"Identity-H" or "Identity-V".

*Examples
The following examples require the files from [[CTAN:languages/japanese/CJK/]].
You should be able to install and use them as given, but to try them out without
installing, you should create an empty temporary directory (folder) and copy
everything there.  Then rename all the dvipdfm/config/cid-x.map-add.* in the
example to cid-x.map.  If you want to install, you should append the contents of
those files to the system's cid-x.map file.

+ &ref(http://oku.edu.mie-u.ac.jp/~okumura/texfaq/archive/CJK-LaTeX-UTF8-noembed.tar.bz2,Render different variants of the same kanji from a TeX file written in UTF-8);
--Known problems:
---There are some parts whose intentions are unclear on each page.
+ &ref(http://oku.edu.mie-u.ac.jp/~okumura/texfaq/archive/CJK-LaTeX-localEncoding-vertical.tar.bz2,Vertical text); also contains settings needed to use JIS X 0213 with Shift_JIS
--Known problems:
---The TFM files included in this archive can only handle full-width characters
---The archive has no fdx files for using horizontal-script fonts to render vertical script, so punctuation appears incorrect in the vertical text mode of CJKvert.sty (this shouldn't be a problem if you have a genuine vertical-script font).
+ &ref(http://oku.edu.mie-u.ac.jp/~okumura/texfaq/archive/CJK-LaTeX-SIP.tar.bz2,Using the Supplementary Ideographic Plane); dvipdfmx (20070409 or newer) is required to build a PDF file from this example.  It also uses proprietary fonts.  In case you don't have them, &ref(http://oku.edu.mie-u.ac.jp/~okumura/texfaq/archive/CJK-LaTeX-SIP.pdf,here's a pre-built PDF) for your reference.
--Known problems
---An old dvipdfmx has a bug that prevents it from handling this example properly (fixed in dvipdfmx-20070409)
---Two fonts are defined in the c70usong.fd file. This should be split into c70usong.fd and c70usong2.fd, or otherwise we can't use SIP characters at the beginning of the document.
---It uses proprietary fonts.  cid-x.map should be rewritten to use [[HAN NOM FONTs:http://www.viethoc.org/article.php?sid=98&mode=threaded&order=0&thold=0]].  (But boy, does Han Nom's design look like SimSun!)
---Lots of other errors that you can spot by searching the Web about this example's usage of CJK's features.
---In CJK 4.6.0, you need to add the kind of code you see in this example's preamble in order to use non-BMP characters.  But on the other hand, the development version (translator note -- as of when?) of CJK clashes with this code.

Let us know if you want to see any other examples!  Of course, new examples and
corrections to this wiki page are welcome too.

No comments:

Post a Comment