diff options
author | Jelmer Vernooij <jelmer@samba.org> | 2004-06-20 12:43:16 +0000 |
---|---|---|
committer | Gerald W. Carter <jerry@samba.org> | 2008-04-23 08:45:56 -0500 |
commit | 83a17815a7689f1f6f7ca57161a0e804277c75f9 (patch) | |
tree | e1cec10510da7038e843f71c9ba95a0e6bc5f494 /docs/Samba-HOWTO-Collection/Unicode.xml | |
parent | 9eb45e211cbc28bbd28837a17dcec3df29d6f455 (diff) | |
download | samba-83a17815a7689f1f6f7ca57161a0e804277c75f9.tar.gz samba-83a17815a7689f1f6f7ca57161a0e804277c75f9.tar.bz2 samba-83a17815a7689f1f6f7ca57161a0e804277c75f9.zip |
New structure for the docs:
- Same name for a doc everywhere (howto -> Samba-HOWTO-Collection, etc)
- Shorter and more clearly structured Makefile
- Make it possible to change the paths for the images
(This used to be commit 96f6c05f25acc8a9bb1977b8bd5cc97ce511b6b1)
Diffstat (limited to 'docs/Samba-HOWTO-Collection/Unicode.xml')
-rw-r--r-- | docs/Samba-HOWTO-Collection/Unicode.xml | 515 |
1 files changed, 515 insertions, 0 deletions
diff --git a/docs/Samba-HOWTO-Collection/Unicode.xml b/docs/Samba-HOWTO-Collection/Unicode.xml new file mode 100644 index 0000000000..396703a705 --- /dev/null +++ b/docs/Samba-HOWTO-Collection/Unicode.xml @@ -0,0 +1,515 @@ +<?xml version="1.0" encoding="iso-8859-1"?> +<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN" + "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd" [ + <!ENTITY % global_entities SYSTEM '../entities/global.entities'> + %global_entities; +]> +<chapter id="unicode"> +<chapterinfo> + &author.jelmer; + &author.jht; + <author> + <firstname>TAKAHASHI</firstname><surname>Motonobu</surname> + <affiliation> + <address><email>monyo@home.monyo.com</email></address> + </affiliation> + <contrib>Japanese character support</contrib> + </author> + <pubdate>25 March 2003</pubdate> +</chapterinfo> + +<title>Unicode/Charsets</title> + +<sect1> +<title>Features and Benefits</title> + +<para> +Every industry eventually matures. One of the great areas of maturation is in +the focus that has been given over the past decade to make it possible for anyone +anywhere to use a computer. It has not always been that way, in fact, not so long +ago it was common for software to be written for exclusive use in the country of +origin. +</para> + +<para> +Of all the effort that has been brought to bear on providing native +language support for all computer users, the efforts of the +<ulink url="http://www.openi18n.org/">Openi18n organization</ulink> +is deserving of special mention. +</para> + +<para> +Samba-2.x supported a single locale through a mechanism called +<emphasis>codepages</emphasis>. Samba-3 is destined to become a truly trans-global +file and printer-sharing platform. +</para> + +</sect1> + +<sect1> +<title>What Are Charsets and Unicode?</title> + +<para> +Computers communicate in numbers. In texts, each number will be +translated to a corresponding letter. The meaning that will be assigned +to a certain number depends on the <emphasis>character set (charset) +</emphasis> that is used. +</para> + +<para> +A charset can be seen as a table that is used to translate numbers to +letters. Not all computers use the same charset (there are charsets +with German umlauts, Japanese characters, and so on). The American Standard Code +for Information Interchange (ASCII) encoding system has been the normative character +encoding scheme used by computers to date. This employs a charset that contains +256 characters. Using this mode of encoding each character takes exactly one byte. +</para> + +<para> +There are also charsets that support extended characters, but those need at least +twice as much storage space as does ASCII encoding. Such charsets can contain +<command>256 * 256 = 65536</command> characters, which is more than all possible +characters one could think of. They are called multi-byte charsets because they use +more then one byte to store one character. +</para> + +<para> +One standardized multi-byte charset encoding scheme is known as +<ulink url="http://www.unicode.org/">unicode</ulink>. A big advantage of using a +multi-byte charset is that you only need one. There is no need to make sure two +computers use the same charset when they are communicating. +</para> + +<para>Old Windows clients use single-byte charsets, named +<parameter>codepages</parameter>, by Microsoft. However, there is no support for +negotiating the charset to be used in the SMB/CIFS protocol. Thus, you +have to make sure you are using the same charset when talking to an older client. +Newer clients (Windows NT, 200x, XP) talk unicode over the wire. +</para> +</sect1> + +<sect1> +<title>Samba and Charsets</title> + +<para> +As of Samba-3, Samba can (and will) talk unicode over the wire. Internally, +Samba knows of three kinds of character sets: +</para> + +<variablelist> + <varlistentry> + <term><smbconfoption><name>unix charset</name></smbconfoption></term> + <listitem><para> + This is the charset used internally by your operating system. + The default is <constant>UTF-8</constant>, which is fine for most + systems, which covers all characters in all languages. The default + in previous Samba releases was <constant>ASCII</constant>. + </para></listitem> + </varlistentry> + + <varlistentry> + <term><smbconfoption><name>display charset</name></smbconfoption></term> + <listitem><para>This is the charset Samba will use to print messages + on your screen. It should generally be the same as the <parameter>unix charset</parameter>. + </para></listitem> + </varlistentry> + + <varlistentry> + <term><smbconfoption><name>dos charset</name></smbconfoption></term> + <listitem><para>This is the charset Samba uses when communicating with + DOS and Windows 9x/Me clients. It will talk unicode to all newer clients. + The default depends on the charsets you have installed on your system. + Run <command>testparm -v | grep <quote>dos charset</quote></command> to see + what the default is on your system. + </para></listitem> + </varlistentry> +</variablelist> + +</sect1> + +<sect1> +<title>Conversion from Old Names</title> + +<para>Because previous Samba versions did not do any charset conversion, +characters in filenames are usually not correct in the UNIX charset but only +for the local charset used by the DOS/Windows clients.</para> + +<para>Bjoern Jacke has written a utility named <ulink url="http://j3e.de/linux/convmv/">convmv</ulink> +that can convert whole directory structures to different charsets with one single command. +</para> + +</sect1> + +<sect1> +<title>Japanese Charsets</title> + +<para> +Setting up Japanese charsets is quite difficult. This is mainly because: +</para> + +<itemizedlist> + <listitem><para>The Windows character set is extended from the original legacy Japanese + standard (JIS X 0208) and is not standardized. This means that the strictly + standardized implementation cannot support the full Windows character set. + </para></listitem> + + <listitem><para> Mainly for historical reasons, there are several encoding methods in + Japanese, which are not fully compatible with each other. There are + two major encoding methods. One is the Shift_JIS series, it is used in Windows + and some UNIX's. The other is the EUC-JP series, used in most UNIX's + and Linux. Moreover, Samba previously also offered several unique encoding + methods, named CAP and HEX, to keep interoperability with CAP/NetAtalk and + UNIX's which can't use Japanese filenames. Some implementations of the + EUC-JP series can't support the full Windows character set. + </para></listitem> + + <listitem><para>There are some code conversion tables between Unicode and legacy + Japanese character sets. One is compatible with Windows, another one + is based on the reference of the Unicode consortium and others are + a mixed implementation. The Unicode consortium does not officially + define any conversion tables between Unicode and legacy character + sets so there cannot be standard one. + </para></listitem> + + <listitem><para>The character set and conversion tables available in iconv() depends + on the iconv library that is available. Next to that, the Japanese locale + names may be different on different systems. This means that the value of + the charset parameters depends on the implementation of iconv() you are using. + </para> + + <para>Though 2 byte fixed UCS-2 encoding is used in Windows internally, + Shift_JIS series encoding is usually used in Japanese environments + as ASCII encoding is in English environments. + </para></listitem> +</itemizedlist> + +<sect2><title>Basic Parameter Setting</title> + + <para> + <smbconfoption><name>dos charset</name></smbconfoption> and + <smbconfoption><name>display charset</name></smbconfoption> + should be set to the locale compatible with the character set + and encoding method used on Windows. This is usually CP932 + but sometimes has a different name. + </para> + + <para> + <smbconfoption><name>unix charset</name></smbconfoption> can be either Shift_JIS series, + EUC-JP series and UTF-8. UTF-8 is always available but the availability of other locales + and its name itself depends on the system. + </para> + + <para> + Additionally, you can consider to use the Shift_JIS series as the + value of the <smbconfoption><name>unix charset</name></smbconfoption> + parameter by using the vfs_cap module, which does the same thing as + setting <quote>coding system = CAP</quote> in the Samba 2.2 series. + </para> + + <para> + Where to set <smbconfoption><name>unix charset</name></smbconfoption> + to is a difficult question. Here is a list of details, advantages and + disadvantages of using a certain value. + </para> + + <variablelist> + <varlistentry><term>Shift_JIS series</term> + <listitem><para> + Shift_JIS series means a locale which is equivalent to <constant>Shift_JIS</constant>, + used as a standard on Japanese Windows. In the case of <constant>Shift_JIS</constant>, + for example if a Japanese file name consist of 0x8ba4 and 0x974c + (a 4 bytes Japanese character string meaning <quote>share</quote>) and <quote>.txt</quote> + is written from Windows on Samba, the file name on UNIX becomes + 0x8ba4, 0x974c, <quote>.txt</quote> (a 8 bytes BINARY string), same as Windows. + </para> + + <para>Since Shift_JIS series is usually used on some commercial based + UNIX's; hp-ux and AIX as Japanese locale (however, it is also possible + to use the EUC-JP series), To use Shift_JIS series on these platforms, + Japanese file names created from Windows can be referred to also on + UNIX.</para> + + <para> + If your UNIX is already working with Shift_JIS and there is a user + who needs to use Japanese file names written from Windows, the + Shift_JIS series is the best choice. However, broken file names + may be displayed and some commands which cannot handle non-ASCII + filenames may be aborted during parsing filenames. especially there + may be <quote>\ (0x5c)</quote> in file names, which need to be handled carefully. + So you had better not touch file names written from Windows on UNIX. + </para> + + <para> + Note that most Japanized free software actually works with EUC-JP + only. You had better verify if the Japanized free software can work + with Shift_JIS. + </para> + </listitem> + </varlistentry> + + <varlistentry><term>EUC-JP series</term> + <listitem><para> + EUC-JP series means a locale which is equivalent to the industry + standard called EUC-JP, widely used in Japanese UNIX (although EUC + contains specifications for languages other than Japanese, such as + EUC-KR). In the case of EUC-JP series, for example if a Japanese + file name consist of 0x8ba4 and 0x974c and <quote>.txt</quote> is written from + Windows on Samba, the file name on UNIX becomes 0xb6a6, 0xcdad, + <quote>.txt</quote> (a 8 bytes BINARY string). + </para> + + <para> + Since EUC-JP is usually used on Open source UNIX, Linux and FreeBSD, + and on commercial based UNIX, Solaris, IRIX and Tru64 UNIX as + Japanese locale (however, it is also possible on Solaris to use + Shift_JIS and UTF-8, on Tru64 UNIX to use Shift_JIS). To use EUC-JP + series, most Japanese file names created from Windows can be + referred to also on UNIX. Also, most Japanized free software work + mainly with EUC-JP only. + </para> + + <para> + It is recommended to choose EUC-JP series when using Japanese file + names on these UNIX. + </para> + + <para> + Although there is no character which needs to be carefully treated + like <quote>\ (0x5c)</quote>, broken file names may be displayed and some + commands which cannot handle non-ASCII filenames may be aborted + during parsing filenames. + </para> + + <para> + Moreover, if you built Samba using differently installed libiconv, + eucJP-ms locale included in libiconv and EUC-JP series locale + included in OS may not be compatible. In this case, you may need to + avoid using incompatible characters for file names. + </para> + </listitem> + </varlistentry> + + <varlistentry><term>UTF-8</term> + <listitem><para> + UTF-8 means a locale which is equivalent to UTF-8, the international + standard defined by Unicode consortium. In UTF-8, a <parameter>character</parameter> is + expressed using 1-3 bytes. In case of Japanese, most characters + are expressed using 3 bytes. Since on Windows Shift_JIS, where a + character is expressed with 1 or 2 bytes, is used to express + Japanese, basically a byte length of a UTF-8 string grows 1.5 times + the length of a original Shift_JIS string. In the case of UTF-8, + for example if a Japanese file name consist of 0x8ba4 and 0x974c and + <quote>.txt</quote> is written from Windows on Samba, the file name on UNIX + becomes 0xe585, 0xb1e6, 0x9c89, <quote>.txt</quote> (a 10 bytes BINARY string). + </para> + + <para> + For systems where iconv() is not available or where iconv()'s locales + are not compatible with Windows, UTF-8 is the only locale available. + </para> + + <para> + There are no systems that use UTF-8 as default locale for Japanese. + </para> + + <para> + Some broken file names may be displayed and some commands which + cannot handle non-ASCII filenames may be aborted during parsing + filenames. especially there may be <quote>\ (0x5c)</quote> in file names, which + need to be handled carefully. So you had better not touch file names + written from Windows on UNIX. + </para> + + <para> + In addition, although it is not directly concerned with Samba, since + there is a delicate difference between iconv() function, which is + generally used on UNIX and the functions used on other platforms, + such as Windows and Java about the conversion table between + Shift_JIS and Unicode, you should be carefully to handle UTF-8. + </para> + + <para> + Although Mac OS X uses UTF-8 as its encoding method for filenames, + it uses an extended UTF-8 specification that Samba cannot handle so + UTF-8 locale is not available for Mac OS X. + </para> + </listitem> + </varlistentry> + + <varlistentry><term>Shift_JIS series + vfs_cap (CAP encoding)</term> + <listitem><para> + CAP encoding means a specification using in CAP and NetAtalk, file + server software for Macintosh. In the case of CAP encoding, for + example if a Japanese file name consist of 0x8ba4 and 0x974c and + <quote>.txt</quote> is written from Windows on Samba, the file name on UNIX + becomes <quote>:8b:a4:97L.txt</quote> (a 14 bytes ASCII string). + </para> + + <para> + For CAP encoding a byte which cannot be expressed as an ASCII + character (0x80 or above) is encoded as <quote>:xx</quote> form. You need to take + care of containing a <quote>\(0x5c)</quote> in a filename but filenames are not + broken in a system which cannot handle non-ASCII filenames. + </para> + + <para> + The greatest merit of CAP encoding is the compatibility of encoding + filenames with CAP or NetAtalk, file server software of Macintosh. + Since they usually write a file name on UNIX with CAP encoding, if a + directory is shared with both Samba and NetAtalk, you need to use + CAP encoding to avoid non-ASCII filenames are broken. + </para> + + <para> + However, recently there are some systems where NetAtalk has been + patched to write filenames with EUC-JP (i.e. Japanese original Vine Linux). + Here you need to choose EUC-JP series instead of CAP encoding. + </para> + + <para> + vfs_cap itself is available for non Shift_JIS series locales for + systems which cannot handle non-ASCII characters or systems which + shares files with NetAtalk. + </para> + + <para> + To use CAP encoding on Samba-3, you should use the unix charset parameter and VFS + as follows: + </para> + +<smbconfexample><title>VFS CAP</title> +<smbconfsection>[global]</smbconfsection> +<smbconfoption><name>dos charset</name><value>CP932<footnote><para>the locale name "CP932" may be different name</para></footnote></value></smbconfoption> +<smbconfoption><name>unix charset</name><value>CP932</value></smbconfoption> + +<member><para>...</para></member> + +<smbconfsection>[cap-share]</smbconfsection> +<smbconfoption><name>vfs option</name><value>cap</value></smbconfoption> +</smbconfexample> + + <para> + You should set CP932 if using GNU libiconv for unix charset. Setting this, + filenames in the <quote>cap-share</quote> share are written with CAP encoding. + </para> + </listitem> + </varlistentry> + </variablelist> + +</sect2> + +<sect2><title>Individual Implementations</title> + +<para> +Here is some additional information regarding individual implementations: +</para> + + <variablelist> + <varlistentry><term>GNU libiconv</term> + <listitem><para> + To handle Japanese correctly, you should apply the patch + <ulink url="http://www2d.biglobe.ne.jp/~msyk/software/libiconv-patch.html">libiconv-1.8-cp932-patch.diff.gz</ulink> + to libiconv-1.8. + </para> + + <para> + Using the patched libiconv-1.8, these settings are available: + </para> + + +<!-- FIXME: Convert to diagram ? --> +<programlisting> +dos charset = CP932 +unix charset = CP932 / eucJP-ms / UTF-8 + | | + | +-- EUC-JP series + +-- Shift_JIS series +display charset = CP932 +</programlisting> + + <para> + Other Japanese locales (for example Shift_JIS and EUC-JP) should not + be used for the lack of the compatibility with Windows. + </para> + </listitem> + </varlistentry> + + <varlistentry><term>GNU glibc</term> + <listitem><para> + To handle Japanese correctly, you should apply a <ulink url="http://www2d.biglobe.ne.jp/~msyk/software/glibc/">patch</ulink> + to glibc-2.2.5/2.3.1/2.3.2 or should use the patch-merged versions, glibc-2.3.3 or later. + </para> + + <para> + Using the above glibc, these setting are available: + </para> + +<smbconfblock> +<smbconfoption><name>dos charset</name><value>CP932</value></smbconfoption> +<smbconfoption><name>unix charset</name><value>CP932 / eucJP-ms / UTF-8</value></smbconfoption> +<smbconfoption><name>display charset</name><value>CP932</value></smbconfoption> +</smbconfblock> + + <para> + Other Japanese locales (for example Shift_JIS and EUC-JP) should not + be used for the lack of the compatibility with Windows. + </para> + </listitem> + </varlistentry> + </variablelist> + +</sect2> + +<sect2> + <title>Migration from Samba-2.2 Series</title> + +<para> +Prior to Samba-2.2 series <quote>coding system</quote> parameter is used as +<smbconfoption><name>unix charset</name></smbconfoption> parameter of the Samba-3 series. +<link linkend="japancharsets">Next table</link> shows the mapping table when migrating from the Samba-2.2 series to Samba-3. +</para> + + <table frame="all" id="japancharsets"> + <title>Japanese Character Sets in Samba-2.2 and Samba-3</title> + + <tgroup cols="2" align="center"> + <colspec align="center"/> + <colspec align="center"/> + <thead> + <row><entry>Samba-2.2 Coding System</entry><entry>Samba-3 unix charset</entry></row> + </thead> + <tbody> + <row><entry>SJIS</entry><entry>Shift_JIS series</entry></row> + <row><entry>EUC</entry><entry>EUC-JP series</entry></row> + <row><entry>EUC3<footnote><para>Only exists in Japanese Samba version</para></footnote></entry><entry>EUC-JP series</entry></row> + <row><entry>CAP</entry><entry>Shift_JIS series + VFS</entry></row> + <row><entry>HEX</entry><entry>currently none</entry></row> + <row><entry>UTF8</entry><entry>UTF-8</entry></row> + <row><entry>UTF8-Mac<footnote><para>Only exists in Japanese Samba version</para></footnote></entry><entry>currently none</entry></row> + <row><entry>others</entry><entry>none</entry></row> + </tbody> + </tgroup> + </table> + +</sect2> + +</sect1> + +<sect1> + <title>Common Errors</title> + + <sect2> + <title>CP850.so Can't Be Found</title> + + <para><quote>Samba is complaining about a missing <filename>CP850.so</filename> file.</quote></para> + + <para><emphasis>Answer:</emphasis> CP850 is the default <smbconfoption><name>dos charset</name></smbconfoption>. + The <smbconfoption><name>dos charset</name></smbconfoption> is used to convert data to the codepage used by your dos clients. + If you do not have any dos clients, you can safely ignore this message. </para> + + <para>CP850 should be supported by your local iconv implementation. Make sure you have all the required packages installed. + If you compiled Samba from source, make sure to configure found iconv.</para> + </sect2> +</sect1> + +</chapter> |