diff options
Diffstat (limited to 'docs/docbook/projdoc/unicode.xml')
-rw-r--r-- | docs/docbook/projdoc/unicode.xml | 142 |
1 files changed, 142 insertions, 0 deletions
diff --git a/docs/docbook/projdoc/unicode.xml b/docs/docbook/projdoc/unicode.xml new file mode 100644 index 0000000000..2351668e56 --- /dev/null +++ b/docs/docbook/projdoc/unicode.xml @@ -0,0 +1,142 @@ +<chapter id="unicode"> +<chapterinfo> + &author.jelmer; + <author> + <firstname>TAKAHASHI</firstname><surname>Motonobu</surname> + <affiliation> + <address><email>monyo@home.monyo.com</email></address> + </affiliation> + </author> + <pubdate>25 March 2003</pubdate> +</chapterinfo> + +<title>Unicode/Charsets</title> + +<sect1> +<title>What are charsets and unicode?</title> + +<para> +Computers communicate in numbers. In texts, each number will be +translated to a corresponding letter. The meaning that will be assigned +to a certain number depends on the <emphasis>character set(charset) +</emphasis> that is used. +A charset can be seen as a table that is used to translate numbers to +letters. Not all computers use the same charset (there are charsets +with German umlauts, Japanese characters, etc). Usually a charset contains +256 characters, which means that storing a character with it takes +exactly one byte. </para> + +<para> +There are also charsets that support even more characters, +but those need twice(or even more) as much storage space. These +charsets can contain <command>256 * 256 = 65536</command> characters, which +is more then all possible characters one could think of. They are called +multibyte charsets (because they use more then one byte to +store one character). +</para> + +<para> +A standardised multibyte charset is unicode, info is available at +<ulink url="http://www.unicode.org/">www.unicode.org</ulink>. +A big advantage of using a multibyte charset is that you only need one; no +need to make sure two computers use the same charset when they are +communicating. +</para> + +<para>Old windows clients used to use single-byte charsets, named +'codepages' by microsoft. However, there is no support for +negotiating the charset to be used in the smb protocol. Thus, you +have to make sure you are using the same charset when talking to an old client. +Newer clients (Windows NT, 2K, XP) talk unicode over the wire. +</para> +</sect1> + +<sect1> +<title>Samba and charsets</title> + +<para> +As of samba 3.0, samba can (and will) talk unicode over the wire. Internally, +samba knows of three kinds of character sets: +</para> + +<variablelist> + <varlistentry> + <term>unix charset</term> + <listitem><para> + This is the charset used internally by your operating system. + The default is <constant>ASCII</constant>, which is fine for most + systems. + </para></listitem> + </varlistentry> + + <varlistentry> + <term>display charset</term> + <listitem><para>This is the charset samba will use to print messages + on your screen. It should generally be the same as the <command>unix charset</command>. + </para></listitem> + </varlistentry> + + <varlistentry> + <term>dos charset</term> + <listitem><para>This is the charset samba uses when communicating with + DOS and Windows 9x clients. It will talk unicode to all newer clients. + The default depends on the charsets you have installed on your system. + Run <command>testparm -v | grep "dos charset"</command> to see + what the default is on your system. + </para></listitem> + </varlistentry> +</variablelist> + +</sect1> + +<sect1> +<title>Conversion from old names</title> + +<para>Because previous samba versions did not do any charset conversion, +characters in filenames are usually not correct in the unix charset but only +for the local charset used by the DOS/Windows clients.</para> + +<para>The following script from Steve Langasek converts all +filenames from CP850 to the iso8859-15 charset.</para> + +<para> +<prompt>#</prompt><userinput>find <replaceable>/path/to/share</replaceable> -type f -exec bash -c 'CP="{}"; ISO=`echo -n "$CP" | iconv -f cp850 \ + -t iso8859-15`; if [ "$CP" != "$ISO" ]; then mv "$CP" "$ISO"; fi' \; +</userinput> +</para> +</sect1> + +<sect1> +<title>Japanese charsets</title> + +<para>Samba doesn't work correctly with Japanese charsets yet. Here are +points of attention when setting it up:</para> + +<itemizedlist> + +<listitem><para>You should set <command>mangling method = +hash</command></para></listitem> + +<listitem><para>There are various iconv() implementations around and not +all of them work equally well. glibc2's iconv() has a critical problem +in CP932. libiconv-1.8 works with CP932 but still has some problems and +does not work with EUC-JP.</para></listitem> + +<listitem><para>You should set <command>dos charset = CP932</command>, not +Shift_JIS, SJIS...</para></listitem> + +<listitem><para>Currently only <command>unix charset = CP932</command> +will work (but still has some problems...) because of iconv() issues. +<command>unix charset = EUC-JP</command> doesn't work well because of +iconv() issues.</para></listitem> + +<listitem><para>Currently Samba 3.0 does not support <command>unix charset += UTF8-MAC/CAP/HEX/JIS*</command></para></listitem> + +</itemizedlist> + +<para>More information (in Japanese) is available at: <ulink url="http://www.atmarkit.co.jp/flinux/special/samba3/samba3a.html">http://www.atmarkit.co.jp/flinux/special/samba3/samba3a.html</ulink>.</para> + +</sect1> + +</chapter> |