1 files changed, 142 insertions, 0 deletions
diff --git a/docs/docbook/projdoc/unicode.xml b/docs/docbook/projdoc/unicode.xml
new file mode 100644
index 0000000000..2351668e56
--- /dev/null
+++ b/docs/docbook/projdoc/unicode.xml
@@ -0,0 +1,142 @@
+<chapter id="unicode">
+<chapterinfo>
+	&author.jelmer;
+	<author>
+		<firstname>TAKAHASHI</firstname><surname>Motonobu</surname>
+		<affiliation>
+		<address><email>monyo@home.monyo.com</email></address>
+		</affiliation>
+	</author>
+	<pubdate>25 March 2003</pubdate>
+</chapterinfo>
+
+<title>Unicode/Charsets</title>
+
+<sect1>
+<title>What are charsets and unicode?</title>
+
+<para>
+Computers communicate in numbers. In texts, each number will be 
+translated to a corresponding letter. The meaning that will be assigned 
+to a certain number depends on the <emphasis>character set(charset)
+</emphasis> that is used. 
+A charset can be seen as a table that is used to translate numbers to 
+letters. Not all computers use the same charset (there are charsets 
+with German umlauts, Japanese characters, etc). Usually a charset contains 
+256 characters, which means that storing a character with it takes 
+exactly one byte. </para>
+
+<para>
+There are also charsets that support even more characters, 
+but those need twice(or even more) as much storage space. These 
+charsets can contain <command>256 * 256 = 65536</command> characters, which
+is more then all possible characters one could think of. They are called 
+multibyte charsets (because they use more then one byte to 
+store one character). 
+</para>
+
+<para>
+A standardised multibyte charset is unicode, info is available at 
+<ulink url="http://www.unicode.org/">www.unicode.org</ulink>. 
+A big advantage of using a multibyte charset is that you only need one; no 
+need to make sure two computers use the same charset when they are 
+communicating.
+</para>
+
+<para>Old windows clients used to use single-byte charsets, named 
+'codepages' by microsoft. However, there is no support for 
+negotiating the charset to be used in the smb protocol. Thus, you 
+have to make sure you are using the same charset when talking to an old client.
+Newer clients (Windows NT, 2K, XP) talk unicode over the wire.
+</para>
+</sect1>
+
+<sect1>
+<title>Samba and charsets</title>
+
+<para>
+As of samba 3.0, samba can (and will) talk unicode over the wire. Internally, 
+samba knows of three kinds of character sets: 
+</para>
+
+<variablelist>
+	<varlistentry>
+		<term>unix charset</term>
+		<listitem><para>
+		This is the charset used internally by your operating system. 
+		The default is <constant>ASCII</constant>, which is fine for most 
+		systems.
+		</para></listitem>
+	</varlistentry>
+
+	<varlistentry>
+		<term>display charset</term>
+		<listitem><para>This is the charset samba will use to print messages
+		on your screen. It should generally be the same as the <command>unix charset</command>.
+		</para></listitem>
+	</varlistentry>
+
+	<varlistentry>
+		<term>dos charset</term>
+		<listitem><para>This is the charset samba uses when communicating with 
+		DOS and Windows 9x clients. It will talk unicode to all newer clients.
+		The default depends on the charsets you have installed on your system.
+		Run <command>testparm -v | grep "dos charset"</command> to see 
+		what the default is on your system. 
+		</para></listitem>
+	</varlistentry>
+</variablelist>
+
+</sect1>
+
+<sect1>
+<title>Conversion from old names</title>
+
+<para>Because previous samba versions did not do any charset conversion, 
+characters in filenames are usually not correct in the unix charset but only 
+for the local charset used by the DOS/Windows clients.</para>
+
+<para>The following script from Steve Langasek converts all 
+filenames from CP850 to the iso8859-15 charset.</para>
+
+<para>
+<prompt>#</prompt><userinput>find <replaceable>/path/to/share</replaceable> -type f -exec bash -c 'CP="{}"; ISO=`echo -n "$CP" | iconv -f cp850 \
+  -t iso8859-15`; if [ "$CP" != "$ISO" ]; then mv "$CP" "$ISO"; fi' \;
+</userinput>
+</para>
+</sect1>
+
+<sect1>
+<title>Japanese charsets</title>
+
+<para>Samba doesn't work correctly with Japanese charsets yet. Here are
+points of attention when setting it up:</para>
+
+<itemizedlist>
+
+<listitem><para>You should set <command>mangling method =
+hash</command></para></listitem>
+
+<listitem><para>There are various iconv() implementations around and not
+all of  them work equally well. glibc2's iconv() has a critical problem
+in CP932.  libiconv-1.8 works with CP932 but still has some problems and
+does not  work with EUC-JP.</para></listitem>
+
+<listitem><para>You should set <command>dos charset = CP932</command>, not
+Shift_JIS, SJIS...</para></listitem>
+
+<listitem><para>Currently only <command>unix charset = CP932</command>
+will work (but still has some problems...) because of iconv() issues.
+<command>unix charset = EUC-JP</command> doesn't work well because of
+iconv() issues.</para></listitem>
+
+<listitem><para>Currently Samba 3.0 does not support <command>unix charset
+= UTF8-MAC/CAP/HEX/JIS*</command></para></listitem>
+
+</itemizedlist>
+
+<para>More information (in Japanese) is available at: <ulink url="http://www.atmarkit.co.jp/flinux/special/samba3/samba3a.html">http://www.atmarkit.co.jp/flinux/special/samba3/samba3a.html</ulink>.</para>
+
+</sect1>
+
+</chapter>