From f62eaeb1a5add34ee7353d0d95db3c84a5c71c22 Mon Sep 17 00:00:00 2001 From: Jelmer Vernooij Date: Wed, 13 Aug 2003 06:07:10 +0000 Subject: regenerate (This used to be commit 75a8a906e8031b50e6583f2e0354073a8aa7f5f3) --- docs/htmldocs/unicode.html | 75 ---------------------------------------------- 1 file changed, 75 deletions(-) delete mode 100644 docs/htmldocs/unicode.html (limited to 'docs/htmldocs/unicode.html') diff --git a/docs/htmldocs/unicode.html b/docs/htmldocs/unicode.html deleted file mode 100644 index 58adb5c993..0000000000 --- a/docs/htmldocs/unicode.html +++ /dev/null @@ -1,75 +0,0 @@ - -Chapter 27. Unicode/Charsets

Chapter 27. Unicode/Charsets

Jelmer R. Vernooij

The Samba Team

TAKAHASHI Motonobu

25 March 2003

Features and Benefits

-Every industry eventually matures. One of the great areas of maturation is in -the focus that has been given over the past decade to make it possible for anyone -anywhere to use a computer. It has not always been that way, in fact, not so long -ago it was common for software to be written for exclusive use in the country of -origin. -

-Of all the effort that has been brought to bear on providing native language support -for all computer users, the efforts of the Openi18n organisation is deserving of -special mention. For more information about Openi18n please refer to: -http://www.openi18n.org/. -

-Samba-2.x supported a single locale through a mechanism called -codepages. Samba-3 is destined to become a truly trans-global -file and printer sharing platform. -

What are charsets and unicode?

-Computers communicate in numbers. In texts, each number will be -translated to a corresponding letter. The meaning that will be assigned -to a certain number depends on the character set(charset) - that is used. -A charset can be seen as a table that is used to translate numbers to -letters. Not all computers use the same charset (there are charsets -with German umlauts, Japanese characters, etc). Usually a charset contains -256 characters, which means that storing a character with it takes -exactly one byte.

-There are also charsets that support even more characters, -but those need twice(or even more) as much storage space. These -charsets can contain 256 * 256 = 65536 characters, which -is more then all possible characters one could think of. They are called -multibyte charsets (because they use more then one byte to -store one character). -

-A standardised multibyte charset is unicode, info is available at -www.unicode.org. -A big advantage of using a multibyte charset is that you only need one; no -need to make sure two computers use the same charset when they are -communicating. -

Old windows clients used to use single-byte charsets, named -'codepages' by Microsoft. However, there is no support for -negotiating the charset to be used in the smb protocol. Thus, you -have to make sure you are using the same charset when talking to an old client. -Newer clients (Windows NT, 2K, XP) talk unicode over the wire. -

Samba and charsets

-As of samba 3.0, samba can (and will) talk unicode over the wire. Internally, -samba knows of three kinds of character sets: -

unix charset

- This is the charset used internally by your operating system. - The default is ASCII, which is fine for most - systems. -

display charset

This is the charset samba will use to print messages - on your screen. It should generally be the same as the unix charset. -

dos charset

This is the charset samba uses when communicating with - DOS and Windows 9x clients. It will talk unicode to all newer clients. - The default depends on the charsets you have installed on your system. - Run testparm -v | grep "dos charset" to see - what the default is on your system. -

Conversion from old names

Because previous samba versions did not do any charset conversion, -characters in filenames are usually not correct in the unix charset but only -for the local charset used by the DOS/Windows clients.

The following script from Steve Langasek converts all -filenames from CP850 to the iso8859-15 charset.

-#find /path/to/share -type f -exec bash -c 'CP="{}"; ISO=`echo -n "$CP" | iconv -f cp850 \ - -t iso8859-15`; if [ "$CP" != "$ISO" ]; then mv "$CP" "$ISO"; fi' \; - -

Japanese charsets

Samba doesn't work correctly with Japanese charsets yet. Here are -points of attention when setting it up:

  • You should set mangling method = -hash

  • There are various iconv() implementations around and not -all of them work equally well. glibc2's iconv() has a critical problem -in CP932. libiconv-1.8 works with CP932 but still has some problems and -does not work with EUC-JP.

  • You should set dos charset = CP932, not -Shift_JIS, SJIS...

  • Currently only unix charset = CP932 -will work (but still has some problems...) because of iconv() issues. -unix charset = EUC-JP doesn't work well because of -iconv() issues.

  • Currently Samba 3.0 does not support unix charset -= UTF8-MAC/CAP/HEX/JIS*

More information (in Japanese) is available at: http://www.atmarkit.co.jp/flinux/special/samba3/samba3a.html.

-- cgit