The big character set handling changeover!

This commit gets rid of all our old codepage handling and replaces it with iconv. All internal strings in Samba are now in "unix" charset, which may be multi-byte. See internals.doc and my posting to samba-technical for a more complete explanation. (This used to be commit debb471267960e56005a741817ebd227ecfc512a)
author: Andrew Tridgell <tridge@samba.org> 2001-07-04 07:15:53 +0000
committer: Andrew Tridgell <tridge@samba.org> 2001-07-04 07:15:53 +0000
commit: 87fbb7092b8f8b2f0db0f361c3d625e19de57cd9 (patch)
tree: 3c302f710cbaa03e3c0d46549e8982771b12b8a5 /source3/internals.doc
parent: 9e9e73303ec10a64bd744b9b33f4e6cd7d394f03 (diff)
download: samba-87fbb7092b8f8b2f0db0f361c3d625e19de57cd9.tar.gz
samba-87fbb7092b8f8b2f0db0f361c3d625e19de57cd9.tar.bz2
samba-87fbb7092b8f8b2f0db0f361c3d625e19de57cd9.zip
1 files changed, 69 insertions, 0 deletions
diff --git a/source3/internals.doc b/source3/internals.doc
index 971f256738..c8cc6dd136 100644
--- a/source3/internals.doc
+++ b/source3/internals.doc
@@ -6,6 +6,75 @@ understood by anyone wishing to add features to Samba.
 
 
 
+=============================================================================
+This section describes character set handling in Samba, as implemented in
+Samba 3.0 and above
+
+In the past Samba had very ad-hoc character set handling. Scattered
+throughout the code were numerous calls which converted particular
+strings to/from DOS codepages. The problem is that there was no way of
+telling if a particular char* is in dos codepage or unix
+codepage. This led to a nightmare of code that tried to cope with
+particular cases without handlingt the general case.
+
+The new system works like this:
+
+- all char* strings inside Samba are "unix" strings. These are
+  multi-byte strings that are in the charset defined by the "unix
+  charset" option in smb.conf. 
+
+- there is no single fixed character set for unix strings, but any
+  character set that is used does need the following properties:
+    * must not contain NULLs except for termination
+    * must be 7-bit compatible with C strings, so that a constant
+      string or character in C will be byte-for-byte identical to the
+      equivalent string in the chosen character set. 
+    * when you uppercase or lowercase a string it does not become
+      longer than the original string
+    * must be able to correctly hold all characters that your client
+      will throw at it
+  For example, UTF-8 is fine, and most multi-byte asian character sets
+  are fine, but UCS2 could not be used for unix strings as they
+  contain nulls.
+
+- when you need to put a string into a buffer that will be sent on the
+  wire, or you need a string in a character set format that is
+  compatible with the clients character set then you need to use a
+  pull_ or push_ function. The pull_ functions pull a string from a
+  wire buffer into a (multi-byte) unix string. The push_ functions
+  push a string out to a wire buffer. 
+
+- the two main pull_ and push_ functions you need to understand are
+  pull_string and push_string. These functions take a base pointer
+  that should point at the start of the SMB packet that the string is
+  in. The functions will check the flags field in this packet to
+  automatically determine if the packet is marked as a unicode packet,
+  and they will choose whether to use unicode for this string based on
+  that flag. You may also force this decision using the STR_UNICODE or
+  STR_ASCII flags. For use in smbd/ and libsmb/ there are wrapper
+  functions clistr_ and srvstr_ that call the pull_/push_ functions
+  with the appropriate first argument.
+
+  You may also call the pull_ascii/pull_ucs2 or push_ascii/push_ucs2
+  functions if you know that a particular string is ascii or
+  unicode. There are also a number of other convenience functions in
+  charcnv.c that call the pull_/push_ functions with particularly
+  common arguments, such as pull_ascii_pstring()
+
+The biggest thing to remember is that internal (unix) strings in Samba
+may now contain multi-byte characters. This means you cannot assume
+that characters are always 1 byte long. Often this means that you will
+have to convert strings to ucs2 and back again in order to do some
+(seemingly) simple task. For examples of how to do this see functions
+like strchr_m(). I know this is very slow, and we will eventually
+speed it up but right now we want this stuff correct not fast.
+
+Other rules:
+
+  - all lp_ functions now return unix strings. The magic "DOS" flag on
+    parameters is gone.
+  - all vfs functions take unix strings. Don't convert when passing to
+    them
 
 
 =============================================================================
author	Andrew Tridgell <tridge@samba.org>	2001-07-04 07:15:53 +0000
committer	Andrew Tridgell <tridge@samba.org>	2001-07-04 07:15:53 +0000
commit	87fbb7092b8f8b2f0db0f361c3d625e19de57cd9 (patch)
tree	3c302f710cbaa03e3c0d46549e8982771b12b8a5 /source3/internals.doc
parent	9e9e73303ec10a64bd744b9b33f4e6cd7d394f03 (diff)
download	samba-87fbb7092b8f8b2f0db0f361c3d625e19de57cd9.tar.gz samba-87fbb7092b8f8b2f0db0f361c3d625e19de57cd9.tar.bz2 samba-87fbb7092b8f8b2f0db0f361c3d625e19de57cd9.zip