ntdb: remove hash table trees.

TDB2 started with a top-level hash of 1024 entries, divided into 128 groups of 8 buckets. When a bucket filled, the 8 bucket group expanded into pointers into 8 new 64-entry hash tables. When these filled, they expanded in turn, etc. It's a nice idea to automatically expand the hash tables, but it doesn't pay off. Remove it for NTDB. 1) It only beats TDB performance when the database is huge and the TDB hashsize is small. We are about 20% slower on medium-size databases (1000 to 10000 records), worse on really small ones. 2) Since we're 64 bits, our hash tables are already twice as expensive as TDB. 3) Since our hash function is good, it means that all groups tend to fill at the same time, meaning the hash enlarges by a factor of 128 all at once, leading to a very large database at that point. 4) Our efficiency would improve if we enlarged the top level, but that makes our minimum db size even worse: it's already over 8k, and jumps to 1M after about 1000 entries! 5) Making the sub group size larger gives a shallower tree, which performs better, but makes the "hash explosion" problem worse. 6) The code is complicated, having to handle delete and reshuffling groups of hash buckets, and expansion of buckets. 7) We have to handle the case where all the records somehow end up with the same hash value, which requires special code to chain records for that case. On the other hand, it would be nice if we didn't degrade as badly as TDB does when the hash chains get long. This patch removes the hash-growing code, but instead of chaining like TDB does when a bucket fills, we point the bucket to an array of record pointers. Since each on-disk NTDB pointer contains some hash bits from the record (we steal the upper 8 bits of the offset), 99.5% of the time we don't need to load the record to determine if it matches. This makes an array of offsets much more cache-friendly than a linked list. Here are the times (in ns) for tdb_store of N records, tdb_store of N records the second time, and a fetch of all N records. I've also included the final database size and the smbtorture local.[n]tdb_speed results. Benchmark details: 1) Compiled with -O2. 2) assert() was disabled in TDB2 and NTDB. 3) The "optimize fetch" patch was applied to NTDB. 10 runs, using tmpfs (otherwise massive swapping as db hits ~30M, despite plenty of RAM). Insert Re-ins Fetch Size dbspeed (nsec) (nsec) (nsec) (Kb) (ops/sec) TDB (10000 hashsize): 100 records: 3882 3320 1609 53 203204 1000 records: 3651 3281 1571 115 218021 10000 records: 3404 3326 1595 880 202874 100000 records: 4317 3825 2097 8262 126811 1000000 records: 11568 11578 9320 77005 25046 TDB2 (1024 hashsize, expandable): 100 records: 3867 3329 1699 17 187100 1000 records: 4040 3249 1639 154 186255 10000 records: 4143 3300 1695 1226 185110 100000 records: 4481 3425 1800 17848 163483 1000000 records: 4055 3534 1878 106386 160774 NTDB (8192 hashsize) 100 records: 4259 3376 1692 82 190852 1000 records: 3640 3275 1566 130 195106 10000 records: 4337 3438 1614 773 188362 100000 records: 4750 5165 1746 9001 169197 1000000 records: 4897 5180 2341 83838 121901 Analysis: 1) TDB wins on small databases, beating TDB2 by ~15%, NTDB by ~10%. 2) TDB starts to lose when hash chains get 10 long (fetch 10% slower than TDB2/NTDB). 3) TDB does horribly when hash chains get 100 long (fetch 4x slower than NTDB, 5x slower than TDB2, insert about 2-3x slower). 4) TDB2 databases are 40% larger than TDB1. NTDB is about 15% larger than TDB1
author: Rusty Russell <rusty@rustcorp.com.au> 2012-06-19 12:43:04 +0930
committer: Rusty Russell <rusty@rustcorp.com.au> 2012-06-19 05:38:07 +0200
commit: dd42962878ab7c9ddfa79d7c32094fb6748017b8 (patch)
tree: a614af427c5ad0d962db77a58f133cb39c9bd057 /lib/ntdb/private.h
parent: f986554b1e38d8dd40b4bf4748d4aeb470e27d2e (diff)
download: samba-dd42962878ab7c9ddfa79d7c32094fb6748017b8.tar.gz
samba-dd42962878ab7c9ddfa79d7c32094fb6748017b8.tar.bz2
samba-dd42962878ab7c9ddfa79d7c32094fb6748017b8.zip
1 files changed, 56 insertions, 83 deletions
diff --git a/lib/ntdb/private.h b/lib/ntdb/private.h
index 488e99a0f9..957d85e494 100644
--- a/lib/ntdb/private.h
+++ b/lib/ntdb/private.h
@@ -48,6 +48,7 @@
 #include <utime.h>
 #include <unistd.h>
 #endif
+#include <assert.h>
 
 #ifndef TEST_IT
 #define TEST_IT(cond)
@@ -114,30 +115,17 @@ typedef int ntdb_bool_err;
 /* Hash chain locks. */
 #define NTDB_HASH_LOCK_START 64
 
-/* Range for hash locks. */
-#define NTDB_HASH_LOCK_RANGE_BITS 30
-#define NTDB_HASH_LOCK_RANGE (1 << NTDB_HASH_LOCK_RANGE_BITS)
-
-/* We have 1024 entries in the top level. */
-#define NTDB_TOPLEVEL_HASH_BITS 10
-/* And 64 entries in each sub-level: thus 64 bits exactly after 9 levels. */
-#define NTDB_SUBLEVEL_HASH_BITS 6
-/* And 8 entries in each group, ie 8 groups per sublevel. */
-#define NTDB_HASH_GROUP_BITS 3
-/* This is currently 10: beyond this we chain. */
-#define NTDB_MAX_LEVELS (1+(64-NTDB_TOPLEVEL_HASH_BITS) / NTDB_SUBLEVEL_HASH_BITS)
-
 /* Extend file by least 100 times larger than needed. */
 #define NTDB_EXTENSION_FACTOR 100
 
-/* We steal bits from the offsets to store hash info. */
-#define NTDB_OFF_HASH_GROUP_MASK ((1ULL << NTDB_HASH_GROUP_BITS) - 1)
 /* We steal this many upper bits, giving a maximum offset of 64 exabytes. */
 #define NTDB_OFF_UPPER_STEAL 8
-#define   NTDB_OFF_UPPER_STEAL_EXTRA 7
-/* The bit number where we store extra hash bits. */
-#define NTDB_OFF_HASH_EXTRA_BIT 57
-#define NTDB_OFF_UPPER_STEAL_SUBHASH_BIT 56
+
+/* And we use the lower bit, too. */
+#define NTDB_OFF_CHAIN_BIT	0
+
+/* Hash table sits just after the header. */
+#define NTDB_HASH_OFFSET (sizeof(struct ntdb_header))
 
 /* Additional features we understand.  Currently: none. */
 #define NTDB_FEATURE_MASK ((uint64_t)0)
@@ -145,7 +133,7 @@ typedef int ntdb_bool_err;
 /* The bit number where we store the extra hash bits. */
 /* Convenience mask to get actual offset. */
 #define NTDB_OFF_MASK							\
-	(((1ULL << (64 - NTDB_OFF_UPPER_STEAL)) - 1) - NTDB_OFF_HASH_GROUP_MASK)
+	(((1ULL << (64 - NTDB_OFF_UPPER_STEAL)) - 1) - (1<<NTDB_OFF_CHAIN_BIT))
 
 /* How many buckets in a free list: see size_to_bucket(). */
 #define NTDB_FREE_BUCKETS (64 - NTDB_OFF_UPPER_STEAL)
@@ -157,12 +145,14 @@ typedef int ntdb_bool_err;
 /* Indicates this entry is not on an flist (can happen during coalescing) */
 #define NTDB_FTABLE_NONE ((1ULL << NTDB_OFF_UPPER_STEAL) - 1)
 
+/* By default, hash is 64k bytes */
+#define NTDB_DEFAULT_HBITS 13
+
 struct ntdb_used_record {
 	/* For on-disk compatibility, we avoid bitfields:
 	   magic: 16,        (highest)
 	   key_len_bits: 5,
 	   extra_padding: 32
-	   hash_bits: 11
 	*/
         uint64_t magic_and_meta;
 	/* The bottom key_len_bits*2 are key length, rest is data length. */
@@ -189,11 +179,6 @@ static inline uint64_t rec_extra_padding(const struct ntdb_used_record *r)
 	return (r->magic_and_meta >> 11) & 0xFFFFFFFF;
 }
 
-static inline uint32_t rec_hash(const struct ntdb_used_record *r)
-{
-	return r->magic_and_meta & ((1 << 11) - 1);
-}
-
 static inline uint16_t rec_magic(const struct ntdb_used_record *r)
 {
 	return (r->magic_and_meta >> 48);
@@ -236,17 +221,12 @@ struct ntdb_recovery_record {
 	uint64_t eof;
 };
 
-/* If we bottom out of the subhashes, we chain. */
-struct ntdb_chain {
-	ntdb_off_t rec[1 << NTDB_HASH_GROUP_BITS];
-	ntdb_off_t next;
-};
-
 /* this is stored at the front of every database */
 struct ntdb_header {
 	char magic_food[64]; /* for /etc/magic */
 	/* FIXME: Make me 32 bit? */
 	uint64_t version; /* version of the code */
+	uint64_t hash_bits; /* bits for toplevel hash table. */
 	uint64_t hash_test; /* result of hashing HASH_MAGIC. */
 	uint64_t hash_seed; /* "random" seed written at creation time. */
 	ntdb_off_t free_table; /* (First) free table. */
@@ -260,8 +240,12 @@ struct ntdb_header {
 	ntdb_off_t capabilities; /* Optional linked list of capabilities. */
 	ntdb_off_t reserved[22];
 
-	/* Top level hash table. */
-	ntdb_off_t hashtable[1ULL << NTDB_TOPLEVEL_HASH_BITS];
+	/*
+	 * Hash table is next:
+	 *
+	 * struct ntdb_used_record htable_hdr;
+	 * ntdb_off_t htable[1 << hash_bits];
+	 */
 };
 
 struct ntdb_freetable {
@@ -280,33 +264,15 @@ struct ntdb_capability {
 /* Information about a particular (locked) hash entry. */
 struct hash_info {
 	/* Full hash value of entry. */
-	uint64_t h;
-	/* Start and length of lock acquired. */
-	ntdb_off_t hlock_start;
-	ntdb_len_t hlock_range;
-	/* Start of hash group. */
-	ntdb_off_t group_start;
-	/* Bucket we belong in. */
-	unsigned int home_bucket;
+	uint32_t h;
+	/* Start of hash table / chain. */
+	ntdb_off_t table;
+	/* Number of entries in this table/chain. */
+	ntdb_off_t table_size;
 	/* Bucket we (or an empty space) were found in. */
-	unsigned int found_bucket;
-	/* How many bits of the hash are already used. */
-	unsigned int hash_used;
-	/* Current working group. */
-	ntdb_off_t group[1 << NTDB_HASH_GROUP_BITS];
-};
-
-struct traverse_info {
-	struct traverse_level {
-		ntdb_off_t hashtable;
-		/* We ignore groups here, and treat it as a big array. */
-		unsigned entry;
-		unsigned int total_buckets;
-	} levels[NTDB_MAX_LEVELS + 1];
-	unsigned int num_levels;
-	unsigned int toplevel_group;
-	/* This makes delete-everything-inside-traverse work as expected. */
-	ntdb_off_t prev;
+	ntdb_off_t bucket;
+	/* Old value that was in that entry (if not found) */
+	ntdb_off_t old_val;
 };
 
 enum ntdb_lock_flags {
@@ -375,40 +341,46 @@ struct ntdb_methods {
 /*
   internal prototypes
 */
+/* Get bits from a value. */
+static inline uint32_t bits_from(uint64_t val, unsigned start, unsigned num)
+{
+	assert(num <= 32);
+	return (val >> start) & ((1U << num) - 1);
+}
+
+
 /* hash.c: */
-uint64_t ntdb_jenkins_hash(const void *key, size_t length, uint64_t seed,
+uint32_t ntdb_jenkins_hash(const void *key, size_t length, uint32_t seed,
 			   void *unused);
 
 enum NTDB_ERROR first_in_hash(struct ntdb_context *ntdb,
-			      struct traverse_info *tinfo,
+			      struct hash_info *h,
 			      NTDB_DATA *kbuf, size_t *dlen);
 
 enum NTDB_ERROR next_in_hash(struct ntdb_context *ntdb,
-			     struct traverse_info *tinfo,
+			     struct hash_info *h,
 			     NTDB_DATA *kbuf, size_t *dlen);
 
 /* Hash random memory. */
-uint64_t ntdb_hash(struct ntdb_context *ntdb, const void *ptr, size_t len);
-
-/* Hash on disk. */
-uint64_t hash_record(struct ntdb_context *ntdb, ntdb_off_t off);
+uint32_t ntdb_hash(struct ntdb_context *ntdb, const void *ptr, size_t len);
 
 /* Find and lock a hash entry (or where it would be). */
 ntdb_off_t find_and_lock(struct ntdb_context *ntdb,
 			 NTDB_DATA key,
 			 int ltype,
 			 struct hash_info *h,
-			 struct ntdb_used_record *rec,
-			 struct traverse_info *tinfo);
+			 struct ntdb_used_record *rec);
 
 enum NTDB_ERROR replace_in_hash(struct ntdb_context *ntdb,
-				struct hash_info *h,
+				const struct hash_info *h,
 				ntdb_off_t new_off);
 
-enum NTDB_ERROR add_to_hash(struct ntdb_context *ntdb, struct hash_info *h,
+enum NTDB_ERROR add_to_hash(struct ntdb_context *ntdb,
+			    const struct hash_info *h,
 			    ntdb_off_t new_off);
 
-enum NTDB_ERROR delete_from_hash(struct ntdb_context *ntdb, struct hash_info *h);
+enum NTDB_ERROR delete_from_hash(struct ntdb_context *ntdb,
+				 const struct hash_info *h);
 
 /* For ntdb_check */
 bool is_subhash(ntdb_off_t val);
@@ -424,7 +396,7 @@ ntdb_off_t next_ftable(struct ntdb_context *ntdb, ntdb_off_t ftable);
 
 /* This returns space or -ve error number. */
 ntdb_off_t alloc(struct ntdb_context *ntdb, size_t keylen, size_t datalen,
-		 uint64_t hash, unsigned magic, bool growing);
+		 unsigned magic, bool growing);
 
 /* Put this record in a free list. */
 enum NTDB_ERROR add_free_record(struct ntdb_context *ntdb,
@@ -436,7 +408,7 @@ enum NTDB_ERROR add_free_record(struct ntdb_context *ntdb,
 enum NTDB_ERROR set_header(struct ntdb_context *ntdb,
 			   struct ntdb_used_record *rec,
 			   unsigned magic, uint64_t keylen, uint64_t datalen,
-			   uint64_t actuallen, unsigned hashlow);
+			   uint64_t actuallen);
 
 /* Used by ntdb_check to verify. */
 unsigned int size_to_bucket(ntdb_len_t data_len);
@@ -504,13 +476,12 @@ enum NTDB_ERROR owner_conflict(struct ntdb_context *ntdb, const char *call);
 /* If we fork, we no longer really own locks. */
 bool check_lock_pid(struct ntdb_context *ntdb, const char *call, bool log);
 
-/* Lock/unlock a range of hashes. */
-enum NTDB_ERROR ntdb_lock_hashes(struct ntdb_context *ntdb,
-				 ntdb_off_t hash_lock, ntdb_len_t hash_range,
-				 int ltype, enum ntdb_lock_flags waitflag);
-enum NTDB_ERROR ntdb_unlock_hashes(struct ntdb_context *ntdb,
-				   ntdb_off_t hash_lock,
-				   ntdb_len_t hash_range, int ltype);
+/* Lock/unlock a hash bucket. */
+enum NTDB_ERROR ntdb_lock_hash(struct ntdb_context *ntdb,
+			       unsigned int hbucket,
+			       int ltype);
+enum NTDB_ERROR ntdb_unlock_hash(struct ntdb_context *ntdb,
+				 unsigned int hash, int ltype);
 
 /* For closing the file. */
 void ntdb_lock_cleanup(struct ntdb_context *ntdb);
@@ -588,9 +559,11 @@ struct ntdb_context {
 	struct ntdb_file *file;
 
 	/* Hash function. */
-	uint64_t (*hash_fn)(const void *key, size_t len, uint64_t seed, void *);
+	uint32_t (*hash_fn)(const void *key, size_t len, uint32_t seed, void *);
 	void *hash_data;
-	uint64_t hash_seed;
+	uint32_t hash_seed;
+	/* Bits in toplevel hash table. */
+	unsigned int hash_bits;
 
 	/* Allocate and free functions. */
 	void *(*alloc_fn)(const void *owner, size_t len, void *priv_data);
author	Rusty Russell <rusty@rustcorp.com.au>	2012-06-19 12:43:04 +0930
committer	Rusty Russell <rusty@rustcorp.com.au>	2012-06-19 05:38:07 +0200
commit	dd42962878ab7c9ddfa79d7c32094fb6748017b8 (patch)
tree	a614af427c5ad0d962db77a58f133cb39c9bd057 /lib/ntdb/private.h
parent	f986554b1e38d8dd40b4bf4748d4aeb470e27d2e (diff)
download	samba-dd42962878ab7c9ddfa79d7c32094fb6748017b8.tar.gz samba-dd42962878ab7c9ddfa79d7c32094fb6748017b8.tar.bz2 samba-dd42962878ab7c9ddfa79d7c32094fb6748017b8.zip